This tutorial combines material from the [SoftwareCarpentary Python tutorial](http://swcarpentry.github.io/python-novice-gapminder/), and the [Matplotlib tutorial](https://matplotlib.org/stable/tutorials/introductory/usage.html)

# matplotlib
matplotlib is a very popular Python library for creating plots. One of the original motivations behind matplotlib was to recreate the types of plotting functions avilable in MATLAB, hence the name. 

## matplotlib basics

In [None]:
# Import MatPlotlib
%matplotlib inline
import matplotlib.pyplot as plt

Let's make a simple set of data and plot a line graph.

In [None]:
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')


In our CoLab notebook session, the plots will appear automatically once we execute each code cell. However, in a Python script or command-line session, we need to tell Python to display the plot with the following code:
`plt.show()`

## Plotting data from a Pandas DataFrame (wide format)
Let's grab some data that tracks large city populations since 1950. We'll load it as a Pandas DataFrame

In [None]:
!wget https://raw.githubusercontent.com/shaunmahony/BMMB554-2022/main/data/city-populations-reformat.csv

In [None]:
!head city-populations-reformat.csv

In [None]:
import numpy as np
import pandas as pd

cities = pd.read_csv('city-populations-reformat.csv')

cities.head()

What format is this table in? Wide or long (tidy)?

Matplotlib is an example of a library that works best with wide format data. 

Let's plot the New York column. 

In [None]:
cities['New York'].plot()
plt.ylabel('Population')

Why isn't the x-axis showing the years? It's actually showing the index, which by default is set to 0,1,.. etc. 

In [None]:
cities.index.values

Instead, let's set it to the Year column and otherwise delete that column from the the DataFrame `cities.set_index('Year',drop=True,inplace=True)`

Note that we could alternatively have just plotted the New York column against the Year column: `cities['New York'].plot(x='Year')`

Or, when we loaded the data, we could have defined the first column to be the index: `cities = pd.read_csv('city-populations-reformat.csv', index_col=0)`

In [None]:
# Let's set index to be the Year column
cities.set_index('Year',drop=True,inplace=True)


In [None]:
cities['New York'].plot()

Plotting multiple cities at once:

In [None]:
cities[['New York','Paris','Beijing','Mumbai']].plot()

## Plot styles/types
There are lots of other plot styles and types available. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) for the DataFrame plot method. 



In [None]:
plt.style.use('ggplot')
cities[['New York','Paris','Beijing','Mumbai']].plot(kind='bar')
plt.ylabel('Population')

You can also create plots by calling the `matplotlib plot` function directly. 

In [None]:
years = cities.index.values
newyork = cities['New York']

plt.plot(years, newyork)
plt.ylabel('Population')
plt.xlabel('Years')

We can add multiple elements onto the same plot.

In [None]:
years = cities.index.values
newyork = cities['New York']
delhi = cities['Delhi']
rio = cities['Rio de Janeiro']

plt.plot(years, newyork, label='New York')
plt.plot(years, delhi, label='Delhi')
plt.plot(years, rio, label='Rio de Janeiro')
plt.ylabel('Population')
plt.xlabel('Years')
plt.legend()
plt.show()

Let's plot New York versus Rio as a scatter.

In [None]:
plt.scatter(newyork, rio)
plt.show()

## Saving plots

Once you are happy with the format and style of the plot, you will likely want to save it as a file. You can do so with the `savefig` method. 

`plt.savefig('my_figure.png')`

This will save the current figure to the file my_figure.png. The file format will determined from the file name extension (other formats are pdf, ps, eps and svg).

The `savefig` method operates on the current `figure`. Once the plot has been displayed to the screen, matplotlib starts a new empty figure. Thus, you will need to call `savefig` before the plot has been displayed. 

When plotting from DataFrames, there is an added complication that the plot is made and displayed in a single line, so you can't call `savefig` before displaying the figure. In this case, you can save a reference to the current figure in a local variable (with `plt.get_figure()`) and then call the `savefig` method from that variable.

In [None]:
plot = cities[['New York','Paris','Beijing','Mumbai']].plot()
plt.ylabel('Population')
fig = plot.get_figure()
fig.savefig('my_figure.png')

## Matplotlib plots from a tidy Pandas DataFrame

Matplotlib works best with wide format tables, but could we have done the same thing with tidy data? Yes, but we're more limited in terms of easily making multi-series plots *in matplotlib* while the data is in tidy format. Let's go through some steps again, but this time tidying the dataframe first. 

Let's load the data again to make everything clear. 

In [None]:
import numpy as np
import pandas as pd

cities2 = pd.read_csv('city-populations-reformat.csv')

cities2.head()

Let's melt the dataframe to get a tidy representation.

In [None]:
cities2 = pd.melt(cities2, value_name='Population', var_name=['City'],value_vars=cities2.columns[1:],id_vars=['Year'])

cities2.head()

Now we want to get the rows where the City equals 'New York", and we want to plot the Year and Population for those rows.

In [None]:
cities2.loc[cities2['City']== 'New York', : ].plot(x='Year', y='Population')

# Seaborn

seaborn is a plotting library that is built on top of matplotlib. It complements matplotlib to make aesthetically pleasing statistical plots. The creator of seaborn, Michael Waksom, puts the goal of seaborn as follows:

> If Matplotlib tries to make easy things easy and hard things possible, seaborn tries to make a well-defined set of hard things easy too




## Basic plots in seaborn

In [None]:
# Import Seaborn for plotting and styling
import seaborn as sns

Okay, so let's see how to create our city population line plot looks like in seaborn. Note that seaborn is naturally compatible with Pandas DataFrames, and favors the tidy data representation. 

**Remember: `cities` here is in wide format while `cities2` is in long format**

Let's try plotting columns from the wide format data first.

In [None]:
sns.lineplot(data=cities[['New York','Paris','Beijing','Mumbai']])
plt.ylabel('Population')
plt.show()

Now let's do the same thing from the long form dataframe (let's extract the relevant cities first).

In [None]:
cities2_subset = cities2.loc[cities2['City'].isin(['New York','Paris','Beijing','Mumbai'])]

In [None]:
sns.lineplot(data=cities2_subset, x='Year', y='Population', hue='City')

Not very different from the basic matplotlib style here. Let's change style and color palette. More info:
https://seaborn.pydata.org/tutorial/aesthetics.html 

In [None]:
sns.set_theme(style='white', palette='pastel')
sns.lineplot(data=cities2_subset, x='Year', y='Population', hue='City')

Great, but I want to show this plot in a talk, and I'm afraid the text elements are too small to be legible on the screen. Seaborn provides a method called set_context(), where you can easily scale the various elements using four preset 'contexts':

*   paper
*   notebook (default)
*   talk
*   poster



In [None]:
sns.set_theme(style='white', palette='pastel')
sns.set_context('talk')
sns.lineplot(data=cities2_subset, x='Year', y='Population', hue='City')

## Combining plots on tidy data

Let's say we want to take another view of our population data. Instead of tracking individual cities, we want to look at the distribution of populations of these 30 cities over time. One type of plot that might be suitable here is a violin plot. Let's try it in seaborn on the tidy format data. 

In [None]:
#Reset our theme
sns.set_theme()

sns.violinplot(data=cities2, x='Year', y='Population')


This is what we wanted, but it's a bit crowded. Let's make it bigger. 

In [None]:
plt.figure(figsize=(12,8))
sns.set_context('talk')
sns.violinplot(data=cities2, x='Year', y='Population')
plt.show()

Another plot style that would show the same information, but using the individual data points, is a swwarmplot. 

In [None]:
plt.figure(figsize=(12,8))
sns.set_context('talk')
sns.swarmplot(data=cities2, x='Year', y='Population')
plt.show()

Could we show both a violinplot and a swarmplot together? Yes!

In [None]:
plt.figure(figsize=(12,8))
sns.set_context('talk')

sns.violinplot(
    data=cities2, 
    x='Year', 
    y='Population',
    inner=None) # Remove the bars inside the violins

sns.swarmplot(
    data=cities2, 
    x='Year', 
    y='Population',
    color='k', # Make points black
    alpha=0.7) # and slightly transparent

plt.show()

## Heatmaps
Heatmaps are a very popular and useful type of plot in genomics. They may not be particularly useful for displaying our city population data, but let's stick with the theme for now. Let's see what a default heatmap would display on our wide format data. (note that in this case, Seaborn wants "a 2D dataset that can be coerced into an ndarray"). 

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data=cities, 
            cmap=sns.color_palette("Reds", as_cmap=True)) #Sets a new color palette
plt.show()

What if we want to automatically cluster cities with similar population trends? Seaborn has a related heatmap function called `clustermap()` that enables hierarchical clustering on rows, columns, or both. 

In [None]:
plt.figure(figsize=(14,8))
sns.clustermap(data=cities, 
               col_cluster=True, 
               row_cluster=False,
               cmap=sns.color_palette("Reds", as_cmap=True))
plt.show()