# Python Tutorial: Intro to Matplotlib

Wednesday May 13th, 2020.

This tutorial aims at introducing participant to basic functionalities of the Python's librairy **Matplotlib**. The turorial is divided in four parts: basic plots, plot attributes, subplots and plotting the *iris dataset*.

## General Setup
We are using Python3

In [None]:
import sys
print("Python version :", sys.version)

Import the **Matplotlib** library and check the version

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

In [None]:
print ("Matplotlib version :", mtb.__version__)

To test various plots and plot attributes, we will be using synthetic datasets. In order to generate them, we need to import some librairies

In [None]:
import numpy as np
import random

## Basic Plots

### Line Plot
Ressource: https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html

Line plots are usefull when we want to visualize $y = f(x)$ functions.

Let's use $\sin$ and $\cos$ functions for our examples.

In [None]:
## Numpy arange (start, end, jump)
x_line = np.arange(0, 15, 0.1)

y_sin = np.sin(x_line)
y_cos = np.cos(x_line)

We can easily draw lines using the ```plot``` function: ```plt.plot(x, y, args**)```

To display a plot without saving it to a file, we can use the ```show``` function: ```plt.show()```

In [None]:
## Plot the sin function


Let's generate two new datasets

In [None]:
y_sin_shift = [i+2 for i in y_sin]
y_cos_shift = [i+2 for i in y_cos]

When plotting line graphs, we can specify various **line types** with the argument ```ls```.

The main **line types** are:
1. Solid ______
2. Dashed -----
3. Dashdot -.-.-.-
4. Dot ......

The **line width** can also be modified using the ```lw``` argument.

In [None]:
## Plot the sin gunction with solid linestyle (ls)
plt.plot(x_line, y_sin)

## Plot the cos gunction with dashed linestyle 
plt.plot(x_line, y_cos)

## Plot the cos gunction with dashdot linestyle 
plt.plot(x_line, y_sin_shift)

## Plot the cos gunction with dot linestyle 
plt.plot(x_line, y_cos_shift)
plt.show()

We can also define a specific ```marker``` to illustrate are linked datapoints.

In [None]:
x = range(10)
y = np.random.normal(0, 10, 10)

## Plot the x-y relationship and identify the datapoint with 'o'


### Scatter plot
Ressource: https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html

There are two ways to draw scatter plots:
1. using the ```plot``` function (faster when markers are identical in size and color) -> ```plt.plot(x, y, 'o', args**)```
2. using the ```scatter```function -> ```plt.scatter(x, y, args**)```

We will be using three randomly generated gaussian datasets

In [None]:
N = 150

x_scatter = [np.random.normal(0, 2, N), np.random.normal(5, 2, N), np.random.normal(10, 2, N)]
y_scatter = [np.random.normal(4, 5, N), np.random.normal(0, 2, N), np.random.normal(8, 2, N)]

Let's first try with the ```plot```function...

Remember what is the format of our $x$ and $y$ data!

In [None]:
## Plot the three datasets as scatter plots


Now, let's try with the ```scatter```function

In [None]:
## Plot the three datasets as scatter plots


Fundamentally, ```scatter``` works with 1-D arrays; $x$ and $y$ may be input as 2-D arrays, but they will be flattened. This is why we are loosing our three distinct datasets. To draw the various datasets as distinct instances within the same plot, we could use a ```for```loop. Each dataset will be assign a default color.

We can use the ```alpha``` parameter to play with our marker transparency, which is useful when plotting a lot of overlapping points.

In [None]:
## Plot the three datasets as distinct scatter plots


The default marker is ```o``` but there exist many more options.

In [None]:
filled_markers = ['o', 'v', '^', '<', '>', '8', 's', 'p', '*', 'h', 'H', 'D', 'd', 'P', 'X']

To change the marker style of a scatter plot, we must specificy the ```marker``` argument. We can also specify the marker size with the ```s``` argument.

In [None]:
x_marker = range(len(filled_markers))
y_marker = x_marker

## Plot the marker datapoints with different marker 
for i in x_marker :
    plt.scatter(x_marker[i], y_marker[i])

### Histogram
Ressource: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html

We can plot the histogram of $x$ using the ```hist```function.

Let's generate a dataset of 5 randomly generated gaussian distributions.

In [None]:
M = 250

mu = [0, 4, 6, 10, 15]
sigma = [4, 2, 1, 5, 3]

norm_dist = [np.random.normal(mu[i], sigma[i], M) for i in range(len(mu))]

Multiple data can be provided via x as a list of datasets of potentially different length, or as a 2-D ndarray in which each column is a dataset.

The default ```histtype``` is set to ```bar```. 

In [None]:
## Plot the histograms of the various distributions


In our case, we would like to see the distributions, and the ```bar``` representation might not be appropriate. We can try to plot individual histograms using a ```for```loop.

In [None]:
## Plot the histograms of the various distribution


The results is more of what we expect but it is not quite there yet: we want the various histograms to share the same bins. To do so, we can define the ```histtype``` to be either ```barstacked``` or ```stepfilled```.

We can also specify that we want the density distrbution, rather than the number of occurrence, by setting the ```density``` argument to ```True``` (by default it is set to ```False```)

In [None]:
## Plot de histograms of the distrbution using barstacked or stepfilled
plt.hist(norm_dist)
plt.show()

It can also be useful to define the bins number, using the ```bins``` argument.

In [None]:
## Change the number of bins to 50
plt.hist(norm_dist)
plt.show()

## Plot Attributes

### Axis labels and plot title

We can add a title and axis labels to our plot by using the ```title```, ```xlabel``` and ```ylabel``` functions. They all take as input a string and optional arguments (eg. font size and color).

We ca integrate LaTEX notation with ```$...$``` and some variable values with ```%d```, ```%s``` and ```%f```.
* ```'$K=3$, $n_{k}=150$' ``` -> $K=3$, $n_{k}=150$
* ```'There are %d points' % (len(x_scatter[0]))```-> There are 150 points

In [None]:
for i in range(len(x_scatter)) :
    plt.scatter(x_scatter[i], y_scatter[i], marker=filled_markers[i*3], alpha=0.7)

## Add a title to our plot


## Add axis labels to our plot


plt.show()

### Plot style and colors

It is possible to use predefined plot styles. Matplotlib has a lot of options to choose from!

In [None]:
print (plt.style.available)

To select a style, we use the ```style.use``` function. We can also define other attributes such as font size by using the ```rc``` functions. This style and these attributes will be applied to all new plots.

Ressource: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.rc.html

In [None]:
plt.style.use('seaborn-whitegrid')
plt.rc('font', size=15)

Let's see what our scatter plot looks like with this new style!

In [None]:
for i in range(len(x_scatter)) :
    plt.scatter(x_scatter[i], y_scatter[i], marker=filled_markers[i*3], alpha=0.7)
    
plt.title('Scatter Plot ($K=3$, $n_{k}=$%s)' % (len(x_scatter[0])))
plt.xlabel('First Component')
plt.ylabel('Second Component')

plt.show()

Commands which take color arguments can use several formats to specify the colors.

Each style has its own cycle of color: these are the color used by default when we do not define a spectific color. If we plot more elements than there are colors in the cycle, we will have color duplicates in our plot.

We can access the colors the various colors of the cycle with ```C``` followed by a digit: ```C0``` is the first color of our cycle. 

Ressources: https://matplotlib.org/2.0.2/api/colors_api.html

In [None]:
for i in range(20) :
    plt.scatter(i, i, s=50)
    #plt.scatter(i, i, s=50, color='C%d' % (i))
    
plt.show()

We can also use the basic built-in colors and refer to them either with a single letter or their complete name: 
* ```b```: blue
* ```g```: green
* ```red```: red
* ```c```: cyan
* ```m```: magenta
* ```y```: yellow
* ```k```: black
* ```w```: white

In [None]:
cl = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'w']

for i in range(len(cl)) :
    plt.scatter(i, i, s=50, color=cl[i])
    
plt.show()

We can also explicitely define the color from the available color list, or with color codes.

Ressources: https://matplotlib.org/2.0.0/_images/named_colors.png

In [None]:
cl = ['paleturquoise',  'mediumorchid', 'goldenrod', 'antiquewhite', '#ff800d']

for i in range(len(cl)) :
    plt.scatter(i, i, s=50, color=cl[i])
    
plt.show()

### Adding horizontal and vertical lines

Horizontal and vertical lines can be very informative! They can illustrate mean and median, the bounds of a confidence interval, a threshold, a reference value, etc.

They are very simple to plot :D 
* ```plt.axhline(y, args**)```
* ```plt.axvline(y, args**)```

The arguments are similar to the ones used when drawing a line plot: ```ls```, ```lw```, ```color```, etc.

Ressource (horizontal line): https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.axhline.html
Ressource (vertical line): https://matplotlib.org/api/_as_gen/matplotlib.pyplot.axvline.html 

In [None]:
for i in range(len(x_scatter)) :
    plt.scatter(x_scatter[i], y_scatter[i], marker=filled_markers[i*3], alpha=0.7, color='C%d' % (i))
    
    ## Plot the x-mean
    
    
    ## Plot the y-mean
    
    
plt.title('Scatter Plot ($K=3$, $n_{k}=$%s)' % (len(x_scatter[0])))
plt.xlabel('First Component')
plt.ylabel('Second Component')

plt.show()

Here's an example when plotting histograms

In [None]:
plt.hist(norm_dist, bins=50, histtype='stepfilled', density=True, alpha=0.7)

for i in range(len(norm_dist)) :
    plt.axvline(np.mean(norm_dist[i]), color='C%d' % (i))
    
plt.title('Normal Distributions\n($N=$%d, $n_{k}=$%d)' % (len(norm_dist), len(norm_dist[0])))
plt.xlabel('Value')
plt.ylabel('Density')
    
plt.show()

### Labels and Legend
Ressource: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html

In [None]:
## Define the labels


## Add the labels
plt.hist(norm_dist, bins=50, histtype='stepfilled', density=True, alpha=0.7)

for i in range(len(norm_dist)) :
    plt.axvline(np.mean(norm_dist[i]), color='C%d' % (i))
    
plt.title('Normal Distributions\n($N=$%d, $n_{k}=$%d)' % (len(norm_dist), len(norm_dist[0])))
plt.xlabel('Value')
plt.ylabel('Density')

## Show the legend and place it outside

    
plt.show()

## Figure and Subplots
Ressource: https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html

```plt.subplots(nrows, ncols, figsize, sharex, sharey)```

In [None]:
N = len(x_scatter)
fig, ax = plt.subplots(len(x_scatter), len(y_scatter), figsize=(12, 9), sharex='row', sharey='row')

### First row: scatter plot of FC vs. SC

    
### Second row: plot distribution of first component w/ mean

### Third row: plot distribution of second component w/ mean


plt.tight_layout()
plt.show()

Here's an example if we plot the datasets column-wise rather than row-wise. This visualization can be very helpful when comparing various datasets.

In [None]:
N = len(x_scatter)
fig, ax = plt.subplots(len(x_scatter), len(y_scatter), figsize=(12, 9), sharex='col', sharey='col')

### First col: scatter plot of FC vs. SC
c = 0

for r in range(N) :
    ax[r][c].scatter(x_scatter[r], y_scatter[r], color='C%d' % (r))
    ax[r][c].set_xlabel('First Component')
    ax[r][c].set_ylabel('Second Component')
    ax[r][c].set_title('Dataset %s' % r)
    
### Second col: plot distribution of first component w/ mean
c = 1

for r in range(N) :
    ax[r][c].hist(x_scatter[r], bins=20, histtype='stepfilled', density=True, color='C%d' % (r))
    ax[r][c].axvline(np.mean(x_scatter[r]), ls='-', color='k')
    ax[r][c].set_xlabel('First Component')
    ax[r][c].set_ylabel('Density')
    
### Third col: plot distribution of second component w/ mean
c = 2

for r in range(N) :
    ax[r][c].hist(y_scatter[r], bins=20, histtype='stepfilled', density=True, color='C%d' % (r))
    ax[r][c].axvline(np.mean(y_scatter[r]), ls='-', color='k')
    ax[r][c].set_xlabel('Second Component')
    ax[r][c].set_ylabel('Density')

plt.tight_layout()
plt.show()

## Plotting the Iris Dataset (example)

In [None]:
## Import scikit learn library
from sklearn import datasets
from sklearn.decomposition import PCA

In [None]:
## Import the iris dataset
iris = datasets.load_iris()
names = iris.target_names
feature = iris.feature_names
target = iris.target

In [None]:
X_reduced = PCA(n_components=len(feature)).fit_transform(iris.data)

In [None]:
fig, ax = plt.subplots(1, 6, figsize=(24, 4))

for i in range(len(target)) :
    tg = target[i]
    tmp = X_reduced[i]
    
    ax[0].scatter(tmp[0], tmp[1], color='C%d' % (tg), label=names[tg])
    ax[1].scatter(tmp[0], tmp[2], color='C%d' % (tg))
    ax[2].scatter(tmp[0], tmp[3], color='C%d' % (tg))
    ax[3].scatter(tmp[1], tmp[2], color='C%d' % (tg))
    ax[4].scatter(tmp[1], tmp[3], color='C%d' % (tg))
    ax[5].scatter(tmp[2], tmp[3], color='C%d' % (tg))
    
ax[0].set_xlabel('PC1')
ax[0].set_ylabel('PC2')

ax[1].set_xlabel('PC1')
ax[1].set_ylabel('PC3')

ax[2].set_xlabel('PC1')
ax[2].set_ylabel('PC4')

ax[3].set_xlabel('PC2')
ax[3].set_ylabel('PC3')

ax[4].set_xlabel('PC2')
ax[4].set_ylabel('PC4')

ax[5].set_xlabel('PC3')
ax[5].set_ylabel('PC4')

ax[0].legend()

plt.tight_layout()
plt.show()