# Module 2 - describing and representing tabulated data 
-------------------------------------------------------------------------------------------------------

## Table of Content <a id='toc'></a>

1. **[A quick intro to matplotlib](#plt0)**


2. **[Basic representation - plotting one column](#plt1)**

   2.1 [the distplot function](#21)

   2.2 [seaborn "axe-level" functions](#22)

   [Exercise 1 - histograms](#ex1)  
   <br>

4. **[Accounting for categories in the data](#descr.3)**  

   [Exercise 2 - representing categories](#ex2)  
   <br>
   
5. **[Representing the relationship between 2 numerical variables](#plot2)**
   <br>

6. **[Writing plots to disk](#writing)**
   <br>
   
7. **[Free form exercise](#exo)**
   <br>

**Supplementary Material:**  
   * [Annex - seaborn themes + python graph gallery](#gallery)


<br>

# Introduction

Before we begin this chapter in earnest, let's import some libraries and read the data:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

print("All modules loaded successfully!")

> Note: if you are missing some of the above modules, you should install them.
>
>    * Installation with **pip**: `pip install --user pandas matplotlib seaborn numpy` 
>    * Installation with **conda**: `conda install -c conda-forge pandas matplotlib seaborn numpy`

<br>
<br>

This is a configuration to make the plots work better when presenting online... you do **not need to run this cell**.

There are a number of plotting libraries in python. Here we will mostly be using **matplotlib** and **seaborn**, which are two very popular plotting libraries in python.

* **[matplotlib](https://matplotlib.org)** can be seen as a base library, which defines a lot of
  low-level plotting functions.
* **[seaborn](https://seaborn.pydata.org)** is built on top of matplotlib, and provides more high-level
  functions. It's designed to interface very well with DataFrames.

We will not delve in all the details of these two libraries, but will instead try to convey elements about their general philosophies and look at their most common functions/arguments. This should give you a head start to get the most out of their very well done tutorial and galleries:
 * [matplotlib](https://matplotlib.org) -
   [tutorial](https://matplotlib.org/stable/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py) -
   [gallery](https://matplotlib.org/stable/gallery/index.html)
 * [seaborn](https://seaborn.pydata.org) -
   [tutorial](https://seaborn.pydata.org/tutorial.html) -
   [gallery](https://seaborn.pydata.org/examples/index.html)

<br>

> Note: `pandas` also proposes plotting functions, which are basically also using `matplotlib`. 

<br>


In [None]:
## This cell configures the default size of plots and font size
## This is useful for me when presenting over zoom
%matplotlib inline

plt.rcParams['figure.figsize'] = 10, 10
plt.rc("font", size=20)

In [None]:
df = pd.read_csv("data/titanic.csv")
df.head()

In [None]:
## checking for NAs
df.isna().sum()

<br>
<br>


[Back to ToC](#toc)

# 1. A quick intro to matplotlib <a id="plt0"></a>
-----------------------------------------------------------

As seaborn is built on top of matplotlib and it is possible (and common) to mix matplotlib and seaborn elements in plots,
we  will quickly review here some basics of matplotlib, and in particular the ones which can help us complement seaborn.


**A first plot:** `plt.plot( ... )` is the base function to create line and scatter plots

In [None]:
x = [1,2,3,5]
y = [1,5,3,8]

# Passing 2 arguments to plt.plot().
plt.plot(x, y)
plt.show()

If we want a scatter plot, we can use the `plt.scatter()` function or tweak the options of `plt.plot()`:

In [None]:
plt.plot(x, y, marker = 'o' , linewidth = 0)
plt.show()

Matplotlib can be used in two different modes:
* **Procedural** (also referred to as **state-based** or **MATLAB-like**): **this is the mode we have been using**. In this mode, successive calls
  to `pyplot` function are made, and the state of the plot (or figure) is preserved between calls. This mode
  was developped with the intention to resemble MATLAB usage, and is also similar to basic plotting in R.


* **Object-oriented:** in this usage mode, a figure consists of a **figure** object that can contain one or
  more **axes**. An **axes** object represent one plot inside a figure - not just an axis, as its name would
  suggest! Elements such as data content, legends or axis legends are all drawn onto **axes** objects.

Importantly, both the procedural and object-oriented approaches can be used to achieve the same results. So the choice is really up to your personal preference. Both approaches will be illustrated in this course module.

---

The object-oriented mode is quite useful for the interaction with seaborn, so we will focus on that part:

In [None]:
## plt.subplots() create the Figure and Axe elements
fig, ax = plt.subplots()

ax.plot( x , y )

As its name entails, `plt.subplots` can be used to create multiple subfigures (ie, multiple plots):

In [None]:
## plt.subplots() create the Figure and Axe elements
fig, ax = plt.subplots(2,1)

ax[0].plot( x , y )
ax[1].plot( y , x , color = 'pink' )

It can also be used to change the plot size:

In [None]:
fig, ax = plt.subplots( figsize = (12,3) )

ax.plot( x , y )

we can call `.plot()` or other plotting functions multiple times to add several elements to the plot.

Adding a `label` argument and then calling `ax.legend()` adds a legend:

In [None]:
fig, ax = plt.subplots()

ax.plot( x , y , label = "line 1")
ax.plot( x , [2,1,5,5] , label = "line 2")
ax.legend()

We can add labels and title:

In [None]:
fig, ax = plt.subplots()

ax.plot( x , y )
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("title")

Also very useful, `ax.axhline()` and `ax.axvline()` adds an horizontal or vertical line:

In [None]:
fig, ax = plt.subplots()

ax.plot( x , y , label = "line 1")
ax.axhline( y = 3 , color = 'grey' , linestyle='--' ) 
ax.axvline( x = 4.5 , color = 'green' , linestyle=':' ) 

Also nice, `ax.grid()` adds a grid:

In [None]:
fig, ax = plt.subplots()

ax.plot( x , y , label = "line 1")
ax.grid()

That's it for this whirlwind tour of the basics of matplotlib we will need here.

It would be possible to craft any of the visualization we are going to see in this notebook using only the matplolib functions, but as it is a **low-level** library we would do so with a lot of code.

Instead, from there we will see how `seaborn` let's us create useful plots in a couple of lines.


<br>
<br>


[Back to ToC](#toc)

# 2. Basic representation - plotting one column <a id='plt1'></a>
-----------------------------------------------------------

## 2.1 the distplot function <a id='21'></a>

The simplest way to represent the distribution of a set of values using **seaborn** is to use its **`sns.displot()`** function (`displot` stands for "distribution plot").

**Example:** let's plot the distribution of passengers fares from the Titanic dataset.

In [None]:
sns.__version__

In [None]:
import seaborn as sns

sns.displot(df.Fare)

Ok, that is nice, but tiny... Le'ts make it larger:

In [None]:
sns.displot(df.Fare, height=7)

Above, you should now see a histogram (in blue):
* The *x-axis* corresponds to the fare paid by each passenger.
* The *y-axis* corresponds to the **count**: the number of values falling in a given *bin*
  (i.e. a bar of the histogram).

By using the **`kind`** argument, one can change the type of plot:
* `kind="hist"`: histogram (this is the default).
* `kind="kde"`: density line (kernel density estimation).
* `kind="ecdf"`: cumulative distribution function.

<br>

**Example:**

In [None]:
# Note: kde stands for 'kernel density estimation', which is the method used to compute the density line.
sns.displot(df.Fare, kind="kde", height=7)

> *Note:* the y-axis is now in **density** units.

Naturally, **`sns.displot()`** has many options allowing to combine plot types and pimp your plot.  
For instance, using `kde=True` will superimpose a density line:

In [None]:
sns.displot(df.Age, kde=True, color="teal", height=7)

# Note: some attributes of the KDE line can be changed via the `line_kws` argument, but not its color.
sns.displot(df.Age, kde=True, color="darkorange", height=7, line_kws={"ls": "--"})

> *Note:* having both the histogram and the density line switches the y-axis from density to count,
  which is sometimes more interpretable.

`sns.displot()` is what we could call a **figure-level** function:
 * It has several *kind* of representation.
 * It does not play nice in **multi-panel** figures.

<br>
<br>

<br>

<div class="alert alert-block alert-success">

### Micro-Exercise 1

1. run the cell below to load some new data (about the 180 swiss census)
2. use seaborn distplot function to represent the `Reformed` column (which contains the fraction of the population registered as of reformed christian faith in each town)
3. change the color to something you like
 


</div>


In [None]:
# 1. run the cell below to load some new data (about the 180 swiss census)
df_census = pd.read_csv("data/census1880_fractions.csv", index_col=0)
df_census.head()

In [None]:
# 2. use seaborn distplot function to represent the `Reformed` column (which contains the fraction of the population registered as of reformed christian faith in each town)
# 3. change the color to something you like


[Back to ToC](#toc)

## 2.2 seaborn "axe-level" functions <a id='22'></a>

<br>


The different values of the **`kind`** argument (`histogram`, `kde`, `ecdf`) of `sns.displot()` can all be called individually using their own functions:
* **`histplot()`**: histogram.
* **`kdeplot()`**: kernel density estimation line.
* **`ecdfplot()`**: cumulative distribution function.

In contrast to `sns.displot()` which is **figure-level**, these more-specialized functions are **axe-level**: they play nicely in a multiple panel context.  
They interface with `matplotlib`'s axes using the **`ax`** argument.

**Example:** plotting the passenger age and fare from the Titanic dataset.

In [None]:
# Creating a figure with multiple panels: 1 row, 2 columns.
fig, axes = plt.subplots(1, 2 , figsize = (12,6))

# Plotting on the first axe == left panel.
sns.histplot(df.Age, ax=axes[0])

# Plotting on the second axe == right panel.
sns.kdeplot(df.Fare, ax=axes[1])

plt.tight_layout()
plt.show()

Alternatively, the seaborn function returns a `Axe` object which we can use ofr further customization:

In [None]:
ax = sns.kdeplot(df.Fare)
ax.set_xlabel("Passenger fare (Â£)")


---
<br>

In **histogram**, a different number of *bins* shows different aspects of the distribution of your data and so it is important to choose one that is appropriate to your sample size and data.  
By default, seaborn automatically infers the number of *bins*. You may stick with the original or change it.

It can be done with several arguments:
 * `bins`: number of bins
 * `binwidth`: bin width
 * `binrange`: Lowest and highest value for bin edges



In [None]:
# Set up a figure with multiple panels, here a 2 by 2 grid.
f, axes = plt.subplots(2, 2, figsize=(14, 7))

# Plot a simple histogram with bin size determined automatically.
sns.histplot(df.Fare,                    ax=axes[0,0]).set_title("automatic")
sns.histplot(df.Fare, bins=5           , ax=axes[0,1]).set_title("bins")
sns.histplot(df.Fare, binwidth=25      , ax=axes[1,0]).set_title("binwidth")
sns.histplot(df.Fare, binrange=[0,100] , ax=axes[1,1]).set_title("binrange")

plt.tight_layout()  # This makes the panels margins and setup more graceful.

<br>

To recap, **when to use `sns.histplot` or `sns.displot`**: 
 * `histplot`: gives a lower-level kind of plot which is easier to manipulate in multi-panel figures.
 * `displot`: high-level function, with a lot of capabilities, but does not play nice if it is not the top figure.
     


<br>

We can combine graphical elements coming from `matplotlib` and `seaborn` quite seamlessly.

Here is a more advanced example where we represent the the **mean**, the **median** and the **mode** of a distribution on top of an histogram.

In [None]:
deciles = df.Age.quantile([0.10,0.90])
median  = df.Age.median()

ax = sns.histplot(df.Age, kde=True)  # Line for histogram and density line.

ax.axvline(median, color="red", linestyle="-", label="Median")
ax.axvline(deciles[0.1], color="grey", linestyle="--", label="10% and 90% percentiles")
ax.axvline(deciles[0.9], color="grey", linestyle="--")
ax.legend()

Of note here :
 * `axvline`: create a vertical line.
 * `label`: legend label associated with the element created by this function.
   Common to a lot of matplotlib plotting functions.
 * `ax.legend()`: makes the legend appear.
 

<br>
<br>

<div class="alert alert-block alert-success">

## Exercise 2.1 - histograms <a id='ex1'></a>

Exercises are located in the dedicated notebook `exercises_course1.ipynb`.

</div>



<br>
<br>

[Back to ToC](#toc)

## 3. Accounting for categories in the data <a id='descr.3'></a>


Here the categories would be the passenger's gender, but depending on data-sets they could be a genotype (Wild-Type versus KO), sex, experimental condition, ...



In **`seaborn`** we account for categories with arguments which control a graphical aspect:
 * argument **`hue`** controls **colors**
 * argument **`style`** controls **plotting symbol**

They work in a similar fashion, and `hue` is used more frequently, so we will focus on that one.

* **`hue`** is quite ubiquitous in seaborn functions, and lets you determine a **categorical column** by
  which you want to split your representation.
* Note how we switched our syntax too: we now use **column names**, and use the **`data=`** argument to
  specify from which `DataFrame` the data comes.


In [None]:
sns.displot(x="Age", hue="Sex", data=df, kind="kde", height=7)

<br>

The default seaborn scheme is fairly nice, but maybe you want to change it:
> Note: use the argument **`linewidth=`** to get wider/narrower lines.


In [None]:
f, axes = plt.subplots(2, 2, figsize=(14, 7))

# Default
sns.kdeplot(x="Age", hue="Sex", linewidth=3, data=df , ax=axes[0][0])

# Using an existing seaborn palette
sns.set_palette(sns.color_palette("Set2"))
sns.kdeplot(x="Age", hue="Sex", data=df, linewidth=3, ax=axes[0][1])


# Setting manually, with hex values
sns.kdeplot(
    x="Age", hue="Sex", data=df,
    linewidth=3, ax=axes[1][0], 
    palette = ["#FFCC04","#1F8AB3"]
)

# Setting semi-manually, using the xkcd palette https://xkcd.com/color/rgb
sns.kdeplot(
    x="Age", hue="Sex", data=df,
    linewidth=3, ax=axes[1][1], 
    palette=["xkcd:lavender", "xkcd:dark mint"]
)

plt.tight_layout()

We can also assign each category to a color explicitely by passing a `dict` to `hue`:

In [None]:
# Setting semi-manually, using the xkcd palette https://xkcd.com/color/rgb
sns.kdeplot(
    x="Age", hue="Sex", data=df,
    linewidth=3, 
    palette={'female':"xkcd:lavender", 'male':"xkcd:aqua"}
)

You can learn more on all the options there :
 * [matplotlib colors](https://matplotlib.org/stable/tutorials/colors/colors.html)
 * [seaborn colors](https://seaborn.pydata.org/tutorial/color_palettes.html)
 

<br>

Sometimes, the `displot` options don't really cut it, perhaps because you have many different categories, or maybe because you want to have more than one column to do the categorization.

This is when **`sns.catplot()`**, another **figure-level** function with a lot a `kind`, saves the day!


In [None]:
sns.catplot(
    x="Fare", y="Pclass",
    orient="horizontal",
    data=df,
    aspect=2.0, height=7
)

Like `sns.displot`, `sns.catplot` is a master function which is designed to create a full figure at once rather than be embedded in a larger figure. 

This is why its arguments are a bit different when it comes to setting the figure *height and width*, which is done using :
 * `height` : height of the figure (no trick there)
 * `aspect` : width/height ratio of the figure ( high aspect -> wider figure )

The **`kind`** argument let's you control the overall look of the plot. I
It can be:
* 'strip' : this is the default, showing all data points. 
* **'box'** : the famous boxplot.
* **'violin'** : an alternative to the boxplot using density lines.
* **'bar'** : the (in)famous barplot.
* 'swarm' : similar to 'strip' but with another way of arranging the points.
* 'boxen' : some intermediary between a boxplot and a violin plot.
* 'point' : alternative to the barplot where only the top point is shown.


In [None]:
kinds = ["box", "violin", "bar", "boxen", "strip", "point"]  #,"swarm"] # swarm takes a long time to compute

for i, k in enumerate(kinds):
    sns.catplot(x="Fare", y="Pclass", orient="horizontal", data=df, kind=k, aspect=5, height=2)


So here we can see:
 * **boxplot:** represent quartiles and a few other things.
 * **violinplot:** density line plot with an (optional) boxplot in the center.
 * **barplot:** just shows the mean + an error bar (by default 95% CI computed using bootstrapping).
 * **boxen:** the unholy child of a boxplot and a violinplot.
 * **strip:** the default, shows the points (super nice, except when you have a lot of points).
 * **point:** basically a barplot without bars and a line between categories.

Given all this, one can reasonably ask the question : **which one is the *best* ?**

While the answer is not clear about the best way to represent a distribution over one or multiple categories, we definitely have some [good answers regarding bad/misleading ways](https://stekhoven.shinyapps.io/barplotNonsense/) (hint, barplot does not shine).  

<br>

With `sns.catplot()`, the **`hue`** argument can also be used here to great effect:

In [None]:
sns.catplot(
    x="Fare", y="Pclass",
    hue="Sex",
    orient="horizontal",
    kind="box",
    data=df,
    aspect=2.0, height=4
)
# It is also very nice with kind="point".

<br>
<br>

<div class="alert alert-block alert-success">

## Exercise 2.2 - Representing categories <a id='ex2'></a>

Exercises are located in the dedicated notebook `exercises_course1.ipynb`.

</div>



<br>
<br>
<br>

[Back to ToC](#toc)

## 4. Representing the relationship between 2 numerical variables <a id="plot2" ></a>
-----------------------------------------------------------------------------------------------

**`sns.scatterplot()`** lets us represent the relationship between two numerical variables.

In [None]:
df = pd.read_table("data/titanic.csv", sep=",")  # Just in case we need to re-read the data.

In [None]:
# Reset the seaborn default theme.
sns.reset_orig()

In [None]:
fig = plt.figure(figsize=(14, 6))
ax = sns.scatterplot(x="Fare", y="Age", data=df)
plt.show()

<br>

OK, it is not so nice... let's set the x-axis to a log scale:

In [None]:
fig = plt.figure(figsize=(14, 6))
ax = sns.scatterplot(x="Fare", y="Age", data=df)
ax.set(xscale="log")      # Setting axis to log scale.
ax.set_xlim((3, 1000))    # Manually setting the limit of the x axis.
plt.show()

Of course, we can customize:
> Note the argument **`style=`** used to associate a marker shape to a column.

In [None]:
fig = plt.figure(figsize=(15, 10))
ax = sns.scatterplot(
    x="Fare", y="Age", 
    hue="Pclass",
    palette=["xkcd:tomato", "xkcd:teal", "xkcd:mustard"],
    style="Sex",
    data=df,
    s=100
)
ax.set(xscale="log")    # Setting axis to log scale.
ax.set_xlim((3, 1000))  # Manually setting the limit of the x axis.
plt.show()

<br>

An alternative with more colors and a single marker.
This was adapted from [this stackoverflow question](https://stackoverflow.com/questions/62230507/multiple-columns-for-hue-parameter-in-seaborn-violinplot).


In [None]:
fig = plt.figure(figsize=(15, 10))

# We combine the 2 columns to later apply them as categories.
CAT = df[["Pclass", "Sex"]].apply(tuple, axis=1)

ax = sns.scatterplot(
    x="Fare", y="Age", 
    hue=CAT,
    hue_order=sorted(CAT.unique()),           # Handle the order categories in legend.
    palette=sns.color_palette("Paired")[:6] , # Here the paired palette is great.
    data=df, s=100
)
ax.set(xscale="log")     # Setting axis to log scale.
ax.set_xlim((3, 1000))   # Manually setting the limit of the x axis.
plt.show()

Sometimes a 2D density plot may give you a higher level view of your data:

In [None]:
iris = sns.load_dataset("iris")
# Draw a contour plot to represent each bi-variate density
sns.kdeplot(
    data=iris,
    x="sepal_width",
    y="sepal_length",
    hue="species",
    thresh=.1,
)
plt.show()

In other cases you may prefer to represent lines:

In [None]:
# Load an example dataset with long-form data
fmri = sns.load_dataset("fmri")

# Plot the responses for different events and regions
#   with 95% confidence interval, 
#    other options include ("ci", "pi", "se", or "sd") or something custom
sns.lineplot(x="timepoint", y="signal",
             hue="region", style="event",
             errorbar = ('ci',95) , 
             data=fmri)
plt.show()

Finally, a little data exploration trick : **`sns.pairplot()`**

In [None]:
sns.pairplot(df_census.iloc[:, 3:10])
plt.show()

<br>
<br>
<br>

[Back to ToC](#toc)


## 5. Writing plots to disk <a id='writing'></a>
------------------------------------

To save figures to disk, we rely on the **`savefig()`** method:
* Different output formats can be specified by changing the output file name extension, 
  e.g. `.pdf`, `.svg`, `.jpg`, ...
* For raster formats, the `dpi` argument can be used to define the image quality.
* See `help(plt.savefig)` or [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html?highlight=savefig#matplotlib.pyplot.savefig) for more info.

In [None]:
k = "box"

# Make you plot as usual, redirecting it to a variable (my_plot).
represented_variable, category = "Fare", "Pclass"
my_plot = sns.catplot(
    x=represented_variable,
    y=category,
    data=df,
    kind=k, orient="h",
    height=10, aspect=2
)

# Save plot to disk, using the savefig() method:
my_plot.savefig("output.png")

In [None]:
## When using the axe level functions, you can just call plt.savefig()
sns.histplot(df.Fare, kde=False)

plt.savefig("output_simple.pdf")

In [None]:
# For multi-panel figures:
f, axes = plt.subplots(2, 2, figsize=(7, 7))

sns.histplot(df.Fare, kde=False, color="b", ax=axes[0, 0]).set_title("automatic")
sns.histplot(df.Fare, bins=5   , kde=False, color="b", ax=axes[0, 1]).set_title("5 bins")
sns.histplot(df.Fare, bins=10  , kde=False, color="b", ax=axes[1, 0]).set_title("10 bins")
sns.histplot(df.Fare, bins=1000 , kde=False, color="b", ax=axes[1, 1]).set_title("1000 bins")
plt.tight_layout()  # This makes the panels margins and setup more graceful.

f.savefig("output_multipanel.pdf")

<br>
<br>
<br>

[Back to ToC](#toc)

<div class="alert alert-block alert-success">

## Exercise 2.3 - Free form exercise <a id="exo"></a>

Exercises are located in the dedicated notebook `exercises_course1.ipynb`.

</div>


<br>
<br>
<br>

# Additional Material
------------------------------

## Annex 1 - seaborn themes + python graph gallery <a id="gallery"></a>

Seaborn and matplotlib offer nigh-endless possibilities when it comes to customizing graph elements.
* A simple way to change the general feeling of a figure is to play with
  [seaborn themes](https://seaborn.pydata.org/tutorial/aesthetics.html), or
  [matplotlib styles](https://www.dunderdata.com/blog/view-all-available-matplotlib-styles).

* In general, when composing a figure we often start from an existing examples.
  Here are useful resources to do just that:
  * [matplotlib gallery](https://matplotlib.org/stable/gallery/index.html)
  * [seaborn gallery](https://seaborn.pydata.org/examples/index.html)
  * [python-graph-gallery](https://www.python-graph-gallery.com/)