## 3b. Seaborn

Seaborn is a library for statistical data visualization, built on top of Matplotlib. It provides a higher-level interface for more complex visualizations, and a slightly changed style.

### Table of contents

- Seaborn
  - Relationship Plots
    - Line
    - Scatter
  - Categorical Plots
    - Swarmplots
    - Boxplots
    - Violinplots
  - Distribution Plots
    - Histograms
    - KDE
    - Bivariate
  - Multiple Plots
    - Plot Grids
    - Jointplots

---

In [None]:
import numpy as np   # we'll use numpy to generate dummy data
import pandas as pd  # we'll use pandas to read and manipulate datasets

import warnings
warnings.simplefilter('ignore', FutureWarning)

import matplotlib.pyplot as plt
# display figures alongside cell output
%matplotlib inline

import matplotlib

**ℹ️ Tip**: it's not by mistake that the following cell is separate from the previous. There is a small bug that causes it not to be executed correctly if they are ran at the same time. This is not limited to Jupyter notebooks.

In [None]:
matplotlib.rcParams['figure.dpi'] = 100  # make figures large
%config InlineBackend.figure_format = 'retina'  # make figures crisp

---

In [None]:
import seaborn as sns
sns.set()  # apply Seaborn's style to future charts

### Relationship Plots

To make examples more meaningful, throughout this section, we'll plot actual datasets. One such dataset is the `tips` one, which logs the bills and tips in a restaurant:

In [None]:
tips = sns.load_dataset('tips')
tips.head()

#### Scatter

Similar to Matplotlib's counterpart, but with a slightly changed style:

In [None]:
sns.relplot(data=tips, x='total_bill', y='tip');

Quickly add additional information such as the time of the meal (color), the customer's gender (shape) and the party size (size of marker), from the underlying dataset:

In [None]:
sns.relplot(
    data=tips,
    x='total_bill', 
    y='tip', 
    hue='time',
    style='sex',
    size='size',
);

#### Line

Another example dataset, of continuous measurements (over time):

In [None]:
fmri = sns.load_dataset('fmri')
fmri.head()

In [None]:
sns.relplot(data=fmri, x='timepoint', y='signal');

Aggregating it into a line, with mean and confidence interval (95%) is more informative:

In [None]:
sns.relplot(data=fmri, x='timepoint', y='signal', kind='line');

Show additional information: the region (color) and event (line style):

In [None]:
sns.relplot(
    kind='line',
    data=fmri,
    x='timepoint',
    y='signal',
    hue='region',
    style='event',
    alpha=.75,
);

### Categorical plots

We'll exemplify on the same, `tips` dataset

#### Between categories

The x axis is categorical, so points are grouped together and are jittered a little as to not overlap, but still show the amount of points in each category/total bill segment. This is called a swarmplot:

In [None]:
sns.catplot(x='day', y='total_bill', kind='swarm', data=tips, color='C0');

Show additional information:

In [None]:
sns.catplot(
    kind='swarm',
    data=tips,
    x='day',
    y='total_bill',
    hue='time',
);

#### Distribution

Boxplot: the box shows the three quartiles, and whiskers extend to show the smallest and largest values, excepting outliers which are plotted separatedly

The three quartiles are:
 1. lower quartile (25% of elements are less than it)
 2. median (50% of elements are less than it)
 3. upper quartile (75% elements are less than it)
 
A point is considered an outlier if it is farther than 1.5 IQR from the lower and upper quartiles.
IQR, the inter-quartile range, is simply the distance between the lower and upper quartiles.

In [None]:
tips.groupby('day').total_bill.describe()

In [None]:
sns.catplot(kind='box', data=tips, x='day', y='total_bill');

Similar to the boxplot, but shows more information about the distribution. Instead of the quartiles and ranges, it shows a KDE. Think of it as a continuous histogram. Its shape allows it to show data for two types of observations for each x-axis categorical value:

In [None]:
sns.catplot(
    kind='violin',
    split=True,
    
    data=tips,
    x='day',
    y='total_bill',
    hue='sex',
    scale='count',
);

The width of each KDE shows the amount of observations falling in that segment

### Distributions

We'll exemplify on the famous `iris` dataset, containing measurements of various species of flowers:

In [None]:
iris = sns.load_dataset('iris')
iris.head()

In [None]:
len(iris)

In [None]:
iris.species.value_counts()

#### Univariate

A **histogram** (the columns) shows how many observations fall in each _bin_.

A **KDE**, Kernel Density Estimation, fits a probability density function over the distribution. You can think of it as a continuous approximation of the histogram.

In [None]:
sns.distplot(iris.sepal_length)
plt.gca().xaxis.grid(False)
plt.ylabel('% Samples');

**ℹ️ Tip**: It seems like our distribution is made up of multiple composing distributions. Since the data comes from natural phenomena, we expect it to be somewhat normally shaped. Plotting the KDE for each species reveals the underlying distributions:

In [None]:
for species, sub_df in iris.groupby('species'):
    sns.kdeplot(sub_df.sepal_length, label=species)

plt.legend(title='Species')
plt.xlabel('Sepal length')
plt.ylabel('Amount of samples')
plt.yticks([])
plt.title('Length Distribution by Species');

#### Bivariate

Scatterplot in the center with univariate histograms on the sides:

In [None]:
sns.jointplot(data=iris, x='sepal_length', y='sepal_width');

Bivariate (2D) analogous of KDE:

In [None]:
sns.jointplot(kind='kde', data=iris, x='sepal_length', y='sepal_width', shade_lowest=False);

Similarly, we can decompose the distributions:

In [None]:
with sns.axes_style('white'):
    for species, sub_df in iris.groupby('species'):
        sns.kdeplot(sub_df.sepal_length, sub_df.sepal_width,  label=species, 
                    shade=True, shade_lowest=False, alpha=.5)

plt.legend(title='Species')
plt.title('Length and Width Distribution by Species');

More than two variables: just have multiple pairwise plots

In [None]:
g = sns.PairGrid(iris, diag_sharey=False, hue='species')

g.map_diag(sns.kdeplot)
g.map_upper(plt.scatter, alpha=.5)
g.map_lower(sns.kdeplot, shade=True, shade_lowest=False)

g.add_legend(title='Species');

#### Linear relationships

Best-fit line and confidence interval:

In [None]:
sns.regplot(data=tips, x='total_bill', y='tip');

Show histograms on the sides:

In [None]:
sns.jointplot(kind='reg', data=tips, x='total_bill', y='tip');

#### Heatmaps

We'll use the `flights` dataset, which contains the number of passagers for some flights over a period of time:

In [None]:
flights = sns.load_dataset('flights').pivot('month', 'year', 'passengers')

In [None]:
flights

Present data in three dimensions. The z-axis (color intensity) represents the number of passengers:

In [None]:
sns.heatmap(flights, cbar_kws=dict(label='# Passangers'));

**ℹ️ Tip**: it is intuitive to represent larger values by darker colors:

In [None]:
sns.heatmap(
    flights, 
    cbar_kws=dict(label='# Passangers'),
    cmap='Blues',
    lw=.1,
);

**ℹ️ Tip**: reverse any colormap by appending `_r` to its name.

Other sequential colormaps:

![pic](https://i.imgur.com/oqfPvJX.png)

---

Sometimes you have diverging data, such as the correlation: two variables can be correlated either positively (both increase and decrease at the same time) or negatively (when one increases, the other decreases). So we adapt to a diverging colormap.

In [None]:
crashes = sns.load_dataset('car_crashes')
crashes.head()

In [None]:
corr = crashes.corr()

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    ax = sns.heatmap(corr, 
                     mask=mask,
                     cbar_kws=dict(label='Correlation'),
                     cmap='RdYlGn', 
                     center=0, #vmin=-.5,
                     annot=True, 
                     fmt='.1f', 
                     lw=1,
                     square=True)

Other diverging colormaps:

![pic](https://i.imgur.com/9H9J71j.png)

**ℹ️ Tip**: see the rest of available colormaps and color palettes: [matplotlib](https://matplotlib.org/tutorials/colors/colormaps.html), [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html). Use tools such as [Color Brewer](http://colorbrewer2.org) to help pick color schemes. [Adobe Color Wheel](https://color.adobe.com/create/color-wheel/) is a good tool for general-purpose palette selection. Online [palette generators](https://coolors.co/app) make exploring colors easy.

---

Set the syle back to the original Matplotlib defaults:

In [None]:
sns.reset_orig()