<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
     INE Bootcamp
</h1>
<h2 style="color: white; position: absolute; top:36%; left:10%;">
    Data Analysis, Visualization and Predictive Modeling
</h2> 

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:58%; left:10%;">
    <b>David Mertz, Ph.D.</b>
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:63%; left:10%;">
    <b>Data Scientist</b>
</h3>
</div>

<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
     Day 3 Agenda
</h1>
<h3 style="color: white; position: absolute; top:36%; left:10%; line-height: 1.5;">
Seaborn statistical plots<br/>
Break and programming exercises<br/>
Linear and Polynomial Fitting<br/>
Break and programming exercises<br/>
Data Analysis for Machine Learning<br/>
Break and programming exercises<br/>
Review and evaluation<br/>
</h3>
</div>

<div style="width: 100%; height: 200px; background-color: #222; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Seaborn statistical plots
</h1>

<br><br> 
</div>

<img src="https://user-images.githubusercontent.com/7065401/110562366-1cfdc480-8128-11eb-8391-ec0ba196619d.png" style="width:300px; float: right; margin: 0 40px 40px 40px;"/>

> Seaborn extends the capabilities of Matplotlib by providing high-level plotting methods for statistical analysis. It is also a set of style options for Matplotlib.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

Seaborn supports Pandas DataFrames as input to plot data in columns.

<h2 style="font-weight: bold;">
    Visualizing distributions
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Let us load a couple often used and public domain datasets.

In [None]:
auto = pd.read_csv('data/auto-mpg.csv')
auto

In [None]:
fmri = sns.load_dataset("fmri")
fmri.head()

<h2 style="font-weight: bold;">
    Univariate distributions
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

**displot()** and **histplot()** are histograms with a kernel density estimation overlain.

In [None]:
sns.histplot(auto.mpg, kde=True, bins=35, stat='density');

In [None]:
sns.displot(fmri.signal, kde=True); 

In [None]:
sns.displot(data=fmri, x="signal", rug=True, rug_kws={"color": "r", "alpha":0.5});

<h2 style="font-weight: bold;">
    Error bars
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

We can show data with 95% confidence interval in one function call.

In [None]:
fmri

In [None]:
sns.relplot(data=fmri, kind="line",  x="timepoint", y="signal", 
            col="region", hue="event", style="event");

<h2 style="font-weight: bold;">
    Bivariate distributions
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

2-dimensional distributions plotted with **jointplot()** include histograms or 1-D KDEs.

The `kind=` keyword paramter can adjust the display. Available options are
* `scatter` (default)
* `hex`: hexagonal binning
* `kde`: contour mapping
* `reg`: adding regressions and kde
* `resid`: deviations from regression

In [None]:
sns.jointplot(data=auto, x='weight', y='hp', height=8);

If we add `hue` as a categorical variable, the distributions are shown as KDEs rather than historgrams.

In [None]:
sns.jointplot(data=auto, x='weight', y='hp', height=7, hue='origin');

We can create a 2-D KDE as the main plot.

In [None]:
sns.jointplot(data=auto, x="mpg", y="weight", hue="origin", kind="kde", height=7);

In [None]:
jp = sns.jointplot(x='mpg', y='hp', kind='hex', data=auto, height=7)

# add colorbar
cax = jp.fig.add_axes([1, .08, .04, .75])
plt.colorbar(cax=cax);

A 2-D histogram is similar to a hexplot, merely using squares rather than hexagons to tile.  Visually, hexagons generally create less of an artificial binning, since they more closely resemble a circular radius.  The below example "reaches down" to use some underlying Matplotlib tweaks; sometimes we need to do this to get the effect we want.  `.histplot()` is somewhat "lower level" than `.jointplot()`, so fewer extra elements are added automatically.  However, Matplotlib is an additional layer lower again.

In [None]:
fig, ax = plt.subplots(figsize=(7, 8))
sns.histplot(auto, x='mpg', y='hp', hue=auto.weight.quantile(0.2))
ax.get_legend().remove()

Somewhat analogous to a 2-D histplot is a heatmap.  This is often useful for representing a grid of numbers.  It needs tabular data to plot, in the shape of the grid.

In [None]:
corr = auto[['displ', 'hp', 'accel', 'weight', 'cyl', 'mpg']].corr()
corr

In [None]:
fig, ax = plt.subplots(figsize=(8, 7))
sns.heatmap(corr, annot=True, cmap="PiYG")
ax.set_title("Correlations among auto features");

<h2 style="font-weight: bold;">
    Regressions
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Regressions of aribrary order can be visualized with **regplot()**.

In [None]:
ax = sns.regplot(x='weight', y='mpg', order=2, 
                 line_kws={'color':'red'}, data=auto)
ax.set_title("Weight vs HP (2nd order fit)");

Sometimes we need to "drop down" to the Matplotlib level to adjust some features.  Here we make the underlying axis object larger.

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.regplot(x='weight', y='mpg', order=2, ax=ax,
                 line_kws={'color':'red'}, data=auto)
ax.set_title("Weight vs HP (2nd order fit)");

<h2 style="font-weight: bold;">
    Factoring variables
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

**catplot()** plots distributions with groupby operations applied along the x-axis.<br>Secondary grouping is performed with `hue=`.

Multiple plots can be factored with `col=`.

The `kind=` keyword can be
* `point`
* `count`
* `bar`
* `box`
* `violin`
* `strip`

Simple grouping along the column on the x-axis.

In [None]:
sns.catplot(x='origin', y='hp', kind='bar', data=auto, height=5);

In [None]:
sns.catplot(x='origin', y='mpg', kind='violin', data=auto, height=5);

Let's also factor out the average horse-power per region and number of cylinders.

In [None]:
sns.catplot(x='origin', y='hp', hue='cyl', kind='bar', data=auto, height=5);

The US is the only manufacturer of 8-cylinder vehicles.

In [None]:
sns.catplot(y='origin', kind='count', hue='cyl', data=auto, height=5);

By binning the dataset in 5-year increments we can further factor the plots into columns.

In [None]:
auto['bins'] = pd.cut(auto['yr'], bins=np.arange(65, 90, 5))
auto['bins'].value_counts().sort_index()

The distributions have changed in Europe and Asia dramatically from 1970 to 1984.

The meadian MPG value has raised in the US during this time.

In [None]:
sns.catplot(x='origin', y='mpg', kind='box', col='bins', data=auto);

<h2 style="font-weight: bold;">
    Pairwise relationships
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

**pairplot()** performs a `jointplot()` over all pairs of columns. 

It is generally helpful to select only the columns you wish to plot from the DataFrame. `vars=` will also accept a list of column names.

In [None]:
sns.pairplot(auto[['mpg','yr','hp','weight']]);

Like `factorplot`, `hue=` performs a groupby operation.

In [None]:
sns.pairplot(auto, vars=['mpg','yr','hp','weight'], hue='origin');

<div style="width: 100%; height: 200px; background-color: #ef7d22; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Exercises
</h1>

<br><br> 
</div>

<h2 style="font-weight: bold;">
    Titanic Survivors
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


These exercises can be performed with either Pandas `.plot()` or Seaborn methods.

In [None]:
# Note: there are missing data in this DataFrame
titanic = pd.read_csv('data/titanic.csv')

In [None]:
titanic.info()

<h2 style="font-weight: bold;">
    Age distributions
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


Plot the distribution of the ages of all passengers and visually determine the mode.

In [None]:
# your solution here

<h2 style="font-weight: bold;">
    Survived
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


Group the ages by survival and show a box plot.

In [None]:
# your solution here

<h2 style="font-weight: bold;">
    Plot CDF
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


Plot the cumulative distribution functions of the ages of the passengers by survival.

*Is age a reasonable determining factor for survival?*

In [None]:
# your solution here

<h2 style="font-weight: bold;">
    Tips
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


In [None]:
tips = pd.read_csv('data/tips.csv')

In [None]:
tips.info()

---
### Tip fraction

1. compute the tip fraction
2. Plot the tip fraction against the total bill with a regression
  1. Is the rate of tipping per the size of the total bill constant?
  2. Does smoking, sex, or time matter?

In [None]:
# your solution here

<div style="width: 100%; height: 400px; background-color: #222; text-align: center; padding-top: 120px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Review and questions
</h1>

<br><br> 
</div>

---
<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<img src="https://user-images.githubusercontent.com/7065401/98864025-08deda80-2448-11eb-9600-22aa17884cdf.png" style="height: 100%; max-height: inherit; position: absolute; top: 20%; left: 0px;"></img>
<br>

<h2 style="font-weight: bold;">
    David Mertz, Ph.D.
</h2>

<h3 style="color: #ef7d22; margin-top: 0.8em">
    Data Scientist
</h3>
<hr>
<br><br>

<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    david.mertz@gmail.com
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    linkedin.com/in/dmertz/
</p>

</div>

<br><br><br>