# About the Instructor

* Teddy Petrou
* Author of Master Data Analysis with Python
* Founder of [Dunder Data](http://dunderdata.com)

# Syllabus

### Module 1: An overview of the Matplotlib Figure and Axes (30 min)
* Training Overview 
* Matplotlib Figure and Axes objects
* An object-oriented approach to Matplotlib
* Seaborn as a “wrapper” for Matplotlib
 
### Module 2: Introduction to Seaborn (30 min)
* Seaborn documentation and API
* Integration with Pandas DataFrames
* Main plotting parameters
* Grid vs Axes plots
* A different categorization for Seaborn plots
 
### Module 3: Distribution Plots (75 min)
* Univariate distribution plots
* Box plots
* Changing the orientation of plots
* Histograms and KDEs
* Other distribution plots
 
### Module 4: Automatic Grouping by Category (75 min)
* Plotting a continuous by a categorical variable
* Grouping with two continuous variables
* Splitting groups by setting hue
 
## Day 2
 
### Module 5: Grouping and Aggregating Plots (60 min)
* Univariate grouping and aggregating
* Choosing the aggregation function
* Bar, Point, Count, and Line plots
 
### Module 6: Tidy Data (30 min)
* Definition of tidy data
* Long data vs wide data
* Seaborn automatically groups and aggregates
* Manually grouping and aggregating with pandas
 
### Module 7: Raw Data Plots (60 min)
* Scatterplots
* Regression plots
* Heatmaps
 
### Module 8: Grid Plots (60 min)
* Relationship with Matplotlib Figures
* Discovering the Grid plotting functions
* Setting the plot with the kind parameter
* Creating multiple Axes with row/col parameters
* Bivariate distribution plots
* Hierarchical cluster maps
 
### Module 9: Seaborn Styles and Palettes (30 min)
* Run configuration parameters
* Specific Seaborn styles
* Color palettes and widgets

## Getting Started 

* Install matplotlib, seaborn
* pip install seaborn matplotlib
* conda install seaborn matplotlib

# Seaborn Axes Plots

* Seaborn is a high-level, easy-to-use interface for creating powerful and beautiful visualizations
* Relies on matplotlib entirely
* Relatively few functions

## Matplotlib Figure - Axes Review

* Figure contains one or more axes
* Axes ("Axeeez") - a single plot within a figure
* `plt.subplots` creates a figure and one or axes
    * Set first two arguments to number of rows and columns of Axes
    * `figsize` - tuple of width, height of Axes in in inches
    * `dpi` - dots per inch - set to monitor dpi to get matplotlib inches to equal screen inches
    * `facecolor` - background color
    * `tight_layout` - boolean for nice spacing
    * `plt.subplots(2, 3, figsize=(7, 4), dpi=147, facecolor='tan', tight_layout=True)`
    * Returns a two-item tuple, where first item is a Figure. Second item is
        * Axes if 1x1
        * Numpy array of Axes if anything else

In [None]:
import matplotlib.pyplot as plt

### Create matplotlib figure with different number of axes

### Retrieve the figure from the axes

## The seaborn API and User Guide

Keep the [seaborn API page][1] and [user guide][2] throughout the lesson

* Relational
* Categorical
* Distribution
* Regression
* Matrix

### Axes and Grid plots

All of the seaborn plotting functions return either a matplotlib axes or a seaborn grid (wrapper around a figure). 

* Functions that return a grid - `relplot`, `displot`, `catplot`, `lmplot` and `clustermap`
* Focus on axes plots in this chapter

### A different categorization of plots

Distribution plots appear in categorical section.

* **Distribution plots** - `boxplot`, `violinplot`, `histplot`, `kdeplot`.
* **Grouping and aggregating plots** - `barplot`, `countplot`, `pointplot`, `lineplot`
* **Raw data plots** - `scatterplot`, `lineplot`, `regplot`, `heatmap`

[1]: http://seaborn.pydata.org/api.html
[2]: https://seaborn.pydata.org/tutorial.html

## seaborn integration with pandas

* seaborn is tightly integrated with pandas. 
* plotting functions contain a `data` parameter that accept a pandas DataFrame. 
* This allows you to use **strings** of the column names for the function arguments.

### The four common seaborn plotting function parameters - `x`, `y`, `hue`, and `data`


```python
>>> sns.plotting_func(x='col1', data=df)
>>> sns.plotting_func(y='col1', data=df)
>>> sns.plotting_func(x='col1', y='col2', data=df)
>>> sns.plotting_func(x='col1', y='col2', hue='col3', data=df)
```

## Distribution Plots

* How numeric data is distributed across its range. 
* Where is it located? 
* What kind of shape does it have?
* boxplots, histograms, KDEs

### Airbnb data from Washington D.C.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
airbnb = pd.read_csv('../data/airbnb.csv')
airbnb.head(3)

### Univariate distribution plots

* Single numeric variable
* Only set one of `x` or `y`
* Create horizontal or vertical plots
* Axes plot

#### Plot price as horizontal boxplot

In [None]:
sns.boxplot(x='price', data=airbnb.query('price < 1000'))

## An axes is returned

* Use `;` to suppress output
* Assign result to `ax`
* Retrieve underlying figure with `ax.figure`

## Setting plot aesthetics

* Default plot settings not great
* Use `sns.set_theme` function
    * `style` - white, dark, whitegrid, darkgrid, ticks
    * `rc` - matplotlib run configuration parameters
        * Find values under plt.rcParams
        * Use dictionary - `{'figure.dpi': 147}`
        * Scale font with `font_scale`

In [None]:
def find_dpi(w, h, d):
    """
    w : width in pixels
    h : height in pixels
    d : diagonal in inches
    """
    w_inches = (15.4 ** 2 / (1 + h ** 2 / w ** 2)) ** 0.5
    return round(w / w_inches)

In [None]:
MY_DPI = find_dpi(1920, 1200, 15.4)
MY_DPI

### Set theme now

Sets theme for all plots in this notebook. Not permanent across all notebooks.

In [None]:
sns.set_theme(style='darkgrid', rc={'figure.dpi': MY_DPI}, font_scale=0.7)
sns.boxplot(x='price', data=airbnb.query('price < 1000'));

## Controlling the size of an axes plot

* Must create figure and axes before hand with `fig, ax = plt.subplots(figsize=(w, h))`
* Then set `ax` parameter - `sns.plotting_func(..., ax=ax)`
* Should be easier...

### Vertical plots

* Set the `y` parameter

### Recreate box plot with matplotlib

* Seaborn uses matplotlib's `boxplot` function

In [None]:
fig, ax = plt.subplots(figsize=(4, 1.5))
ax.boxplot(x='price', data=airbnb.query('price < 1000'), 
           widths=.8, vert=False, patch_artist=True,
           medianprops={'color': '.25', 'lw': 1.5},
           boxprops={'ec': '.25', 'lw': 1.5, 'fc': '#3274a1'},
           whiskerprops={'color': '.25', 'lw': 1.5}, 
           capprops={'color': '.25', 'lw': 1.5},
           flierprops={'marker': 'd', 'mfc': '.25', 'mec': '.25', 'ms': 5});

### Still have to use matplotlib for fine tuning

* plot full range of price

In [None]:
from matplotlib import ticker
fig, ax = plt.subplots(figsize=(4, .6))
sns.boxplot(x='price', data=airbnb, whis=(5, 95), ax=ax)
ax.set_xscale('log')
ax.xaxis.set_major_locator(ticker.LogLocator(base=10, subs=(1, 2, 5)))
func = lambda x, pos: f'${x:,.0f}' if x < 1000 else f'${x // 1000:.0f}k'
ax.xaxis.set_major_formatter(ticker.FuncFormatter(func))
ax.xaxis.set_minor_locator(ticker.NullLocator())
ax.yaxis.set_major_locator(ticker.NullLocator())
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

### Histograms

* [histplot function](https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot)
* Shows raw count per bin




In [None]:
fig, ax = plt.subplots(figsize=(4.5, 2))
sns.histplot(x='price', data=airbnb, binrange=(0, 400), ax=ax);

### Function parameters

* All functions have many parameters
* Only memorize `data`, `x`, `y`, `hue` - common to all
* Use help/api to learn about function specific parameters

### Practice making histograms with different parameter combinations

* `bins` - number of bins to use, default 'auto'
* `binrange` - (left, right)
* `binwidth` - width of bin - cannot be used with `bins`
* `stat` - type of statistic to show - defaults to raw count - 'count', 'frequency', 'density', 'probability', 'percent'
* `cumulative` - bool continuously adds 
* `element` - 'bar', 'step', 'poly' - type of line to draw
* `fill` - boolean whether or not to fill with color
* `kwargs` - extra keyword arguments forwarded to `plt.bar`
    * `ec` - edgecolor
    * `lw` - linewidth
    * `alpha` - color opacity

### Create several histograms with different parameters

### KDE Plots

* Kernel Density Estimation - `kdeplot` function.
* Estimates probability density function
* Parameters
    * `cumulative`, `log_scale`, `fill` - bool
    * `color`, `ec`, `lw`

## Plot histograms and kde together

### Bivariate KDE plots

* Two numeric values co-occurring from two different variables
* Set `clip=((x_min, x_max), (y_min, y_max))`

### Read in diamonds dataset

In [None]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head()

In [None]:
diamonds.shape

Filter diamonds to smaller dataset and then create bivariate KDE

## Practice with the other distribution plots

* `violinplot`
* `stripplot`
* `swarmplot`
* `boxenplot`

## Long vs Wide data

* Long data == Tidy data
    * Every variable represents a column
    * Every observation represents a row

In [None]:
mort = pd.read_csv('../data/mortality.csv', parse_dates=['week'])
mort.head(10)

### Pivot to create wide data

* Use `pivot` method to transform
* With wide data, all values have the same units
* With long data, each column has different units of values

In [None]:
df_wide = mort.pivot(index=['week', 'age'], columns='status', values='deaths')
df_wide.head(12)

### Pivot a different set of variables to make wide data

### Read in the stocks dataset. Is it long or wide?

In [None]:
stocks = pd.read_csv('../data/stocks.csv', index_col='date', parse_dates=['date'])
stocks.head()

### Change it to the other form

## Automatic grouping by category

* Big benefit of seaborn is automatically splitting the data into distinct groups
* Must have long (tidy) data and not wide data
* One **grouping column**
* One **graphing column**

### Select grouping and graphed columns

By default:

* Grouping column is non-numeric
* Graphing column is numeric

#### Graph price by neighborhood

* Select top 5 most common neighborhoods
* Both `x` and `y` can be grouping/graphing column
* Use `order` to limit grouping column

### Practice other box plots

### Grouping with two numeric variables

* Grouping column can be numeric
* Set `orient` to 'h' to force seaborn to use other column as grouping column

### Grouping within groups with `hue`

* `hue` is second grouping column
* `hue_order` - specific order for hue categories

### Grouping with violin plots

* Set `cut=0` to cut off data outside of range

### Splitting violin plots when there are exactly two unique categories

### Practice grouping and splitting with violin plots 

## Grouping and Aggregating Plots

Next category of plots group and aggregate

* One grouping variable
* One aggregating variable - gets summarized by a single number by a function (min, max, mean, median, etc...)

### Seaborn grouping and aggregating functions

* `barplot`
* `pointplot`
* `countplot`
* `lineplot`

### Bar plots

* Set `x` and `y` to grouping and aggregating columns
* Default aggregation is the mean
* Set title to clarify the aggregating function used

### Change aggregating function with the `estimator` parameter

* `estimator` - must use function and not string name
    * np.min
    * np.max
    * np.mean
    * np.median
* Set `ci=None` to remove confidence interval (which is done by bootstrapping)

### Plot median price of top 10 neighborhoods

### Wrapping text

In [None]:
import textwrap
def wrap_labels(ax, width):
    labels = []
    for label in ax.get_xticklabels():
        text = label.get_text()
        labels.append(textwrap.fill(text, width=width, break_long_words=False))
    ax.set_xticklabels(labels, rotation=0)

We recreate the same plot, but make it wider and use the new labels at a smaller size.

### Split again with `hue`

Set `hue` to be a categorical variable

### Point plots

Similar to barplot, but places a point at the calculated statistic and connects them with a line

* `scale` - relative size of the line and point
* `errwidth`, `capsize` - control confidence interval appearance

### Count plots

* `countplot` function can be thought of as a specific case of `barplot` 
    * Calculates the size of each group

In [None]:
fig, ax = plt.subplots(figsize=(3, 1.8))
sns.countplot(x='response_time', data=airbnb, ax=ax)
wrap_labels(ax, width=10)

### Use `hue` to split

### Line plots

* `lineplot` is similar to `pointplot` but does not draw markers at every point
* It only aggregates the `y` variable and expects `x` to be either numeric or datetime but not categorical
* It aggregates points with the same x

In [None]:
covid = pd.read_csv('../data/covid.csv', parse_dates=['date'])
covid

## Raw data plots

* `scatterplot` - can also aggregate
* `regplot`
* `heatmap`

### Scatter plots

* `scatterplot` - `x` and `y` usually numeric

In [None]:
ax = sns.scatterplot(x='longitude', y='latitude', data=airbnb, s=16, hue='neighborhood')
ax.legend(bbox_to_anchor=(1, 1))

### Mini-project - finding the listings within 1 mile of the Whitehouse

In [None]:
wh_coords = -77.0365, 38.8977
fig, ax = plt.subplots(figsize=(4, 4))
ax.set_aspect('equal')
sns.scatterplot(x='longitude', y='latitude', data=airbnb, s=16, ax=ax)

# easier to use matplotlib to plot a single point
ax.scatter(*wh_coords, marker='*', c='white', ec='red', lw=1.5, s=150)
ax.annotate('White House', xy=wh_coords, xytext=(-77.09, 38.86), 
            arrowprops={'arrowstyle': '->', 'shrinkB': 7, 'color': 'black'})
ax.set_title('Washington D.C Airbnb Listings');

### Estimate distance

In [None]:
dist_degree = ((airbnb['longitude'] - wh_coords[0]) ** 2 + 
               (airbnb['latitude'] - wh_coords[1]) ** 2) ** .5
dist_degree.head(3)

### Approximate miles per degree

In [None]:
miles_per_degree = 25000 / 360
miles_per_degree

### Find miles from Whitehouse

In [None]:
airbnb['miles_from_wh'] = (dist_degree * miles_per_degree).round(2)
airbnb['miles_from_wh'].head(3)

### Create boolean column 

In [None]:
airbnb['near_whitehouse'] = airbnb['miles_from_wh'] < 1

### Function to setup plotting of Whitehouse

In [None]:
def setup_wh_plot(figsize=(4, 4)):
    wh_coords = -77.0365, 38.8977
    fig, ax = plt.subplots(figsize=figsize)
    ax.set_aspect('equal')
    ax.scatter(*wh_coords, marker='*', c='white', ec='red', lw=1.5, s=150, zorder=3)
    ax.set_title('Airbnb Listings Near White House')
    return ax

### Use `hue` to color

In [None]:
ax = setup_wh_plot()
sns.scatterplot(x='longitude', y='latitude', data=airbnb, s=16, ax=ax, hue='near_whitehouse')

## Set color, style, size, color by other variables

* `hue` - column to color by - use with `hue_order`, `hue_norm`
* `style` - column to style by - use with `style_order`
* `size` - column to size by - use with `size_order`, `size_norm`