<img style="float: right;" width="120" src="../Images/supplier-logo.png">
<img style="float: left; margin-top: 0" width="80" src="../Images/client-logo.png">
<br><br><br>

# Synopsis

This notebook will presetns an overivew of plotting data and more importantly the benefits of visualizing data and the pitfalls of not visualizing data.

It uses the most common plotting packages used by python, `matplotlib`

The data it uses is called **Anscombe's quarter**
- developed by Frank Ascombe
- designed to show the importance of statitical graphs
- contains 4 sets of data
- each set contains 2 continuous variables
- each set has the same mean, variance, correlation and regression

Only when the data is visualized deos it become apparent that each dataset does not follow the same pattern.



In [None]:
import pandas as pd

# format for floats
pd.options.display.float_format = '{:,.2f}'.format

df = pd.read_excel(io="../Data/plotting_data.xls", sheet_name='anscombe', index_col='ID')

# Matplotlib

%matplotlib inline is a **magic** command.<br>
It means when plotting matplotlib charts, embed them directly into the notebook

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# plot only dataset 'I'
fltr = df['dataset'] == 'I'
df_I = df[fltr]

# extract the x and y values
xI = df_I['x']
yI = df_I['y']

# plot
plt.plot(xI, yI)
plt.plot(xI, yI, 'o')

## Plot all 4 datasets on a single figure

- Extract the 4 datasets from the dataframe
- Create a figure & add 4 sub plots to the figure
- Plot each dataset on its corresponding sub_plot
- Add some titles  etc

In [None]:
# Extract the 4 datasets from the dataframe

df_I = df[df['dataset'] == 'I']
df_II = df[df['dataset'] == 'II']
df_III = df[df['dataset'] == 'III']
df_IV = df[df['dataset'] == 'IV']

In [None]:
# Create a Figure
fig = plt.figure(figsize=(18,6))

# Add 4 subplots
# subplot takes (num rows, num columns, plot location)
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

# plot each dataset on its corresponding sub_plot
ax1.plot(df_I['x'], df_I['y'], 'o')
ax2.plot(df_II['x'], df_II['y'], 'o')
ax3.plot(df_III['x'], df_III['y'], 'o')
ax4.plot(df_IV['x'], df_IV['y'], 'o')

# Add some titles etc
ax1.set_title('Dataset 1')
ax2.set_title('Dataset 2')
ax3.set_title('Dataset 3')
ax4.set_title('Dataset 4')

# Title of entire figure
fig.suptitle("Anscombe Data")

# Use a tight layout
fig.tight_layout()

# Anatomy of a figure

<img style="float: left; margin-top: 0" width="300" src="../Images/subplot.png">
<img style="float: centre; margin-top: 0" width="300" src="../Images/figure.webp">
<br>


**NOTE** 
- difference between `axes` and `axis`
- both pronounced the same
- one is singlular, one is plural
- both refer to different parts of a figure
- axes contains mulitple axis
- here axes contains and x-axis and a y-axis

# Plotting statistics using matplotlib

**Univariate** - a single variable<br>
**Bivariate** - two row labels<br>
**Multivariate** - multiple row labels


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

df.head()

## Histograms

In [None]:
df = pd.read_excel(io="../Data/plotting_data.xls", sheet_name='tips', index_col='ID')
x_data = df['total_bill']

hist_plot = plt.figure()

ax1 = hist_plot.add_subplot(1,1,1)
ax1.hist(x = x_data, bins=10)
ax1.set_title('Histogram of Total Bill')
ax1.set_xlabel('Frequency')
ax1.set_ylabel('Total Bill')

hist_plot.show()

## Scatterplot

In [None]:
scatter_plot = plt.figure()

x_data = df['total_bill']
y_data = df['tip']

ax1 = scatter_plot.add_subplot(1,1,1)

ax1.scatter(x = x_data, y = y_data)
ax1.set_title('Scatterplot of Total Bill vs Tips')
ax1.set_xlabel('Total Bill')
ax1.set_ylabel('Tips')

hist_plot.show()

## Boxplots

Used when a discrete variable is plotted against a continuous variable

In [None]:
box_plot = plt.figure()

female_tips = df[df['sex'] == 'Female']['tip']
male_tips = df[df['sex'] == 'Male']['tip']

ax1 = box_plot.add_subplot(1,1,1)
ax1.boxplot(x = [female_tips, male_tips], labels=['Female', 'Male'])
ax1.set_title('Boxplot of Tips by Gender')
ax1.set_xlabel('Gender')
ax1.set_ylabel('Tips')

box_plot.show()

## Multivariate Data

- Plotting this is doable but tricky
- No single way to visualize such data
- How to distinguish values of different labels

In [None]:
# recode gender to a colour value
def recode_gender(g):
    if g == 'Female':
        return 0
    else:
        return 1
    
df['Gender(Colour)'] = df['sex'].apply(recode_gender)

scatter_plot = plt.figure()

ax1 = scatter_plot.add_subplot(1,1,1)
ax1.scatter(x=df['total_bill'], y=df['tip'], c=df['Gender(Colour)'], alpha = 0.5)


ax1.set_title('Total Bill vs Tip - Coloured by Gender & Sized by Tip')
ax1.set_xlabel('Total Bill')
ax1.set_ylabel('Tip')

scatter_plot.show()

# Seaborn


In [None]:
import seaborn as sns

df = pd.read_excel(io="../Data/plotting_data.xls", sheet_name='tips', index_col='ID')

## Histograms

In [None]:
# this subplots function returns 2 values as a tuple
# a figure
# a subplot added to the figure

hist_plot, ax1 = plt.subplots()

ax1 = sns.distplot(a = df['total_bill'])
ax1.set_title('Total Bill Histogram with density plot')

plt.show()


In [None]:
# Same thing but without the denisty
# supply a kde argument and set it to False
hist_plot, ax1 = plt.subplots()

ax1 = sns.distplot(a = df['total_bill'], kde=False)
ax1.set_title('Total Bill Histogram')
ax1.set_xlabel('Total Bill')
ax1.set_ylabel('Frequency')

plt.show()


## Density Pot

**Kernel Density Estimation**

In [None]:
den_plot, ax1 = plt.subplots()

ax1 = sns.distplot(a = df['total_bill'], hist=False)
ax1.set_title('Total Bill Density')
ax1.set_xlabel('Total Bill')
ax1.set_ylabel('Unit Probability')

plt.show()

## Rug Plots

- 1 dimensional representation of a variables distribution
- usually used with other plots to enhance visualization

In [None]:
hist_den_rug_plot, ax1 = plt.subplots()

ax1 = sns.distplot(a = df['total_bill'], rug=True)
ax1.set_title('Total Bill Histogram with Density and Rug Plot')
ax1.set_xlabel('Total Bill')

plt.show()

## Bar Plots

In [None]:
count_plot, ax1 = plt.subplots()

ax1 = sns.countplot(x='day', data=df)
ax1.set_title('Count of Days')
ax1.set_xlabel('Day of Week')
ax1.set_xlabel('Frequency')

plt.show()

## Scatterplot

- No `scatter` function in `seaborn`, use `regplot` instead.
- plots a scatterpoot **and** fits a regression line
- use `fit_reg=False` to toggle regression line on/off

In [None]:
scatter_plot, ax1 = plt.subplots()

ax = sns.regplot(x='total_bill', y='tip', data=df)

ax1.set_title('Scatter Plot of Total Bill and Tip')
ax1.set_xlabel('Total Bill')
ax1.set_xlabel('Tip')

plt.show()

Alternative is to use `lmplot`<br>
`lmplot` calls `regplot`<br>
`lmplot` creates `figures`, `replot` creates axes<br>

In [None]:
fig = sns.lmplot(x='total_bill', y='tip', data=df)

plt.show()

Or use `jointplot`<br>
`jointplot` creates a scatter that includes a univariate plot on each axis<br>
`jointplot` does not return axes, so no need to create a figure<br>
`jointplot` creates a `JointGrid` object

In [None]:
joint_grid = sns.jointplot(x='total_bill', y='tip', data=df)
joint_grid.set_axis_labels(xlabel='Total Bill', ylabel='Tip')
joint_grid.fig.suptitle(t='Joint Plot of Total Bill and Tip', fontsize=10, y=1.03)

# Hexbins

- Group points on a scatter plit into larger points.
- In the same way that a `histogram` can bin a variable to create a `bar`
- `hexbin` can bin variables to create hexagons

In [None]:
hexbin = sns.jointplot(x="total_bill", y='tip', data=df, kind='hex')
hexbin.set_axis_labels(xlabel='Total Bill', ylabel='Tip')
hexbin.fig.suptitle(t='Hexbin Joint Plot of Total Bill and Tip', fontsize=10, y=1.03)

# 2D Density Plots

- similar to `sns.kdeplot`
- Create a density plot across a bivariate (2 variables)
- Can show just the bivariate
- or show the invividual univariates

In [None]:
# Just the bivariate
kde, ax1 = plt.subplots()

ax1 = sns.kdeplot(data=df['total_bill'], data2=df['tip'], shade=True) # toggle shade True/False
ax1.set_title('Kernek Density Plot of Total Bill & Tip')
ax1.set_xlabel("Total Bill")
ax1.set_ylabel('Tip')

plt.show()

In [None]:
# Include the univariates
kde_joint = sns.jointplot(x='total_bill', y='tip', data=df, kind='kde')

# Bar Plots
Default is to calculate the `mean`<br>
Use the `estimator` parameter to pass in any function<br>

In [None]:
bar, ax1 = plt.subplots()

ax1 = sns.barplot(x='time', y='total_bill', data=df)
ax1.set_title('Bar plot of average total bill for time of day')
ax1.set_xlabel('Time of day')
ax1.set_ylabel('Average total bill')

plt.show()

## Box Plots

- Use to show multiple statistics<br>
- e.g. quartiles, max, min, outliers etc

In [None]:
box, ax1 = plt.subplots()

ax1 = sns.boxplot(x='time', y='total_bill', data=df)
ax1.set_title('Box plot of total bill by time of day')
ax1.set_xlabel('Time of day')
ax1.set_ylabel('Total bill')

plt.show()

## Violin Plot

Include the distribution of the data

In [None]:
violin, ax1 = plt.subplots()

ax1 = sns.violinplot(x='time', y='total_bill', data=df)
ax1.set_title('Violin plot of total bill by time of day')
ax1.set_xlabel('Time of day')
ax1.set_ylabel('Total bill')

plt.show()

## Pairwise Plots

- `pairplot` visualizes **ALL** pairwise relationships 
- Creates lots of redundant information
- Use `PairGrid` and manually assign plots for the top hald and bottom half

In [None]:
pair_grid = sns.PairGrid(data = df)

pair_grid = pair_grid.map_upper(sns.regplot) # or use plt.scatter
pair_grid = pair_grid.map_lower(sns.kdeplot)
pair_grid = pair_grid.map_diag(sns.distplot, rug=True)

plt.show()

## Multivariate Data & Colours

No standard way to show this.

Use colours to distinguish bewteen gender in previous example

Using violin plot with and without `split` parameter

In [None]:
violin, ax1 = plt.subplots()

ax1 = sns.violinplot(x='time', y='total_bill', data=df, hue='sex')

plt.show()

In [None]:
violin, ax1 = plt.subplots()

ax1 = sns.violinplot(x='time', y='total_bill', data=df, hue='sex', split=True)

plt.show()

In [None]:
scatter = sns.lmplot(x='total_bill', y='tip', data=df, hue='sex', fit_reg=True)

In [None]:
fig = sns.pairplot(df, hue='sex')

# Pandas Plotting Functions

Pandas comes with its own in-built plotting function<br>
General format is usually
- `DataFrame.plot.PLOT_TYPE` <br>
- `Series.plot.PLOT_TYPE`

## Histograms

In [None]:
# a series
df['total_bill'].plot.hist()

In [None]:
# an entire DataFrame
df.plot.hist()

## Density Plots

In [None]:
# Series
df['tip'].plot.kde()

In [None]:
# DataFrame
cols=['total_bill', 'tip']
df[cols].plot.kde()

## Scatterplot

In [None]:

df.plot.scatter(x='total_bill', y='tip')

## Hexbin

Use `gridsize` argument where appropriate

In [None]:
df.plot.hexbin(x='total_bill', y='tip', gridsize=10)

## Box plot


In [None]:
df.plot.box()

# Seaborn Styles

Seaborn comes with some inbuilt styles.

To use these, call the `sns.set_style` once (usually at the top of your code)

Common styles are darkgrid, whitegrid, dark, white, ticks

In [None]:
# Default Style
fig, ax1 = plt.subplots()

ax1 = sns.violinplot(x='time', y='total_bill', hue='sex', data=df, split=True)

plt.show()

In [None]:
# Darkgrid
sns.set_style('darkgrid')

fig, ax1 = plt.subplots()

ax1 = sns.violinplot(x='time', y='total_bill', hue='sex', data=df, split=True)

plt.show()

In [None]:
# Show a few styles

fig = plt.figure(figsize=(18,9))

styles = ['darkgrid', 'whitegrid', 'dark', 'white', 'ticks']

for idx, style in enumerate(styles):
    plot_position = idx + 1
    with sns.axes_style(style):
        ax = fig.add_subplot(2,3,plot_position)
        violin = sns.violinplot(x='time', y='total_bill', hue='sex', data=df, split=True)
        violin.set_title(style)

fig.tight_layout()
plt.show()