# Visualizing Data w/ Pandas, Matplotlib and Seaborn
---
**Author:** Robert Kelley  
**Version:** 1.1  
**Semester:** Spring 2022 
**Summary:**  

I developed this notebook to so we could walk through the approaches for visualizing data.  The dataset for this notebook was obtained from: https://github.com/allisonhorst/palmerpenguins.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import statsmodels.api as sm

## Why is Visualization Important?

In [None]:
anscombe = sns.load_dataset('anscombe')

In [None]:
I = anscombe.query("dataset == 'I'")
II = anscombe.query("dataset == 'II'")
III = anscombe.query("dataset == 'III'")
IV = anscombe.query("dataset == 'IV'")

In [None]:
print(I.mean(), II.mean(), III.mean(),IV.mean(), sep='\n')

In [None]:
plt.scatter(I.x, I.y)  # Then do II to show difference

Let's build a plot with with subplots.

In [None]:
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(221)
ax1.scatter(I.x, I.y)
ax2 = fig.add_subplot(222)
ax2.scatter(II.x, II.y)
ax3 = fig.add_subplot(223)
ax3.scatter(III.x, III.y)
ax4 = fig.add_subplot(224)
ax4.scatter(IV.x, IV.y)

## Working with Other Data Sets
Seaborn provides several data sets you can learn with. You list them with get_dataset_names().

In [None]:
sns.get_dataset_names()

And you can load them as follows:

In [None]:
tips = sns.load_dataset('tips')

## Line Charts

In [None]:
fmri = sns.load_dataset('fmri')

In [None]:
fmri.tail()

Let's grab two rows that we can plot.

In [None]:
s13 = fmri.query("subject=='s13' and region=='frontal' and event=='stim'")
s13 = s13.sort_values(by='timepoint')
s13.index = s13.timepoint
s1 = fmri.query("subject=='s1' and region=='frontal' and event=='stim'")
s1 = s1.sort_values(by='timepoint')
s1.index = s1.timepoint

Now let's plot.

In [None]:
fig = plt.figure(figsize=(12,6))
plt.plot(s13.signal)
plt.title('Stimulus Signal at Various TimePoints')
plt.xlabel('Timepoint')
plt.ylabel('Signal')
plt.axis([0,18,-1,1])
plt.plot(s1.signal)

## Bar Charts

In [None]:
mpg = sns.load_dataset('mpg')

In [None]:
mpg.head()

In [None]:
cylinders = mpg.cylinders.groupby(mpg.cylinders).count()

In [None]:
cylinders

In [None]:
cylinders.sort_index()

In [None]:
plt.barh(cylinders.index,cylinders.values)
plt.title('Cars/Cylinders')
plt.ylabel('Number of Models')
plt.xlabel('Number of Cylinders')

In [None]:
cylinders.sort_index(ascending=False).plot(kind='barh', xlabel='Number of Cylinders', ylabel="Number of Vehicles")

## Box/Whisker Plots

In [None]:
plt.boxplot(mpg.weight)

In [None]:
ax1 = mpg.boxplot(column='weight', by='cylinders')
ax1.set_title('')
ax1.set_ylabel('Weight')
ax1.grid(False)

## Histograms

In [None]:
mpg.columns

In [None]:
plt.hist(mpg.acceleration, bins=20)

In [None]:
print(mpg.acceleration.mean(), mpg.acceleration.std())

In [None]:
sns.displot(mpg.acceleration)

## Kernel Density Plots

In [None]:
sns.displot(mpg.acceleration, kind='kde')

In [None]:
sns.displot(mpg.mpg, kde='true')

## Heatmaps

In [None]:
flights = sns.load_dataset("flights")

In [None]:
flights.head()

In [None]:
flights = flights.pivot("month","year","passengers")

In [None]:
sns.heatmap(flights)

## Visualizing the Penguin Data

In [None]:
p = pd.read_csv('processed_penguins.csv')

In [None]:
p.sample(5)

Let's look the distributions of the continuous data.

In [None]:
fig, axes = plt.subplots(2,3, figsize=(12,8))
sns.histplot(data=p, x="flipper_length_mm", kde=True, ax=axes[0,0])
sns.histplot(data=p, x="bill_length_mm", kde=True, ax=axes[0,1])
sns.histplot(data=p, x="bill_depth_mm", kde=True, ax=axes[0,2])
sns.histplot(data=p, x="body_mass_g", kde=True, ax=axes[1,0])
sns.histplot(data=p, x="delta_15", kde=True, ax=axes[1,1])
sns.histplot(data=p, x="delta_13", kde=True, ax=axes[1,2])

We can also look at these a different way with a pair plot.

In [None]:
sns.pairplot(p[['bill_length_mm','bill_depth_mm','flipper_length_mm', 'body_mass_g', 'species']], hue='species')

In [None]:
sns.pairplot(p[['delta_15', 'delta_13', 'island', 'species','sex']], hue='island')

Let's look at the box plots for the various physical measurements (not including blood type).

In [None]:
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(221)
ax1.boxplot(p.flipper_length_mm, notch=True, labels=['Flipper Length'], vert=False)
ax2 = fig.add_subplot(222)
ax2.boxplot(p.bill_depth_mm,  notch=True,labels=['Bill Depth'], vert=False)
ax3 = fig.add_subplot(223)
ax3.boxplot(p.bill_length_mm, notch=True, labels=['Bill Length'], vert=False)
ax4 = fig.add_subplot(224)
ax4.boxplot(p.body_mass_g, notch=True, labels=['Body Mass'], vert=False)

Line chart of observations per study day.

In [None]:
fig = plt.figure(figsize=(12,6))
p.study_day.groupby(p.study_day).count().plot(kind='line', ylabel='Number of Penguins Observed')

Bar Chart of different species.

In [None]:
fig = plt.figure(figsize=(5,3))
p.species.groupby(p.species).count().plot(kind='barh')

In [None]:
fig = plt.figure(figsize=(5,3))
p.sex.groupby(p.sex).count().plot(kind='barh', color='purple')

## Visualizing Linear Regression

In [None]:
X = p[['flipper_length_mm']]
y = p['body_mass_g']
X = sm.add_constant(X)

In [None]:
model = sm.OLS(y,X).fit()

In [None]:
sns.regplot(x='flipper_length_mm', y='body_mass_g', data=p, color='gray', line_kws={'color': 'blue'}, scatter_kws={})

In [None]:
sns.residplot(x='flipper_length_mm', y='body_mass_g', data=p, color='gray', scatter_kws={'alpha': .50} )

In [None]:
p.columns