# Visualization in Matplotlib and Pandas

## Learning Objectives:
*After this lesson, you will be able to:*

* Understand the importance of using visualizations
* Identify appropriate plots for different situations.
* Create bar charts, line charts, histograms, scatterplots, and more using matplotlib


# Opening: Visualizations / Review of YSFB

Why are visualizations important?

[Anscombe's Quartet](https://blog.heapanalytics.com/anscombes-quartet-and-why-summary-statistics-dont-tell-the-whole-story/)

Think back to Joseph Nelson's lecture last week, You Should ____ Blog. What should we consider when making visualizations?

Vizualizations gone wrong: http://viz.wtf/

# Intro to Matplotlib

Matplotlib is a well-known and comprehensive visualization library in Python. It allows virtually complete customizability for creating beautiful, elegant visualizations. 

# Demo: Building a Histogram

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
url = 'https://raw.githubusercontent.com/josephofiowa/DAT8/master/data/drinks.csv'
drinks = pd.read_csv(url, header=0, names=drink_cols, na_filter=False)

In [None]:
# North America is here now!
drinks.continent.value_counts()

In [None]:
drinks.head()

In [None]:
plt.hist(drinks['beer'],bins=3)
plt.show()

In [None]:
# I can also plot this using pandas syntax

drinks.beer.plot(kind='hist',bins=3)
plt.show()

In [None]:
plt.hist(drinks['beer'])
plt.show()

In [None]:
# Add black separating lines
plt.hist(drinks['beer'],ec='black')
plt.show()

In [None]:
# HD!
%config InlineBackend.figure_format = 'retina'

In [None]:
plt.hist(drinks['beer'],ec='black')
plt.show()

In [None]:
#This will allow me generate figures without running plt.show()

%matplotlib inline

In [None]:
#Let's make it beer-ish color https://matplotlib.org/users/colors.html
plt.hist(drinks['beer'], ec='black', color='goldenrod')

In [None]:
# Use plt.xlabel and plt.ylabel to label axes
# plt.title allows us to add a title. Bonus Activity: Come up with a better title than this!
# (Think: Joseph's YSFB lecture)
plt.hist(drinks['beer'], ec='black', color='goldenrod')
plt.xlabel('beer')
plt.ylabel('frequency')
plt.title('Beer consumption')

In [None]:
# Increase figure size
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 17

In [None]:
plt.hist(drinks['beer'], ec='black', color='goldenrod')
plt.xlabel('beer')
plt.ylabel('frequency')
plt.title('Beer consumption')

In [None]:
# You can see all the availble styles matplotlib has to offer
plt.style.available

In [None]:
# Set a style for the notebook
plt.style.use('seaborn-white')

In [None]:
plt.hist(drinks['beer'], ec='black', color='goldenrod')
plt.xlabel('beer')
plt.ylabel('frequency')
plt.title('Beer consumption')

In [None]:
plt.hist(drinks['beer'], ec='black', color='goldenrod', bins=20)
plt.xlabel('beer')
plt.ylabel('frequency')
plt.title('Beer consumption')

In [None]:
plt.hist(drinks['beer'], ec='black', color='goldenrod', bins=20, alpha = 0.5)
plt.hist(drinks['wine'], ec='black', color='plum', bins=20, alpha = 0.5)
plt.xlabel('Drinks')
plt.ylabel('Frequency')
plt.title('Wine > Beer')

In [None]:
# Adding a legend (don't forget the labels argument in each histogram)
# Making the legend, the title, and the axes labels bigger

plt.hist(drinks['beer'], ec='black', color='goldenrod', bins=20, alpha = 0.5, label = 'beer')
plt.hist(drinks['wine'], ec='black', color='plum', bins=20, alpha = 0.5, label = 'wine')
plt.legend(fontsize = 'large') # Legend
plt.xlabel('Drinks', fontsize = 'xx-large')
plt.ylabel('Frequency', fontsize = 'xx-large')
plt.title('Wine > Beer', fontsize = 'xx-large')
plt.show()


#Hmmmm, the bins don't match up. Why might this be?

In [None]:
#Hint hint hint

print max(drinks['beer'])
print (max(drinks['wine']))

In [None]:
# Creating list of multiples of 20 so that bins for beer and wine are the same
binz = [i * 20 for i in range(20)]

plt.hist(drinks['beer'], ec='black', color='goldenrod', bins=binz, alpha = 0.5, label = 'beer')
plt.hist(drinks['wine'], ec='black', color='plum', bins=binz, alpha = 0.5, label = 'wine')
plt.legend(fontsize = 'large') # Legend
plt.xlabel('Drinks', fontsize = 'xx-large')
plt.ylabel('Frequency', fontsize = 'xx-large')
plt.title('Wine > Beer', fontsize = 'xx-large')
plt.show()

In [None]:
## Bonus activity: Make the tick labels on the x-axis correspond to the lines separating every bin.
## Ie. instead of 0, 50, 100,...400, make them be 20, 40, 60, 80,...,400



# Demo: Other plots in Matplotlib

In [None]:
# Smoothed curve showing the distribution of all our observations using pandas plotting.
drinks.beer.plot(kind='density', xlim=(0, 500))
plt.xlabel('beer')

In [None]:
# Scatter plot
plt.scatter(x=drinks['beer'], y=drinks['wine'])
plt.xlim(0,400)
plt.ylim(0,400)

In [None]:
# Scatter plot using pandas

drinks.plot(kind='scatter', x='beer', y='wine', xlim=(0,400), ylim=(0,400))

In [None]:
plt.scatter(x=drinks['beer'], y=drinks['wine'], alpha=0.3)
plt.xlim(0,400)
plt.ylim(0,400)

In [None]:
# Use a colormap to add another dimension to data
drinks.plot(kind='scatter', x='beer', y='wine', c='spirit', colormap='Blues', xlim=(0,400), ylim=(0,400))

In [None]:
# We can do a scatter matrix to look at relationships among several variables. This can be helpful during EDA.
pd.scatter_matrix(drinks[['beer', 'spirit', 'wine']])

## Introducing: Seaborn!

In [None]:
import seaborn as sns

In [None]:
sns.pairplot(drinks)

In [None]:
# count the number of countries in each continent
drinks.continent.value_counts()

In [None]:
# compare with bar plot
drinks.continent.value_counts().plot(kind='bar')

In [None]:
# calculate the mean alcohol amounts for each continent
drinks.groupby('continent').mean()

In [None]:
# side-by-side bar plots
drinks.groupby('continent').mean().plot(kind='bar')

In [None]:
# drop the liters column
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar')

In [None]:
# stacked bar plots
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar', stacked=True)

## Box Plots

In [None]:
# Box Plot: show quartiles (and outliers) for one or more numerical variables
# Five-number summary:
# min = minimum value
# 25% = first quartile (Q1) = median of the lower half of the data
# 50% = second quartile (Q2) = median of the data
# 75% = third quartile (Q3) = median of the upper half of the data
# max = maximum value
# (More useful than mean and standard deviation for describing skewed distributions)
# Interquartile Range (IQR) = Q3 - Q1
# Outliers:
# below Q1 - 1.5 * IQR
# above Q3 + 1.5 * IQR

In [None]:
# show "five-number summary" for spirit
drinks.beer.describe()

In [None]:
# compare with box plot
drinks.beer.plot(kind='box')

In [None]:
# include multiple variables
drinks.drop('liters', axis=1).plot(kind='box')

## Line Plots: show the trend of a numerical variable over time

In [None]:
url = 'https://raw.githubusercontent.com/josephofiowa/DAT8/master/data/ufo.csv'
ufo = pd.read_csv(url)

In [None]:
ufo.head()

In [None]:
# Hmmmm, w don't want time to be a string
ufo.dtypes

In [None]:
# Introducing datetime: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year

In [None]:
# Nice
ufo.dtypes

In [None]:
ufo.head()

In [None]:
ufo.Year.value_counts().sort_index().plot()

In [None]:
drinks.continent.value_counts().plot()
#value count of every continent

In [None]:
drinks.continent.value_counts()

## Line plots don't make sense if there's no logical order
<img src="https://s-media-cache-ak0.pinimg.com/originals/26/d8/88/26d888978d61cadf1834e10c40c0516c.jpg">

## Back to Box Plots

In [None]:
#what if we're trying to show one box plot for each group

#let's take a look at a box plot of beer servings again, to begin with
drinks.beer.plot(kind='box')

In [None]:
# box plot of beer servings grouped by continent
drinks.boxplot(column='beer', by='continent')


In [None]:
# box plot of all numeric columns grouped by continent
drinks.boxplot(by='continent')

In [None]:
# histogram of beer servings grouped by continent
drinks.hist(column='beer', by='continent')

In [None]:
# share the x axes
drinks.hist(column='beer', by='continent', sharex=True)

In [None]:
# share the x and y axes
drinks.hist(column='beer', by='continent', sharex=True, sharey=True)

In [None]:
# change the layout
drinks.hist(column='beer', by='continent', sharex=True, layout=(2, 3))

In [None]:
# Last but not least, saving figures:

drinks.beer.plot(kind='hist', bins=20, title='Histogram of Beer Servings')
plt.xlabel('Beer Servings')
plt.ylabel('Frequency')
plt.savefig('beer_histogram.png')

# Conclusion:

In [None]:
# This figure shows the name of several matplotlib elements composing a figure
# https://matplotlib.org/examples/showcase/anatomy.html
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator, MultipleLocator, FuncFormatter


np.random.seed(19680801)

X = np.linspace(0.5, 3.5, 100)
Y1 = 3+np.cos(X)
Y2 = 1+np.cos(1+X/0.75)/2
Y3 = np.random.uniform(Y1, Y2, len(X))

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(1, 1, 1, aspect=1)


def minor_tick(x, pos):
    if not x % 1.0:
        return ""
    return "%.2f" % x

ax.xaxis.set_major_locator(MultipleLocator(1.000))
ax.xaxis.set_minor_locator(AutoMinorLocator(4))
ax.yaxis.set_major_locator(MultipleLocator(1.000))
ax.yaxis.set_minor_locator(AutoMinorLocator(4))
ax.xaxis.set_minor_formatter(FuncFormatter(minor_tick))

ax.set_xlim(0, 4)
ax.set_ylim(0, 4)

ax.tick_params(which='major', width=1.0)
ax.tick_params(which='major', length=10)
ax.tick_params(which='minor', width=1.0, labelsize=10)
ax.tick_params(which='minor', length=5, labelsize=10, labelcolor='0.25')

ax.grid(linestyle="--", linewidth=0.5, color='.25', zorder=-10)

ax.plot(X, Y1, c=(0.25, 0.25, 1.00), lw=2, label="Blue signal", zorder=10)
ax.plot(X, Y2, c=(1.00, 0.25, 0.25), lw=2, label="Red signal")
ax.plot(X, Y3, linewidth=0,
        marker='o', markerfacecolor='w', markeredgecolor='k')

ax.set_title("Anatomy of a figure", fontsize=20, verticalalignment='bottom')
ax.set_xlabel("X axis label")
ax.set_ylabel("Y axis label")

ax.legend()


def circle(x, y, radius=0.15):
    from matplotlib.patches import Circle
    from matplotlib.patheffects import withStroke
    circle = Circle((x, y), radius, clip_on=False, zorder=10, linewidth=1,
                    edgecolor='black', facecolor=(0, 0, 0, .0125),
                    path_effects=[withStroke(linewidth=5, foreground='w')])
    ax.add_artist(circle)


def text(x, y, text):
    ax.text(x, y, text, backgroundcolor="white",
            ha='center', va='top', weight='bold', color='blue')


# Minor tick
circle(0.50, -0.10)
text(0.50, -0.32, "Minor tick label")

# Major tick
circle(-0.03, 4.00)
text(0.03, 3.80, "Major tick")

# Minor tick
circle(0.00, 3.50)
text(0.00, 3.30, "Minor tick")

# Major tick label
circle(-0.15, 3.00)
text(-0.15, 2.80, "Major tick label")

# X Label
circle(1.80, -0.27)
text(1.80, -0.45, "X axis label")

# Y Label
circle(-0.27, 1.80)
text(-0.27, 1.6, "Y axis label")

# Title
circle(1.60, 4.13)
text(1.60, 3.93, "Title")

# Blue plot
circle(1.75, 2.80)
text(1.75, 2.60, "Line\n(line plot)")

# Red plot
circle(1.20, 0.60)
text(1.20, 0.40, "Line\n(line plot)")

# Scatter plot
circle(3.20, 1.75)
text(3.20, 1.55, "Markers\n(scatter plot)")

# Grid
circle(3.00, 3.00)
text(3.00, 2.80, "Grid")

# Legend
circle(3.70, 3.80)
text(3.70, 3.60, "Legend")

# Axes
circle(0.5, 0.5)
text(0.5, 0.3, "Axes")

# Figure
circle(-0.3, 0.65)
text(-0.3, 0.45, "Figure")

color = 'blue'
ax.annotate('Spines', xy=(4.0, 0.35), xycoords='data',
            xytext=(3.3, 0.5), textcoords='data',
            weight='bold', color=color,
            arrowprops=dict(arrowstyle='->',
                            connectionstyle="arc3",
                            color=color))

ax.annotate('', xy=(3.15, 0.0), xycoords='data',
            xytext=(3.45, 0.45), textcoords='data',
            weight='bold', color=color,
            arrowprops=dict(arrowstyle='->',
                            connectionstyle="arc3",
                            color=color))

ax.text(4.0, -0.4, "Made with http://matplotlib.org",
        fontsize=10, ha="right", color='.5')

# Resources:

[DataCamp Intro to Visualization](https://campus.datacamp.com/courses/introduction-to-data-visualization-with-python/)

[pythonprogramming.net](https://pythonprogramming.net/matplotlib-intro-tutorial/)

[Seaborn Documentations](http://seaborn.pydata.org/)

[Simple Plotting Tutorials from Matplotlib Documentation](https://matplotlib.org/users/pyplot_tutorial.html)


