# Overview
In this lecture we will learn about different methods for visualizing data, and the exploratory data analysis (EDA) that you can do purely with visual techniques. Some specific topics we will cover:
- Basic plotting with matplotlib and seaborn: scatter plots, histograms, box plots, time series
- Exploratory Data Analysis
- Parameter settings & $\LaTeX$ rendering for publication-quality plots
- Multi-panel plots
- 2D heatmaps
- 3D surface plots
- Saving videos

# The Basics

## Scatter plots
One of the most common plots you will make as a data scientist is a scatter plot, which helps you determine whether two variables are related to each other. In other words: as the value of one variable increases, does the value of the other variable also increase? Or does it decrease? Or remain approximately constant? We can generate some random data to explore this technique:

In [None]:
import numpy as np
import matplotlib.pyplot as plt  # Matplotlib is the standard plotting package, and this is the standard way to import it 

In [None]:
x = np.linspace(-10, 10, 100)  # 100 values linearly spaced between -10 and 10

In [None]:
x

In [None]:
noise = np.random.randn(len(x)) * 2  # Normally distributed random noise

In [None]:
y = 1.5 * x + noise

We have now defined two arrays, `x` and `y` that are the same size (100 elements each). We can create an x vs y scatter plot like so:

In [None]:
plt.plot(x, y, 'o')

Or equivalently:

In [None]:
plt.scatter(x, y)

But if you turned in a plot like this for your homework, you would lose many points. Let's add some key elements, like labels. This would be the bare minimum required for an acceptable plot; we'll learn later about ways to make publication-quality graphics in matplotlib. 

In [None]:
plt.plot(x, y, 'o')
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Example Scatter Plot")

Seems pretty clear that `x` and `y` are positively correlated to each other. But what about `x` and `noise`?

In [None]:
plt.plot(x, noise, 'o')
plt.xlabel("X")
plt.ylabel("Gaussian Noise")
plt.title("Example Scatter Plot")

Not so much. Even without calculating any statistics, we can see visually that the two variables don't have a relationship to each other. 

But what if we wanted to quantify the level of correlation between the variables? For that, we can calculate the correlation coefficient:
$$ \rho_{xy} = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum_{i=1}^n(x_i - \overline{x})^2}\sqrt{\sum_{i=1}^n(y_i - \overline{y})^2}} $$

The numpy function `corrcoef` calculates a 2x2 correlation matrix, with the correlation between $x$ and $x$ along the diagonal, and the correlations between $x$ and $y$ and $y$ and $x$ as the off-diagonal components:

In [None]:
corr = np.corrcoef(x, y)
corr

In [None]:
rho_xy = corr[0, 1] # extract from row 0, column 1
rho_xy

In [None]:
corr = np.corrcoef(x, noise)
corr

In [None]:
rho_xn = corr[0, 1]
rho_xn

## Histograms
Sometimes we only have one variable to look at rather than two variables to compare to each other. In this case, a histogram is a good visualization tool. As a reminder: histograms split your variable up into intervals/ranges/bins of values, and then count how many values fall into each bin. The bins will be on the x-axis, and the counts will be on the y-axis.  

### Plotting techniques

In [None]:
plt.hist(noise, bins="auto")
plt.xlabel("Noise Magnitude")
plt.ylabel("Counts")

Here, we used the optional argument `bins="auto"` to tell matplotlib to automatically determine how many bins to use. We can manually specify this with an integer if we want:

In [None]:
plt.hist(noise, bins=3)
plt.xlabel("Noise Magnitude")
plt.ylabel("Counts")

Or provide a range, which sets the boundary of each bin:

In [None]:
plt.hist(noise, bins=np.arange(-5, 5, 0.1))
plt.xlabel("Noise Magnitude")
plt.ylabel("Counts")

As you can see, the histogram will look very different depending on how you bin it. A good histogram will give an accurate sense of the *distribution* of the data, e.g., unimodal vs bimodal, skewed vs symmetric. If you use too many bins, then most of the counts will be around 1. If you use too few bins, you'll just get a couple big bars. Something in between is what you want, and it can be more of an art than a science.

One more note about histograms: instead of having counts on the y-axis, we often have *density*, which is a way to normalize the histogram bar heights such that the area of all the bars (i.e., the area under the curve) sums to 1. This is helpful because it means that the probability of finding a sampling a certain range of values can be estimated as the area under the curve between those values. 

In [None]:
plt.hist(noise, bins="auto", density=True)
plt.xlabel("Noise Magnitude")
plt.ylabel("Density")

We can confirm that the area under the curve is equal to 1:

In [None]:
density, bins, _ = plt.hist(noise, bins="auto", density=True)

In [None]:
y = density
dx = np.diff(bins)

In [None]:
np.sum(y * dx)

Another plotting package we will use is Seaborn. It has better integration with `pandas` than matplotlib, and it has nice built-in functionality to spruce up histograms:

In [None]:
import seaborn as sns

In [None]:
ax = sns.histplot(x=noise, stat="density", kde=True)
ax.set_xlabel("Noise Magnitude")

In the plot above, the smooth line that we add using the `kde=True` option is the approximation of the underlying distribution that the histogram is sampled from. This is calculated using *kernel density estimation*

### Histogram properties
When we make a histogram of variable, the shape tells us a lot about which values of that variable are most common. I'll add made-up labels to the plot to give a more concrete idea of what they could represent.

In [None]:
from scipy.stats import norm, skewnorm, uniform

We would classify this histogram as *symmetric* and *unimodal*. The distribution is centered near `0`, which means it is the most likely (expected) value, and larger/smaller values get increasingly less common. 

In [None]:
values = norm.rvs(loc=0, scale=1, size=1000)
ax = sns.histplot(x=values, stat="density", kde=True)
ax.set_xlabel("Forecast error in predicted wave height (m)")

This histogram is *right-skewed* or *positive-skewed*. We would say that it has a long tail -- though they aren't very likely, large positive values do appear. Large negative values are extremely rare or impossible. 

In [None]:
values = skewnorm.rvs(10, loc=0.5, scale=1, size=1000)
ax = sns.histplot(x=values, stat="density", kde=True)
ax.set_xlabel("Daily total rainfall (in.)")

Similarly, this one is *left-skewed* or *negative-skewed*. It has a long tail in the other direction. 

In [None]:
values = skewnorm.rvs(-3, loc=-1, scale=3, size=1000)
ax = sns.histplot(x=values, stat="density", kde=True)
ax.set_xlabel("Actively-managed portfolio returns versus S&P 500 (%)")

This distribution is *uniform*: any value within the range is equally likely to appear. 

In [None]:
values = uniform.rvs(loc=0, scale=1, size=1000)
ax = sns.histplot(x=values, stat="density", kde=True)

## Box plots
If you want to compare the distributions of multiple variables to each other, one way to do it would be to make multiple histograms and stack them on top of each other. This can get a little messy though, so it is more common to rely on Box Plots. Let's see how they work with the Iris dataset and Seaborn.

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

In [None]:
sns.boxplot(data=df, x="sepal_length", hue="species")

Here, we told seaborn to plot data contained in the dataframe `df`. We wanted the `sepal_length` column, and we wanted a separate box plot for each unique value of `species`, denoted by the `hue` argument. 

But what is the box plot actually showing? Let's discuss the components of the diagram below

In [None]:
from IPython.display import Image
Image("boxplot.png")

## Breakout: Seaborn for Scatter Plots and Histograms
1. Use Seaborn to make scatter plots of some of the different variables in the Iris dataset. Which variables are correlated to each other? How do these correlations change when your scatter plots are separated by species?
2. Use Seaborn to make a single plot showing overlapping histograms for sepal length for the setosa and virginica species. How to the distributions of these variables differ from each other? How are they similar? 

## Time series data
So far we haven't seen any data with a temporal (time) component. Any dataset where the sampling time is relevant is called time series data. This requires special handling in Python, as we will demonstrate with a sample air quality dataset. 

In [None]:
df = pd.read_csv("air_quality_no2_long.csv")

In [None]:
df.head()

In [None]:
df = df.loc[df["city"] == "Paris"]

In [None]:
df.info()

Let's say we want to plot the time (given by `date.utc`) vs the NO2 value (given by `value`). If we try to work with the data as it is read in, we run into trouble because the dates are interpreted as strings:

In [None]:
plt.plot(df["date.utc"], df["value"])

We will almost always need to convert times into a pandas Datetime type before working with them:

In [None]:
df["date.utc"] = pd.to_datetime(df["date.utc"])

In [None]:
df.info()

Trying again:

In [None]:
plt.plot(df["date.utc"], df["value"])

A little better, but still not great. Let's mess with some options:

In [None]:
fig = plt.figure()  # Initialize a figure object
plt.plot(df["date.utc"], df["value"])  # Make the plot
plt.xticks(pd.date_range(start="2019-05-08", end="2019-06-22", freq="5D"))  # Set the x-axis tick positions
plt.ylabel("NO2 concentration")
plt.title("Paris Air Quality")
fig.autofmt_xdate()  # Rotates and offsets the x-axis labels

# More Advanced Plotting
The techniques above will give you a lot of mileage for exploratory data analysis, but for professional purposes you will want to give your plots a bit more pizzazz. Let's start by making plots look better. 

## Plot params
We can improve a simple plot dramatically by modifying the default parameters that matplotlib uses. Most of these are just related to the font style and size. By default it is hard to read the labels on a matplotlib plot unless it is on a computer screen in front of you. Remember that your audience may be squinting at the screen from across the room, and they need to be able to read your graphics. 

In [None]:
params = {
    "axes.labelsize": 18, 
    "font.size": 18,
    "legend.fontsize": 16,
    "xtick.labelsize": 18,
    "ytick.labelsize": 18,
    "text.usetex": True,
    "font.family": "serif",
}
plt.rcParams.update(params)

In [None]:
import matplotlib.dates as mdate  # Package for formatting dates

In [None]:
fig, ax = plt.subplots()  # object-oriented figure initialization gives us more control over the options
ax.plot(df["date.utc"], df["value"], linewidth=1.5, alpha=0.8)  # Slightly thicker line with a bit of transparency
ax.set_xticks(pd.date_range(start="2019-05-08", end="2019-06-22", freq="5D"))
ax.set_ylabel(r"NO$_2$ concentration ($\mu$g/L)")  # Format labels with LaTeX
ax.set_title("Paris Air Quality, 2019")
ax.xaxis.set_major_formatter(mdate.DateFormatter("%m/%d"))  # Custom x-tick label format
fig.autofmt_xdate()
fig.set_size_inches(10, 5)  # Optimal size will depend on the plot, but with time series you often want something pretty wide

## Multi-panel plots
What if we want multiple subplots in a single figure? Matplotlib makes this easy. But I'm going to turn off TeX rendering because it's slow

In [None]:
params = {
    "axes.labelsize": 18,
    "font.size": 18,
    "legend.fontsize": 16,
    "xtick.labelsize": 18,
    "ytick.labelsize": 18,
    "text.usetex": False,
    "font.family": "sans-serif",  # sans-serif is usually cleaner looking if you aren't rendering equations with TeX
}
plt.rcParams.update(params)

In [None]:
df_full = pd.read_csv("air_quality_no2_long.csv")
df_full["date.utc"] = pd.to_datetime(df_full["date.utc"])
cities = df_full["city"].unique()

In [None]:
fig, ax = plt.subplots(3, 1, sharex=True)  # Initialize a 3x1 grid of plots
for i, city in enumerate(cities):
    df = df_full.loc[df_full["city"] == city]
    ax[i].plot(df["date.utc"], df["value"], linewidth=1.5, alpha=0.8)  # Slightly thicker line with a bit of transparency
    ax[i].set_xticks(pd.date_range(start="2019-05-08", end="2019-06-22", freq="5D"))
    ax[i].set_ylabel(r"NO$_2$ concentration ($\mu$g/L)")  # Format labels with LaTeX
    ax[i].set_title(f"{city} Air Quality, 2019")
    ax[i].xaxis.set_major_formatter(mdate.DateFormatter("%m/%d"))  # Custom x-tick label format
fig.autofmt_xdate()
fig.set_size_inches(10, 15)  # Optimal size will depend on the plot, but with time series you often want something pretty wide


## Heatmaps
Sometimes you want to visualize how a variable changes as a function of two other variables. For that, we need a heatmap, which Seaborn is particularly good at helping us create. 

In [None]:
flights = sns.load_dataset("flights")

In [None]:
flights.head()

In [None]:
flights = flights.pivot(index="month", columns="year", values="passengers") # Creates a 2d grid to plot

In [None]:
flights

In [None]:
ax = sns.heatmap(flights, cbar_kws={'label': 'Passengers'})
ax.set_title("Heatmap Flight Data")

## Surface Plots
Surface plots come up often enough in optimization settings that it's worth seeing how they work. 

In [None]:
# Make data.
x = np.linspace(-8, 8, 100)
y = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x, y)  # Turns 1d vectors x and y into 2d grids X and Y
Z = X**2 + Y**2  # Defines our values Z = f(X,Y)

In [None]:
X.shape

In [None]:
Y.shape

In [None]:
X

In [None]:
Y

In [None]:
Z

In [None]:
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap="coolwarm",
                       linewidth=0, antialiased=False)

ax.set_zticks(np.arange(0, 90, 20))
ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_zlabel("Z")
ax.zaxis.labelpad=-2
# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=2, location="left")

plt.show()

## Animations
Too many animations in a presentation can be distracting, but every once in a while they can be a really effective (and impressive) way to visualize your data. 

In [None]:
import matplotlib.animation as animation

In [None]:
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})

# Same data as our surface plot above
x = np.linspace(-8, 8, 100)
y = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x, y)  # Turns 1d vectors x and y into 2d grids X and Y
Z = X**2 + Y**2  # Defines our values Z = f(X,Y)

def init():
    ax.plot_surface(X, Y, Z, cmap="coolwarm",
                       linewidth=0, antialiased=False)
    ax.set_zticks(np.arange(0, 90, 20))
    ax.set_xlabel("X")
    ax.set_ylabel("Y")
    ax.set_zlabel("Z")
    ax.zaxis.labelpad=-2
    return fig,

def animate(i):
    ax.view_init(elev=10., azim=i)
    return fig,

# Animate
anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=360, interval=20, blit=True)
# Save
anim.save('basic_animation.gif', fps=30)

## Geospatial Visualization
Unfortunately, we don't have time in this course to dig deep into geospatial data visualization. It can get very complicated very quickly, so I will simply recommend checking out the [plotly](https://plotly.com/python/maps/) graphing package if you are interested in visualizing geospatial data in Python. Here's an example showing the power of plotly:

In [None]:
import plotly
from urllib.request import urlopen
import json
import pandas as pd
import plotly.express as px

with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv",
                   dtype={"fips": str})

fig = px.choropleth_map(df, geojson=counties, locations='fips', color='unemp',
                           color_continuous_scale="Viridis",
                           range_color=(0, 12),
                           map_style="carto-positron",
                           zoom=3, center = {"lat": 37.0902, "lon": -95.7129},
                           opacity=0.5,
                           labels={'unemp':'unemployment rate'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
plotly.offline.plot(fig, filename='unemployment_map.html')

# Breakout: Make a Nice Plot
We've now seen how to make lots of different types of plots using matplotlib and seaborn. In this breakout, you'll get a chance to practice these methods on a new dataset. Please do the following:
1. Load in a default dataset from seaborn using the [load_dataset](https://seaborn.pydata.org/generated/seaborn.load_dataset.html) functionality.
2. Explore the dataset using the pandas methods we have learned about (e.g., `df.head()`, `df.info()`, etc.)
3. Make a plot of *something* in the dataset. It could be a scatter plot showing a correlation between two variables, or a time series showing some trend over time, or a heatmap showing how a trend evolves with two different variables.
4. Make the plot prettier by updating the plot params.
5. Share your plot with your neighbor and explain what your plot is showing. 