# Some Things About Visualization

In [1]:
%matplotlib inline 
#this line above prepares IPython notebook for working with matplotlib


import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes


## Efficiencies in Visualization

![](images/longersteeper.jpeg)

*your answer here*


![](images/largerdarker.jpeg)

*your answer here*


![](images/biggervalue.jpeg)

*your answer here*


## mtcars dataset

 

The documentation for this data is [here](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html) but I have extracted some relevant parts below:

```
Description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Usage

mtcars
Format

A data frame with 32 observations on 11 variables.

[, 1]	mpg	Miles/(US) gallon
[, 2]	cyl	Number of cylinders
[, 3]	disp	Displacement (cu.in.)
[, 4]	hp	Gross horsepower
[, 5]	drat	Rear axle ratio
[, 6]	wt	Weight (1000 lbs)
[, 7]	qsec	1/4 mile time
[, 8]	vs	V/S
[, 9]	am	Transmission (0 = automatic, 1 = manual)
[,10]	gear	Number of forward gears
[,11]	carb	Number of carburetors
Source

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
```

In [2]:
dfcars=pd.read_csv("data/mtcars-edited.csv")
dfcars.head()

There are multiple data types here:

![](images/datatypes.png)

In [26]:
dfcars.dtypes

We can construct the `av_mpg` series by using the "split-apply-combine" paradigm and summarizing within group data by a mean:

In [3]:
av_mpg = dfcars.groupby('maker').mpg.mean()
av_mpg

In [4]:
import seaborn as sns

## Using Colors

Now we see maplotlib's default color palette, 'viridis'.

![](images/viridis.png)

This is an example of a **perceptually linear** colormap, as opposed to the rainbow colormap used by Matlab and countless other software.

In [5]:
sns.palplot(sns.color_palette());

In [37]:
sns.palplot(sns.color_palette("viridis"))

How should you choose colors?

![](images/colorscales.png)

#### Bar Charts

We see that Pandas series very niftily give us bar graphs. Pandas has many other charts. They will use the standard palette. As long as you are using lengths to represent data, you **do not** need more than one color. And length is better!!!

In [6]:
plt.figure(figsize=(8, 10))
av_mpg.plot(kind="barh")

Here we are clearly using length to encode the numerical values, all in one color. What models of visual expression are most efficient?

![](images/datatypesmodes.jpeg)

## Co-relations and diverging Palettes

Color is indeed not the most efficient. But when spatial information is *taken up* by something else we must use it, as in a map, or a correlation matrix, or a more than 1-D probability density)

In [7]:
dfcars[['mpg', 'wt', 'hp', 'qsec', 'disp']].corr()

Since correlations range from -1 to 1 through 0, a diverging palette is probably our best bet.

In [9]:
dpal = sns.choose_colorbrewer_palette('diverging', as_cmap=True)

We use `maptplotlib`s correlation plot. You'll land up doing plots like this for both EDA and do see misclassification from your machine learning algorithms. In other words, EDA is even useful at the analysis stage.

In [27]:
plt.matshow(dfcars[['mpg', 'wt', 'hp', 'qsec', 'disp']].corr(), cmap=dpal)
ax = plt.gca()
ax.tick_params(axis='both', which='both',length=0);
plt.title("Correlation Matrix")
plt.xticks(range(5), ['mpg', 'wt', 'hp', 'qsec', 'disp'])
plt.yticks(range(5), ['mpg', 'wt', 'hp', 'qsec', 'disp']);


### Density Plots and Sequential Palettes.

Here we make a KDE plot of a multivariate normal distribution. Since a probability density is strictly positive, with values near 0 not being so interesting, a sequential palette is our ticket. Seaborn will by default provide such a palette for KDE plots, but you can use your own!

In [12]:
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(data, columns=["x", "y"])
df.head()

In [13]:
seqpal = sns.choose_colorbrewer_palette("sequential", as_cmap=True)

In [14]:
sns.kdeplot(df.x, df.y, cmap=seqpal, shade=True);

In 1-D you dont need such a palette: you can just use one color, as you are once again using spatial information. For 2D colormaps, you could use a 3-D plot, but these are harder to interpret. For example a good cartograph of mountains is often easier to read than a fake-3d rendering. Indeed a subtle use of color, and other markings can render far better than 3d...

![](https://s3.amazonaws.com/edwardtufte.com/Thinking%20Eye/SwissMap1.jpg)

In [31]:
sns.kdeplot(df.x, shade=True);

## The Interest in Science Problem: which graphical mode to use?

![](images/questionasked.png)

You are given this data. Let us think of some ways to display this...

In [15]:
science = {
    'interest': ['Excited', 'Kind of interested', 'OK', 'Not great', 'Bored'],
    'before': [19, 25, 40, 5, 11],
    'after': [38, 30, 14, 6, 12]
}
dfscience = pd.DataFrame.from_dict(science).set_index("interest")[['before', 'after']]
dfscience

### Multiple Pie Charts

In [16]:
fig, axs = plt.subplots(1,2, figsize = (10,6))
dfscience.before.plot(kind="pie", ax=axs[0], labels=None);
axs[0].legend(loc="upper left", ncol=5, labels=dfscience.index)
dfscience.after.plot(kind="pie", ax=axs[1], labels=None);


### Before and after bar charts

In [17]:
plt.figure(figsize=(12,8))
ax = plt.gca()
dfscience.plot(ax = ax, kind="bar");

### Stacked Charts


In [18]:
plt.figure(figsize=(12,8));
ax = plt.gca()
dfscience.plot(ax=ax, kind="barh", stacked=True);

This is hard to read. We want to compare before and after easily. Sometimes the solution is a transpose!!

In [19]:
dfscience.transpose()

We have to play some games to get the ordering right...

In [20]:
dfst = dfscience.transpose()
dfst.iloc[[1, 0],:]

In [21]:
plt.figure(figsize=(12,8));
ax = plt.gca()
dfscience.transpose().loc[['after', 'before'], :].plot(ax=ax, kind="barh", stacked=True)
plt.legend(loc=2, ncol=5);

Now we see that the blue and greens have it. Try and improve this plot. It still has too much chartjunk.

### Difference Bar chart

Sometimes a small data transformation makes the effect blindingly obvious. Here we just make a bar chart of the change!

In [22]:
plt.figure(figsize=(12,8));
ax = plt.gca()
(dfscience.after - dfscience.before).plot(kind="barh", ax=ax);

### Slopegraph

In [23]:
dfst.head()

In [24]:
plt.figure(figsize=(10,8))
ax = plt.gca()
ax.tick_params(axis='both', which='both',length=0)
ax.set_yticklabels([])
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
for kind in ['Excited', 'Kind of interested', 'OK', 'Not great', 'Bored']:
    vals = dfst[kind]
    plt.plot(vals, 'o-')
    plt.annotate(kind+", "+str(vals['before']), xy = ('before', vals['before']+0.5), ha='right')
    plt.annotate(kind+", "+str(vals['after']), xy = ('after', vals['after']+0.5), ha='left')

plt.title("How do you feel about doing Science?");