###  Practice - Python for data scientists III - eda viz

Answer all **Questions**

References:  
https://matplotlib.org/users/index.html   
https://github.com/cs109  
Python Data Science Handbook, Jake VanderPlas, 2017.    
The Visual Display of Quantitative Information, 2001.  
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2017.
Applied Multivariate Statistical Analysis, 2015.  

In [0]:
# The %... is an iPython magic command, and is not part of the Python language.
# In this case we're telling the plotting library to draw things in
# the notebook instead of in a separate window.
%matplotlib inline 

import numpy as np # imports a fast numerical programming library
import scipy as sp # imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm # allows us easy access to colormaps
import matplotlib.pyplot as plt # sets up plotting under plt
import pandas as pd # lets us handle data as dataframes

#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

### Getting the mtcars dataset into shape

 

The documentation for this data is [here](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html) but I have extracted some relevant parts below:

```
Description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Usage

mtcars
Format

A data frame with 32 observations on 11 variables.

[, 1]	mpg	Miles/(US) gallon
[, 2]	cyl	Number of cylinders
[, 3]	disp	Displacement (cu.in.)
[, 4]	hp	Gross horsepower
[, 5]	drat	Rear axle ratio
[, 6]	wt	Weight (1000 lbs)
[, 7]	qsec	1/4 mile time
[, 8]	vs	V/S
[, 9]	am	Transmission (0 = automatic, 1 = manual)
[,10]	gear	Number of forward gears
[,11]	carb	Number of carburetors
Source

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
```

In [0]:
dfcars=pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/mtcars.csv")
dfcars.head()

#### Question 1

There is an poorly named column here. Change the "Unnamed: 0" column to "name".

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html


In [0]:
# your work here


In [0]:
dfcars.shape

#### Question 2

Parse out the car `maker` from column $0$, i.e., the column you just renamed, and create a new `maker` column with this information. Display the first 10 lines of this new column.


In [0]:
# your work here


This is what the dataframe looks like now:

In [0]:
dfcars.head()

#### Question 3

Construct and display the `avg_mpg` series by using the "split-apply-combine" paradigm and summarizing within group data by a mean.

https://pandas.pydata.org/pandas-docs/version/0.23.4/groupby.html   

Your results should look similar to the following:

```
maker
AMC         15.200000
Cadillac    10.400000
Camaro      13.300000
Chrysler    14.700000
...
```

In [0]:
# Your work here


In [0]:
dfcars.hp.mean()

In [0]:
dfcars['hp'].mean()

In [0]:
dfcars.groupby('maker').hp.mean()

In [0]:
dfcars.groupby('maker')['hp'].mean()

In [0]:
g = dfcars.groupby('maker')
g['hp'].mean()

### Basic  Exploratory Data Analysis (EDA)  

Basic objectives for EDA:  

1. **Build** a DataFrame from the data (ideally, put all data into this object)
2. **Clean** the DataFrame. It should have the following properties:
    - Each row describes a single object
    - Each column describes a property of that object
    - Columns are numeric whenever appropriate
    - Columns contain atomic properties that cannot be further decomposed  
3. Explore **global properties**. Use histograms, scatter plots, and aggregation functions to summarize the data.
4. Explore **group properties**. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.

So far we have **built** the dataframe, and carried out very minimal cleaning (renaming) in this dataframe. 

### Exploring global properties

We are going to focus on visualizing global properties of the data set below. For now, we'll focus on `mpg` to illustrate the concepts, but you should be doing this for all the columns. It may identify interesting properties and even errors in the data.

While we do this, we will see several examples of the  `matplotlib` plotting experience.

Below, we are setting our matplotlib style to `ggplot`, which is modeled after an R library. The default is 'classic.' Feel free to experiment with other styles:   

https://matplotlib.org/users/style_sheets.html


In [0]:
plt.style.use('ggplot')

#### Bar Charts

Matplotlib is accessible via Pandas series. We can use the plot function with $kind="barh"$ to generate very nice horizontal bar charts.

In [0]:
avg_mpg.plot(kind="barh")

In [0]:
avg_mpg.plot(kind="barh")
plt.show() # we can remove the '<matplotlib.axes...' by adding a function that does not return anything.


#### Histograms

Numerical data leads to distributions, and distributions to histograms. Here is the Pandas default histogram:

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.hist.html   

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html  

In [0]:
#Using pandas interface:
dfcars.mpg.hist()
plt.xlabel("mpg");

And matplotlib interface:

In [0]:
plt.hist(dfcars.mpg.values);

#### Question 4

Generate a histogram of mpg with 50 bins. Add a vertical line in blue, 75% of the plot height to show the mean mpg.

Your plot should look something like the following:  
    
<img src='hist_with_mean.png' width='500px'>    

In [0]:
# Your work here


We can add a kernel density estimate (KDE) to our histogram as follows:
    
https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.plot.kde.html

In [0]:
fig, ax = plt.subplots()
dfcars.mpg.hist(bins=10, density=True, ax=ax)
dfcars.mpg.plot.kde(ax=ax, legend=False, title='Car MPG')
plt.axvline(dfcars.mpg.mean(), 0, 0.75, color='b', label='Mean')
plt.xlabel("mpg");



### Plotting features against other features

Sometimes we want to see co-variation amongst our columns. A scatter-plot does this for us.

In [0]:
plt.scatter(dfcars.wt, dfcars.mpg);

Usually we use `plt.show()` at the end of every plot to show the plot. The magic function `%matplotlib inline` takes care of this for us, and we dont have to doit in the jupyter notebook. But if you run your puthon program from a file, you will need to explicitly have a call to show. Does not hurt us to include it and it eliminates the object reference.

In [0]:
plt.plot(dfcars.wt, dfcars.mpg, marker='o')
plt.show()

If we want to save our figure into a file, the `savefig` needs to be in the same cell as the plotting commands. Go look at the files..

In [0]:
plt.plot(dfcars.wt, dfcars.mpg, 'o', markersize=4, alpha=0.5)
plt.savefig('scatter1.png')
plt.savefig('scatter2.png', bbox_inches='tight') #less whitespace around image

In [0]:
from IPython.display import Image
Image('scatter2.png')

#### Trend

The correlation that we saw might suggest a trend. We can capture it with a "regression". We'll learn more about regressions soon, but we show a quadratic fit here with a 1 standard deviation bar to show the graphics aspect of this. Also see the Seaborn `sns.regplot`.

In [0]:
x = dfcars.wt
y = dfcars.mpg
params = np.polyfit(x, y, 2)
xp = np.linspace(x.min(), x.max(), 20)
yp = np.polyval(params, xp)
plt.plot(xp, yp, 'k', alpha=0.8, linewidth=1)
plt.plot(dfcars.wt, dfcars.mpg, 'o', markersize=4, alpha=0.5)
sig = np.std(y - np.polyval(params, x))
plt.fill_between(xp, yp - sig, yp + sig, 
                 color='k', alpha=0.2);

#### Question 5

Generate a scatter plot with a regression like the plot above for hp vs. mpg. Use 2 standard deviations. Please feel free to experiment.

Note the use of numpy polyfit above to fit a second-order polynomial to the data.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html

In [0]:
# your work here


### Group Properties

"Co-variational" plots, and single-variable plots, can be more interesting when we look at them *conditioned* upon the value of a categorical variable.

Such conditionality is behind the notion of grouping, where we group our data by various values of categorical variables, for example, whether our cars have an automatic transmission or not.

### Grouping of one outcome variable

The notion of grouping based on combinations of factors is used to make various easy-to-see exploratory visualizations for us. 

First, we make a boxplot of  `mpg`, grouped by transmission style.

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html

In [0]:
# Create a figure instance
fig = plt.figure(1, figsize=(9, 6))

# Create an axes instance
ax = fig.add_subplot(111)

# Create the boxplot
bp = ax.boxplot(dfcars.mpg)


#### Question 6

Create boxplots for all mpg, hp, and disp on a single plot.

In [0]:
# Your work here



One can see that the difference in mpg is more significant between 6 and 8 cylinder cars, for manual transmissions. And that the large-range effect in automatics is coming almost entirely through 4-cylinder cars.  

What about the better mpg for automatics? We can see how representative these are in our sample. We'll show this using a cross-tabulation. Note: We can comine the cross-tab with a graph.


In [0]:
pd.crosstab(dfcars.am, dfcars.cyl)

#### Problem 7

Examine the dtcar sets. Create a cross tab of two parameters of your choosing.

In [0]:
# Your work here


### Faceting for general grouping

Seaborn package which is built on matplotlib provides a nice construct: the `FacetGrid`. You decide what variables to facet over, and then decide the kind of plot you want. Here we want hue to be `am`, and  different columns in the  plot grid to be cylinders. We then ask for a facet plot  of `mpg` against `wt` scatter.

https://seaborn.pydata.org/generated/seaborn.FacetGrid.html

Such plots are often called small multiple plots. They repeat the same plot based on categories, making sure that all plotting parameters are the same so that we have direct comparability.

In [0]:
g = sns.FacetGrid(dfcars, col="cyl", hue="am", palette="Set1")
g.map(plt.scatter, "mpg", "wt", alpha=0.5, s=10);

We can see that the "regression-like" effect is "cleanest" for automatic transmissions in 4 cylinder cars.

#### SPLOM, or Scatter Plot Matrix

Creating 2-by-2 basis for every pair of continuously co-varying features can get tedious.  The `PairGrid`, colorable by transmission type, allows us to do this comparison for 5 continuous features here, with the diagonal being a kernel density estimate.

https://seaborn.pydata.org/generated/seaborn.PairGrid.html


In [0]:
g = sns.PairGrid(dfcars, vars=['mpg', 'hp', 'wt', 'qsec', 'disp'], hue="am")
g.map_diag(sns.kdeplot)
g.map_offdiag(plt.scatter, s=15)

In many places, for example `mpg` vs `disp`, you will see two separate trends for the different transmissions. This suggests the addition of a transmission term as a **indicator** variable in regressions for `mpg` against various features. This changes the intercept of the regression. But the trends have different slopes as well, which suggests that `disp` may interact with `am`, the transmission indicator to create a varying slope as well.

#### Question 8

Experiment with sns.PairGrid using coloring for categorical variables other than `am` and see if you can identify any changes in scatter plot pairs.

In [0]:
# Your work here



#### Correlation

The SPLOM seems to suggest correlations. We can calculate corelation with the Pandas corr() function.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

In [0]:
dfcars[['mpg', 'wt', 'hp', 'qsec', 'disp']].corr()

Since correlations range from -1 to 1 through 0, a diverging palette is usually a good choice for visualizations.

In [0]:
dpal = sns.choose_colorbrewer_palette('diverging', as_cmap=True)

We can also use `maptplotlib`s correlation plot. These plots are especially helpful for both EDA and do see misclassification from your machine learning algorithms. EDA is even useful at the analysis stage.

In [0]:
plt.matshow(dfcars[['mpg', 'wt', 'hp', 'qsec', 'disp']].corr(), cmap=dpal)
ax = plt.gca()
ax.tick_params(axis='both', which='both',length=0);
plt.title("Correlation Matrix")
plt.xticks(range(5), ['mpg', 'wt', 'hp', 'qsec', 'disp'])
plt.yticks(range(5), ['mpg', 'wt', 'hp', 'qsec', 'disp']);


### KDE plots and sequential palettes.

We can make a KDE plot of a multivariate normal distribution. Since a probability density is strictly positive, with values near $0$ not being so interesting, a sequential palette is a good approach. Seaborn will by default provide such a palette for KDE plots, but you can also make your own.

In [0]:
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(data, columns=["x", "y"])
df.head()

In [0]:
seqpal = sns.choose_colorbrewer_palette("sequential", as_cmap=True)

In [0]:
sns.kdeplot(df.x, df.y, cmap=seqpal, shade=True);

### Matplotlib and multiple plots: Small Multiples

There are many cases where we want to see plots side by side. For example, SPLOMS and Facet grids. 

Here is an example of a plot with one column and 3 rows. 

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html



In [0]:
fig = plt.figure(figsize=(5, 9))

ax1 = fig.add_subplot(311)
ax1.plot([1, 2, 3], [1, 2, 3])
ax1.set_xticklabels([])
ax1.set_ylim([1.0, 3.0])

ax2 = fig.add_subplot(312)
ax2.scatter([1, 2, 3], [1, 2, 3])
ax2.set_xticklabels([])
ax2.set_ylim([1.0, 3.0])

ax3 = fig.add_subplot(313)
ax3.plot([1, 2, 3], [1, 2, 3])
ax3.set_ylim([1.0, 3.0])


fig.tight_layout()

### Small multiples, another approach

Here is another approach, which might be more straightforward than using `add_subplot`. It basically creates an array of plots and zips this array up with the various data grouped by categories.

In [0]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
print(axes)
print(axes.ravel())
carbs = ['==1', '==2', '==3', '>=4']
bins = np.arange(10, 30, 2)
for ax, carb in zip(axes.ravel(), carbs):
    data = dfcars.query("carb%s" % carb)
    print(data.shape)
    #ax.plot(data.wt, data.mpg, 'o', markersize=10, alpha=0.5)
    ax.hist(data.mpg, bins=bins, histtype='stepfilled', normed=True, color='r', alpha=.3)    
    ax.annotate("carb"+str(carb), xy=(12, 0.35), fontsize=14)
    #ax.set_yticks([])
    ax.set_ylim((0,0.4))
    ax.set_xlabel('mpg');

#### Question 9

Take a few moments and re-examine the orginal dataset. Identify a couple of variables that you believe would be interesting to investigate. Generate subplots for different values for one of the two variables similar to the plot above. multiple plots.

In [0]:
# Your work here


#### Question 10

Create one additional plot you believe would be relevant to understanding the dataset. You may use any combination of variables and plot type.

In [0]:
# Your work here
