# W9 Lab Assignment

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss

sns.set_style('white')

%matplotlib inline

# KDE

Import the IMDb data.

In [None]:
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
movie_df.head()

We can plot histogram and KDE using pandas:

In [None]:
movie_df['Rating'].hist(bins=10, normed=True)
movie_df['Rating'].plot(kind='kde')

Or using seaborn:

In [None]:
sns.distplot(movie_df['Rating'], bins=10)

Can you plot the histogram and KDE of the log of movie votes?

In [None]:
# TODO: implement this using pandas

In [None]:
# TODO: implement this using seaborn

We can get a random sample using the pandas' [**`sample()`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) function. The [**`kdeplot()`**](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.kdeplot.html) function in seaborn provides many options (like kernel types) to do KDE.

In [None]:
f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)

sample_sizes = [10, 50, 100, 500, 1000, 10000]
for i, N in enumerate(sample_sizes, 1):
    plt.subplot(2,3,i)
    plt.title("Sample size: {}".format(N))
    for j in range(5):
        s = movie_df['Rating'].sample(N)
        sns.kdeplot(s, kernel='gau', legend=False)

# Regression

Remember [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)? Let's plot the four datasets and do linear regression, which can be done with scipy's [**`linregress()`**](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) function.

**TODO**: display the fitted lines using the [**`text()`**](http://matplotlib.org/users/text_intro.html) function.

In [None]:
X1 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
Y1 = [8.04, 6.95, 7.58,  8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

X2 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
Y2 = [9.14, 8.14, 8.74,  8.77, 9.26, 8.10, 6.13, 3.10, 9.13,  7.26, 4.74]

X3 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
Y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15,  6.42, 5.73]

X4 = [8.0,  8.0,  8.0,   8.0,  8.0,  8.0,  8.0,  19.0,  8.0,  8.0,  8.0]
Y4 = [6.58, 5.76, 7.71,  8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

data = [ (X1,Y1),(X2,Y2),(X3,Y3),(X4,Y4) ]

plt.figure(figsize=(10,8))

for i,p in enumerate(data, 1):
    X, Y = p[0], p[1]
    plt.subplot(2, 2, i)
    plt.scatter(X, Y, s=30, facecolor='#FF4500', edgecolor='#FF4500')
    slope, intercept, r_value, p_value, std_err = ss.linregress(X, Y)
    plt.plot([0, 20], [intercept, slope*20+intercept], color='#1E90FF') #plot the fitted line Y = slope * X + intercept
    
    # TODO: display the fitted line (Y=slopte * X + intercept) using the text() function.
    plt.text(????)

    plt.xlim(0,20)
    plt.xlabel('X'+str(i))
    plt.ylabel('Y'+str(i))

Actually, the dataset is included in seaborn and we can load it. 

In [None]:
df = sns.load_dataset("anscombe")
df.head()

All four datasets are in this single data frame and the 'dataset' indicator is one of the columns. This is a form often called [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf), which is easy to manipulate and plot. In tidy data, each row is an observation and columns are the properties of the observation. Seaborn makes use of the tidy form. 

We can show the linear regression results for each eadataset. [Here](https://stanford.edu/~mwaskom/software/seaborn/examples/anscombes_quartet.html) is the example:

In [None]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", size=4,
           scatter_kws={"s": 50, "alpha": 1})

What do these parameters mean? The documentation for the `lmplot()` is [here](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.lmplot.html).

In [None]:
# TODO: explain what the parameters (x, y, col, hue, etc.) mean?
# Change the values of these parameters and see the results.

# 2-D scatter plot and KDE

Select movies released in the 1990s:

In [None]:
geq = movie_df['Year'] >= 1990
leq = movie_df['Year'] <= 1999
subset = movie_df[ geq & leq ]
subset.head()

We can draw a scatter plot of movie votes and ratings using the [**`scatter()`**](http://matplotlib.org/examples/shapes_and_collections/scatter_demo.html) function.

In [None]:
plt.scatter(subset['Votes'], subset['Rating'])
plt.xlabel('Votes')
plt.ylabel('Rating')

Too many data points. We can decrease symbol size, set symbols empty, and make them transparent.

In [None]:
plt.scatter(subset['Votes'], subset['Rating'], s=20, alpha=0.6, facecolors='none', edgecolors='b')
plt.xlabel('Votes')
plt.ylabel('Rating')

Number of votes is broadly distributed. So set the x axis to log scale.

In [None]:
plt.scatter(subset['Votes'], subset['Rating'], s=10, alpha=0.6, facecolors='none', edgecolors='b')
plt.xscale('log')
plt.xlabel('Votes')
plt.ylabel('Rating')

We can combine scatter plot with 1D histogram using seaborn's [**`jointplot()`**](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.jointplot.html) function.

In [None]:
sns.jointplot(np.log(subset['Votes']), subset['Rating'])

## Hexbin

There are too many data points. We need to bin them, which can be done by using the [**`jointplot()`**](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.jointplot.html) and setting the `kind` parameter.

In [None]:
# TODO: draw a joint plot with hexbins and two histograms for each marginal distribution

## KDE

We can also do 2D KDE using seaborn's [**`kdeplot()`**](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.kdeplot.html) function.

In [None]:
sns.kdeplot(np.log(subset['Votes']), subset['Rating'], cmap="Reds", shade=True, shade_lowest=False)

Or using [**`jointplot()`**](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.jointplot.html) by setting the `kind` parameter.

In [None]:
# TODO: draw a joint plot with bivariate KDE as well as marginal distributions with KDE

# High dimensional data

In the IMDb dataset, we have two dimensions (number of votes and rating). How about if we have high dimensional data? First, in many cases, the number of dimensions is not too large. For instance, the ["Iris" dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) contains four dimensions of measurements on the three types of iris flower species. It's more than two dimensions, yet still manageable. 

This dataset is also included in seaborn, so we can load it.

In [None]:
iris = sns.load_dataset('iris')
iris.head()

It's often useful to look at the basic statistics of variables:

In [None]:
iris.describe()

We get four dimensions (sepal_length, sepal_width, petal_length, petal_width). One direct way to visualize them is to have a scatter plot for each pair of dimensions. We can use the [**`pairplot()`**](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.pairplot.html) function in seaborn to do this.

In [None]:
sns.pairplot(iris)

We can also color the symbols based on species:

In [None]:
sns.pairplot(iris, hue='species')

## PCA 

The [principal component analysis (PCA)](http://setosa.io/ev/principal-component-analysis/) is a nice dimensionality reduction method. The goal of dimensionality reduction is, of course, to reduce the number of variables (dimensions, measurements, columns). 

For example, in the Iris dataset we have four variables (`sepal_length`, `sepal_width`, `petal_length`, `petal_width`). If we can reduce the number of variables to two, then we can easily visualize them. PCA offers one way to do this.

PCA is already implemented in the [scikit-learn](http://scikit-learn.org/stable/) package, a machine learning library in Python, which should have been included in Anaconda. If not, to install scikit-learn, run:

`conda install scikit-learn`

or

`pip install scikit-learn`

Before running PCA, we need to transform the `iris` from [`DataFrame`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) to [Numpy's array](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) object. [DataFrame.values](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html) returns the Numpy representation of `DataFrame`.

In [None]:
print(iris.values)

Extract the four variable as X and species as Y:

In [None]:
X = iris.values[:, 0:4] # extract the 1st to the 3rd columns of all rows
Y = iris.values[:, 4] # extract the 4th column of all rows
print(X)
print(Y)

We can now do the PCA on the four variables (`X`). The first step is to initialize a [**`PCA`**](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) object.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2) # set the number of components to 2

We then fit `X` to get the model and then transform the original data of four variables (`X`) to two variables (components).

In [None]:
X_r = pca.fit(X).transform(X)
print(X_r)

Now we can assemble the two components and the `species` column into a DataFrame.

In [None]:
df = pd.DataFrame(X_r, columns=['PC1', 'PC2'])
df['species'] = y
df.head()

In [None]:
# TODO: show the scatter plot the two components using pairplot()