# W5 Lab Assignment

This lab covers some fundamental plots of 1-D data.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

sns.set_style('white')

%matplotlib inline 

# Q1 1-D Scatter Plot


Remember that you can use not only the real data, but also fake data if you want to play with visualization tools. Actually it is a nice way to experiment because you can control every aspect of data. Let's create some random numbers. 

The function [**`np.random.randn()`**](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html) generates a sample with size $N$ from the [standard normal distribution](https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution).

In [None]:
print( np.random.rand(10) )

The following small function generates $N$ normally distributed numbers:

In [None]:
def generate_many_numbers(N=10, mean=5, sigma=3):
    return mean + sigma * np.random.randn(N)

Generate 10 normally distributed numbers with mean 5 and sigma 3:

In [None]:
data = generate_many_numbers(N=10)
print(data)

The most immediate method to visualize 1-D data is just plotting it. Here we can use the [**`scatter()`**](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter) function to draw a scatter plot. The most basic usage of this function is to provide x and y.

In [None]:
x = np.arange(1,11)
y = x + 5
print(x)
print(y)
plt.scatter(x, y)

But here we only have x (the generated data). We can set the y values to 0. The [**`np.zeros_like(data)`**](http://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros_like.html) function creates a numpy array (list) that have the same dimension as the argument.

In [None]:
print(np.zeros_like(data))

Now let's plot the generated 1-D data.

In [None]:
plt.figure(figsize=(10,1)) # set figure size, width = 10, height = 1
plt.scatter(data, np.zeros_like(data), s=50) # set size to 50
plt.gca().axes.get_yaxis().set_visible(False) # set y axis invisible

If we have more numbers?

In [None]:
data = generate_many_numbers(N=100)
plt.figure(figsize=(10,1))
plt.scatter(data, np.zeros_like(data), s=50)
plt.gca().axes.get_yaxis().set_visible(False)

Of course we can't see much at the center. We can add "jitters" using the [**`np.random.rand()`**](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.rand.html) function.  

In [None]:
data = generate_many_numbers(N=100)

# TODO: implement this
# zittered_ypos = ??

plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s=50)
plt.gca().axes.get_yaxis().set_visible(False)

Let's also make it transparent. Here is [a useful Google query](https://www.google.com/search?client=safari&rls=en&q=matplotlib+scatter+transparent+symbole&ie=UTF-8&oe=UTF-8#q=matplotlib+scatter+transparent+symbol), and the documentation of [**`scatter()`**](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter) also helps.

In [None]:
data = generate_many_numbers(N=200)

# From the last question
# zittered_ypos = ??

# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)

We can use transparency as well as empty symbols.

* Increase the number of points to 1,000
* Set the symbol empty and edgecolor red ([a useful query](https://www.google.com/search?client=safari&rls=en&q=matplotlib+scatter+empty+symbols&ie=UTF-8&oe=UTF-8))

In [None]:
# TODO: implement this
# data = ?? 
# zittered_ypos = ??


# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)

Let's use real data. Load the IMDb dataset that we used before.

In [None]:
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
movie_df.head()

Try to plot the 'Rating' information using 1D scatter plot. Does it work?

In [None]:
# TODO: plot 'rating'


# Q2 Histogram 

There are too many data points! Let's try histogram. 


In [None]:
movie_df['Rating'].hist()

Looks good! Can you increase or decrease the number of bins? Find the documentation [here](https://www.google.com/search?client=safari&rls=en&q=pandas+plotting&ie=UTF-8&oe=UTF-8). 

In [None]:
# TODO: try different number of bins


A nice to way to explore this is visualizing the "[small multiples](https://www.google.com/search?client=safari&rls=en&q=small+multiples&ie=UTF-8&oe=UTF-8)". It is possible to draw many plots in a single "figure". Read about [subplot](https://www.google.com/search?client=safari&rls=en&q=matplotlib+subplot&ie=UTF-8&oe=UTF-8). For instance, you can do something like:

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
movie_df['Rating'].hist(bins=3)
plt.subplot(1,2,2)
movie_df['Rating'].hist(bins=100)

Ok, so create 8 subplots (2 rows and 4 columns) with the given `binsizes`. 

In [None]:
binsizes = [2, 3, 5, 10, 30, 40, 60, 100 ]

plt.figure(1, figsize=(18,8))
for i, bins in enumerate(binsizes): 
    # TODO: use subplot and hist() function to draw 8 plots
    

Do you notice weird patterns that emerge from `bins=40`? Can you guess why do you see such patterns? What are the peaks and what are the empty bars?

In [None]:
# Provide your answer and evidence here

# Q3 Boxplot

Now let's try boxplot. We can use pandas' plotting functions. The usages of boxplot is [here](http://pandas.pydata.org/pandas-docs/version/0.15.0/visualization.html#visualization-box).

In [None]:
movie_df['Rating'].plot(kind='box', vert=False)

Or try seaborn's [**`boxplot()`**](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.boxplot.html) function:

In [None]:
sns.boxplot(movie_df['Rating'])

We can also easily draw a series of boxplots grouped by categories. For example, let's do the boxplots of movie ratings for different decades.

In [None]:
df = movie_df.sort('Year')
df.head()

One easy way to transform a particular year to the decade (e.g., 1874 -> 1870):

In Python 3, the `\\` operator is used for integer division.

In [None]:
print(1874//10)
print(1874//10*10)
decade = (df['Year']//10) * 10
decade.head()

In [None]:
ax = sns.boxplot(x=decade, y=df['Rating'])
ax.figure.set_size_inches(12, 8)

Can you draw boxplots of movie votes for different decade?

In [None]:
# TODO

What do you see? Can you actually see the "box"? The number of votes span a very wide range, from 1 to more than 1.4 million. One way to deal with this is to make a log-transformation of votes, which can be done with the [**`numpy.log()`**](http://docs.scipy.org/doc/numpy/reference/generated/numpy.log.html) function.

In [None]:
log_votes = np.log(df['Votes'])
log_votes.head()

Can you draw boxplots of log-transformed movie votes for different decade?

In [None]:
# TODO
