# W6 Lab Assignment

Deep dive into Histogram and boxplot.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

sns.set_style('white')

%matplotlib inline 

# Histogram


Let's revisit the table from the class

| Hours | Frequency |
|-------|-----------|
| 0-1   | 4,300     |
| 1-3   | 6,900     |
| 3-5   | 4,900     |
| 5-10  | 2,000     |
| 10-24 | 2,100     |

You can draw a histogram by just providing bins and counts instead of a list of numbers. So, let's do that for convenience. 

In [None]:
bins = [0, 1, 3, 5, 10, 24]
data = {0.5: 4300, 2: 6900, 4: 4900, 7: 2000, 15: 2100} 

Draw histogram using this data. Useful query: [Google search: matplotlib histogram pre-counted](https://www.google.com/search?client=safari&rls=en&q=matplotlib+histogram+already+counted&ie=UTF-8&oe=UTF-8#q=matplotlib+histogram+pre-counted)

In [None]:
# TODO: draw a histogram with pre-counted data. 
#plt.xlabel("Hours")

As you can see, the **default histogram does not normalize with binwidth and simply shows the counts**! This can be very misleading if you are working with variable bin width. One simple way to fix this is using the option [`normed`](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist). 

In [None]:
# TODO: fix it with normed option. 

## IMDB data

How does matplotlib decide the bin width? Let's try with the IMDb data.

In [None]:
# TODO: Load IMDB data into movie_df using pandas


Plot the histogram of movie ratings using the `plt.hist()` function.

In [None]:
plt.hist(movie_df['Rating'])


Have you noticed that this function returns three objects? Take a look at the documentation [here](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist) to figure out what they are.

To get the returned three objects:

In [None]:
n_raw, bins_raw, patches = plt.hist(movie_df['Rating'])
print(n_raw)
print(bins_raw)

Actually, `n_raw` contains the values of histograms, i.e., the number of movies in each of the 10 bins. Thus, the sum of the elements in `n_raw` should be equal to the total number of movies:

In [None]:
# TODO: test whether the sum of the numbers in n_raw is equal to the number of movies. 

The second returned object (`bins_raw`) is a list containing the edges of the 10 bins: the first bin is \[1.0,1.89\], the second \[1.89,2.78\], and so on. We can calculate the width of each bin.

In [None]:
# TODO: calculate the width of each bin and print them. 


The above `for` loop can be conveniently rewritten as the following, using [list comprehension](https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions) and the [**`zip()`**](https://docs.python.org/3/library/functions.html#zip) function. Can you explain what's going on inside the zip?

In [None]:
[ j-i for i,j in zip(bins_raw[:-1],bins_raw[1:]) ]

Noticed that the width of each bin is the same? This is equal-width binning. We can calculate the width as:

In [None]:
min_rating = min(movie_df['Rating'])
max_rating = max(movie_df['Rating'])
print(min_rating, max_rating)
print( (max_rating-min_rating) / 10 )

Now, let's plot the histogram where the y axis is normed.

In [None]:
n, bins, patches = plt.hist(movie_df['Rating'], normed=True)
print(n)
print(bins)

In this case, the edges of the 10 bins do not change. But now `n` represents the heights of the bins. Can you verify that matplotlib has correctly normed the heights of the bins?

Hint: the area of each bin should be equal to the fraction of movies in that bin.

In [None]:
# TODO: verify that it is properly normalized. 


## Selecting binsize

A nice to way to explore this is using the "[small multiples](https://www.google.com/search?client=safari&rls=en&q=small+multiples&ie=UTF-8&oe=UTF-8)" with a set of sample bin sizes. In other words, pick some binsizes that you want to see and draw many plots within a single "figure". Read about [subplot](https://www.google.com/search?client=safari&rls=en&q=matplotlib+subplot&ie=UTF-8&oe=UTF-8). For instance, you can do something like:

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
movie_df['Rating'].hist(bins=3)
plt.subplot(1,2,2)
movie_df['Rating'].hist(bins=100)

What does the argument in `plt.subplot(1,2,1)` mean?  
http://stackoverflow.com/questions/3584805/in-matplotlib-what-does-the-argument-mean-in-fig-add-subplot111

Ok, so create 8 subplots (2 rows and 4 columns) with the given `binsizes`. 

In [None]:
binsizes = [2, 3, 5, 10, 30, 40, 60, 100 ]

plt.figure(1, figsize=(18,8))
for i, bins in enumerate(binsizes): 
    # TODO: use subplot and hist() function to draw 8 plots
   


Do you notice weird patterns that emerge from `bins=40`? Can you guess why do you see such patterns? What are the peaks and what are the empty bars? What do they tell you about choosing the binsize in histograms?

In [None]:
# TODO: Provide your answer and evidence here

Now, let's try to apply several algorithms for finding the number of bins. 

In [None]:
N = len(movie_df['Rating'])

# TODO: plot three histograms based on three formulae

plt.figure(figsize=(12,4))


# Sqrt 
nbins = int(np.sqrt(N))

plt.subplot(1,3,1)
plt.title("SQRT, {} bins".format(nbins))


# Sturge's formula
nbins = int(np.ceil(np.log2(N) + 1))

# Freedman-Diaconis
data = movie_df['Rating'].order()
iqr = np.percentile(data, 75) - np.percentile(data, 25)
width = 2*iqr/np.power(N, 1/3)
nbins = int((max(data) - min(data)) / width)



# Investigating the anomalies in the histogram

Let's investigate the anormalies in the histogram. 

In [None]:
# TODO: draw the histogram with 120 bins


We can locate where the empty bins are, by checking whether the value in the n is zero or not. 

In [None]:
# TODO: print out bins that doesn't contain any values. Check whether they fall into range like [1.8XX, 1.8XX]
# useful zip: zip(bins[:-1], bins[1:], n)  what does this do?


In [None]:
# TODO: draw the histogram with 120 bins


One way to identify the peak is comparing the number to the next bin and see whether it is much higher than the next bin. 

In [None]:
# TODO: identify peaks and print the bins with the peaks 
# e.g. 
# [1.0, 1.1]
# [1.3, 1.4]
# [1.6, 1.7]
# ...
#
# you can use zip again like zip(bins[:-1], bins[1:]  ... ) to access the data in two adjacent bins.


Ok. They doesn't necessarilly cover the integer values. Let's see the minimum number of votes. 

In [None]:
movie_df.describe()

Ok, the minimum number of votes is 5 not 1. IMDB may only keep the rating information for movies with at least 5 votes. This may explain why the most frequent ratings are like 6.4 and 6.6. Let's plot the histogram with only the rows with 5 votes. Set the binsize 30. 

In [None]:
# TODO: plot the histogram only with ratings that have the minimum number of votes. 

Then, print out what are the most frequent rating values. Use `value_counts()` function for dataframe. 

In [None]:
# TODO: filter out the rows with the min number of votes (5) and then `value_counts()` them. 
# sort the result to see what are the most common numbers. 


So, the most frequent values are not "x.0". Let's see the CDF. 

In [None]:
# Plot the CDF of votes. 


What's going on? The number of votes is heavily skewed and most datapoints are at the left end. 

In [None]:
# TODO: plot the same thing but limit the xrange (xlim) to [0, 100]. 



Draw a histogram focused on the range [0, 10] to just see how many datapoints are there. 

In [None]:
# TODO: set the xlim to [0, 10] adjust ylim and bins so that 
# we can see how many datapoints are there for each # of votes. 


Let's assume that most 5 ratings are from 5 to 8 and see what we'll get. You can use `itertools.product` function to generate the fake ratings. 

In [None]:
list(product([5,6,7,8], repeat=5))[:10]

In [None]:
from itertools import product
from collections import Counter

c = Counter()
for x in product([5,6,7,8], repeat=5):
    c[str(round(np.mean(x), 1))]+=1
sorted(c.items(), key=lambda x: x[1], reverse=True)
    
# or sorted(Counter(str(round(np.mean(x), 1)) for x in product([5,6,7,8], repeat=5)).items(), key=lambda x: x[1], reverse=True)

# Boxplot

Let's look at the example data that we looked at during the class. 

In [None]:
data = [-1, 3, 3, 4, 15, 16, 16, 17, 23, 24, 24, 25, 35, 36, 37, 46]

The [**`numpy.percentile()`**](http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html) function provides a way to calculate the percentiles. Note that using the option `interpolation`, you can specify which value to take when the percentile value lies in between numbers. The default is linear. 

In [None]:
print(np.percentile(data, 25))
print(np.percentile(data, 50), np.median(data))
print(np.percentile(data, 75))

Can you explain why do you get those first and third quartile values? The first quantile value is not 4, not 15, and not 9.5. Why?

Let's draw a boxplot with matplotlib. 

In [None]:
# TODO: draw a boxplot of the data
