# W3 Lab Assignment

Submit the .ipynb file to Canvas with file name `w03_lab_lastname_firstname.ipynb`.

In this lab, we will introduce [pandas](http://pandas.pydata.org/), [matplotlib](http://matplotlib.org/), and [seaborn](http://stanford.edu/~mwaskom/software/seaborn/index.html) and continue to use the `imdb.csv` file from the last lab.

There will be some exercises, and as usual, write your code in the empty cells to answer them.


## Importing libraries

I think some of you have already used `pandas`. Pandas is a library for high-performance data analysis, and makes  tedious jobs of reading, manipulating, analyzing data super easy and nice. You can even plot directly using `pandas`. If you used R before, you'll see a lot of similarity with the R's dataframe and pandas's dataframe. 

In [None]:
import pandas as pd  
import numpy as np

### Matplotlib magic

`Jupyter` notebook provides several [**magic** commands](https://ipython.org/ipython-doc/dev/interactive/tutorial.html#magic-functions). These are the commands that you can use only in the notebook (not in IDLE for instance). One of the greatest magic command is `matplotlib inline`, which displays plots within the notebook instead of creating figure files. 

In [None]:
%matplotlib inline 

There are many ways to import `matplotlib`, but the most common way is:

In [None]:
import matplotlib.pyplot as plt 

# Q1: Revisting W2 lab

Let's revisit last week's exercise with `pandas`. It's very easy to read `CSV` files with `pandas`, using the [**`panda.read_csv()`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function. This function has many many options and it may be worthwhile to take a look at available options. Things that you need to be careful are: 

1. `delimiter` or `sep`: the data file may use ',', tab, or any weird character to separate fields. You can't read data properly if this option is incorrect. 
1. `header`: some data files have "header" row that contains the names of the columns. If you read it as data or use the first row as the header, you'll have problems. 
1. `na_values` or `na_filter`: often the dataset is incomplete and contains missing data (`NA`, `NaN` (not a number), etc.). It's very important to take care of them properly. 

You don't need to create dictionaries and other data structures. `Pandas` just imports the whole table into a data structure called [**`DataFrame`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). You can do all kinds of interesting manipulation with the `DataFrame`. 

In [None]:
df = pd.read_csv('imdb.csv', delimiter='\t')

Let's look at the first few rows to get some sense of the data. 

In [None]:
df.head()

You can see more, or less lines of course

In [None]:
df.head(2)

You can extract one column by using dictionary-like expression

In [None]:
df['Year'].head(3)

or select multiple columns

In [None]:
df[['Year','Rating']].head(3)

To get the first 10 rows

In [None]:
df[:10]

We can also select both rows and columns. For example, to select the first 10 rows of the 'Year' and 'Rating' columns:

In [None]:
df[['Year','Rating']][:10]

You can swap the order of rows and columns. 

But, when you deal with large datasets, You may want to stick to this principle: 

> Try to reduce the size of the dataset you are handling as soon as possible, and as much as possible. 

For instance, if you have a billion rows with three columns, getting the small row slice (`df[:10]`) and working with this small slice can be much better than getting the column slice (`df['Year']`) and working with this slice (still contains billion items). 

In [None]:
df[:10][['Year','Rating']]

It is very easy to answer the question of the number of movies per year. The [**`value_counts()`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function counts how many times each data value (year) appears. 

In [None]:
print( min(df['Year']), df['Year'].min(), max(df['Year']), df['Year'].max() )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()

To calculate average ratings and votes

In [None]:
print( np.mean(df['Rating']), np.mean(df['Votes']) )

or you can even do 

In [None]:
print( df['Rating'].mean() )

To get the median ratings of movies in 1990s, we first select only movies in that decade

In [None]:
geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]

In [None]:
movie_nineties.head()

Then, we can do the calculation

In [None]:
print( movie_nineties['Rating'].median(), movie_nineties['Votes'].median() )

Finally, if we want to know the top 10 movies in 1990s, we can use the [**`sort()`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html) function:

In [None]:
sorted_by_rating = movie_nineties.sort('Rating', ascending=False)
sorted_by_rating[:10]

### Exercise

Calculate the following basic characteristics of ratings of movies only in 1994: 10th percentile, median, mean, 90th percentile.


* http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html
* http://pandas.pydata.org/pandas-docs/stable/text.html

Write your code in the cell below

In [None]:
# implement here


# Q2: Basic plotting with pandas

`Pandas` provides some easy ways to draw plots by using `matplotlib`. `Dataframe` object has several plotting functions. For instance, 

In [None]:
df['Year'].hist()

### Exercise

Can you plot the histogram of ratings of the movies between 2000 and 2014?

In [None]:
# implement here


# Q3: Basic plotting with matplotlib

Let's plot the histogram of ratings using the [**`pyplot.hist()`**](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist) function.

In [None]:
plt.hist(df['Rating'], bins=10)

### Exercise

Let's try to make some style changes to the plot:

* change the color from blue to whatever you want
  - http://matplotlib.org/users/pyplot_tutorial.html#working-with-text
  - http://matplotlib.org/api/colors_api.html
* add labels of x and y axis
* change the number of bins to 20

In [None]:
# implement here


# Q4: Basic plotting with Seaborn

Seaborn sits on the top of matplotlib and makes it easier to draw statistical plots. Most plots that you create with Seaborn can be created with matplotlib. It just typically requires a lot more work. 

Be sure seaborn has been installed on your computer, otherwise run

`conda install seaborn`

In [None]:
import seaborn as sns

Let's do nothing and just run the histgram again

In [None]:
plt.hist(df['Rating'], bins=10)

We can use the [**`distplot()`**](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html) function to plot the histogram.

In [None]:
sns.distplot(df['Rating'])

### Exercise

Read the document about the function and make the following changes: http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html

* change the number of bins to 10;
* not to show kde;

In [None]:
# implement here
