# W3 Lab Assignment

Submit the .ipynb file to Canvas with file name `w03_lab_lastname_firstname.ipynb`.

In this lab, we will introduce [pandas](http://pandas.pydata.org/), [matplotlib](http://matplotlib.org/), and [seaborn](http://stanford.edu/~mwaskom/software/seaborn/index.html) and continue to use the `imdb.csv` file from the last lab.

There will be some exercises, and as usual, write your code in the empty cells to answer them.


## Importing libraries

In [None]:
# pandas makes tedious jobs of reading and manipulating data super easy and nice. You can even plot 
# directly using pandas. 
import pandas as pd  
import numpy as np

### Matplotlib magic

`IPython` (`Jupyter`) notebook provides several [**magic** commands](https://ipython.org/ipython-doc/dev/interactive/tutorial.html#magic-functions). One of the greatest magic command is `matplotlib inline`, which displays plots within the notebook instead of creating figure files. 

In [None]:
%matplotlib inline 

There are many ways to import `matplotlib`, but the most common way is:

In [None]:
import matplotlib.pyplot as plt 

# Q1: Revisting W2 lab

Let's revisit last week's exercise with `pandas`. It's very easy to read `CSV` files with `pandas`, using the [**`panda.read_csv()`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function. You don't need to create dictionaries and other data structures. `Pandas` just imports the whole table into a data structure called [**`DataFrame`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). You can do all kinds of interesting manipulation with the `DataFrame`. 

In [None]:
df = pd.read_csv('imdb.csv', delimiter='\t')

Let's look at the first few rows to get some sense of the data. 

In [None]:
df.head()

You can extract one column by using dictionary-like expression

In [None]:
df['Year'].head()

or select multiple columns

In [None]:
df[['Year','Rating']].head()

To get the first 10 rows

In [None]:
df[:10]

We can also select both rows and columns. For example, to select the first 10 rows of the 'Year' and 'Rating' columns:

In [None]:
df[['Year','Rating']][:10]

The order of rows and columns does not matter:

In [None]:
df[:10][['Year','Rating']]

It is very easy to answer the question of the number of movies per year. The [**`value_counts()`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function counts how many times each data value (year) appears. 

In [None]:
print( min(df['Year']), max(df['Year']) )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()

To calculate average ratings/votes

In [None]:
print( np.mean(df['Rating']), np.mean(df['Votes']) )

To get the median ratings of movies in 1990s, we first select only movies in that decade

In [None]:
geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]

In [None]:
movie_nineties.head()

Then, we can do the calculation

In [None]:
print( np.median(movie_nineties['Rating']), np.median(movie_nineties['Votes']) )

Finally, if we want to know the top 10 movies in 1990s, we can use the [**`sort()`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html) function:

In [None]:
sorted_by_rating = movie_nineties.sort('Rating', ascending=False)
sorted_by_rating[:10]

### Exercise

Calculate the following basic characteristics of ratings of movies only in 1994: 10th percentile, median, mean, 90th percentile.


* http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html
* http://pandas.pydata.org/pandas-docs/stable/text.html

Write your code in the cell below

In [None]:
# implement here


# Q2: Basic plotting with pandas

`Pandas` provides some easy ways to draw plots by using `matplotlib`. `Dataframe` object has several plotting functions. For instance, 

In [None]:
df['Year'].hist()

### Exercise

Can you plot the histogram of ratings of the movies from 2015?

In [None]:
# implement here


# Q3: Basic plotting with matplotlib

Let's plot the histogram of ratings using the [**`pyplot.hist()`**](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist) function.

In [None]:
plt.hist(df['Rating'], bins=10)

### Exercise

Let's try to make some style changes to the plot:

* change the color from blue to whatever you want
  - http://matplotlib.org/users/pyplot_tutorial.html#working-with-text
  - http://matplotlib.org/api/colors_api.html
* add labels of x and y axis


In [None]:
# implement here


# Q4: Basic plotting with Seaborn

Seaborn sits on the top of matplotlib and makes it easier to draw statistical plots. Most plots that you create with Seaborn can be created with matplotlib. It just typically requires a lot more work. 

Be sure seaborn has been installed on your computer, otherwise run

`conda install seaborn`

In [None]:
import seaborn as sb

Let's do nothing and just run the histgram again

In [None]:
plt.hist(df['Rating'], bins=10)

We can use the [**`distplot()`**](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html) function to plot the histogram.

In [None]:
sb.distplot(df['Rating'])

### Exercise

Read the document about the function and make the following changes: http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html

* change the number of bins to 10;
* not to show kde;

In [None]:
# implement here
