# W3 Lab Assignment

Submit the .ipynb file to Canvas with file name `w03_lab_lastname_firstname.ipynb`.

In this lab, we will introduce [pandas](http://pandas.pydata.org/), [matplotlib](http://matplotlib.org/), and [seaborn](http://stanford.edu/~mwaskom/software/seaborn/index.html) and continue to use the `imdb.csv` file from the last lab.


## Importing libraries

In [3]:
# pandas makes tedious jobs of reading and manipulating data super easy and nice. You can even plot 
# directly using pandas. 
import pandas as pd  
import numpy as np

### Matplotlib magic

`IPython` (`Jupyter`) notebook provides several **magic** commands. One of the greatest magic command is `matplotlib inline`, which displays plots within the notebook instead of creating figure files. 

In [4]:
%matplotlib inline 

There are many ways to import `matplotlib`, but the most common way is:

In [5]:
import matplotlib.pyplot as plt 

# Q1: Revisting W2 lab

Let's revisit last week's exercise with `pandas`. It's very easy to read `CSV` files with `pandas`, using the [`panda.read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function. You don't need to create dictionaries and other data structures. `Pandas` just imports the whole table into a data structure called [**`DataFrame`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). You can do all kinds of interesting manipulation with the `DataFrame`. 

In [6]:
df = pd.read_csv('../../lab02_imdb/imdb.csv', delimiter='\t')

Let's look at the first few rows to get some sense of the data. 

In [7]:
df.head()

Unnamed: 0,Title,Year,Rating,Votes
0,!Next?,1994,5.4,5
1,#1 Single,2006,6.1,61
2,#7DaysLater,2013,7.1,14
3,#Bikerlive,2014,6.8,11
4,#ByMySide,2012,5.5,13


You can extract one column by using dictionary-like expression

In [8]:
df['Year'].head()

0    1994
1    2006
2    2013
3    2014
4    2012
Name: Year, dtype: int64

or slect multiple columns

In [9]:
df[['Year','Rating']].head()

Unnamed: 0,Year,Rating
0,1994,5.4
1,2006,6.1
2,2013,7.1
3,2014,6.8
4,2012,5.5


To get the first 10 rows

In [11]:
df[:10]

Unnamed: 0,Title,Year,Rating,Votes
0,!Next?,1994,5.4,5
1,#1 Single,2006,6.1,61
2,#7DaysLater,2013,7.1,14
3,#Bikerlive,2014,6.8,11
4,#ByMySide,2012,5.5,13
5,#LawstinWoods,2013,7.0,6
6,#lovemilla,2013,6.7,17
7,#nitTWITS,2011,7.1,9
8,$#*! My Dad Says,2010,6.3,4349
9,"$1,000,000 Chance of a Lifetime",1986,6.4,16


We can also select both rows and columns. For example, to select the first 10 rows of the 'Year' and 'Rating' columns:

In [12]:
df[['Year','Rating']][:10]

Unnamed: 0,Year,Rating
0,1994,5.4
1,2006,6.1
2,2013,7.1
3,2014,6.8
4,2012,5.5
5,2013,7.0
6,2013,6.7
7,2011,7.1
8,2010,6.3
9,1986,6.4


The order of rows and columns does not matter:

In [13]:
df[:10][['Year','Rating']]

Unnamed: 0,Year,Rating
0,1994,5.4
1,2006,6.1
2,2013,7.1
3,2014,6.8
4,2012,5.5
5,2013,7.0
6,2013,6.7
7,2011,7.1
8,2010,6.3
9,1986,6.4


It is very easy to answer the question of the number of movies per year. [`value_counts()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function counts how many times each data value (year) appears. 

In [17]:
print( min(df['Year']), max(df['Year']) )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()

1874 2017


2011    13944
2012    13887
2013    13048
2010    12931
2009    12268
dtype: int64

To calculate average ratings/votes

In [19]:
print( np.mean(df['Rating']), np.mean(df['Votes']) )

6.29619534138 1691.2317746


To get the median ratings of movies in 1990s, we first select only movies in that decade

In [22]:
geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]

In [23]:
movie_nineties.head()

Unnamed: 0,Title,Year,Rating,Votes
0,!Next?,1994,5.4,5
23,'N Sync TV,1998,7.5,11
33,'t Zal je gebeuren...,1998,6.0,7
34,'t Zonnetje in huis,1993,6.1,148
42,.COM,1999,3.8,5


Then, we can do the calculation

In [25]:
print( np.median(movie_nineties['Rating']), np.median(movie_nineties['Votes']) )

6.3 32.0


Finally, if we want to know the top 10 movies in 1990s, we can use the [`sort`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html) function:

In [30]:
sorted_by_rating = movie_nineties.sort('Rating', ascending=False)
sorted_by_rating[:10]

Unnamed: 0,Title,Year,Rating,Votes
131241,Girls Loving Girls,1996,9.8,5
202778,Nicole's Revenge,1995,9.5,13
38899,The Beatles Anthology,1995,9.4,3822
39429,The Civil War,1990,9.4,4615
218444,Pink Floyd: P. U. L. S. E. Live at Earls Court,1994,9.3,3202
279320,The Shawshank Redemption,1994,9.3,1511933
72171,Bardot,1992,9.2,5
42590,The Sopranos,1999,9.2,163406
29419,Otvorena vrata,1994,9.1,2337
3955,Baseball,1994,9.1,2463


### Exercise

Calculate various basic characteristics (10th percentile, median, mean, 90th percentile) of ratings of movies only in 1994

How about if we want to select movies whose titles contain a particular word, say 'girls', 'boys', or 'war'?

* http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html
* http://pandas.pydata.org/pandas-docs/stable/text.html

Write your code in the cell below

In [40]:
# implement here


# Q2: Basic plotting with pandas

In [None]:
df['Year'].hist()


# Q3: Basic plotting with matplotlib

# Q4: Basic plotting with Seaborn

In [None]:
import seaborn as sb