# Lab Assignment 01

Submit the .ipynb file to Canvas with file name 'lab02_lastname_firstname'.

[Internet Movie Database (IMDb)](http://www.imdb.com/) provides various information about movies, such as total budgets, lengths, actors, and user ratings. They are publicly available from [here](http://www.imdb.com/interfaces). In this lab, let's explore a processed dataset named 'imdb.csv', which contains some basic information of movies.

Download the file from Canvas. There are 4 columns separated by tab:

1. Title: title of the movie;
1. Year: release year;
1. Rating: average IMDb user rating;
1. Votes: number of IMDB users who rated this movie

These are the questions to explore:

1. What is the first and last year in this dataset? How many movies released in each year?
1. What is the average ratings/votes?
1. What are the 10 movies that have the highest ratings/votes?
1. Get the median ratings of movies released in each decade?
1. Get the 5 movies with highest ratings in 1980s and 1990s?

Things to note:

1. Let's use Python 3.4;
2. There are 313,012 lines in the file. When printing things, print selectively.


# Q1: How many movies released in each year?

To do this, we first need to read the CSV file. Python provides the [csv](https://docs.python.org/2/library/csv.html) module to read and write CSV files. The [`csv.reader`](https://docs.python.org/2/library/csv.html#csv.reader) function returns a Python object which will iterate over lines in the given file. Each line is returned as a list of strings, so that we can access a particular column using list index. If we want to ignore the first line, we can use [`islice`](https://docs.python.org/2/library/itertools.html#itertools.islice). It is like slicing a list, but it can slice an iterator (e.g. file stream). For instance, `islice(reader, 0, 5)` means "give me the first 5 items from the `reader`". `islice(reader, 1, 5)` means "give me the 4 items starting from the second item". 

A basic usage example to read the first 11 lines of 'imdb.csv':

In [9]:
import csv
from itertools import islice

f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 0, 3):
    print(row)
    print(row[1])

['Title', 'Year', 'Rating', 'Votes']
Year
['!Next?', '1994', '5.4', '5']
1994
['#1 Single', '2006', '6.1', '61']
2006


There are many ways to do Q1. One way is to use [dictionaries](https://docs.python.org/2/tutorial/datastructures.html#dictionaries) where the key: value pairs are:

- key: year
- value: a list of movie titles or number of movies


In [10]:
dt = {}
year = 2013
if year not in dt:
    dt[year] = 1
else:
    dt[year] += 1
print(dt)

{2013: 1}


Python automates the job above by using [`Counter`](https://docs.python.org/3.4/library/collections.html#collections.Counter). 

In [2]:
from collections import Counter

movie_counter = Counter()
movie_counter[1972] +=1 
print(movie_counter[1972])
print(movie_counter[1970])

1
0


Once all lines are read, we want to print the dictionary, which can be done by iterating its key: value pairs.

In [5]:
for key,val in dt.items():
    print(key,val)
for key,val in movie_counter.items():
    print(key,val)

(1972, 1)
(1972, 1)


You can get the keys (the years) by using `.keys()` function. 

In [6]:
movie_counter[1980] += 5
movie_counter[2015] += 1
movie_counter.keys()

[1980, 1972, 2015]

and you have convenient functions like [`min()`](https://docs.python.org/2/library/functions.html#min) and [`max()`](https://docs.python.org/2/library/functions.html#max) for calculating the min and max value of a list or iterable. 

In [25]:
alist = [23,3,5,4,2,1,1,0,1000]
print(min(alist))
print(max(alist))

0
1000


**Code for Q1**

In [11]:
# implement below
import csv
import collections
from itertools import islice
from collections import Counter

lt = []

f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 1, None):
    lt.append(row[1])


print('\n')
print('the min year is')
print(min(lt))
print('the max year is')
print(max(lt)) 
print('\n')
print('the movies released in each year sorted with max no of movies first:')
print('\n')
print collections.Counter(lt)
print('\n')






the min year is
1874
the max year is
2017


the movies released in each year sorted with max no of movies first:


Counter({'2011': 13944, '2012': 13887, '2013': 13048, '2010': 12931, '2009': 12268, '2008': 11095, '2014': 10862, '2007': 10147, '2006': 10115, '2005': 9508, '2004': 8584, '2003': 7355, '2002': 6694, '2001': 6042, '2000': 5575, '1999': 5138, '1998': 4651, '2015': 4402, '1997': 4353, '1996': 3923, '1995': 3698, '1994': 3415, '1989': 3193, '1992': 3136, '1993': 3128, '1990': 3093, '1988': 3054, '1987': 3049, '1991': 2993, '1985': 2908, '1986': 2882, '1984': 2779, '1983': 2647, '1982': 2537, '1979': 2526, '1981': 2485, '1972': 2445, '1980': 2438, '1976': 2399, '1974': 2392, '1978': 2386, '1971': 2370, '1973': 2325, '1969': 2320, '1975': 2286, '1977': 2264, '1970': 2240, '1968': 2199, '1967': 2086, '1966': 2025, '1965': 1896, '1964': 1823, '1962': 1669, '1963': 1635, '1961': 1623, '1957': 1604, '1959': 1572, '1960': 1567, '1958': 1533, '1956': 1479, '1955': 1476, '1954': 139

# Q2: Average ratings/votes

We can store the ratings/votes column as a list and then calculate various basic statistics (mean, median, etc.). To do this, we can use the [NumPy](http://www.numpy.org/) library and call the function [`numpy.mean`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and [`numpy.median`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.median.html). For example,

In [14]:
import numpy as np

alist = [1,3,6,2,5,2]
print(np.mean(alist))
print(np.median(alist))

3.16666666667
2.5


**Code for Q2**

In [2]:
# implement below
import csv
import numpy as np
from array import array
from itertools import islice


lt = []
f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 1, None):
    lt.append(float(row[2]))
a = np.array(lt)

print np.median(lt)

a.sort()
print(a)
x=np.mean(a)
#x= np.mean(a, axis=None, dtype=None, out=None, keepdims=False)
print x



6.5
[ 1.   1.   1.  ...,  9.8  9.8  9.9]
6.29619534138


# Q3: Top 10 movies

Store the movie titles and ratings information as a dictonary:

- key: movie title
- value: movie rating

Then, we can sort the dictionary based on its values, which will return a list of [tuples](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences). Note to print only the top 10 movies.

In [10]:
import operator

dt = {1971: 2, 1975: 10, 1962: 1, 1980: 50, 1981: 55}
sorted_x_by_val = sorted( dt.items(), key=operator.itemgetter(1), reverse=True )
print(sorted_x_by_val)
for elem in sorted_x_by_val:
    print(elem[0],elem[1])

[(1981, 55), (1980, 50), (1975, 10), (1971, 2), (1962, 1)]
(1981, 55)
(1980, 50)
(1975, 10)
(1971, 2)
(1962, 1)


**Code for Q3**

In [23]:
# implement below
import operator
import csv
from itertools import islice

dc = {}
dlist = []
f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 1, None):
    dlist.append([row[0],row[1]])
dc=dict(dlist)


#for elem in dic_list[:10]:
#    print(elem[0],elem[1])
                    
sortdval = sorted(dc.items(), key=operator.itemgetter(1), reverse=True)

for elem in sortdval[:10]:
    print(elem[0],    elem[1])

    


('Untitled Spider-Man Reboot', '2017')
('Tarzan', '2016')
('Hood', '2016')
('The Timber', '2015')
('Living with the Dead', '2015')
('Raising Ryland', '2015')
('Stitchers', '2015')
('Project MC\xc2\xb2', '2015')
('North v South', '2015')
('Biff Wellington', '2015')


# Q4: Median ratings of movies in each decade

We first need to transform year to decade, e.g., 1984 -> 1980s.


In [13]:
year = '1984'
print(year[:3])
print(year[:3]+'0s')
de

198
1980s


We then use a dictionary to store rating information:

- key: decade
- value: a list of ratings of movies released in the decade


In [15]:
decade_to_r = {}
decade = '1980s'
if decade not in decade_to_r:
    decade_to_r[decade] = []
decade_to_r[decade].append(7.5)
print(decade_to_r)

{'1980s': [7.5]}


Python automates the job above by using [defaultdict](https://docs.python.org/2/library/collections.html#collections.defaultdict).

In [17]:
from collections import defaultdict

dec_to_r = defaultdict(list)
dec_to_r['1980s'].append(7.5)
dec_to_r['1980s'].append(9.1)
print(dec_to_r)
print(dec_to_r['1970s'])

defaultdict(<type 'list'>, {'1980s': [7.5, 9.1]})
[]


**Code for Q4**

In [80]:
# implement below
import csv
import numpy as np
from itertools import islice
from collections import defaultdict

yr=[]
decade=[]
ten = {}
f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 1, 60):
    yr.append(row[1][:3])
yr.sort()
tuple(yr)
print(yr)

#print (yr)
#for num in yr
#    print(num)
#i=1
#for i in range(1,20):
#    path=[int(row[i][:3])]
#print (path)
#ten = {path, path.append(row[2])}
#print (ten)
#year = row[1][:3]
#decade = row[1][:3] + '0s'
#print('decade is '+decade)

#tuple(yr)
decade_to_r = {}

if yr not in decade_to_r:
    decade_to_r[yr] = []
decade_to_r[decade].append(row[2])
dec_to_r = defaultdict(list)
print(dec_to_r)




['196', '196', '197', '197', '197', '198', '198', '198', '198', '199', '199', '199', '199', '199', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '200', '201', '201', '201', '201', '201', '201', '201', '201', '201', '201', '201', '201', '201', '201', '201']


TypeError: unhashable type: 'list'

# Q5: 5 movies with highest ratings in each decade

Differnt from Q4, we now need to store not only ratings but also movie titles. This can be done by setting vaules in the dictonary as dictonaries.

- key: decade
- value: a dictonary mapping from movie titles to ratings


In [19]:
dec_to_title_to_rating = defaultdict(dict)
dec_to_title_to_rating['1970s']['The Godfather'] = 9.2
print(dec_to_title_to_rating)
print(dec_to_title_to_rating['1970s'])

defaultdict(<type 'dict'>, {'1970s': {'The Godfather': 9.2}})
{'The Godfather': 9.2}


**Code for Q5**