# 2489-1819 Data Curation  T1 Final Exam 

The final exam will contain 1 question with subquestions for 70% of the total points (20% for 5 questions in SQL and 50% for Python). 

## Game of Thrones or The Big Bang Theory?

“What movie should I watch this evening?” This perhaps is a question you would ask yourself very often. As for me — yes, and more than once. As such, from Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers.

We are going to examine a MovieLens dataset which provides non-commercial, personalized movie recommendations. 

This dataset describes 5-star rating from MovieLens. It contains ratings and tag applications across movies created by  users. Users were selected at random for inclusion. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files movies.csv, ratings.csv. More details about the contents and use of all these files follows.

**Ratings Data File Structure (ratings_fe.csv)**
All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
`userId,movieId,rating,timestamp`

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

**Movies Data File Structure (movies_fe.xlsx)**
Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:
`movieId,title,year,genres`

The database `movielens_small` contains 4 tables: *ratings, tags, movies and links*. In the multiple choice questions, you need to use these tables to answer those.


Answer the following questions using the provided dataset. You can write down intermediate results towards the final answers

In [1]:
import pandas as pd
import numpy as np

### Question 1 (10 points)

However, errors and inconsistencies may exist in these files shown as below:

The ratings in the `rating_fe.csv` should be made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). So if the ratings that are larger than 5 or smaller than 0.5, you need to round it to the value of 5 and 1. For example, if a movie is rated 8, then it might be wronly rated and you need to change the value to 5. Similarly, if a movie is rated -1, then it should be changed to 1, if any.

In [2]:
movies = pd.read_excel('movies_fe.xlsx', skiprows=15)

In [3]:
movies.head()

Unnamed: 0,movieId,title,year,genres
0,1,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,1995.0,Adventure|Children|Fantasy
2,3,Grumpier Old Men,1995.0,Comedy|Romance
3,4,Waiting to Exhale,1995.0,Comedy|Drama|Romance
4,5,Father of the Bride Part II,1995.0,Comedy


In [4]:
ratings = pd.read_csv('ratings_fe.csv', skiprows=10, index_col=0)

In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,6,2.0,980730861
1,1,22,3.0,980731380
2,1,32,2.0,980731926
3,1,50,5.0,980732037
4,1,110,4.0,980730408


In [6]:
ratings['rating'].max()

10.0

In [7]:
ratings['rating'].min()

-1.0

Let's correct the wrong values

In [8]:
ratings.loc[ratings['rating']>5, 'rating'] = 5
ratings.loc[ratings['rating']<0.5, 'rating'] = 1

### Question 2 (5 points)

Show the top 5 years with the most number of movies:

In [9]:
movies['year'].value_counts().head(5)

2006.0    315
2002.0    294
1996.0    280
1995.0    273
1999.0    266
Name: year, dtype: int64

In [10]:
movies.groupby('year')[['movieId']].count().reset_index().rename(columns={'movieId':'number_movies'}).sort_values(by='number_movies', ascending=False).head()

Unnamed: 0,year,number_movies
90,2006.0,315
86,2002.0,294
80,1996.0,280
79,1995.0,273
83,1999.0,266


### Question 3 (5 points)
Show the average rating of movies with ID 100:

In [11]:
ratings.loc[ratings['movieId']==100, 'rating'].mean()

3.328125

### Question 4 (5 points)

Show the median ratings given by user with ID 500:

In [12]:
ratings.loc[ratings['userId']==500, 'rating'].median()

4.0

### Question 5 (10 points)

Among all movies that uer with Id 500 has rated, show the his/her top 10 favorite movies (i.e., the movie he/she rated 5) more recently as three columns: `movieId, title, rating`

In [13]:
movie_ratings = pd.merge(movies, ratings, on='movieId')

In [14]:
movie_ratings.loc[movie_ratings['userId']==500].sort_values(['rating', 'timestamp'], ascending=False)[['movieId', 'title', 'rating']].head(10)

Unnamed: 0,movieId,title,rating
46923,1924,Plan 9 from Outer Space,5.0
40242,1391,Mars Attacks!,5.0
51497,2162,"NeverEnding Story II: The Next Chapter, The",5.0
80453,5502,Signs,5.0
52040,2232,Cube,5.0
23276,671,Mystery Science Theater 3000: The Movie,5.0
69078,3671,Blazing Saddles,5.0
52638,2291,Edward Scissorhands,5.0
60174,2788,Monty Python's And Now for Something Completel...,5.0
8455,235,Ed Wood,5.0


### Question 6 (15 points)

Now you need to implement a **recommender system using collaborative filtering method**. This works simply as to recommend movies that "people who like this movie also like these movies". For example, people who like to watch Star Wars are very likely to watch Star Treks. 

In order to do so, you need to find out users who like one movie (i.e., post a rating of 5), and count what are the movies these users also like, ranked by the number of likes. 

Show the recommended movie list with top 10 movies that users who like the *Titanic* may also like.

In [15]:
titanic_user = movie_ratings.loc[(movie_ratings['title'] == 'Titanic') & (movie_ratings['rating'] == 5), 'userId']
titanic_user

44874     46
44879     71
44894    130
44896    145
44899    160
44921    247
44931    289
44932    290
44934    302
44946    352
44950    374
44956    396
44958    413
44961    426
44964    430
44971    463
44977    491
44983    526
44986    550
44990    570
44992    576
44998    599
45001    608
45006    633
45016    683
66760    247
66765    536
Name: userId, dtype: int64

After we extract user list of all users who give the rating of 5 to Titanic, we need to see what other movies these users have given the ratings of 5.

In [16]:
liked_movies = movie_ratings.loc[(movie_ratings['userId'].isin(titanic_user)) & (movie_ratings['rating'] == 5)
                                & (movie_ratings['title'] != 'Titanic')]
liked_movies

Unnamed: 0,movieId,title,year,genres,userId,rating,timestamp
95,1,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,290,5.0,1342626325
162,1,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,463,5.0,1307067849
179,1,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,526,5.0,970541258
501,5,Father of the Bride Part II,1995.0,Comedy,570,5.0,905444039
762,10,GoldenEye,1995.0,Action|Adventure|Thriller,145,5.0,945467546
917,11,"American President, The",1995.0,Comedy|Drama|Romance,160,5.0,913130528
931,11,"American President, The",1995.0,Comedy|Drama|Romance,247,5.0,994218976
935,11,"American President, The",1995.0,Comedy|Drama|Romance,302,5.0,945111256
972,11,"American President, The",1995.0,Comedy|Drama|Romance,608,5.0,988149788
1117,16,Casino,1995.0,Crime|Drama,491,5.0,1008829180


Now we can count and rank the number of users who like the movie:

In [17]:
liked_movies = liked_movies.groupby('movieId')[['userId']].count().reset_index().rename(columns={'userId':
                                                                                  'number_users'}).sort_values(by='number_users', 
                                                                                                               ascending=False)

In [18]:
pd.merge(liked_movies, movies, on='movieId').head(10)

Unnamed: 0,movieId,number_users,title,year,genres
0,593,8,"Silence of the Lambs, The",1991.0,Crime|Horror|Thriller
1,2571,8,"Matrix, The",1999.0,Action|Sci-Fi|Thriller
2,356,8,Forrest Gump,1994.0,Comedy|Drama|Romance|War
3,527,8,Schindler's List,1993.0,Drama|War
4,318,8,"Shawshank Redemption, The",1994.0,Crime|Drama
5,589,7,Terminator 2: Judgment Day,1991.0,Action|Sci-Fi
6,110,7,Braveheart,1995.0,Action|Drama|War
7,2028,7,Saving Private Ryan,1998.0,Action|Drama|War
8,1784,7,As Good as It Gets,1997.0,Comedy|Drama|Romance
9,2959,6,Fight Club,1999.0,Action|Crime|Drama|Thriller


Congratulations! You just build the first recommender system that worth 1 million dollars :D