# Movies recommendations
## Yoav Ram

In this session we will apply our Pandas and Scikit-learn skills to provide movie recommendations using K-nearest neighbors.

The data we use is the [MovieLens](http://movielens.org) dataset, available [here](https://grouplens.org/datasets/movielens/)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print('Pandas version:', pd.__version__)

import os
import urllib.request
import zipfile

Pandas version: 0.25.3


# Movies dataset

The main file contains the titles and genres of movies.

In [2]:
path = '../data/movies/'
movies = pd.read_csv(path + 'movies.csv', index_col='movieId')
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


We split the genres to a list.

In [3]:
movies['genres'] = movies['genres'].str.split('|')
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
2,Jumanji (1995),"[Adventure, Children, Fantasy]"
3,Grumpier Old Men (1995),"[Comedy, Romance]"
4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]"
5,Father of the Bride Part II (1995),[Comedy]


# Ratings dataset

The ratings dataset contains the movie ratings by users.

In [4]:
ratings = pd.read_csv(path + 'ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


We convert the timestamp to a date using [`pandas.to_datetime`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html).

In [5]:
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,2000-07-30 18:45:03
1,1,3,4.0,2000-07-30 18:20:47
2,1,6,4.0,2000-07-30 18:37:04
3,1,47,5.0,2000-07-30 19:03:35
4,1,50,5.0,2000-07-30 18:48:51


# Ratings summary

We summarize the ratings with mean, standard deviation, and count of ratings per movie.

For this, we use `groupby` and `agg`.

In [6]:
grp = ratings.groupby('movieId')
movies_stats = grp.agg({'rating': [np.mean, np.std, np.size]})
movies['rating_mean'] = movies_stats['rating']['mean'] # column names with two levels
movies['rating_std'] = movies_stats['rating']['std']
movies['rating_count'] = movies_stats['rating']['size']
movies.head()

Unnamed: 0_level_0,title,genres,rating_mean,rating_std,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0
2,Jumanji (1995),"[Adventure, Children, Fantasy]",3.431818,0.881713,110.0
3,Grumpier Old Men (1995),"[Comedy, Romance]",3.259615,1.054823,52.0
4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",2.357143,0.852168,7.0
5,Father of the Bride Part II (1995),[Comedy],3.071429,0.907148,49.0


# Recommend movies

First, we merge the movies and ratings data frames on the `movieId` column.

In [7]:
movies_full = pd.merge(movies, ratings, on='movieId')
movies_full.head()

Unnamed: 0,movieId,title,genres,rating_mean,rating_std,rating_count,userId,rating,timestamp
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,1,4.0,2000-07-30 18:45:03
1,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,5,4.0,1996-11-08 06:36:02
2,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,7,4.5,2005-01-25 06:52:26
3,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,15,2.5,2017-11-13 12:59:30
4,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,17,4.5,2011-05-18 05:28:03


We now have one row for every rating, with full info of the movie at question.

Next, **normalize each rating around the mean movie rating**, so that we know which users liked or disliked a movie more than the average user. 
To do that, subtract the mean rating of each movie from the user rating; remember that each row is a single rating to a single movie by a single user. 

In [8]:
movies_full['rating'] -= movies_full['rating_mean']
movies_full.head()

Unnamed: 0,movieId,title,genres,rating_mean,rating_std,rating_count,userId,rating,timestamp
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,1,0.07907,2000-07-30 18:45:03
1,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,5,0.07907,1996-11-08 06:36:02
2,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,7,0.57907,2005-01-25 06:52:26
3,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,15,-1.42093,2017-11-13 12:59:30
4,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",3.92093,0.834859,215.0,17,0.57907,2011-05-18 05:28:03


Now, **take only movies rates at least 25 times,** and remove movies with less than 25 ratings.

In [9]:
highly_rated = movies_full['rating_count'] >= 25
movies_full = movies_full[highly_rated]

This is an important step: we __pivot__ the data frame to get a table where element at row `i` and column `j` is the rating of user `j` for movie `i`. 

This table will be the input for our machine learning algorithm: each row is a sample (movie), each column is a feature (user rating).

In [10]:
movies_user_rating = movies_full.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0)
movies_user_rating.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.07907,0.0,0.0,0.0,0.07907,0.0,0.57907,0.0,0.0,0.0,...,0.07907,0.0,0.07907,-0.92093,0.07907,-1.42093,0.07907,-1.42093,-0.92093,1.07907
2,0.0,0.0,0.0,0.0,0.0,0.568182,0.0,0.568182,0.0,0.0,...,0.0,0.568182,0.0,1.568182,0.068182,0.0,0.0,-1.431818,0.0,0.0
3,0.740385,0.0,0.0,0.0,0.0,1.740385,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.259615,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.928571,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-0.071429,0.0,0.0,0.0,0.0,0.0,0.0
6,0.053922,0.0,0.0,0.0,0.0,0.053922,0.0,0.0,0.0,0.0,...,0.0,-0.946078,0.053922,-0.946078,0.0,0.0,0.0,0.0,0.0,1.053922


Again, `movies_user_rating` is a matrix where each row is a movie and each column is a user rating. We can now train on it a KNN model.
But we are not trying to classify or regress; we just want to build a model that, given a movie (row), provides its neighbors (recommendations).

The relevant scikit-learn model is `NearestNeighbors`.

**Import it and train it** on the movies-user-rating table.

You should determine the hyper-parameters:
- `n_neighbors`: not too little, not too many.
- `metric` and `algorithm`: the standard Euclidean distance doesn't work well when most numbers are zeros and most users didn't vote for most movies; not every algorithm works with every metric, so try `brute`, which supports many metrics.

The next cell imports the `NearestNeighbors` model and prints the valid metrics for the `brute` algorithm (there is an [open issue](https://github.com/scikit-learn/scikit-learn/issues/4521) for adding this list of metrics to the documentation).

In [11]:
from sklearn.neighbors import NearestNeighbors, VALID_METRICS
print(sorted(VALID_METRICS['brute']))

['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'cosine', 'dice', 'euclidean', 'hamming', 'haversine', 'jaccard', 'kulsinski', 'l1', 'l2', 'mahalanobis', 'manhattan', 'matching', 'minkowski', 'nan_euclidean', 'precomputed', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'wminkowski', 'yule']


In [12]:
knn = NearestNeighbors(metric='cosine', algorithm='brute', 
                       n_neighbors=20, n_jobs=-1)
knn.fit(movies_user_rating);

The recommend function takes a movie title and number of recommendations `k` and:
1. find the `movieId` for that movie title
1. extract the features (user ratings) for that movie. The features must be a 2D array, hence the `values.reshape(...)` call.
1. finds the `k+1` nearest neighbors and their distances, and discards of the nearest neighbor, which is the movie itself (as it was part of the training set).
1. finds the `movieId` and title for each recommendation.
1. prints the results!

In [13]:
def recommend(title, k=3):
    movieId = movies[movies['title'].str.startswith(title)].index[0]
    user_rating = movies_user_rating.loc[movieId].values.reshape((1, -1))

    distances, indices = knn.kneighbors(user_rating, k+1)
    distances = distances.squeeze()[1:]
    indices = indices.squeeze()[1:]
    
    recommended_movieIds = movies_user_rating.index[indices]
    titles = movies.loc[recommended_movieIds, 'title'].values
    
    for i, d, t in zip(range(1, k+1), distances, titles):
        print('{}. {} <{:.3f}>'.format(i, t, d))

Lets try it with some favorite movies:

In [14]:
recommend('Toy Story')

1. Toy Story 2 (1999) <0.681>
2. Finding Nemo (2003) <0.682>
3. Aladdin (1992) <0.683>


In [15]:
recommend('Toy Story 2')

1. Toy Story (1995) <0.681>
2. Jaws (1975) <0.732>
3. RoboCop (1987) <0.738>


In [16]:
recommend('Iron Man', 10)

1. Star Trek (2009) <0.550>
2. Iron Man 2 (2010) <0.583>
3. District 9 (2009) <0.616>
4. Ratatouille (2007) <0.622>
5. Grindhouse (2007) <0.627>
6. Dark Knight, The (2008) <0.679>
7. Avengers, The (2012) <0.681>
8. Zootopia (2016) <0.690>
9. Batman Begins (2005) <0.692>
10. WALL·E (2008) <0.699>


In [17]:
recommend('Iron Man 2', 10)

1. Iron Man (2008) <0.583>
2. Harry Potter and the Half-Blood Prince (2009) <0.628>
3. Thor (2011) <0.634>
4. Toy Story 3 (2010) <0.646>
5. Star Trek Into Darkness (2013) <0.656>
6. X-Men Origins: Wolverine (2009) <0.666>
7. Superman Returns (2006) <0.679>
8. Iron Man 3 (2013) <0.683>
9. Captain America: The First Avenger (2011) <0.684>
10. The Hunger Games (2012) <0.692>


In [18]:
recommend('Iron Man 3', 10)

1. Avengers: Age of Ultron (2015) <0.372>
2. Avengers, The (2012) <0.491>
3. Amazing Spider-Man, The (2012) <0.499>
4. Brave (2012) <0.533>
5. Dodgeball: A True Underdog Story (2004) <0.534>
6. Wolf of Wall Street, The (2013) <0.562>
7. Django Unchained (2012) <0.582>
8. Captain America: The First Avenger (2011) <0.597>
9. Interstellar (2014) <0.620>
10. Rise of the Planet of the Apes (2011) <0.634>


**Now try it with your own favorite movie!**

In [19]:
recommend('Godfather', 10)

1. Godfather: Part II, The (1974) <0.486>
2. Super Size Me (2004) <0.737>
3. Fight Club (1999) <0.754>
4. Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966) <0.755>
5. Rocky (1976) <0.763>
6. Star Wars: Episode V - The Empire Strikes Back (1980) <0.777>
7. Saving Private Ryan (1998) <0.782>
8. Slumdog Millionaire (2008) <0.785>
9. Schindler's List (1993) <0.795>
10. Rear Window (1954) <0.797>


# Colophon
This notebook was written by [Yoav Ram](http://python.yoavram.com).

The notebook was written using [Python](http://python.org/) 3.7.
Dependencies listed in [environment.yml](../environment.yml).

This work is licensed under a CC BY-NC-SA 4.0 International License.

![Python logo](https://www.python.org/static/community_logos/python-logo.png)