We are building a Movie Recommendation System with user ratings. 

There will be two parts in this project:
- Part 1: the user gives his/ her favorite movie name, then we recommend other movies that he/ she might like
- Part 2: given the user id, we recommend movies based on past user ratings

Please feel free to try it at the end of this notebook!

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# import warnings
# warnings.simplefilter(action = 'ignore', category = FutureWarning)

## Data Collection and Pre-Processing

In [2]:
# loading the dataset
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
# number of ratings
n_ratings = len(ratings)
n_ratings

100836

In [5]:
# number of unique movies in the ratings
n_movies = len(ratings['movieId'].unique())
n_movies

9724

In [6]:
# number of unique movies in the movies
num_movies = len(movies['movieId'].unique())
num_movies

9742

In [7]:
# number of users in the ratings
n_users = len(ratings['userId'].unique())
n_users

610

## Exploratory Data Analysis

In [8]:
# average number of ratings per user
round(n_ratings/n_users, 2)

165.3

In [9]:
# average number of ratings per movie
round(n_ratings/n_movies, 2)

10.37

In [10]:
# calculate user rating frequency
user_freq = ratings[['userId', 'movieId']].groupby('userId').count().reset_index()
user_freq.columns = ['userId', 'n_ratings']
user_freq.head()

Unnamed: 0,userId,n_ratings
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44


In [11]:
# find lowest and highest rated movies
mean_rating = ratings.groupby('movieId')[['rating']].mean()

mean_rating.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.92093
2,3.431818
3,3.259615
4,2.357143
5,3.071429


In [12]:
# lowest rated movies
lowest_rated = mean_rating['rating'].idxmin()
movies.loc[movies['movieId'] == lowest_rated]

Unnamed: 0,movieId,title,genres
2689,3604,Gypsy (1962),Musical


In [13]:
# highest rated movies
highest_rated = mean_rating['rating'].idxmax()
movies.loc[movies['movieId'] == highest_rated]

Unnamed: 0,movieId,title,genres
48,53,Lamerica (1994),Adventure|Drama


In [14]:
# show the users who rated the lowest rated movie
ratings[ratings['movieId'] == lowest_rated]

Unnamed: 0,userId,movieId,rating,timestamp
13633,89,3604,0.5,1520408880


In [15]:
# show the users who rated the highest rated movie
ratings[ratings['movieId'] == highest_rated]

Unnamed: 0,userId,movieId,rating,timestamp
13368,85,53,5.0,889468268
96115,603,53,5.0,963180003


In [17]:
# the above movies hve a very low dataset.
# We'll use Bayesian average
movie_stats = ratings.groupby('movieId')[['rating']].agg(['count', 'mean'])
movie_stats.columns = movie_stats.columns.droplevel()
movie_stats.head()

Unnamed: 0_level_0,count,mean
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,215,3.92093
2,110,3.431818
3,52,3.259615
4,7,2.357143
5,49,3.071429


## Feature Engineering

In [18]:
# Now let's create user-item matrix using scipy csr matrix
from scipy.sparse import csr_matrix 

Compressed Sparse Row matrix:

*Advantages of the CSR format*: 
- efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
- efficient row slicing
- fast matrix vector products

*Disadvantages of the CSR format*: 
- slow column slicing operations (consider CSC)
- changes to the sparsity structure are expensive (consider LIL or DOK)

In [19]:
def create_matrix(df):
    N = len(df['userId'].unique())
    M = len(df['movieId'].unique())
    
    # map Ids to indices
    user_mapper = dict(zip(np.unique(df['userId']), list(range(N))))
    movie_mapper = dict(zip(np.unique(df['movieId']), list(range(M))))
    
    # map indices to Ids
    user_inv_mapper = dict(zip(list(range(N)), np.unique(df["userId"])))
    movie_inv_mapper = dict(zip(list(range(M)), np.unique(df["movieId"])))
      
    user_index = [user_mapper[i] for i in df['userId']]
    movie_index = [movie_mapper[i] for i in df['movieId']]
  
    X = csr_matrix((df["rating"], (movie_index, user_index)), shape=(M, N))
      
    return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper

In [20]:
X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_matrix(ratings)

In [26]:
print(X[:1])

  (0, 0)	4.0
  (0, 4)	4.0
  (0, 6)	4.5
  (0, 14)	2.5
  (0, 16)	4.5
  (0, 17)	3.5
  (0, 18)	4.0
  (0, 20)	3.5
  (0, 26)	3.0
  (0, 30)	5.0
  (0, 31)	3.0
  (0, 32)	3.0
  (0, 39)	5.0
  (0, 42)	5.0
  (0, 43)	3.0
  (0, 44)	4.0
  (0, 45)	5.0
  (0, 49)	3.0
  (0, 53)	3.0
  (0, 56)	5.0
  (0, 62)	5.0
  (0, 63)	4.0
  (0, 65)	4.0
  (0, 67)	2.5
  (0, 70)	5.0
  :	:
  (0, 559)	3.0
  (0, 560)	4.0
  (0, 561)	4.5
  (0, 566)	3.5
  (0, 569)	4.0
  (0, 571)	4.0
  (0, 572)	5.0
  (0, 578)	4.0
  (0, 579)	3.0
  (0, 583)	5.0
  (0, 586)	5.0
  (0, 589)	4.0
  (0, 595)	4.0
  (0, 596)	4.0
  (0, 598)	3.0
  (0, 599)	2.5
  (0, 600)	4.0
  (0, 602)	4.0
  (0, 603)	3.0
  (0, 604)	4.0
  (0, 605)	2.5
  (0, 606)	4.0
  (0, 607)	2.5
  (0, 608)	3.0
  (0, 609)	5.0


## Model Training

In [27]:
from sklearn.neighbors import NearestNeighbors

In [28]:
# find similar movies using KNN
def find_similar_movies(movie_id, X, k, metric='cosine', show_distance=False):
      
    neighbour_ids = []
      
    movie_ind = movie_mapper[movie_id]
    movie_vec = X[movie_ind]
    k += 1
    kNN = NearestNeighbors(n_neighbors=k, algorithm="brute", metric=metric)
    kNN.fit(X)
    movie_vec = movie_vec.reshape(1,-1)
    neighbour = kNN.kneighbors(movie_vec, return_distance = show_distance)
    for i in range(0,k):
        n = neighbour.item(i)
        neighbour_ids.append(movie_inv_mapper[n])
    neighbour_ids.pop(0)
    return neighbour_ids

In [30]:
movie_titles = dict(zip(movies['movieId'], movies['title']))
# print the first 3 movie titles
str({key: movie_titles[key] for key in list(movie_titles)[:3]})

"{1: 'Toy Story (1995)', 2: 'Jumanji (1995)', 3: 'Grumpier Old Men (1995)'}"

## Part 1: User Input --> Find potential favorite movies 

In [31]:
# getting the movie name from the user
movie_name = input(' Enter your favorite movie name: ')

 Enter your favorite movie name: tangled


In [32]:
movie_titles_ori = movies['title'].tolist()
movie_titles_no_year = [(lambda x: x.split('(')[0])(x) for x in movie_titles_ori]

In [33]:
movie_titles_no_year[0]

'Toy Story '

In [34]:
import difflib
close_matches = difflib.get_close_matches(movie_name, movie_titles_no_year)
closest_match = close_matches[0]
closest_match

'Tangled '

In [35]:
close_matches

['Tangled ', 'Strangeland ', 'Triangle ']

In [36]:
# the following filter doesn't work since it returns more than one value
movie_id = movies[movies['title'].str.contains(closest_match)]['movieId']
movie_id

7467     81847
8447    112006
Name: movieId, dtype: int64

In [37]:
# the right way:
movie_id = movies.iloc[movie_titles_no_year.index(closest_match)]['movieId']
movie_id

81847

In [38]:
movie_titles_no_year.index(closest_match)

7467

In [39]:
similar_ids = find_similar_movies(movie_id, X, k=10)
movie_title = movie_titles[movie_id]
  
print(f"Since you watched {movie_title}\n")
print(f"The movies that you might like are: \n")
for i in similar_ids:
    print(movie_titles[i])

Since you watched Tangled (2010)

The movies that you might like are: 

Princess and the Frog, The (2009)
Brave (2012)
Frozen (2013)
Bolt (2008)
Ugly Truth, The (2009)
Legally Blonde 2: Red, White & Blonde (2003)
Despicable Me (2010)
Megamind (2010)
Ratatouille (2007)
Enchanted (2007)


In [40]:
similar_ids

[72737, 95167, 106696, 63859, 70183, 6535, 79091, 81564, 50872, 56152]

## Part 2: Given a user id, automatically find similar movies to recommend

In [42]:
# getting the user id
user_id = input(' Enter the user id: ')

 Enter the user id: 1


In [43]:
# find the ratings from the user
user_rating = ratings[ratings['userId'] == int(user_id)]
user_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [44]:
# find out the favorite movies of the user
sorted_user_ratings = user_rating.sort_values(by = ['rating'], ascending = False)
sorted_user_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
231,1,5060,5.0,964984002
185,1,2872,5.0,964981680
89,1,1291,5.0,964981909
90,1,1298,5.0,964984086
190,1,2948,5.0,964982191


In [45]:
# find the movies with the highest ratings

# first, find the highest rating from the user
highest_user_rating = sorted_user_ratings[:1].rating.tolist()[0]
highest_user_rating

5.0

In [46]:
# check highest rated movies
highest_rating_movies = sorted_user_ratings[sorted_user_ratings['rating'] == highest_user_rating].movieId.tolist()
highest_rating_movies[:5]

[5060, 2872, 1291, 1298, 2948]

In [47]:
len(highest_rating_movies)

124

In [48]:
# check the top 5 rated movies this user likes
i = 1
for movie_id in highest_rating_movies:
    if i < 6:
        print(movie_titles[movie_id])
        i += 1

M*A*S*H (a.k.a. MASH) (1970)
Excalibur (1981)
Indiana Jones and the Last Crusade (1989)
Pink Floyd: The Wall (1982)
From Russia with Love (1963)


In [49]:
# then we can use the same function above to create a list of similar movies
# notice that when there are more than one highest rating movie, we consider all possibilities and choose the movies 
# that the user is most likely to enjoy
if len(highest_rating_movies) == 1:
    movie_id = highest_rating_movies[0]
    similar_ids = find_similar_movies(movie_id, X, k = 10)
    movie_title = movie_titles[movie_id]

    print(f"Since you watched {movie_title}\n")
    print(f"The movies that you might like are: \n")
    for i in similar_ids:
        print(movie_titles[i])
else:
    i = 1
    similar_movies_all = []
    for movie_id in highest_rating_movies:
        if i < 6:
            print(f"One of the user's favorite movies is {movie_titles[movie_id]}\n")
            i += 1
        similar_ids = find_similar_movies(movie_id, X, k = 10)
        similar_movies_all = similar_movies_all + similar_ids
    final_movie_ids = [ele for ele, _ in Counter(similar_movies_all).most_common(10)]
    # if more precise, need to consider if the user has watched these movies
    print(f"The movies that you might like are: \n")
    for i in final_movie_ids:
        print(movie_titles[i])

One of the user's favorite movies is M*A*S*H (a.k.a. MASH) (1970)

One of the user's favorite movies is Excalibur (1981)

One of the user's favorite movies is Indiana Jones and the Last Crusade (1989)

One of the user's favorite movies is Pink Floyd: The Wall (1982)

One of the user's favorite movies is From Russia with Love (1963)

The movies that you might like are: 

Indiana Jones and the Temple of Doom (1984)
Reservoir Dogs (1992)
Terminator, The (1984)
Star Wars: Episode VI - Return of the Jedi (1983)
Fight Club (1999)
RoboCop (1987)
Star Wars: Episode V - The Empire Strikes Back (1980)
Newton Boys, The (1998)
Who Framed Roger Rabbit? (1988)
Ferris Bueller's Day Off (1986)


In [50]:
# select the user id intentionally
user_freq.head()

Unnamed: 0,userId,n_ratings
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44


In [51]:
user_freq.sort_values(by = 'n_ratings').head()

Unnamed: 0,userId,n_ratings
441,442,20
405,406,20
146,147,20
193,194,20
568,569,20


## Sum up: Movie Recommendation System

### Part 1: user input

In [52]:
# getting the movie name from the user
movie_name = input(' Enter your favorite movie name: ')

close_matches = difflib.get_close_matches(movie_name, movie_titles_no_year)
closest_match = close_matches[0]

movie_id = movies.iloc[movie_titles_no_year.index(closest_match)]['movieId']

similar_ids = find_similar_movies(movie_id, X, k=10)
movie_title = movie_titles[movie_id]
  
print(f"Since you watched {movie_title}\n")
print(f"The movies that you might like are: \n")
for i in similar_ids:
    print(movie_titles[i])

 Enter your favorite movie name: The Shawshank Redemption
Since you watched Shawshank Redemption, The (1994)

The movies that you might like are: 

Forrest Gump (1994)
Pulp Fiction (1994)
Silence of the Lambs, The (1991)
Usual Suspects, The (1995)
Schindler's List (1993)
Fight Club (1999)
Braveheart (1995)
Matrix, The (1999)
Apollo 13 (1995)
Seven (a.k.a. Se7en) (1995)


### Part 2: given a user id (aka past behavior), recommend similar movies

In [53]:
# getting the user id
user_id = input(' Enter the user id: ')

user_rating = ratings[ratings['userId'] == int(user_id)]
sorted_user_ratings = user_rating.sort_values(by = ['rating'], ascending = False)
highest_user_rating = sorted_user_ratings[:1].rating.tolist()[0]
highest_rating_movies = sorted_user_ratings[sorted_user_ratings['rating'] == highest_user_rating].movieId.tolist()

if len(highest_rating_movies) == 1:
    movie_id = highest_rating_movies[0]
    similar_ids = find_similar_movies(movie_id, X, k = 10)
    movie_title = movie_titles[movie_id]

    print(f"Since the user likes {movie_title}\n")
    print(f"The movies that the user might like are: \n")
    for i in similar_ids:
        print(movie_titles[i])
else:
    i = 1
    similar_movies_all = []
    for movie_id in highest_rating_movies:
        if i < 6:
            print(f"One of the user's favorite movies is {movie_titles[movie_id]}\n")
            i += 1
        similar_ids = find_similar_movies(movie_id, X, k = 10)
        similar_movies_all = similar_movies_all + similar_ids
    final_movie_ids = [ele for ele, _ in Counter(similar_movies_all).most_common(10)]
    # if more precise, need to consider if the user has watched these movies
    print(f"The movies that you might like are: \n")
    for i in final_movie_ids:
        print(movie_titles[i])

 Enter the user id: 406
One of the user's favorite movies is 27 Dresses (2008)

One of the user's favorite movies is Sisterhood of the Traveling Pants, The (2005)

One of the user's favorite movies is Sweet Home Alabama (2002)

One of the user's favorite movies is Ever After: A Cinderella Story (1998)

The movies that you might like are: 

Sweet Home Alabama (2002)
Penelope (2006)
Maid in Manhattan (2002)
Wedding Planner, The (2001)
Legally Blonde 2: Red, White & Blonde (2003)
Holiday, The (2006)
13 Going on 30 (2004)
Ugly Truth, The (2009)
Legally Blonde (2001)
Bewitched (2005)
