# Movie Recommendation Project

The aim of this project is to create a movie recommendation system that is interactive. At the end, one should be able to type a movie name and get recommendations for other movies we might like.

I will be using the MovieLens 25M dataset from here: https://files.grouplens.org/datasets/movielens/ml-25m.zip and then build a search engine to find a specific movie title in our data, to then create a recommendation engine to recommend specific movies.

## Reading in our Movie Data with Pandas
We will try to read our data directly from the website. We will only be pulling the movies.csv and ratings.csv from this zip file.

In [1]:
# Load our movies.csv file from the online zipped folder
import pandas as pd
import numpy as np
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
url = urlopen("https://files.grouplens.org/datasets/movielens/ml-25m.zip")

#Download Zipfile and create pandas DataFrame
zipfile = ZipFile(BytesIO(url.read()))
movies = pd.read_csv(zipfile.open("ml-25m/movies.csv"))
ratings = pd.read_csv(zipfile.open("ml-25m/ratings.csv"))

In [2]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


## Cleaning our Movie Titles

We will be using `regex` to clean the movie titles due to some extra characters that would make search difficult e.g. special characters like parentheses. We will do this by writing a function that would look for this special character in the title and remove it for us.

In [3]:
import re

def clean_title(title):
  return re.sub("[^a-zA-Z0-9 ]", "", title) #look for characters that are not within this category and remove them

In [4]:
# Create a new column as clean_title and use the apply method to use our clean_title function
movies["clean_title"] = movies["title"].apply(clean_title)

In [5]:
movies.head()

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995


## Build a Term Frequency Matrix

We will turn our titles into numbers first using TfidVectorizer from sklearn, and then we will turn it into groups of two words that are consecutive, using the `ngram_range` attribute.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Turn our titles to numbers
vectorizer = TfidfVectorizer(ngram_range=(1,2))

# Use the vectorizer to turn our set of titles to matrix
tfidf = vectorizer.fit_transform(movies["clean_title"])

## Create our Search Function

We will compute the similarity between a term that we enter using cosinesimilarity.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Write a function to search for the title
def search(title):
    title = clean_title(title)
    query_vec = vectorizer.transform([title]) #Use the vectorizer to turn the search term we enter into numbers
    similarity = cosine_similarity(query_vec, tfidf).flatten() #find similarity between our search term and all of the titles in our data
    indices = np.argpartition(similarity, -5)[-5:] #find 5 most similar titles to our search term
    results = movies.iloc[indices].iloc[::-1] #index our movies data by these indices to get the title and reverse the result to start with the most similar result
    
    return results

In [8]:
# pip install ipywidgets
#jupyter labextension install @jupyter-widgets/jupyterlab-manager

## Build our Interactive Search Box with Jupyter

In [9]:
import ipywidgets as widgets
from IPython.display import display

#Create our input widget
movie_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)

#Create an output widget
movie_list = widgets.Output()

#Define a function to be called whenever anything is typed in the widget box
def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            display(search(title))

movie_input.observe(on_type, names='value')

#Display both our output
display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

## Reading in Movie Ratings Data

We will now find movies that are similar to the movies that we liked, so we can search for it and get recommendations. We will be using our ratings dataframe in this case.

In [10]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [11]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

What we want to do here is to find all of the users who also liked the movies that we type into our search, to serve as recommendation for us.

## Finding the users who liked the same movie

In [12]:
#hard coding our movie id
movie_id = 89745

#def find_similar_movies(movie_id):
movie = movies[movies["movieId"] == movie_id]

In [13]:
# Find similar users (their unique userid) who liked the movie we searched and gave a rating more than 4
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
similar_users

array([    21,    187,    208, ..., 162469, 162485, 162532])

In [14]:
# Find the other movies that our similar users liked and rated greater than 4 and return the movieId
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
similar_user_recs

3741           318
3742           527
3743           541
3744           589
3745           741
             ...  
24998517     91542
24998518     92259
24998522     98809
24998523    102125
24998524    112852
Name: movieId, Length: 577796, dtype: int64

We are returning the movieId above, because that is what joins the two datasets. Now, we will find only the movies that greater than 10% of the users who are similar to us liked.

In [15]:
# Convert into percentage by dividing the count of how many times each movie appears in our dataset by the total count
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

# Take only the ones that are greater than 10%
similar_user_recs = similar_user_recs[similar_user_recs > .10]

In [16]:
similar_user_recs

89745    1.000000
58559    0.573393
59315    0.530649
79132    0.519715
2571     0.496687
           ...   
47610    0.103545
780      0.103380
88744    0.103048
1258     0.101226
1193     0.100895
Name: movieId, Length: 193, dtype: float64

Now we have a set of 193 movies that greater than 10% of the users liked. Now we will like to find the movies that defines the similarity to the movies we like. In a nutshell, we don't want to see all the movies those similar to us like, we want to see the movies that are closer to what we both share in common.

## Finding how much all users like movies

In [17]:
# Find how much all of the users in our dataset like these movies and those who have rated them high
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]

In [18]:
all_users

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
29,1,4973,4.5,1147869080
48,1,7361,5.0,1147880055
72,2,110,5.0,1141416589
76,2,260,5.0,1141417172
...,...,...,...,...
25000065,162541,5952,5.0,1240952617
25000078,162541,7153,5.0,1240952613
25000081,162541,7361,4.5,1240953484
25000086,162541,31658,4.5,1240953287


Above gives us all of the users that have watched all of the movies that are recommended to us. Now we will find what percentage of all users recommend each of these movies.

In [19]:
# Find percentage of all users that recommend each of these movies in all_users
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

In [20]:
all_user_recs

318       0.346395
296       0.288146
2571      0.247010
356       0.238136
593       0.228665
            ...   
86332     0.010142
91630     0.009324
122900    0.008573
122926    0.008070
106072    0.005289
Name: movieId, Length: 193, dtype: float64

## Creating a Recommendation Score
What we will do now is to essentially compare the percentages.

In [21]:
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", "all"]

In [22]:
rec_percentages

Unnamed: 0,similar,all
1,0.236083,0.126250
32,0.103877,0.101516
47,0.203115,0.146232
50,0.211067,0.202959
110,0.182240,0.162835
...,...,...
134853,0.198641,0.036444
152081,0.133532,0.020652
164179,0.128728,0.029124
166528,0.124751,0.014411


Above gives us each of the movies recommended to us, and how much users similar to us liked them and how much the average person likes them. So, we want movies that have a big difference between these two. We will therefore create a score, just by dividing one by the other.

In [23]:
# Divide similar users by all users to create a recommendation score
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

In [24]:
# Sort the recommendations to give us the biggest value first
rec_percentages = rec_percentages.sort_values("score", ascending=False)

In [25]:
#Take our top 10 recommendations and merge with our movies dataset
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
17067,1.0,0.040459,24.716368,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,Avengers The 2012
20513,0.103711,0.005289,19.610199,106072,Thor: The Dark World (2013),Action|Adventure|Fantasy|IMAX,Thor The Dark World 2013
25058,0.241054,0.012367,19.49177,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi,Avengers Age of Ultron 2015
19678,0.216534,0.012119,17.867419,102125,Iron Man 3 (2013),Action|Sci-Fi|Thriller|IMAX,Iron Man 3 2013
16725,0.215043,0.012052,17.843074,88140,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War,Captain America The First Avenger 2011
16312,0.175447,0.010142,17.299824,86332,Thor (2011),Action|Adventure|Drama|Fantasy|IMAX,Thor 2011
21348,0.287608,0.016737,17.183667,110102,Captain America: The Winter Soldier (2014),Action|Adventure|Sci-Fi|IMAX,Captain America The Winter Soldier 2014
25071,0.214049,0.012856,16.649399,122920,Captain America: Civil War (2016),Action|Sci-Fi|Thriller,Captain America Civil War 2016
25061,0.136017,0.008573,15.865628,122900,Ant-Man (2015),Action|Adventure|Sci-Fi,AntMan 2015
14628,0.242876,0.015517,15.651921,77561,Iron Man 2 (2010),Action|Adventure|Sci-Fi|Thriller|IMAX,Iron Man 2 2010


## Building our Recommendation Function and Creating an Interactive Recommendation Widget

In [26]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

In [27]:
import ipywidgets as widgets
from IPython.display import display

movie_name_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()