We'll be working with movies.csv and ratings.csv data. The movies.csv dataset is just a listing of different movie titles. Each row includes a single movie and its genre. So, using this dataset, we can build a movie search engine that will find the movie we want.

We'll use the other dataset, ratings.csv, to build the recommendation engine.

In [1]:
import pandas as pd

In [2]:
movies=pd.read_csv('movies.csv')

movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


The first thing we'll do is build our search engine. To do that, we should clean the movie titles. We want to clean them because some extra characters like parenthesis will make the search difficult. To do this, we can use regular expressions.

In [3]:
#making a function the clean the title column

import re

def cleaned_title(title):
  res = re.sub(r'[^a-zA-Z0-9 ]', '', title)
  return res



In [4]:
#making a new column called cleaned_title that uses the function that we had made

movies['cleaned_title']=movies['title'].apply(cleaned_title)

In [5]:
movies.head()

Unnamed: 0,movieId,title,genres,cleaned_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995



# Creating a TFIDF Matrix

Next, we'll use TfidfVectorizer to create a TFIDF Matrix.

While creating the matrix, we'll pass in something called ngram_range when we initialize our class. Instead of only looking at individual words in the title, this parameter will also look at engrams. Engrams are groups of two consecutive words, so instead of just looking at toy story in 1995, it's also going to look at toy story together and story 1995 together, so this makes our search a more accurate.

In [6]:


from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

vectorizer = TfidfVectorizer(ngram_range=(1,2))

tdif=vectorizer.fit_transform(movies['cleaned_title'])

# Creating a Search Function

Next, we'll compute the similarity between our search term and all the titles in our data. To do this, we're going to use something called cosine_similarity, which is available in scikit-learn — we don't need to implement it ourselves.

We'll then write a function called search, which takes in a search term; in this case, the term is a title we want to search. The function will then do the following:



*   Clean the title
*   Convert the title into a set of numbers
*   Use cosine_similarity to find the similarity between our search term and all the titles in our data
*   Return the five most similar titles to our search term








In [7]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def search(title):
  title=cleaned_title(title)
  vec=vectorizer.transform([title])
  #returns the array as a one dimension
  cosine_sim=cosine_similarity(vec,tdif).flatten()
  #the code below finds the five most similar titles to our search term
  inde=np.argpartition(cosine_sim,-5)[-5:]
  #index our movies so that we can get the titles 
  results= movies.iloc[inde].iloc[::-1]
  return results



In [8]:
#seeign if our search function works
search('Toy Story (1995)')

Unnamed: 0,movieId,title,genres,cleaned_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
3021,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
59767,201588,Toy Story 4 (2019),Adventure|Animation|Children|Comedy,Toy Story 4 2019
14813,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
20497,106022,Toy Story of Terror (2013),Animation|Children|Comedy,Toy Story of Terror 2013


# Building an Interactive Search Box in Jupyter

Now that we have created our search function, we're going to build an interactive Jupyter Notebook widget wherein we can type in the name of a movie and see the search results.

We need to import something called ipywidgets — widgets are small, interactive elements we can embed in notebooks. They let us enter input and then use that input. We also need to import display from ipython.display. display is a function you can use to display the output.

In [9]:
import ipywidgets as widgets
from IPython.display import display

In [10]:
#creating a widget

movie_input = widgets.Text(
    #defult value
    value='Toy Story',
    #what people want to enter in the widget
    description='Movie Title:',

    disabled=False
)

#to make the textbox useful we need to work it with an output widget

movie_list=widgets.Output()
    
#making a function so that whenever we type something into the box the function will be then used
def on_type(data):
  #using our output widget
  with movie_list:
    #clear the otuput first
    movie_list.clear_output()
    #get the title from out input 
    #input will be a dictionar and the new field will be give us a new value
    title=data['new']
    #arbitary number of 5 since most movie titles have a lenght getter than 5
    if len(title)>5:
      #display the movie name
      display(search(title))

#what this does is that whenever something happens with the movie input is that 
#whenever we type something in it is going to observe the value event 
movie_input.observe(on_type,names='value')

#displays the movie_input and movie_list
#when we run the box it will return everythign that is similar to what we had typed in


display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

# Reading in Movie Ratings Data

We've finished the first half of the project. The second half is the more exciting because we'll build the actual recommendation system. We need to find movies similar to a movie we liked. If we liked a specific movie, we can search for it and get recommendations. The ratings.csv file will help us do this.

In the ratings.csv file, we have movie_id and rating. Each user has rated a movie, and we can see how they rated it. We'll create a function to find all the users who also liked the movie that we typed in. For example, if we type the hulk, we want to find all users who also liked the movie hulk. Then we want to see the other movies they liked because those will probably be good recommendations for us.

In [11]:
ratings=pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


# Finding Users Who Liked the Same Movie


Now we'll find the users who liked the same movie we liked. Then, we need to find the other movies that they liked. Once we have done that, we'll establish a threshold for recommendations. For example, we could say that at least 10% of users like us need to like the movie for inclusion in our recommendations

In [12]:
movie_id = 89745

#def find_similar_movies(movie_id):
movie = movies[movies["movieId"] == movie_id]

In [13]:
#finding all the users that liked the same mvie and gave it a rating greater than 4

similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
similar_users

array([    21,    187,    208, ..., 162469, 162485, 162532])

In [14]:
#now we need to find the movies that the other user also liked

similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
similar_user_recs

3741           318
3742           527
3743           541
3744           589
3745           741
             ...  
24998517     91542
24998518     92259
24998522     98809
24998523    102125
24998524    112852
Name: movieId, Length: 577796, dtype: int64

In [15]:
#going to find only the movies that are greater than 10% of users that are similar
#to us that were liked

similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

similar_user_recs = similar_user_recs[similar_user_recs > .10]

similar_user_recs

89745    1.000000
58559    0.573393
59315    0.530649
79132    0.519715
2571     0.496687
           ...   
47610    0.103545
780      0.103380
88744    0.103048
1258     0.101226
1193     0.100895
Name: movieId, Length: 193, dtype: float64

# Determining How Much Users Like Movies

Now, we're going to find how many of the users in our dataset like these movies. We need to find movies that are specific to our niche. For example, if someone likes the Avengers, you want to find other movies they like that are similar to the Avengers. You don't just want all of the movies they like because they probably like many movies that don't have anything to do with the Avengers.

In [16]:
#now going for all the users in our dataset by first fisniding anyone who has rated a movie
#that is in our set of recommended movies and then we are also going to find all the people that have rated it hughly 

all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
all_users

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
29,1,4973,4.5,1147869080
48,1,7361,5.0,1147880055
72,2,110,5.0,1141416589
76,2,260,5.0,1141417172
...,...,...,...,...
25000065,162541,5952,5.0,1240952617
25000078,162541,7153,5.0,1240952613
25000081,162541,7361,4.5,1240953484
25000086,162541,31658,4.5,1240953287


In [18]:
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
all_user_recs

318       0.346395
296       0.288146
2571      0.247010
356       0.238136
593       0.228665
            ...   
86332     0.010142
91630     0.009324
122900    0.008573
122926    0.008070
106072    0.005289
Name: movieId, Length: 193, dtype: float64

# Creating a Recommendation Score

Now that we found the percentages, we compare them.

In [19]:
#making a datframe that shows similar user and all users
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", "all"]

In [20]:
rec_percentages

Unnamed: 0,similar,all
1,0.236083,0.126250
32,0.103877,0.101516
47,0.203115,0.146232
50,0.211067,0.202959
110,0.182240,0.162835
...,...,...
134853,0.198641,0.036444
152081,0.133532,0.020652
164179,0.128728,0.029124
166528,0.124751,0.014411


In [21]:
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

In [22]:
rec_percentages = rec_percentages.sort_values("score", ascending=False)


In [23]:
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,cleaned_title
17067,1.0,0.040459,24.716368,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,Avengers The 2012
20513,0.103711,0.005289,19.610199,106072,Thor: The Dark World (2013),Action|Adventure|Fantasy|IMAX,Thor The Dark World 2013
25058,0.241054,0.012367,19.49177,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi,Avengers Age of Ultron 2015
19678,0.216534,0.012119,17.867419,102125,Iron Man 3 (2013),Action|Sci-Fi|Thriller|IMAX,Iron Man 3 2013
16725,0.215043,0.012052,17.843074,88140,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War,Captain America The First Avenger 2011
16312,0.175447,0.010142,17.299824,86332,Thor (2011),Action|Adventure|Drama|Fantasy|IMAX,Thor 2011
21348,0.287608,0.016737,17.183667,110102,Captain America: The Winter Soldier (2014),Action|Adventure|Sci-Fi|IMAX,Captain America The Winter Soldier 2014
25071,0.214049,0.012856,16.649399,122920,Captain America: Civil War (2016),Action|Sci-Fi|Thriller,Captain America Civil War 2016
25061,0.136017,0.008573,15.865628,122900,Ant-Man (2015),Action|Adventure|Sci-Fi,AntMan 2015
14628,0.242876,0.015517,15.651921,77561,Iron Man 2 (2010),Action|Adventure|Sci-Fi|Thriller|IMAX,Iron Man 2 2010


# Building a Recommendation Function

Now we need to put all of these into a function. It should return the following columns of our top 10 movie recommendations:



*   score
*   title
*   genres







In [24]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

# Create an Interactive Recommendation Widget

Now we can build the widget that will do this automatically so we can type in a movie title and get recommendations.



In [25]:
import ipywidgets as widgets
from IPython.display import display

movie_name_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()