# Movie Recommendation system that uses cosine similarity on plots to determine similar movies 

In the cells below we import packages that we will be using in the project. We use numpy and pandas so that we can perform operations on arrays and create dataframes. CountVectorizer will be used to create a matrix that represents the counts for our words in the plot and we also import cosine_similarity as we will be using this on our sparse matrix that CountVectorizer returns to compute similarity scores. Lastly, we import pickle as we will be using this to dump the dataframes so that they can be used in the streamlit front-end. 

In [29]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

Here we use pandas to load both our datasets into the variables movies and credits. 

In [30]:
#Reading in both the datasets from the TMBD dataset 
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")


In [31]:
#Viewing the data in the two dataframes that we have read in 
credits.head()



Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [32]:
#Reading the credits dataframe
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [33]:
#Changing the column name from movie_id to id so we are able to merge the datframes together allowing us to easily work with all the relevant data
credits = credits.rename(columns  = {'movie_id': 'id'})
#checking if the column name has actually been changed
credits.head()

Unnamed: 0,id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [34]:
full_dataframe = credits.merge(movies, on = 'id')
full_dataframe.head()


Unnamed: 0,id,title_x,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title_y,vote_average,vote_count
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [35]:
#We can now drop duplicate columns and clean up the dataframe
full_dataframe.columns.tolist()
full_dataframe = full_dataframe.drop(columns= ['title_x', 'title_y'])
full_dataframe.columns.tolist()

['id',
 'cast',
 'crew',
 'budget',
 'genres',
 'homepage',
 'keywords',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'vote_average',
 'vote_count']

In [36]:
#Now we can create a new dataframe that contains only the columns that we require for the cosine similarity.
#We will be using the other dataframe to create a front end for our recommender using streamlit
cosine_dataframe = full_dataframe.drop(columns = ['cast', 'crew', 'budget', 'genres', 'homepage', 'keywords','original_language','popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline','vote_average', 'vote_count'])
cosine_dataframe.head()

Unnamed: 0,id,original_title,overview
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [37]:
#We now check if our cosine dataframe has any na values in the overview column
cosine_dataframe.isnull().sum()



id                0
original_title    0
overview          3
dtype: int64

In [38]:
#As we can see from our output we have 3 movies without an overview. We can simply change the value of these overviews to be an empty string. 

cosine_dataframe = cosine_dataframe.fillna('')
cosine_dataframe['original_title']


0                                         Avatar
1       Pirates of the Caribbean: At World's End
2                                        Spectre
3                          The Dark Knight Rises
4                                    John Carter
                          ...                   
4798                                 El Mariachi
4799                                   Newlyweds
4800                   Signed, Sealed, Delivered
4801                            Shanghai Calling
4802                           My Date with Drew
Name: original_title, Length: 4803, dtype: object

In [39]:
#Checking if our null values have been replaced with empty strings
cosine_dataframe.isnull().sum()
cosine_dataframe.shape


(4803, 3)

In the cell below we use our CountVectoriser classs which creates an matrix where the rows are our movies and the columns are the words from the plots with the value being a count of occurence. 

In [40]:
countvectorizer = CountVectorizer(stop_words='english')
sparse_cosine_dataframe = countvectorizer.fit_transform(cosine_dataframe['overview'])

In this cell we actually calculate the cosine similarity on our sparse matrix which gives us an idea as to which movies are further away from others based on the plot.

In [41]:
#We can now determine the cosine similarity 
cosine_sim = cosine_similarity(sparse_cosine_dataframe,sparse_cosine_dataframe) 


We now need a function that can take in a movie so that we can recommend similar movies. To do this we need to locate the row index of the movie so that we can access this row index in the similarity matrix and thus locate similar movies which will be located in the columns. 

In [42]:
#We now need a way to extract the index of the movie we want to check similarities with. 
def get_index(movie):
    row_index = cosine_dataframe.index.get_loc(cosine_dataframe[cosine_dataframe['original_title']== movie].index[0])
    return row_index


We now need a function that can actually take the user input and return movie recommendations. We start this function out by first calling the function we defined above to get the index of the movie and then we access this row index in our cosine similarity matrix. After this we can sort by decreasing similarity scores and take the top 10 similar movies, excluding the movie the user inputted.  

In [43]:
def get_similar(movie):
    index = get_index(movie)
    sim_row = cosine_sim[index]
    top_10 = sorted(list(enumerate(sim_row)), reverse=True, key=lambda x:x[1])[1:11] #create a sorted list with regards to the similarity scores and keeping the indexes the same.
    for i in range (10):
        print(cosine_dataframe['original_title'].iloc[top_10[i][0]]) 
     


In [44]:
get_similar("Avatar") #Testing to see if the get similar function returns 


Apollo 18
Tears of the Sun
Beowulf
The American
Obitaemyy Ostrov
The Matrix
Aliens vs Predator: Requiem
Hanna
Just Visiting
In Bruges


In [46]:
#Here we use pickle and numpy to dump all relevant data so that we can use it when creating the streamlit front-end. 
cosine_dataframe.to_pickle('cosine.pkl')
np.save('similar', cosine_sim)