Our recommendation system is a type of content-based recommendation system. The system recommends movies that are similar to the movie viewed by the user. TMDB 5000 movies dataset contains many features, but only 3 of them are needed to build our system. 'id'- a unique identifier for each movie; 'original_title'- the title of the movie before translation or adaptation, 'overview'- description of the movie. After processing the dataset, we create a TF-IDF matrix using the sklearn function 'TfidfVectorizer'. This matrix contains the TF-IDF value for each word in the film description. We then construct the cosine similarity matrix for the TF-IDF matrix. It is the values that the cosine similarity matrix contains that we will find the corresponding films. Films are considered similar if their descriptions are similar. Then we create a function that recommends the 10 most similar movies from the dataset based on the received movie.

In [1]:
#importing the necessary libraries

import numpy as np
import pandas as pd

In [2]:
#read data

data=pd.read_csv('https://raw.githubusercontent.com/noahjett/Movie-Goodreads-Analysis/master/tmdb_5000_movies.csv')
data.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [3]:
#selection of necessary features

data=data[['id', 'original_title', 'overview' ]]
data.columns

Index(['id', 'original_title', 'overview'], dtype='object')

In [4]:
data.head(2)

Unnamed: 0,id,original_title,overview
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."


In [5]:
#check for missing values

data.isnull().sum()

id                0
original_title    0
overview          3
dtype: int64

In [6]:
#deleting missing values

data=data.dropna()
data.isnull().sum()

id                0
original_title    0
overview          0
dtype: int64

In [7]:
#check for dublicate values

data.duplicated().sum()

0

In [8]:
#importing a module for working with movie descriptions

from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
#creating TF-IDF matrix

TFIDF = TfidfVectorizer(stop_words='english')
TFIDF_matrix=TFIDF.fit_transform(data['overview'])
TFIDF_matrix.shape

(4800, 20978)

In [10]:
#importing a module for creating cosine similarity matrix

from sklearn.metrics.pairwise import linear_kernel

In [11]:
#compute the cosine similarity matrix

cosine_similarity_matrix = linear_kernel(TFIDF_matrix,TFIDF_matrix)
cosine_similarity_matrix

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.02160368, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.01488031, 0.        ,
        0.        ],
       ...,
       [0.        , 0.02160368, 0.01488031, ..., 1.        , 0.01608882,
        0.00701802],
       [0.        , 0.        , 0.        , ..., 0.01608882, 1.        ,
        0.01171476],
       [0.        , 0.        , 0.        , ..., 0.00701802, 0.01171476,
        1.        ]])

In [12]:
##Construct a reverse map of indices and movie titles

movie_indices=pd.Series(data.index, index=data['original_title'])
movie_indices.head()

original_title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

In [13]:
#creating a function that recommends a movie

def recommend_movies(movie_name):
    
    # Get the index of the movie matching the title
    movie_index=movie_indices[movie_name]
    
    # Get ratings of pairwise similarity of all films with this film
    similarity_score=list(enumerate(cosine_similarity_matrix[movie_index]))
    
    # Sort movies by similarity scores
    similarity_score=sorted(similarity_score, key=lambda x: x[1], reverse=True)
    
    # Get the score of the 10 most similar movies
    similarity_score=similarity_score[1:11]
    
    # Get movie indices
    given_movie_indices=[i[0] for i in similarity_score]
    
    # Return the 10 most similar movies
    return data['original_title'].iloc[given_movie_indices]

In [14]:
#testing the function on one of the movies

recommend_movies('Avatar')

3604                       Apollo 18
2130                    The American
634                       The Matrix
1341                Obitaemyy Ostrov
529                 Tears of the Sun
1610                           Hanna
311     The Adventures of Pluto Nash
847                         Semi-Pro
775                        Supernova
2628             Blood and Chocolate
Name: original_title, dtype: object