# Project Name: Movie Recommendation System
In this project we are going to build up a content and popularity based movie recommendation system based on several movie specifications that are given in available movie dataset. At the end of this project we create a function that ask the user about his/her favourite movies and based on that it will recommend 20 other movies that are somehow similar(content and popularity based similarity) to the given movie.

### Calling Libraries

In [3]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

### Calling the dataset

In [4]:
# now we are going to import the movies dataset into a pandas dataframe object
df = pd.read_csv('movies.csv')
df.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [5]:
# let's check out the shape of the dataset
print(f'''
No of rows: {df.shape[0]}
No of columns: {df.shape[1]}
''')


No of rows: 4803
No of columns: 24



In [6]:
# as we are building up a content based and popularity based Recommendation system, therfore we are extracting only a few 
# features from thr dataset which are given below
selected_features = ['genres','keywords','tagline','cast','director']
print(selected_features)

['genres', 'keywords', 'tagline', 'cast', 'director']


In [7]:
# Now we are going to check the presence of missing values inside the selected features
for feature in selected_features:
    print(f'{feature}: {df[feature].isnull().sum()} missing values')

genres: 28 missing values
keywords: 412 missing values
tagline: 844 missing values
cast: 43 missing values
director: 30 missing values


In [8]:
# so, it seems there are missing values in the selected columns and we are going to replace them by a blank space
for feature in selected_features:
    df[feature]=df[feature].fillna('')

In [9]:
for feature in selected_features:
    print(f'{feature}: {df[feature].isnull().sum()} missing values')

genres: 0 missing values
keywords: 0 missing values
tagline: 0 missing values
cast: 0 missing values
director: 0 missing values


In [10]:
# combining the selected features
combined_data = ''
for feature in selected_features:
    combined_data = combined_data + ' ' + df[feature]

combined_data

0        Action Adventure Fantasy Science Fiction cult...
1        Adventure Fantasy Action ocean drug abuse exo...
2        Action Adventure Crime spy based on novel sec...
3        Action Crime Drama Thriller dc comics crime f...
4        Action Adventure Science Fiction based on nov...
                              ...                        
4798     Action Crime Thriller united states\u2013mexi...
4799     Comedy Romance  A newlywed couple's honeymoon...
4800     Comedy Drama Romance TV Movie date love at fi...
4801       A New Yorker in Shanghai Daniel Henney Eliz...
4802     Documentary obsession camcorder crush dream g...
Length: 4803, dtype: object

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# now we are going to convert the text data into numeric data
vectorizer = TfidfVectorizer()
vectorized_data = vectorizer.fit_transform(combined_data)

print(vectorized_data)

  (0, 2432)	0.17272411194153
  (0, 7755)	0.1128035714854756
  (0, 13024)	0.1942362060108871
  (0, 10229)	0.16058685400095302
  (0, 8756)	0.22709015857011816
  (0, 14608)	0.15150672398763912
  (0, 16668)	0.19843263965100372
  (0, 14064)	0.20596090415084142
  (0, 13319)	0.2177470539412484
  (0, 17290)	0.20197912553916567
  (0, 17007)	0.23643326319898797
  (0, 13349)	0.15021264094167086
  (0, 11503)	0.27211310056983656
  (0, 11192)	0.09049319826481456
  (0, 16998)	0.1282126322850579
  (0, 15261)	0.07095833561276566
  (0, 4945)	0.24025852494110758
  (0, 14271)	0.21392179219912877
  (0, 3225)	0.24960162956997736
  (0, 16587)	0.12549432354918996
  (0, 14378)	0.33962752210959823
  (0, 5836)	0.1646750903586285
  (0, 3065)	0.22208377802661425
  (0, 3678)	0.21392179219912877
  (0, 5437)	0.1036413987316636
  :	:
  (4801, 17266)	0.2886098184932947
  (4801, 4835)	0.24713765026963996
  (4801, 403)	0.17727585190343226
  (4801, 6935)	0.2886098184932947
  (4801, 11663)	0.21557500762727902
  (4801, 1672

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

# finding the cosine similarity
similarity = cosine_similarity(vectorized_data)
print(similarity)

[[1.         0.07219487 0.037733   ... 0.         0.         0.        ]
 [0.07219487 1.         0.03281499 ... 0.03575545 0.         0.        ]
 [0.037733   0.03281499 1.         ... 0.         0.05389661 0.        ]
 ...
 [0.         0.03575545 0.         ... 1.         0.         0.02651502]
 [0.         0.         0.05389661 ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.02651502 0.         1.        ]]


In [13]:
# getting the movie name from the user
movie_name = input('Enter your favourite movie: ')

In [14]:
# creating a list of all the movies in the dataset
list_of_all_movies = df['title'].tolist()

In [15]:
import difflib

# finding the movie in the Movies list that are closest to what is inserted by the user
close_matches_found = difflib.get_close_matches(movie_name, list_of_all_movies)
print(close_matches_found)
closest_match = close_matches_found[0]

['Iron Man', 'Iron Man 3', 'Iron Man 2']


In [16]:
# finding the index of the movies that closely match with the input so that we could find the similarity score later
index_of_the_selected_movie = int(df[df['title']==closest_match]['index'].values[0])
print(index_of_the_selected_movie)

68


In [17]:
# getting the list of similarity score of the selected movie to the other movies
similarity_score = list(enumerate(similarity[index_of_the_selected_movie]))
similarity_score

[(0, 0.033570748780675445),
 (1, 0.0546448279236134),
 (2, 0.013735500604224323),
 (3, 0.006468756104392058),
 (4, 0.03268943310073386),
 (5, 0.013907256685755473),
 (6, 0.07692837576335507),
 (7, 0.23944423963486405),
 (8, 0.007882387851851008),
 (9, 0.07599206098164225),
 (10, 0.07536074882460438),
 (11, 0.01192606921174529),
 (12, 0.013707618139948929),
 (13, 0.012376074925089967),
 (14, 0.09657127116284188),
 (15, 0.007286271383816743),
 (16, 0.22704403782296803),
 (17, 0.013112928084103857),
 (18, 0.04140526820609594),
 (19, 0.07883282546834255),
 (20, 0.07981173664799915),
 (21, 0.011266873271064948),
 (22, 0.006892575895462364),
 (23, 0.006599097891242659),
 (24, 0.012665208122549737),
 (25, 0.0),
 (26, 0.21566241096831154),
 (27, 0.030581282093826635),
 (28, 0.061074402219665376),
 (29, 0.014046184258938898),
 (30, 0.0807734659476981),
 (31, 0.31467052449477506),
 (32, 0.02878209913426701),
 (33, 0.13089810941050173),
 (34, 0.0),
 (35, 0.035350090674865595),
 (36, 0.03185325269

In [18]:
# lets sort the movies other than the chosen movie in descending Order of the similarity score
sorted_similar_movies = sorted(similarity_score, key=lambda x:x[1], reverse=True) 

In [19]:
# now we are going to find out the top 10 movie-name that are similar to the selected movie
n_top_movie = 10
top_similar_movie_index = [movie[0] for movie in sorted_similar_movies[1:n_top_movie+1]]
top_similar_movie_name = []

for index in top_similar_movie_index:
    name = str(df[df['index']==index]['title']).split('   ')[1].split('\n')[0]
    top_similar_movie_name.append(name)

print(top_similar_movie_name)

[' Iron Man 2', ' Iron Man 3', ' Avengers: Age of Ultron', ' The Avengers', ' Captain America: Civil War', ' Captain America: The Winter Soldier', ' Ant-Man', ' X-Men', ' Made', ' X-Men: Apocalypse']


### Finalize the Recommender

In [20]:
import difflib
def movies_recommender():
    movie_name = input('Enter your favourite movie: ')
    close_matches_found = difflib.get_close_matches(movie_name, list_of_all_movies)
    closest_match = close_matches_found[0]
    index_of_the_selected_movie = int(df[df['title']==closest_match]['index'].values[0])
    similarity_score = list(enumerate(similarity[index_of_the_selected_movie]))
    sorted_similar_movies = sorted(similarity_score, key=lambda x:x[1], reverse=True) 
    n_top_movie = 20
    top_similar_movie_index = [movie[0] for movie in sorted_similar_movies[1:n_top_movie+1]]
    top_similar_movie_name = []

    for index in top_similar_movie_index:
        name = str(df[df['index']==index]['title']).split('   ')[1].split('\n')[0]
        top_similar_movie_name.append(name)

    print(f'{closest_match} is your favourite movie!!\n')
    print('You should also try:')
    for index, movie in enumerate(top_similar_movie_name):
        print(f'{index+1}. {movie}')

In [21]:
movies_recommender()

Batman is your favourite movie!!

You should also try:
1.  Batman Returns
2.  Batman & Robin
3.  The Dark Knight Rises
4.  Batman Begins
5.  The Dark Knight
6.  A History of Violence
7.  Superman
8.  Beetlejuice
9.  Bedazzled
10.  Mars Attacks!
11.  The Sentinel
12.  Planet of the Apes
13.  Man of Steel
14.  Suicide Squad
15.  The Mask
16.  Salton Sea
17.  Spider-Man 3
18.  The Postman Always Rings Twice
19.  Hang 'em High
20.  Spider-Man 2
