# MOVIE RECOMMENDATION SYSTEM

In this project, I've made a Model that suggests you a movie, based on the movie you've watched, or say, your favourite movie. I've used cosine similarity, a similarity measure used in various machine learning and data analysis tasks, including both supervised and unsupervised learning. It's a mathematical concept used to quantify the similarity between vectors in a multi-dimensional space. It is often applied in various fields, such as natural language processing, information retrieval, and recommendation systems, to compare the similarity between documents, words, or other entities represented as vectors. I've also used a feature extraction technique TF-IDF (Term Frequency- Inverse Document Frequency) Vectorizer, which is used to convert a textual data to a numerical representation.


In [1]:
#importing the libraries

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import difflib
from sklearn.metrics.pairwise import cosine_similarity

difflib library is used to get the closest match of some value. We'll use this library to get the closest match of the movie title in our dataset entered by the user.

In [2]:
data = pd.read_csv('/content/movies.csv')

For my purpose, I've used a movies dataset and uploaded it to the session storage.

In [3]:
data

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4798,4798,220000,Action Crime Thriller,,9367,united states\u2013mexico barrier legs arms pa...,es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,...,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238,Carlos Gallardo Jaime de Hoyos Peter Marquardt...,"[{'name': 'Robert Rodriguez', 'gender': 0, 'de...",Robert Rodriguez
4799,4799,9000,Comedy Romance,,72766,,en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,...,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5,Edward Burns Kerry Bish\u00e9 Marsha Dietlein ...,"[{'name': 'Edward Burns', 'gender': 2, 'depart...",Edward Burns
4800,4800,0,Comedy Drama Romance TV Movie,http://www.hallmarkchannel.com/signedsealeddel...,231617,date love at first sight narration investigati...,en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,...,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,6,Eric Mabius Kristin Booth Crystal Lowe Geoff G...,"[{'name': 'Carla Hetland', 'gender': 0, 'depar...",Scott Smith
4801,4801,0,,http://shanghaicalling.com/,126186,,en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,...,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,Shanghai Calling,5.7,7,Daniel Henney Eliza Coupe Bill Paxton Alan Ruc...,"[{'name': 'Daniel Hsia', 'gender': 2, 'departm...",Daniel Hsia


In [4]:
data.shape

(4803, 24)

The dataframe contains 4803 movies in total.

We need to choose those features that could be useful to classify or recommend similar movies to the one the user chooses.

In [5]:
features = ['genres','keywords','tagline','cast','director']

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   index                 4803 non-null   int64  
 1   budget                4803 non-null   int64  
 2   genres                4775 non-null   object 
 3   homepage              1712 non-null   object 
 4   id                    4803 non-null   int64  
 5   keywords              4391 non-null   object 
 6   original_language     4803 non-null   object 
 7   original_title        4803 non-null   object 
 8   overview              4800 non-null   object 
 9   popularity            4803 non-null   float64
 10  production_companies  4803 non-null   object 
 11  production_countries  4803 non-null   object 
 12  release_date          4802 non-null   object 
 13  revenue               4803 non-null   int64  
 14  runtime               4801 non-null   float64
 15  spoken_languages     

Here, we can see that there are some values that contain null entries. Thus, we will replace them with an empty string.

In [7]:
#Replacing null values with empty string
for f in features:
  data[f] = data[f].fillna('')

Now, we will combine all the features that we chose for recommending the movies.

In [8]:
#Combining all the selected features
combined_features = data['genres']+' '+data['keywords']+' '+data['tagline']+' '+data['cast']+' '+data['director']

We need to convert the text data into numerical data as the computer understands numerical data in a better way as compared to text data. We'll use feature extraction for that purpose.

In [9]:
#converting the textual data to numeric data
vectorizer=TfidfVectorizer()

In [10]:
feature_vectors = vectorizer.fit_transform(combined_features)

In [11]:
print(feature_vectors)

  (0, 2432)	0.17272411194153
  (0, 7755)	0.1128035714854756
  (0, 13024)	0.1942362060108871
  (0, 10229)	0.16058685400095302
  (0, 8756)	0.22709015857011816
  (0, 14608)	0.15150672398763912
  (0, 16668)	0.19843263965100372
  (0, 14064)	0.20596090415084142
  (0, 13319)	0.2177470539412484
  (0, 17290)	0.20197912553916567
  (0, 17007)	0.23643326319898797
  (0, 13349)	0.15021264094167086
  (0, 11503)	0.27211310056983656
  (0, 11192)	0.09049319826481456
  (0, 16998)	0.1282126322850579
  (0, 15261)	0.07095833561276566
  (0, 4945)	0.24025852494110758
  (0, 14271)	0.21392179219912877
  (0, 3225)	0.24960162956997736
  (0, 16587)	0.12549432354918996
  (0, 14378)	0.33962752210959823
  (0, 5836)	0.1646750903586285
  (0, 3065)	0.22208377802661425
  (0, 3678)	0.21392179219912877
  (0, 5437)	0.1036413987316636
  :	:
  (4801, 17266)	0.2886098184932947
  (4801, 4835)	0.24713765026963996
  (4801, 403)	0.17727585190343226
  (4801, 6935)	0.2886098184932947
  (4801, 11663)	0.21557500762727902
  (4801, 1672

Now, we will use the similarity measure, cosine similarity to measure how similar is one movie to another, based on the features of the movie that we gave to our model.

In [12]:
#Getting the similarity score using the cosine similarity
similarity= cosine_similarity(feature_vectors)

In [13]:
print(similarity)

[[1.         0.07219487 0.037733   ... 0.         0.         0.        ]
 [0.07219487 1.         0.03281499 ... 0.03575545 0.         0.        ]
 [0.037733   0.03281499 1.         ... 0.         0.05389661 0.        ]
 ...
 [0.         0.03575545 0.         ... 1.         0.         0.02651502]
 [0.         0.         0.05389661 ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.02651502 0.         1.        ]]


In [14]:
similarity.shape

(4803, 4803)

Here, we can see from the shape that the similarity of every movie is measured corresponding to every other movie in the dataset.

Now since we have the measure of similarity, we can process further on with taking the movie from the user.

In [24]:
#Getting a movie name from the user
movie_name=input("Enter the movie: ")

Enter the movie: the wolf of wall street


In [25]:
#Getting the title of all the movies from the dataset
all_movies=data['title'].tolist()
all_movies

['Avatar',
 "Pirates of the Caribbean: At World's End",
 'Spectre',
 'The Dark Knight Rises',
 'John Carter',
 'Spider-Man 3',
 'Tangled',
 'Avengers: Age of Ultron',
 'Harry Potter and the Half-Blood Prince',
 'Batman v Superman: Dawn of Justice',
 'Superman Returns',
 'Quantum of Solace',
 "Pirates of the Caribbean: Dead Man's Chest",
 'The Lone Ranger',
 'Man of Steel',
 'The Chronicles of Narnia: Prince Caspian',
 'The Avengers',
 'Pirates of the Caribbean: On Stranger Tides',
 'Men in Black 3',
 'The Hobbit: The Battle of the Five Armies',
 'The Amazing Spider-Man',
 'Robin Hood',
 'The Hobbit: The Desolation of Smaug',
 'The Golden Compass',
 'King Kong',
 'Titanic',
 'Captain America: Civil War',
 'Battleship',
 'Jurassic World',
 'Skyfall',
 'Spider-Man 2',
 'Iron Man 3',
 'Alice in Wonderland',
 'X-Men: The Last Stand',
 'Monsters University',
 'Transformers: Revenge of the Fallen',
 'Transformers: Age of Extinction',
 'Oz: The Great and Powerful',
 'The Amazing Spider-Man 2',

In [26]:
#Finding the close name of the movie given by the user
CloseMatch= difflib.get_close_matches(movie_name, all_movies)
CloseMatch

['The Wolf of Wall Street']

In [27]:
CloseMatch=CloseMatch[0]
CloseMatch

'The Wolf of Wall Street'

In [28]:
#finding  the  index  of  the  movie  entered  by  the  user  with  title
movieIndex=data[data['title']==CloseMatch]['index'].values[0]
movieIndex

298

In [29]:
#Getting a list of similar movies
similarity_score=list(enumerate(similarity[movieIndex]))
similarity_score

[(0, 0.0),
 (1, 0.0),
 (2, 0.009471448671593934),
 (3, 0.024385930753888903),
 (4, 0.0),
 (5, 0.0),
 (6, 0.0),
 (7, 0.0),
 (8, 0.0),
 (9, 0.0),
 (10, 0.0),
 (11, 0.00822373757227185),
 (12, 0.0),
 (13, 0.0),
 (14, 0.0),
 (15, 0.0),
 (16, 0.0),
 (17, 0.02789551876910133),
 (18, 0.00432032082243108),
 (19, 0.04726608458106889),
 (20, 0.0),
 (21, 0.0),
 (22, 0.01974266016964517),
 (23, 0.0),
 (24, 0.00313712160979796),
 (25, 0.09014095125690445),
 (26, 0.0),
 (27, 0.0),
 (28, 0.0),
 (29, 0.0),
 (30, 0.0),
 (31, 0.0),
 (32, 0.0),
 (33, 0.0),
 (34, 0.0),
 (35, 0.0),
 (36, 0.0),
 (37, 0.0),
 (38, 0.0),
 (39, 0.0),
 (40, 0.009582303198668266),
 (41, 0.02105748189994935),
 (42, 0.004518391966898306),
 (43, 0.0),
 (44, 0.0),
 (45, 0.003016567825067032),
 (46, 0.0),
 (47, 0.0),
 (48, 0.0),
 (49, 0.07491888318865163),
 (50, 0.0),
 (51, 0.0),
 (52, 0.0),
 (53, 0.0),
 (54, 0.0),
 (55, 0.00475714316076741),
 (56, 0.0),
 (57, 0.005028910404597939),
 (58, 0.015601395452918342),
 (59, 0.0),
 (60, 0.003

In [31]:
#Sorting the similarity scores in the list of similar movies in descending order
similar_movies=sorted(similarity_score,key=lambda x:x[1],reverse=True)
similar_movies

[(298, 1.0000000000000004),
 (351, 0.19688411891594998),
 (971, 0.17698475654076765),
 (439, 0.17263403763584168),
 (316, 0.17033081399610314),
 (250, 0.14457842328925033),
 (652, 0.12366504726040514),
 (908, 0.12329487328560951),
 (72, 0.12106969682793685),
 (1119, 0.12078152940214512),
 (1682, 0.11951514616356444),
 (4766, 0.11629269837033196),
 (883, 0.11572083995099185),
 (62, 0.1143811033854019),
 (1500, 0.11366599536731903),
 (1646, 0.11198228391977733),
 (1409, 0.10446259941385336),
 (2943, 0.10356363000394236),
 (4204, 0.10185060561840945),
 (911, 0.10177864013159156),
 (1478, 0.09935447458784039),
 (2045, 0.09915551824045445),
 (2854, 0.09832857995224419),
 (681, 0.09683447152601826),
 (96, 0.09498955530478634),
 (2231, 0.09412812673846961),
 (928, 0.09285856779991365),
 (4772, 0.09258553678997594),
 (3376, 0.09175219127328108),
 (1879, 0.09173507913587663),
 (1221, 0.09168853263798117),
 (2392, 0.09111697639438714),
 (843, 0.09096723005982671),
 (297, 0.09079958121824755),
 (

In [32]:
#Suggesting the movies to the user
print('Suggested movies: ')
i = 1
for movie in similar_movies:
  index = movie[0]
  if(data[data.index==index]['title'].values[0]==CloseMatch): #This would be the movie entered by the user itself, so we won't consider it
    continue
  if(i<=30): #We'll suggest 30 movies to the user
    print(i,'. ',data[data.index==index]['title'].values[0])
  i+=1

Suggested movies: 
1 .  The Departed
2 .  The Story of Us
3 .  Shutter Island
4 .  Gangs of New York
5 .  The Aviator
6 .  Focus
7 .  Super 8
8 .  Suicide Squad
9 .  21 Jump Street
10 .  Sausage Party
11 .  The Last Waltz
12 .  Catch Me If You Can
13 .  The Legend of Tarzan
14 .  This Is the End
15 .  Alex & Emma
16 .  J. Edgar
17 .  Kinsey
18 .  Foolish
19 .  22 Jump Street
20 .  A Few Good Men
21 .  I Heart Huckabees
22 .  Def Jam's How to Be a Player
23 .  The American President
24 .  Inception
25 .  Strange Wilderness
26 .  Moneyball
27 .  Down Terrace
28 .  Cyrus
29 .  The Sitter
30 .  The Doors


# Recommendation System

In [35]:
movie_name=input("Enter the movie: ")

all_movies=data['title'].tolist()

CloseMatch= difflib.get_close_matches(movie_name, all_movies)

CloseMatch = CloseMatch[0]

movieIndex=data[data['title']==CloseMatch]['index'].values[0]

similarity_score=list(enumerate(similarity[movieIndex]))

similar_movies=sorted(similarity_score,key=lambda x:x[1],reverse=True)

print('Suggested movies: ')
i = 1
for movie in similar_movies:
  index = movie[0]
  if(data[data.index==index]['title'].values[0]==CloseMatch):
    continue
  if(i<=30):
    print(i,'. ',data[data.index==index]['title'].values[0])
  i+=1

Enter the movie: avengers
Suggested movies: 
1 .  Avengers: Age of Ultron
2 .  Captain America: The Winter Soldier
3 .  Captain America: Civil War
4 .  Iron Man 2
5 .  Thor: The Dark World
6 .  X-Men
7 .  The Incredible Hulk
8 .  X-Men: Apocalypse
9 .  Ant-Man
10 .  Thor
11 .  X2
12 .  X-Men: The Last Stand
13 .  Deadpool
14 .  X-Men: Days of Future Past
15 .  Captain America: The First Avenger
16 .  The Amazing Spider-Man 2
17 .  The Image Revolution
18 .  Iron Man
19 .  Iron Man 3
20 .  Man of Steel
21 .  The Spirit
22 .  Superman II
23 .  X-Men: First Class
24 .  Guardians of the Galaxy
25 .  Batman v Superman: Dawn of Justice
26 .  Serenity
27 .  Spawn
28 .  Teenage Mutant Ninja Turtles: Out of the Shadows
29 .  The Helix... Loaded
30 .  What's Your Number?


##CONCLUSION


In this project, I used cosine similarity,  and feature extraction technique TF-IDF (Term Frequency- Inverse Document Frequency) Vectorizer, to make a Movie Recommendation System that would suggest 30 movies to the user based on the movie entered by the user.

## References and Future Work

You can find the links to the resources that I found useful during the execution of this project and learn more about the tools and libraries used in it.



*   Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
*   Cosine Similarity user guide: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
* Feature Extraction user guide: https://scikit-learn.org/stable/modules/feature_extraction.html
* Difflib user guide: https://docs.python.org/3/library/difflib.html