# **Title** **Of** **Project**

**MOVIE** **RECOMMENDATION** **SYSTEM**


# **Objective**

The movie recommendation system is designed to provide personalized movie suggestions based on a user's input of their favorite movie. The system utilizes cosine similarity to measure the similarity between movies based on their features or attributes.

Here are the key steps involved:

1)User Input: The user is prompted to enter the name of their favorite movie.
2)Closest Match: The system finds the closest match to the user's input by comparing it with a list of movie titles.
3)Movie Selection: The system identifies the movie with the closest match and retrieves its corresponding index or ID.
4)Similarity Calculation: The system calculates the similarity scores between the selected movie and all other movies using cosine similarity. The similarity score indicates how closely related each movie is to the selected movie.
5)Ranking Recommendations: The system ranks the movies based on their similarity scores, with higher scores indicating a stronger similarity.
6)Top Recommendations: The system presents the top 10 movies with the highest similarity scores as recommendations to the user.
By leveraging the concept of cosine similarity, the system can suggest movies that are similar to the user's favorite movie, based on shared characteristics or features. This allows users to discover new movies that align with their preferences and interests.



# **Data** **Source**

https://github.com/YBI-Foundation/Dataset/blob/main/Movies%20Recommendation.csv

# **Import Library**

In [1]:
import pandas as pd
import numpy as np

# **Import Data**

In [2]:
df=pd.read_csv("https://github.com/YBI-Foundation/Dataset/raw/main/Movies%20Recommendation.csv")

In [3]:
df.head()

Unnamed: 0,Movie_ID,Movie_Title,Movie_Genre,Movie_Language,Movie_Budget,Movie_Popularity,Movie_Release_Date,Movie_Revenue,Movie_Runtime,Movie_Vote,...,Movie_Homepage,Movie_Keywords,Movie_Overview,Movie_Production_House,Movie_Production_Country,Movie_Spoken_Language,Movie_Tagline,Movie_Cast,Movie_Crew,Movie_Director
0,1,Four Rooms,Crime Comedy,en,4000000,22.87623,09-12-1995,4300000,98.0,6.5,...,,hotel new year's eve witch bet hotel room,It's Ted the Bellhop's first night on the job....,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]",Twelve outrageous guests. Four scandalous requ...,Tim Roth Antonio Banderas Jennifer Beals Madon...,"[{'name': 'Allison Anders', 'gender': 1, 'depa...",Allison Anders
1,2,Star Wars,Adventure Action Science Fiction,en,11000000,126.393695,25-05-1977,775398007,121.0,8.1,...,http://www.starwars.com/films/star-wars-episod...,android galaxy hermit death star lightsaber,Princess Leia is captured and held hostage by ...,"[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]","A long time ago in a galaxy far, far away...",Mark Hamill Harrison Ford Carrie Fisher Peter ...,"[{'name': 'George Lucas', 'gender': 2, 'depart...",George Lucas
2,3,Finding Nemo,Animation Family,en,94000000,85.688789,30-05-2003,940335536,100.0,7.6,...,http://movies.disney.com/finding-nemo,father son relationship harbor underwater fish...,"Nemo, an adventurous young clownfish, is unexp...","[{""name"": ""Pixar Animation Studios"", ""id"": 3}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]","There are 3.7 trillion fish in the ocean, they...",Albert Brooks Ellen DeGeneres Alexander Gould ...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton
3,4,Forrest Gump,Comedy Drama Romance,en,55000000,138.133331,06-07-1994,677945399,142.0,8.2,...,,vietnam veteran hippie mentally disabled runni...,A man with a low IQ has accomplished great thi...,"[{""name"": ""Paramount Pictures"", ""id"": 4}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]","The world will never be the same, once you've ...",Tom Hanks Robin Wright Gary Sinise Mykelti Wil...,"[{'name': 'Alan Silvestri', 'gender': 2, 'depa...",Robert Zemeckis
4,5,American Beauty,Drama,en,15000000,80.878605,15-09-1999,356296601,122.0,7.9,...,http://www.dreamworks.com/ab/,male nudity female nudity adultery midlife cri...,"Lester Burnham, a depressed suburban father in...","[{""name"": ""DreamWorks SKG"", ""id"": 27}, {""name""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]",Look closer.,Kevin Spacey Annette Bening Thora Birch Wes Be...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4760 entries, 0 to 4759
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Movie_ID                  4760 non-null   int64  
 1   Movie_Title               4760 non-null   object 
 2   Movie_Genre               4760 non-null   object 
 3   Movie_Language            4760 non-null   object 
 4   Movie_Budget              4760 non-null   int64  
 5   Movie_Popularity          4760 non-null   float64
 6   Movie_Release_Date        4760 non-null   object 
 7   Movie_Revenue             4760 non-null   int64  
 8   Movie_Runtime             4758 non-null   float64
 9   Movie_Vote                4760 non-null   float64
 10  Movie_Vote_Count          4760 non-null   int64  
 11  Movie_Homepage            1699 non-null   object 
 12  Movie_Keywords            4373 non-null   object 
 13  Movie_Overview            4757 non-null   object 
 14  Movie_Pr

In [5]:
df.shape

(4760, 21)

In [6]:
df.columns

Index(['Movie_ID', 'Movie_Title', 'Movie_Genre', 'Movie_Language',
       'Movie_Budget', 'Movie_Popularity', 'Movie_Release_Date',
       'Movie_Revenue', 'Movie_Runtime', 'Movie_Vote', 'Movie_Vote_Count',
       'Movie_Homepage', 'Movie_Keywords', 'Movie_Overview',
       'Movie_Production_House', 'Movie_Production_Country',
       'Movie_Spoken_Language', 'Movie_Tagline', 'Movie_Cast', 'Movie_Crew',
       'Movie_Director'],
      dtype='object')

# **Get Feature Selection**

In [7]:
df_features=df[[ 'Movie_Genre','Movie_Keywords','Movie_Tagline', 'Movie_Cast','Movie_Director']].fillna('')

In [8]:
df_features.shape

(4760, 5)

In [9]:
df_features

Unnamed: 0,Movie_Genre,Movie_Keywords,Movie_Tagline,Movie_Cast,Movie_Director
0,Crime Comedy,hotel new year's eve witch bet hotel room,Twelve outrageous guests. Four scandalous requ...,Tim Roth Antonio Banderas Jennifer Beals Madon...,Allison Anders
1,Adventure Action Science Fiction,android galaxy hermit death star lightsaber,"A long time ago in a galaxy far, far away...",Mark Hamill Harrison Ford Carrie Fisher Peter ...,George Lucas
2,Animation Family,father son relationship harbor underwater fish...,"There are 3.7 trillion fish in the ocean, they...",Albert Brooks Ellen DeGeneres Alexander Gould ...,Andrew Stanton
3,Comedy Drama Romance,vietnam veteran hippie mentally disabled runni...,"The world will never be the same, once you've ...",Tom Hanks Robin Wright Gary Sinise Mykelti Wil...,Robert Zemeckis
4,Drama,male nudity female nudity adultery midlife cri...,Look closer.,Kevin Spacey Annette Bening Thora Birch Wes Be...,Sam Mendes
...,...,...,...,...,...
4755,Horror,,The hot spot where Satan's waitin'.,Lisa Hart Carroll Michael Des Barres Paul Drak...,Pece Dingo
4756,Comedy Family Drama,,It’s better to stand out than to fit in.,Roni Akurati Brighton Sharbino Jason Lee Anjul...,Frank Lotito
4757,Thriller Drama,christian film sex trafficking,She never knew it could happen to her...,Nicole Smolen Kim Baldwin Ariana Stephens Brys...,Jaco Booyens
4758,Family,,,,


In [14]:
X=df_features['Movie_Genre'] +' '+ df_features['Movie_Keywords'] +' '+ df_features['Movie_Tagline'] +' '+ df_features['Movie_Cast'] +' '+ df_features['Movie_Director']

In [15]:
X

0       Crime Comedy hotel new year's eve witch bet ho...
1       Adventure Action Science Fiction android galax...
2       Animation Family father son relationship harbo...
3       Comedy Drama Romance vietnam veteran hippie me...
4       Drama male nudity female nudity adultery midli...
                              ...                        
4755    Horror  The hot spot where Satan's waitin'. Li...
4756    Comedy Family Drama  It’s better to stand out ...
4757    Thriller Drama christian film sex trafficking ...
4758                                           Family    
4759    Documentary music actors legendary perfomer cl...
Length: 4760, dtype: object

In [16]:
X.shape

(4760,)

# **Get Feature Text Conversion to Tokens**

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer class from the sklearn.feature_extraction.text module is being used to convert the text data in the variable X into numerical features. TF-IDF stands for Term Frequency-Inverse Document Frequency, which is a commonly used technique to represent text data numerically.

This line imports the TfidfVectorizer class from the scikit-learn library, which provides methods for converting text data into numerical features using the TF-IDF approach.

In [18]:
tfidf=TfidfVectorizer()

 This line creates an instance of the TfidfVectorizer class. The tfidf variable is used to store this instance, which will be used to transform the text data.

In [19]:
X=tfidf.fit_transform(X)

This line applies the fit_transform method of the TfidfVectorizer class to convert the text data in X into numerical features. The fit_transform method fits the vectorizer to the data and then transforms it. It learns the vocabulary from the input data (X) and computes the TF-IDF values for each term in the vocabulary. The resulting transformed data is assigned back to the X variable.

In [20]:
X.shape

(4760, 17258)

In [21]:
print(X)

  (0, 617)	0.1633382144407513
  (0, 492)	0.1432591540388685
  (0, 15413)	0.1465525095337543
  (0, 9675)	0.14226057295252661
  (0, 9465)	0.1659841367820977
  (0, 1390)	0.16898383612799558
  (0, 7825)	0.09799561597509843
  (0, 1214)	0.13865857545144072
  (0, 729)	0.13415063359531618
  (0, 13093)	0.1432591540388685
  (0, 15355)	0.10477815972666779
  (0, 9048)	0.0866842116160778
  (0, 11161)	0.06250380151644369
  (0, 16773)	0.17654247479915475
  (0, 5612)	0.08603537588547631
  (0, 16735)	0.10690083751525419
  (0, 7904)	0.13348000542112332
  (0, 15219)	0.09800472886453934
  (0, 11242)	0.07277788238484746
  (0, 3878)	0.11998399582562203
  (0, 5499)	0.11454057510303811
  (0, 7071)	0.19822417598406614
  (0, 7454)	0.14745635785412262
  (0, 1495)	0.19712637387361423
  (0, 9206)	0.15186283580984414
  :	:
  (4757, 5455)	0.12491480594769522
  (4757, 2967)	0.16273475835631626
  (4757, 8464)	0.23522565554066333
  (4757, 6938)	0.17088173678136628
  (4757, 8379)	0.17480603856721913
  (4757, 15303)	0.07

The printed output represents the transformed X variable after applying the TF-IDF vectorization. The output format indicates the nonzero elements in the sparse matrix representation of X.

Each line in the printed output represents a nonzero element in the matrix. Let's take a specific line as an example:(0, 617)    0.1633382144407513(0, 617) represents the row and column indices of the nonzero element. In this case, the element is located at row 0 and column 617.
0.1633382144407513 is the TF-IDF value for the term at row 0 and column 617.
Each line in the output corresponds to a specific term (word) in the vocabulary that was learned during the TF-IDF vectorization process. The row index corresponds to the document (in this case, a movie) in the original dataset, and the column index corresponds to a unique term.

The TF-IDF values represent the importance of each term within each document. A higher value indicates that the term is more significant within that document.

In the given example, the transformed X variable is a sparse matrix where most elements are zero, and only the nonzero elements are displayed in the printed output. The sparse matrix format is used to efficiently store and manipulate matrices with a large number of zeros.

Note that without the full context of the original dataset and vocabulary, it's challenging to interpret the specific meaning of each term and its corresponding TF-IDF value.


# **Get Similarity Score using Cosine Similarity**

cosine_similarity function from sklearn.metrics.pairwise module is being used to calculate the cosine similarity between the rows of matrix X.

Cosine similarity is a metric used to measure the similarity between two vectors, in this case, the rows of X. It calculates the cosine of the angle between two vectors, which ranges from -1 to 1. A value of 1 indicates that the vectors are identical, 0 indicates that the vectors are orthogonal (no similarity), and -1 indicates that the vectors are diametrically opposed.

By applying cosine_similarity(X), you are calculating the pairwise cosine similarity between all rows of matrix X. The resulting Similarity_Score matrix will have dimensions (n, n), where n is the number of rows in X. Each element Similarity_Score[i, j] represents the cosine similarity between the i-th and j-th rows of X.

This matrix can be useful for tasks such as finding similar items, clustering, or recommendation systems, where you want to measure the similarity between different samples or documents based on their feature vectors.

In [48]:
from sklearn.metrics.pairwise import cosine_similarity

In [47]:
Similarity_Score=cosine_similarity(X)

The similarity_score is an array that represents the cosine similarity between pairs of documents in your dataset. Each element in the array corresponds to the similarity score between two documents.

In your example, the similarity_score array has a shape of (4760, 4760), indicating that it contains similarity scores for 4760 documents in your dataset. The value at (i, j) in the array represents the similarity score between document i and document j.

For example:

similarity_score[0, 0] represents the similarity score of document 0 with itself, which is 1.0 (maximum similarity as it is the same document).
similarity_score[0, 1] represents the similarity score between document 0 and document 1, which is 0.01351235.
similarity_score[2, 617] represents the similarity score between document 2 and document 617.
The values in the array range from 0 to 1, where 1 indicates maximum similarity (same document) and 0 indicates no similarity (completely different documents).

The similarity_score array provides a pairwise measure of similarity between documents based on the cosine similarity metric. It can be useful for tasks such as document clustering, recommendation systems, or identifying similar documents in a large corpus.

In [49]:
Similarity_Score

array([[1.        , 0.01351235, 0.03570468, ..., 0.        , 0.        ,
        0.        ],
       [0.01351235, 1.        , 0.00806674, ..., 0.        , 0.        ,
        0.        ],
       [0.03570468, 0.00806674, 1.        , ..., 0.        , 0.08014876,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.08014876, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [50]:
Similarity_Score.shape

(4760, 4760)

# **Get Movie Name as Input from User and Validate for Closest Spelling**

In [62]:
Favourite_Movie_Name=input('Enter your favourite movie name:')

Enter your favourite movie name:gattace


In [63]:
All_Movies_Title_List=df['Movie_Title'].tolist()

In [64]:
import difflib

In [65]:
Movie_Recommendation=difflib.get_close_matches(Favourite_Movie_Name,All_Movies_Title_List)
print(Movie_Recommendation)

['Gattaca', 'Rat Race']


In [67]:
Close_Match=Movie_Recommendation[0]
print(Close_Match)

Gattaca


In [68]:
Index_of_Close_Match_Movie=df[df.Movie_Title==Close_Match]['Movie_ID'].values[0]
print(Index_of_Close_Match_Movie)

348


In [69]:
Recommendation_Score=list(enumerate(Similarity_Score[Index_of_Close_Match_Movie]))
print(Recommendation_Score)

[(0, 0.03367776818551606), (1, 0.0), (2, 0.003949261104395081), (3, 0.04122590225297962), (4, 0.003050537600736829), (5, 0.0028997880939131268), (6, 0.0), (7, 0.03368769083486511), (8, 0.003139874045042835), (9, 0.0042592819869340616), (10, 0.03606379216974619), (11, 0.037703534938098554), (12, 0.023441778497374215), (13, 0.013039714229005125), (14, 0.004211656234842127), (15, 0.0026667320421110167), (16, 0.0024097673038898787), (17, 0.0), (18, 0.0071857166755995745), (19, 0.0027894879278700375), (20, 0.003024079329292719), (21, 0.0030917925190035362), (22, 0.0), (23, 0.003012862109508887), (24, 0.014185945706689549), (25, 0.006198865891592865), (26, 0.05527090857608063), (27, 0.0), (28, 0.03533141859350345), (29, 0.021906847616644312), (30, 0.0), (31, 0.0633383789460176), (32, 0.01631282891103817), (33, 0.020330077670622518), (34, 0.0030084781027295143), (35, 0.04934570347514729), (36, 0.0), (37, 0.066007916085901), (38, 0.016087086865033292), (39, 0.05886211682250396), (40, 0.0106277

In [70]:
len(Recommendation_Score)

4760

# **Get All Movies Sort Based on Recommendation Score wrt Favourite Movie**

In [71]:
Sorted_Similar_Movies=sorted(Recommendation_Score,key=lambda x:x[1],reverse=True)
print(Sorted_Similar_Movies)

[(348, 1.0000000000000002), (934, 0.2308538803727682), (4143, 0.18933612180874942), (1241, 0.1809875085588574), (184, 0.17638933361183268), (1131, 0.15999147392172172), (387, 0.14611596726892584), (4479, 0.1435969871675139), (372, 0.13734520376244974), (1730, 0.1341177988048019), (2430, 0.13361550473082068), (2725, 0.13214978067479055), (2920, 0.131993067972903), (1946, 0.1286544318234798), (852, 0.12703855954824872), (494, 0.1269426105868352), (302, 0.12632828732643214), (1656, 0.12239047474878872), (911, 0.12238445426196431), (3742, 0.1223112205607431), (2083, 0.12184217398183013), (657, 0.12095331977427747), (1896, 0.11834371511732862), (4336, 0.11606166521687887), (1974, 0.11548002659248391), (2294, 0.11454962939994544), (4345, 0.11397458455093472), (2134, 0.1126547645685942), (815, 0.11247907657889152), (1570, 0.11247890413931969), (4282, 0.11116075680240978), (2977, 0.107673030795816), (2935, 0.10526024373452701), (3731, 0.10405981771426584), (2037, 0.10384765658833338), (1160, 0

In [72]:
print('Top 30 Movies suggested for you:\n')
i=1
for movie in Sorted_Similar_Movies:
  index=movie[0]
  title_from_index=df[df.index==index]['Movie_Title'].values[0]
  if (i<31):
    print(i,'.',title_from_index)
    i+=1

Top 30 Movies suggested for you:

1 . Gandhi
2 . A Bridge Too Far
3 . The Monuments Men
4 . The Longest Day
5 . Schindler's List
6 . The Thin Red Line
7 . Saving Private Ryan
8 . The Theory of Everything
9 . Judgment at Nuremberg
10 . Hart's War
11 . The Sound of Music
12 . The Road
13 . Agora
14 . The New World
15 . Legends of the Fall
16 . Miss Congeniality
17 . Pearl Harbor
18 . Catch-22
19 . The Hunting Party
20 . Red Tails
21 . Windtalkers
22 . Mr. Holland's Opus
23 . Patton
24 . Unbroken
25 . Sweet Home Alabama
26 . The Great Raid
27 . Fury
28 . Miracle at St. Anna
29 . Flags of Our Fathers
30 . Saints and Soldiers


# **Top 10 Movie Recommendation System**

In [74]:
Favourite_Movie_Name=input('Enter your favourite movie name:')
All_Movies_Title_List=df['Movie_Title'].tolist()
Movie_Recommendation=difflib.get_close_matches(Favourite_Movie_Name,All_Movies_Title_List)
Close_Match=Movie_Recommendation[0]
Index_of_Close_Match_Movie=df[df.Movie_Title==Close_Match]['Movie_ID'].values[0]
Recommendation_Score=list(enumerate(Similarity_Score[Index_of_Close_Match_Movie]))
Sorted_Similar_Movies=sorted(Recommendation_Score,key=lambda x:x[1],reverse=True)
print('Top 10 Movies suggested for you:\n')
i=1
for movie in Sorted_Similar_Movies:
  index=movie[0]
  title_from_index=df[df.index==index]['Movie_Title'].values[0]
  if (i<11):
    print(i,'.',title_from_index)
    i+=1

Enter your favourite movie name:avtaar
Top 10 Movies suggested for you:

1 . Niagara
2 . Caravans
3 . My Week with Marilyn
4 . Brokeback Mountain
5 . Harry Brown
6 . Night of the Living Dead
7 . The Curse of Downers Grove
8 . The Boy Next Door
9 . Back to the Future
10 . The Juror
