# WorkFlow of the Notebook 🎥

This is the outline for building a **content-based movie recommendation system** as described in the video. Each step is briefly explained.

---

## **Step 1: Data 📊**
The movie dataset is loaded from Kaggle ("https://www.kaggle.com/datasets/abdallahwagih/movies") using `pandas`. This dataset contains essential information such as movie titles, genres, and keywords.

---

## **Step 2: Data Pre-Processing 🛠️**
The data is cleaned and prepared for analysis using `pandas`. This involves handling missing values and combining relevant columns into a single text-based feature.

---

## **Step 3: Feature Extraction 🧠**
Text features are transformed into numerical representations using **TF-IDF** (Term Frequency-Inverse Document Frequency) with `scikit-learn`. This helps quantify the importance of words in describing movies.

---

## **Step 4: User Input 📝**
We take input from the user as a movie title and number of suggestions. This input is used to find similar movies based on their textual features.

---

## **Step 5: Cosine Similarity 🔍**
Cosine similarity is calculated between movie vectors using `scikit-learn` to identify movies that are most similar to the input movie. Cosine similarity gives a similarity score by comparing a movie to all other movies available.

---

## **Step 6: List of Movies 🎬**
Based on cosine similarity scores, the system generates a ranked list of recommended movies. Higher the similarity score, higher is the probability that the user might like the suggested movie. These recommendations are displayed to the user.

---

🎯 This workflow provides a structured approach to building a content-based movie recommendation system, leveraging Python libraries and machine learning techniques.


In [None]:
import numpy as np
import pandas as pd
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Data Collection and Pre-Processing

In [None]:
#loading the data from the movies.csv file as Pandas DataFrame
movies_data = pd.read_csv('/content/movies.csv')
movies_data.head()#print the first 5 rows

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [None]:
#print the number of rows and columns
movies_data.shape

(4803, 24)

We have 4803 rows and 24 columns. But we do not need all the 24 columns. Since we are building a content-based and popularity based recommendation system, We extract the relevant features from the dataset.

In [None]:
#selecting the relevant features for recommendation
selected_features = ['genres','keywords','tagline','cast','director']
print(selected_features)

['genres', 'keywords', 'tagline', 'cast', 'director']


In [None]:
#replace the null values
for feature in selected_features:
  movies_data[feature] = movies_data[feature].fillna('')

In [None]:
#combining selected features
combined_features = movies_data['genres']+' '+movies_data['keywords']+' '+movies_data['tagline']+' '+movies_data['cast']+' '+movies_data['director']

In [None]:
print(combined_features)

0       Action Adventure Fantasy Science Fiction cultu...
1       Adventure Fantasy Action ocean drug abuse exot...
2       Action Adventure Crime spy based on novel secr...
3       Action Crime Drama Thriller dc comics crime fi...
4       Action Adventure Science Fiction based on nove...
                              ...                        
4798    Action Crime Thriller united states\u2013mexic...
4799    Comedy Romance  A newlywed couple's honeymoon ...
4800    Comedy Drama Romance TV Movie date love at fir...
4801      A New Yorker in Shanghai Daniel Henney Eliza...
4802    Documentary obsession camcorder crush dream gi...
Length: 4803, dtype: object


In [None]:
#Term Frequency-Inverse Document Frequency
#converting the text data to feature vectors using Tf-idf Vectorizer
vectorizer = TfidfVectorizer()
feature_vectors = vectorizer.fit_transform(combined_features)

In [None]:
print(feature_vectors)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 124266 stored elements and shape (4803, 17318)>
  Coords	Values
  (0, 201)	0.07860022416510505
  (0, 274)	0.09021200873707368
  (0, 5274)	0.11108562744414445
  (0, 13599)	0.1036413987316636
  (0, 5437)	0.1036413987316636
  (0, 3678)	0.21392179219912877
  (0, 3065)	0.22208377802661425
  (0, 5836)	0.1646750903586285
  (0, 14378)	0.33962752210959823
  (0, 16587)	0.12549432354918996
  (0, 3225)	0.24960162956997736
  (0, 14271)	0.21392179219912877
  (0, 4945)	0.24025852494110758
  (0, 15261)	0.07095833561276566
  (0, 16998)	0.1282126322850579
  (0, 11192)	0.09049319826481456
  (0, 11503)	0.27211310056983656
  (0, 13349)	0.15021264094167086
  (0, 17007)	0.23643326319898797
  (0, 17290)	0.20197912553916567
  (0, 13319)	0.2177470539412484
  (0, 14064)	0.20596090415084142
  (0, 16668)	0.19843263965100372
  (0, 14608)	0.15150672398763912
  (0, 8756)	0.22709015857011816
  :	:
  (4801, 403)	0.17727585190343229
  (4801, 4835)	0.247137650

Cosine Similarity

In [None]:
#getting similarity score using cosine similarity
similarity_score = cosine_similarity(feature_vectors)

In [None]:
print(similarity_score)#This is square matrix of 4803*4803

[[1.         0.07219487 0.037733   ... 0.         0.         0.        ]
 [0.07219487 1.         0.03281499 ... 0.03575545 0.         0.        ]
 [0.037733   0.03281499 1.         ... 0.         0.05389661 0.        ]
 ...
 [0.         0.03575545 0.         ... 1.         0.         0.02651502]
 [0.         0.         0.05389661 ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.02651502 0.         1.        ]]


Getting the input from the use

In [None]:
#getting the fav movie
movie_name = input('Enter your fav movie name: ')

Enter your fav movie name: spider man


In [None]:
#creating a list with all the tiltes in the dataset
list_of_all_titles = movies_data['title'].tolist()
print(list_of_all_titles)

['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre', 'The Dark Knight Rises', 'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron', 'Harry Potter and the Half-Blood Prince', 'Batman v Superman: Dawn of Justice', 'Superman Returns', 'Quantum of Solace', "Pirates of the Caribbean: Dead Man's Chest", 'The Lone Ranger', 'Man of Steel', 'The Chronicles of Narnia: Prince Caspian', 'The Avengers', 'Pirates of the Caribbean: On Stranger Tides', 'Men in Black 3', 'The Hobbit: The Battle of the Five Armies', 'The Amazing Spider-Man', 'Robin Hood', 'The Hobbit: The Desolation of Smaug', 'The Golden Compass', 'King Kong', 'Titanic', 'Captain America: Civil War', 'Battleship', 'Jurassic World', 'Skyfall', 'Spider-Man 2', 'Iron Man 3', 'Alice in Wonderland', 'X-Men: The Last Stand', 'Monsters University', 'Transformers: Revenge of the Fallen', 'Transformers: Age of Extinction', 'Oz: The Great and Powerful', 'The Amazing Spider-Man 2', 'TRON: Legacy', 'Cars 2', 'Green Lant

In [None]:
#finding the close match for the movie name given by the user
close_matches = difflib.get_close_matches(movie_name, list_of_all_titles)
print(close_matches)

['Spider-Man', 'Inside Man', 'Superman']


In [None]:
close_match = close_matches[0] #Taking the most relevant movie name
print(close_match)

Spider-Man


In [None]:
#finding the index of the close match
index_close_match = movies_data[movies_data.title == close_match]['index'].values[0]
print(index_close_match)

159


In [None]:
#getting a list of similar movies based on the index value
similarity_score_movie = list(enumerate(similarity_score[index_close_match]))
print(similarity_score_movie)

[(0, np.float64(0.058033472668241895)), (1, np.float64(0.028510860472594178)), (2, np.float64(0.027527242422615502)), (3, np.float64(0.006457320533567937)), (4, np.float64(0.07910312941508457)), (5, np.float64(0.3188331558421017)), (6, np.float64(0.0)), (7, np.float64(0.005837358717098843)), (8, np.float64(0.019848569330341657)), (9, np.float64(0.017094559158687336)), (10, np.float64(0.032895000920978516)), (11, np.float64(0.005137462350368439)), (12, np.float64(0.017699463936031612)), (13, np.float64(0.01631207363457817)), (14, np.float64(0.01722159880482989)), (15, np.float64(0.011028690492438572)), (16, np.float64(0.0055350569108376544)), (17, np.float64(0.016931592006051974)), (18, np.float64(0.005525360643470354)), (19, np.float64(0.021680168394872627)), (20, np.float64(0.04593061379094501)), (21, np.float64(0.0048534966726055005)), (22, np.float64(0.01735613282383714)), (23, np.float64(0.028252755876041345)), (24, np.float64(0.012687691446279886)), (25, np.float64(0.0119928781069

In [None]:
#sorting the movies
sorted_similar_movies = sorted(similarity_score_movie, key = lambda x:x[1], reverse = True)
print(sorted_similar_movies)

[(159, np.float64(1.0)), (5, np.float64(0.3188331558421017)), (30, np.float64(0.31791901982229703)), (1559, np.float64(0.18281312771525624)), (382, np.float64(0.16501718739122473)), (3575, np.float64(0.16167041055131562)), (2361, np.float64(0.13913552967690979)), (37, np.float64(0.1325433517724574)), (1364, np.float64(0.1315753447298594)), (1193, np.float64(0.12823533301253537)), (1793, np.float64(0.12355727076460593)), (328, np.float64(0.12188313538843791)), (677, np.float64(0.11987005023906433)), (1796, np.float64(0.11833294042448733)), (1523, np.float64(0.1167833101290838)), (1598, np.float64(0.11454660709309697)), (4441, np.float64(0.1127857020453836)), (2529, np.float64(0.10580630139750402)), (4427, np.float64(0.10339492817261245)), (976, np.float64(0.10254994461653802)), (2157, np.float64(0.10224484429143274)), (978, np.float64(0.10101660950536458)), (2369, np.float64(0.09964302137705913)), (1435, np.float64(0.09844672492278536)), (448, np.float64(0.0979633602202219)), (1533, np.

In [None]:
#take the first 10 similar movies
top_10_similar_movies = list(movies_data[movies_data.index == i[0]]['title'].values[0] for i in sorted_similar_movies[:10])
print(top_10_similar_movies)

['Spider-Man', 'Spider-Man 3', 'Spider-Man 2', 'The Notebook', 'Seabiscuit', 'Clerks II', 'The Ice Storm', 'Oz: The Great and Powerful', 'Horrible Bosses', 'The Count of Monte Cristo']


In [None]:
#print the suggested movies
print('Your Top Suggested movies are: ')
for i in top_10_similar_movies:
  print(i)

Your Top Suggested movies are: 
Spider-Man
Spider-Man 3
Spider-Man 2
The Notebook
Seabiscuit
Clerks II
The Ice Storm
Oz: The Great and Powerful
Horrible Bosses
The Count of Monte Cristo


**Movie Recommendations System.**

We successfully build a conten-based recommendation system.
Putting it all together to let the user input the fav movie and get desired number of recommendations

In [None]:
movie_name = input(' Enter your favourite movie name : ')
n = int(input('How many recommendations you want?: '))

list_of_all_titles = movies_data['title'].tolist()

close_matches = difflib.get_close_matches(movie_name, list_of_all_titles)

close_match = close_matches[0]

index_close_match = movies_data[movies_data.title == close_match]['index'].values[0]

similarity_score_movie = list(enumerate(similarity_score[index_close_match]))

sorted_similar_movies = sorted(similarity_score_movie, key = lambda x:x[1], reverse = True)

print('Movies suggested for you : \n')

i = 1

for movie in sorted_similar_movies[1:]:
  index = movie[0]
  title_from_index = movies_data[movies_data.index==index]['title'].values[0]
  if (i<=n):
    print(i,title_from_index)
    i+=1

 Enter your favourite movie name : batman
How many recommendations you want?: 10
Movies suggested for you : 

1 Batman Returns
2 Batman & Robin
3 The Dark Knight Rises
4 Batman Begins
5 The Dark Knight
6 A History of Violence
7 Superman
8 Beetlejuice
9 Bedazzled
10 Mars Attacks!


**We can now build another recommendation system which considers the popularity factor also by making very small changes to the above code.**

In [None]:
#we normalize popularity score using min max normalization for better result
movies_data['popularity'] = (movies_data['popularity'] - movies_data['popularity'].min()) / (movies_data['popularity'].max() - movies_data['popularity'].min())

In [None]:
movie_name = input(' Enter your favourite movie name : ')
n = int(input('How many recommendations you want?: '))

list_of_all_titles = movies_data['title'].tolist()

close_matches = difflib.get_close_matches(movie_name, list_of_all_titles)

close_match = close_matches[0]

index_close_match = movies_data[movies_data.title == close_match]['index'].values[0]

similarity_score_movie = list(enumerate(similarity_score[index_close_match]*movies_data['popularity']))#we multiply with popularity score. So that we rank based on popularity too.

sorted_similar_movies = sorted(similarity_score_movie, key = lambda x:x[1], reverse = True)

print('Movies suggested for you : \n')

i = 1

for movie in sorted_similar_movies[1:]:
  index = movie[0]
  title_from_index = movies_data[movies_data.index==index]['title'].values[0]
  if (i<=n):
    print(i,title_from_index)
    i+=1

 Enter your favourite movie name : batman
How many recommendations you want?: 10
Movies suggested for you : 

1 Minions
2 The Dark Knight
3 Batman Returns
4 The Dark Knight Rises
5 Batman Begins
6 Interstellar
7 Deadpool
8 Batman v Superman: Dawn of Justice
9 Spider-Man 3
10 Batman & Robin
