# 🎬 Movie Recommendation System using Content-Based Filtering

### ✅ Problem:
Recommend similar movies based on user-selected movie using metadata like genres, cast, keywords, and overview.

### ✅ Dataset:
- TMDB 5000 Movies Dataset
- TMDB 5000 Credits Dataset

### ✅ What This Notebook Covers:
- Data Cleaning & Merging
- Feature Engineering: Extract genres, cast, crew
- Text Preprocessing: Lowercase, Remove Spaces, Stemming
- Vectorization with CountVectorizer
- Cosine Similarity for Movie Matching
- Recommend Function for Top 5 Similar Movies

### ✅ Tech Stack:
- Pandas, NumPy
- Scikit-learn
- NLTK for Stemming
- Python


# Import Libraries

In [1]:
# Importing necessary libraries
import pandas as pd  # For handling data
import numpy as np   # For numerical operations
from sklearn.feature_extraction.text import CountVectorizer  # For converting text to numbers
from sklearn.metrics.pairwise import cosine_similarity       # For finding similar movies
import ast  # To convert strings to lists or dictionaries
import warnings # supress warnings
warnings.filterwarnings('ignore')
print("Libraries imported successfully")

Libraries imported successfully


# Load Dataset

In [2]:
#load the datasets
movies=pd.read_csv(r"C:\Users\HP\Downloads\tmdb_5000_movies.csv")
credits=pd.read_csv(r"C:\Users\HP\Downloads\tmdb_5000_credits.csv")
print("Datasets loaded successfully")

Datasets loaded successfully


# Data Understanding

In [3]:
# Show the first 2 rows of the movies dataset 
print(" First 2 rows of Data:")
display(movies.head(2))

# Display basic info about the dataset: number of entries, column types, and null values
movies.info()

 First 2 rows of Data:


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [4]:
print(" First 2 rows of Credits Data:")
display(credits.head(2))

print("\n Info:")
credits.info()

 First 2 rows of Credits Data:


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."



 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


### Dataset Shape

In [5]:
# Show dataset shape (rows,columns)
print(f"movies dataset shape: {movies.shape}")
print(f"credits dataset shape: {credits.shape}")

movies dataset shape: (4803, 20)
credits dataset shape: (4803, 4)


### Merge Datasets 

In [6]:
# First, rename 'id' in movies to 'movie_id' to match with credits
movies.rename(columns={'id': 'movie_id'}, inplace=True)

# Then merge
movies = movies.merge(credits, on='movie_id')
print(movies.shape)

(4803, 23)


In [7]:
movies.columns

Index(['budget', 'genres', 'homepage', 'movie_id', 'keywords',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline',
       'title_x', 'vote_average', 'vote_count', 'title_y', 'cast', 'crew'],
      dtype='object')

- Since two columns had the same name (title), pandas automatically renames them to:
- 'title_x' → the title column from the left DataFrame (movies)
- 'title_y' → the title column from the right DataFrame (credits)

### Drop Unecessary columns

In [8]:
# Rename 'title_x' to 'title' to standardize
movies.rename(columns={'title_x': 'title'}, inplace=True)

# Now safely drop the unused 'title_y'
movies.drop('title_y', axis=1, inplace=True)

# Now select only the useful columns
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

# Show result
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


# Data Preprocessing

### Check & Handle Missing Values

In [9]:
# Checking for missing values
print("Missing values per column:")
print(movies.isnull().sum())

Missing values per column:
movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64


In [10]:
# Drop rows with null values in 'overview'
movies.dropna(subset=['overview'], inplace=True)
print("Shape after dropping rows with missing overview:", movies.shape)

Shape after dropping rows with missing overview: (4800, 7)


### Check for Duplicated Rows & Title

In [11]:
# Check duplicated rows
print(int(movies.duplicated().sum()))

0


In [12]:
# Check duplicated title
print(int(movies['title'].duplicated().sum()))

# Drop duplicate titles, keep the first one
movies.drop_duplicates(subset='title', keep='first', inplace=True)

print("Remaining rows after dropping duplicate titles:", movies.shape[0])

3
Remaining rows after dropping duplicate titles: 4797


# Feature Extraction Functions

### Convert string of dictionaries to list of names

In [13]:
# Show the genres of the first movie to see how the data looks
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [14]:
#Function to extract name from list of dictionaries to list
def extract_names(obj):
    L=[]
    for i in ast.literal_eval(obj): # safely convert string to list of dictionaires
        L.append(i['name'])
    return L    
    
print("Extracting genres")
movies['genres'] = movies['genres'].apply(extract_names)
movies.head(1)    

Extracting genres


Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [15]:
print("Extracting keywords")
movies['keywords'] = movies['keywords'].apply(extract_names)
movies.head(1)

Extracting keywords


Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


### Extract Top 3 Cast Members

In [16]:
# Function to extract top 3 cast members
def extract_top_cast(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])
        if len(L) == 3:
            break
    return L
print("Extracting top 3 cast members...")
movies['cast'] = movies['cast'].apply(extract_top_cast)
movies.head(1)

Extracting top 3 cast members...


Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


### Get Director's Name from Crew

In [17]:
# Get the director's name
def extract_director(obj):
    L = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            L.append(i['name'])
            break
    return L
    
print("Extracting director's name from crew")
movies['crew'] = movies['crew'].apply(extract_director)
movies.head(1)

Extracting director's name from crew


Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


### Remove Spaces & Lowercase

In [18]:
# Convert overview into list of words using .split like "an hero"="an","hero"
movies['overview'] = movies['overview'].apply(lambda x: x.split())
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [19]:
# Remove spaces in names like "Tom Cruise" -> "TomCruise"
movies['genres'] = movies['genres'].apply(lambda x: [i.replace(" ","") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x: [i.replace(" ","") for i in x])
movies['crew'] = movies['crew'].apply(lambda x: [i.replace(" ","") for i in x])
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


### Create Tags

In [20]:
# Create a new column "tags" which has all text data combined
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
# Convert list to string
movies['tags'] = movies['tags'].apply(lambda x: " ".join(x))
# show tag for first movie
print(movies['tags'].iloc[0])

In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron


In [21]:
# Convert all text in tags to lowercase for consistency
movies['tags'] = movies['tags'].apply(lambda x: x.lower())
print(movies.iloc[0]['tags'])                                                  

in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron


In [22]:
# Create a final dataframe with only required columns
print("Creating final dataframe with movie_id, title and tags")
final_movies = movies[['movie_id', 'title', 'tags']]
final_movies.head(1)

Creating final dataframe with movie_id, title and tags


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


# Apply Stemming

In [23]:
# Apply stemming to reduce words to root form 
import nltk
from nltk.stem.porter import PorterStemmer 
ps = PorterStemmer() 

In [24]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))  # Apply stemming to each word and add to list
    return " ".join(y)       # Return the stemmed words as a single string
final_movies['tags']=final_movies['tags'].apply(stem)
print("Tags after stemming:\n")
print(final_movies['tags'].head())

Tags after stemming:

0    in the 22nd century, a parapleg marin is dispa...
1    captain barbossa, long believ to be dead, ha c...
2    a cryptic messag from bond’ past send him on a...
3    follow the death of district attorney harvey d...
4    john carter is a war-weary, former militari ca...
Name: tags, dtype: object


# Vectorization using CountVectorizer

In [25]:
# Convert text tags to numerical vectors using CountVectorizer
cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(final_movies['tags']).toarray()
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(4797, 5000))

In [26]:
# Returns list of unique words (features) extracted from the dataset
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      shape=(5000,), dtype=object)

# Calculate Cosine Similarity

In [27]:
# Calculate cosine similarity between movie vectors
similarity=cosine_similarity(vectors)
similarity[1]

array([0.08346223, 1.        , 0.06063391, ..., 0.02378257, 0.        ,
       0.02615329], shape=(4797,))

# Recommendation Function

In [28]:
def recommend(movie):
    # Convert movie title to lowercase for case-insensitive match
    movie = movie.lower()
    
    # Check if the movie exists in the dataset
    if movie not in final_movies['title'].str.lower().values:
        print("Movie not found in the dataset.")
        return
    
    # Find the index of the movie
    index = final_movies[final_movies['title'].str.lower() == movie].index[0]
    
    # Get similarity scores for the selected movie
    distances = similarity[index]
    
    # Pair movie indices with their similarity scores and sort them
    movie_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
    
    print(f"\nTop 5 movies similar to '{movie.title()}':")
    for i in movie_list:
        print(final_movies.iloc[i[0]].title)

In [29]:
recommend("Avatar")



Top 5 movies similar to 'Avatar':
Aliens vs Predator: Requiem
Aliens
Falcon Rising
Independence Day
Titan A.E.
