# Project Title- 

# A Movie Recommendation System Project

## Project Purpose

### Objective
Develop a movie recommendation system that provides users with movie suggestions similar to their selected movie.

### How It Works
1. **Data Preprocessing:**
   - Clean and preprocess movie data, including genres, keywords, cast, crew, and overview.

2. **Feature Extraction:**
   - Transform text data from movies into numerical features using techniques like `CountVectorizer` and text stemming.

3. **Similarity Calculation:**
   - Calculate the similarity between movies using cosine similarity.

4. **Recommendation Generation:**
   - Based on the similarity scores, recommend movies similar to the one specified by the user.

### Technologies and Libraries
- **Pandas and Numpy:** For data manipulation and numerical operations.
- **Scikit-learn:** For text vectorization and similarity calculation.
- **NLTK:** For natural language processing tasks, such as stemming.
- **Matplotlib and Seaborn:** For visualization (if included in the project but it is not necessary in this project).

### Key Features
- **Data Merging:** Integrates movie details with credits information to provide a comprehensive dataset.
- **Text Processing:** Cleans and preprocesses text data for better feature extraction.
- **Recommendation Logic:** Uses cosine similarity to find and recommend similar movies.

### Impact
- **User Experience:** Enhances user experience by helping users discover movies similar to those they like.
- **Data Utilization:** Leverages rich movie metadata to provide relevant recommendations.

This system is useful for movie streaming services, entertainment platforms, and any application that seeks to enhance user engagement by suggesting content based on user preferences.




In [530]:
# Importing necessary libraries for data manipulation, visualization, and text processing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ast
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity

In [531]:
# Load the datasets
movies = pd.read_csv(r'E:\Data Set\RESTYM\tmdb_5000_movies.csv')

In [532]:
# Load the datasets
credits = pd.read_csv(r'E:\Data Set\RESTYM\tmdb_5000_credits.csv')

In [533]:
# Display the first row of each dataset to understand their structure
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [534]:
# Display the first row of each dataset to understand their structure
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


# Data preprocessing 

In [535]:
# Check the shape of the datasets to understand their dimensions
movies.shape

(4803, 20)

In [536]:
# Check the shape of the datasets to understand their dimensions
credits.shape

(4803, 4)

In [537]:
# Merging the movies and credits data on the 'title' column
movies= movies.merge(credits, on='title')

In [538]:
# Display the first row of the merged dataset
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [539]:
# Check the shape of the merged dataset
movies.shape

(4809, 23)

### Selecting Necessary Columns According to Our Projects

In [540]:
# Select relevant columns for the recommendation system
movies=movies[['movie_id', 'title', 'overview','genres', 'keywords', 'cast', 'crew']]

In [541]:
# Display the first row of the filtered dataset
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [542]:
# Check for missing values in the dataset
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [543]:
# Drop rows with missing values
movies.dropna(inplace=True)

In [544]:
# Confirm that there are no missing values left
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [545]:
# Check for and remove duplicate rows
movies.duplicated().sum()

0

In [546]:
# Display the 'genres' column for the first row to understand its format
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [547]:
# Function to convert genre information from JSON string to list of genre names
def convert(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])  
    return L

In [548]:
# Apply the 'convert' function to 'genres' columns
movies['genres'] = movies['genres'].apply(convert)

In [549]:
# Apply the 'convert' function to  'keywords' columns
movies['keywords'] = movies['keywords'].apply(convert)

In [550]:
# Display the first row to verify the transformation
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [551]:
# Function to extract the top 3 cast members from the 'cast' column
def convert5(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L 

In [552]:
# Apply the 'convert5' function to the 'cast' column
movies['cast'] = movies['cast'].apply(convert5)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [553]:
# Function to extract the director's name from the 'crew' column
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [554]:
# Apply the 'fetch_director' function to the 'crew' column
movies['crew'] = movies['crew'].apply(fetch_director)

In [555]:
# Display the first row to verify the transformation
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [556]:
# Split the overview text into individual words
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [557]:
# Display the first row to verify the transformation
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [558]:
# Remove spaces and convert genre names to camel case format

movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","") for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ","") for i in x])

In [559]:
# Display the first row to verify the transformation
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]


In [560]:
# Combine 'overview', 'genres', 'keywords', 'cast', and 'crew' into a single 'tags' column
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [561]:
# Display the first row to verify the creation of 'tags' column
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [562]:
# Create a new DataFrame with only 'movie_id', 'title', and 'tags' columns
df=movies[['movie_id', 'title', 'tags']]

In [563]:
# Display the first few rows of the new DataFrame
df.head(5)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [564]:
# Join all tags into a single string per movie
df['tags'] = df['tags'].apply(lambda x: " ".join(x))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tags'] = df['tags'].apply(lambda x: " ".join(x))


In [565]:
# Display the DataFrame to verify the transformation
df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."
...,...,...,...
4804,9367,El Mariachi,El Mariachi just wants to play his guitar and ...
4805,72766,Newlyweds,A newlywed couple's honeymoon is upended by th...
4806,231617,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic..."
4807,126186,Shanghai Calling,When ambitious New York attorney Sam is sent t...


In [566]:
# Convert all tags to lowercase for consistency

df['tags'] = df['tags'].apply(lambda x:x.lower())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tags'] = df['tags'].apply(lambda x:x.lower())


In [567]:
# Display the DataFrame to verify the transformation
df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."
...,...,...,...
4804,9367,El Mariachi,el mariachi just wants to play his guitar and ...
4805,72766,Newlyweds,a newlywed couple's honeymoon is upended by th...
4806,231617,"Signed, Sealed, Delivered","""signed, sealed, delivered"" introduces a dedic..."
4807,126186,Shanghai Calling,when ambitious new york attorney sam is sent t...


In [568]:
# Display the tags for the first movie to ensure proper conversion
df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

In [569]:
# Initialize CountVectorizer to convert text data into a matrix of token counts
cv = CountVectorizer(max_features=5000,stop_words='english')

In [570]:
# Fit and transform the 'tags' column into a matrix of token counts
vector = cv.fit_transform(df['tags']).toarray()
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [571]:
# Display the shape of the vectorized data
vector.shape

(4806, 5000)

### NLTK

In [572]:
# Initialize PorterStemmer for text stemming
ps=PorterStemmer()

In [573]:
# Function to apply stemming to each word in the text
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)
    

In [574]:
# Test the stemming function with a sample text
stem('in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron')

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'

In [575]:
# Apply stemming to the 'tags' column
df['tags'] = df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tags'] = df['tags'].apply(stem)


In [576]:
# Display the DataFrame to verify the stemming transformation
df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."
2,206647,Spectre,a cryptic messag from bond’ past send him on a...
3,49026,The Dark Knight Rises,follow the death of district attorney harvey d...
4,49529,John Carter,"john carter is a war-weary, former militari ca..."
...,...,...,...
4804,9367,El Mariachi,el mariachi just want to play hi guitar and ca...
4805,72766,Newlyweds,a newlyw couple' honeymoon is upend by the arr...
4806,231617,"Signed, Sealed, Delivered","""signed, sealed, delivered"" introduc a dedic q..."
4807,126186,Shanghai Calling,when ambiti new york attorney sam is sent to s...


In [577]:
# Compute the cosine similarity matrix between movie vectors
similarity = cosine_similarity(vector)
# Display the similarity matrix
similarity

array([[1.        , 0.08964215, 0.06071767, ..., 0.02519763, 0.0277885 ,
        0.        ],
       [0.08964215, 1.        , 0.06350006, ..., 0.02635231, 0.        ,
        0.        ],
       [0.06071767, 0.06350006, 1.        , ..., 0.02677398, 0.        ,
        0.        ],
       ...,
       [0.02519763, 0.02635231, 0.02677398, ..., 1.        , 0.07352146,
        0.04774099],
       [0.0277885 , 0.        , 0.        , ..., 0.07352146, 1.        ,
        0.05264981],
       [0.        , 0.        , 0.        , ..., 0.04774099, 0.05264981,
        1.        ]])

In [578]:
# Find the index of a specific movie ('The Lego Movie') in the DataFrame
df[df['title'] == 'The Lego Movie'].index[0]

744

In [579]:
# Function to recommend movies similar to a given movie
def recommend(movie):
    # Check if the movie exists in the DataFrame
    if movie not in df['title'].values:
        print(f"Movie '{movie}' not found in the dataset.")
        return
    
 # Find the index of the movie and compute similarity scores
    movies_index = df[df['title'] == movie].index[0]
    distances = similarity[movies_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
    
        
# Print the titles of similar movies
    print(f"Movies similar to '{movie}':")
    for i in movies_list:
        print(df.iloc[i[0]].title)

# Example usage of the recommendation function
recommend('Avatar')  


Movies similar to 'Avatar':
Titan A.E.
Small Soldiers
Ender's Game
Aliens vs Predator: Requiem
Independence Day


In [580]:
# Example usage of the recommendation function
recommend('Batman Begins')

Movies similar to 'Batman Begins':
The Dark Knight
The Dark Knight Rises
Batman
Batman & Robin
Batman


# Our project is Now 100% Working 