In [72]:
# Movie Recommendation System using Content-Based Filtering
# Module E: AI Applications ‚Äì Individual Open Project

"""
This notebook implements a content-based movie recommendation system using the TMDB 5000 Movies Dataset.
The system analyzes movie metadata (genres, keywords, cast, crew, overview) to recommend similar movies
based on cosine similarity of their feature vectors.

Project Track: AI Application - Content-Based Recommendation System
AI Technique: Natural Language Processing (NLP) with Cosine Similarity

Author: [Your Name]
Date: January 2026
GitHub: https://github.com/[your-username]/movie-recommendation-system
"""

# Import Required Libraries
import numpy as np  # Linear algebra operations
import pandas as pd  # Data processing and CSV file I/O
import ast  # For parsing string representations of lists/dicts
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("=" * 60)
print("üé¨ MOVIE RECOMMENDATION SYSTEM")
print("   Content-Based Filtering using NLP")
print("=" * 60)
print("\n‚úÖ Libraries imported successfully!")
print(f"   ‚Ä¢ NumPy version: {np.__version__}")
print(f"   ‚Ä¢ Pandas version: {pd.__version__}")

üé¨ MOVIE RECOMMENDATION SYSTEM
   Content-Based Filtering using NLP

‚úÖ Libraries imported successfully!
   ‚Ä¢ NumPy version: 2.0.2
   ‚Ä¢ Pandas version: 2.2.2


# 1. Problem Definition & Objective

## Selected Project Track
**AI Application: Content-Based Recommendation System**

## Problem Statement
With thousands of movies available on streaming platforms, users often struggle to find movies that match their preferences. This project aims to build a **content-based movie recommendation system** that suggests similar movies based on movie attributes like genres, keywords, cast, crew, and plot overview.

## Real-World Relevance and Motivation
- Streaming platforms like Netflix, Amazon Prime use recommendation systems to enhance user experience
- Personalized recommendations increase user engagement and satisfaction
- Content-based filtering doesn't require user history, making it suitable for new users (cold-start problem)
- Helps users discover movies they might enjoy based on features of movies they already like

# 2. Data Understanding & Preparation

## Dataset Source
**TMDB 5000 Movies Dataset** - A publicly available dataset from Kaggle containing metadata for approximately 5000 movies.

### Dataset Files:
- `tmdb_5000_movies.csv` - Contains movie information (title, overview, genres, keywords, etc.)
- `tmdb_5000_credits.csv` - Contains cast and crew information for each movie

In [None]:
# Load the datasets - Works on Google Colab and Local environments
import os

# Check if running on Google Colab
IN_COLAB = 'google.colab' in str(get_ipython()) if 'get_ipython' in dir() else False

if IN_COLAB:
    print("üîµ Running on Google Colab!")
    print("=" * 50)
    
    # Install gdown to download from Google Drive
    !pip install -q gdown
    import gdown
    
    # Create archive folder
    os.makedirs('/content/archive', exist_ok=True)
    
    # Google Drive file IDs - SWAPPED (movies and credits were reversed)
    movies_file_id = '1qSINlHXsBZd_dT1EncPhJGsQEuXgGoCu'   # Was credits, now movies
    credits_file_id = '1ZQ9qfqvYGn0J5mpTeqaUjwnizakPgQsf'  # Was movies, now credits
    
    movies_path = '/content/archive/tmdb_5000_movies.csv'
    credits_path = '/content/archive/tmdb_5000_credits.csv'
    
    # Remove existing files to re-download with correct IDs
    if os.path.exists(movies_path):
        os.remove(movies_path)
    if os.path.exists(credits_path):
        os.remove(credits_path)
    
    # Download files with confirmation bypass for large files
    print("\nüì• Downloading datasets from Google Drive...")
    
    try:
        gdown.download(f'https://drive.google.com/uc?id={movies_file_id}', movies_path, quiet=False, fuzzy=True)
        gdown.download(f'https://drive.google.com/uc?id={credits_file_id}', credits_path, quiet=False, fuzzy=True)
        
        # Verify downloads - check file sizes
        movies_size = os.path.getsize(movies_path) if os.path.exists(movies_path) else 0
        credits_size = os.path.getsize(credits_path) if os.path.exists(credits_path) else 0
        
        print(f"\nüìÅ Downloaded file sizes:")
        print(f"   Movies: {movies_size:,} bytes")
        print(f"   Credits: {credits_size:,} bytes")
        
        # Check if files are too small (likely HTML error page)
        if movies_size < 10000 or credits_size < 10000:
            print("\n‚ö†Ô∏è Files seem too small! They might be HTML error pages.")
            print("Please ensure Google Drive sharing is set to 'Anyone with the link'")
            raise Exception("Download failed - files too small")
            
    except Exception as e:
        print(f"\n‚ùå Download error: {e}")
        print("\nüìã Manual Upload Option:")
        from google.colab import files
        print("Please upload 'tmdb_5000_movies.csv' and 'tmdb_5000_credits.csv':")
        uploaded = files.upload()
        
        # Move uploaded files to archive folder
        for filename in uploaded.keys():
            os.rename(filename, f'/content/archive/{filename}')
            print(f"   ‚úÖ Moved {filename} to /content/archive/")
    
    archive_path = '/content/archive'

else:
    # For local environments
    possible_paths = [
        'archive',
        '/workspaces/Minor_In_Project_Module_E/archive',
        r'd:\Minor_In_Project_Module_E\archive',
        os.path.join(os.getcwd(), 'archive'),
    ]
    
    archive_path = None
    for path in possible_paths:
        if os.path.exists(os.path.join(path, 'tmdb_5000_movies.csv')):
            archive_path = path
            print(f"‚úÖ Found dataset at: {path}")
            break
    
    if archive_path is None:
        print("‚ùå Dataset not found!")
        print(f"Current working directory: {os.getcwd()}")
        archive_path = 'archive'
    
    movies_path = os.path.join(archive_path, 'tmdb_5000_movies.csv')
    credits_path = os.path.join(archive_path, 'tmdb_5000_credits.csv')

# Load the datasets
if os.path.exists(movies_path) and os.path.exists(credits_path):
    movies = pd.read_csv(movies_path)
    credits = pd.read_csv(credits_path)
    
    print(f"\n‚úÖ Datasets loaded successfully!")
    print(f"   Movies dataset shape: {movies.shape}")
    print(f"   Credits dataset shape: {credits.shape}")
    
    # Verify correct files loaded
    print(f"\nüìã Verification:")
    print(f"   Movies columns: {list(movies.columns)[:5]}...")
    print(f"   Credits columns: {list(credits.columns)}")
else:
    print(f"\n‚ùå Files not found at: {archive_path}")

üîµ Running on Google Colab!


In [None]:
# Explore the movies dataset
print("Movies Dataset - First 2 rows:")
movies.head(2)

Movies Dataset - First 2 rows:


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [None]:
# Explore the credits dataset
print("Credits Dataset - First 2 rows:")
credits.head(2)

Credits Dataset - First 2 rows:


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [None]:
# Data Quality Check - Missing Values Analysis
print("=" * 60)
print("DATA QUALITY CHECK - Missing Values")
print("=" * 60)

print("\nüìä Missing values in Movies dataset:")
movies_missing = movies.isnull().sum()
print(movies_missing[movies_missing > 0] if movies_missing.sum() > 0 else "   No missing values!")

print("\nüìä Missing values in Credits dataset:")
credits_missing = credits.isnull().sum()
print(credits_missing[credits_missing > 0] if credits_missing.sum() > 0 else "   No missing values!")

# Dataset Info
print("\nüìã Dataset Information:")
print(f"   Movies: {movies.shape[0]} rows √ó {movies.shape[1]} columns")
print(f"   Credits: {credits.shape[0]} rows √ó {credits.shape[1]} columns")

Missing values in Movies dataset:
budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

Missing values in Credits dataset:
movie_id    0
title       0
cast        0
crew        0
dtype: int64


In [None]:
# Merge movies and credits datasets on 'title' column
movies = movies.merge(credits, on='title')
print(f"Merged dataset shape: {movies.shape}")
movies.head(2)

Merged dataset shape: (4809, 23)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [None]:
# Select relevant columns for the recommendation system
# We need: movie_id, title, overview, genres, keywords, cast, crew
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]
print("Selected columns for recommendation system:")
print(movies.columns.tolist())
movies.head()

Selected columns for recommendation system:
['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']


Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond‚Äôs past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [None]:
# Drop rows with missing values
print(f"Shape before dropping null values: {movies.shape}")
movies.dropna(inplace=True)
print(f"Shape after dropping null values: {movies.shape}")

Shape before dropping null values: (4809, 7)
Shape after dropping null values: (4806, 7)


In [None]:
# 4. Core Implementation

## Feature Extraction Functions

# Function to extract names from JSON-like string (for genres, keywords)
def convert(text):
    """
    Converts JSON-like string to list of names.
    Example: '[{"id": 28, "name": "Action"}]' -> ['Action']
    """
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name'])
    return L

# Function to extract top 3 names (used for cast to limit features)
def convert3(text):
    """
    Converts JSON-like string to list of top 3 names.
    Limits cast to top 3 actors to reduce noise.
    """
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter += 1
    return L

# Function to extract director name from crew
def fetch_director(text):
    """
    Extracts director name(s) from crew JSON-like string.
    """
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L

# Function to remove spaces from names (for better matching)
def collapse(L):
    """
    Removes spaces from names to create single tokens.
    Example: ['Sam Worthington'] -> ['SamWorthington']
    """
    L1 = []
    for i in L:
        L1.append(i.replace(" ", ""))
    return L1

print("Feature extraction functions defined successfully!")

Feature extraction functions defined successfully!


In [None]:
# Apply feature extraction to genres column
movies['genres'] = movies['genres'].apply(convert)
print("Sample genres after conversion:")
print(movies['genres'].head(3))

Sample genres after conversion:
0    [Action, Adventure, Fantasy, Science Fiction]
1                     [Adventure, Fantasy, Action]
2                       [Action, Adventure, Crime]
Name: genres, dtype: object


In [None]:
# Apply feature extraction to keywords column
movies['keywords'] = movies['keywords'].apply(convert)
print("Sample keywords after conversion:")
print(movies['keywords'].head(3))

Sample keywords after conversion:
0    [culture clash, future, space war, space colon...
1    [ocean, drug abuse, exotic island, east india ...
2    [spy, based on novel, secret agent, sequel, mi...
Name: keywords, dtype: object


In [None]:
# Apply feature extraction to cast column (get all cast, then limit to top 3)
movies['cast'] = movies['cast'].apply(convert)
movies['cast'] = movies['cast'].apply(lambda x: x[0:3])
print("Sample cast after conversion (top 3):")
print(movies['cast'].head(3))

Sample cast after conversion (top 3):
0    [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1       [Johnny Depp, Orlando Bloom, Keira Knightley]
2        [Daniel Craig, Christoph Waltz, L√©a Seydoux]
Name: cast, dtype: object


In [None]:
# Apply feature extraction to crew column (get director only)
movies['crew'] = movies['crew'].apply(fetch_director)
print("Sample crew (directors) after conversion:")
print(movies['crew'].head(3))

Sample crew (directors) after conversion:
0     [James Cameron]
1    [Gore Verbinski]
2        [Sam Mendes]
Name: crew, dtype: object


In [None]:
# Remove spaces from all extracted features to create single tokens
movies['cast'] = movies['cast'].apply(collapse)
movies['crew'] = movies['crew'].apply(collapse)
movies['genres'] = movies['genres'].apply(collapse)
movies['keywords'] = movies['keywords'].apply(collapse)

print("Features after removing spaces:")
movies[['title', 'genres', 'cast', 'crew']].head(3)

Features after removing spaces:


Unnamed: 0,title,genres,cast,crew
0,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,Spectre,"[Action, Adventure, Crime]","[DanielCraig, ChristophWaltz, L√©aSeydoux]",[SamMendes]


In [None]:
# Convert overview to list of words
movies['overview'] = movies['overview'].apply(lambda x: x.split())
print("Sample overview after splitting:")
print(movies['overview'].head(1))

Sample overview after splitting:
0    [In, the, 22nd, century,, a, paraplegic, Marin...
Name: overview, dtype: object


In [None]:
# Create 'tags' column by combining all features
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
print("Tags column created by combining: overview + genres + keywords + cast + crew")
movies[['title', 'tags']].head(2)

Tags column created by combining: overview + genres + keywords + cast + crew


Unnamed: 0,title,tags
0,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."


In [None]:
# Create final dataframe with only required columns
new = movies.drop(columns=['overview', 'genres', 'keywords', 'cast', 'crew'])

# Convert tags list to string
new['tags'] = new['tags'].apply(lambda x: " ".join(x))

# Convert to lowercase for better matching
new['tags'] = new['tags'].apply(lambda x: x.lower())

print(f"Final dataset shape: {new.shape}")
new.head()

Final dataset shape: (4806, 3)


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond‚Äôs past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


# 3. Model / System Design

## AI Technique Used
**Content-Based Filtering using Natural Language Processing (NLP)**

## Architecture / Pipeline Explanation
1. **Data Preprocessing**: Extract and clean relevant features from JSON-like strings
2. **Feature Engineering**: Create a unified 'tags' column combining all text features
3. **Text Vectorization**: Convert text to numerical vectors using CountVectorizer (Bag of Words)
4. **Similarity Computation**: Calculate cosine similarity between movie vectors
5. **Recommendation Generation**: Find top-N most similar movies for any given movie

## Justification of Design Choices
- **CountVectorizer**: Simple yet effective for capturing word frequency; works well with combined text features
- **Cosine Similarity**: Measures angle between vectors, ideal for high-dimensional sparse data
- **Stop Words Removal**: Eliminates common words that don't contribute to meaning
- **Feature Combination**: Combining genres, keywords, cast, crew, and overview creates rich movie representations

In [None]:
## Text Vectorization using CountVectorizer (Bag of Words)

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with max 5000 features and English stop words removal
cv = CountVectorizer(max_features=5000, stop_words='english')

# Fit and transform the tags column
vector = cv.fit_transform(new['tags']).toarray()

print(f"Vector shape: {vector.shape}")
print(f"Number of movies: {vector.shape[0]}")
print(f"Number of features (vocabulary size): {vector.shape[1]}")

Vector shape: (4806, 5000)
Number of movies: 4806
Number of features (vocabulary size): 5000


In [None]:
# Display sample vocabulary words
print("Sample vocabulary words:")
print(cv.get_feature_names_out()[:20])

Sample vocabulary words:
['000' '007' '10' '100' '11' '12' '13' '14' '15' '16' '17' '17th' '18'
 '18th' '19' '1930s' '1940s' '1944' '1950' '1950s']


In [None]:
## Compute Cosine Similarity

from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity matrix
similarity = cosine_similarity(vector)

print(f"Similarity matrix shape: {similarity.shape}")
print(f"This is a {similarity.shape[0]}x{similarity.shape[1]} matrix where each cell represents")
print("the cosine similarity between two movies")

Similarity matrix shape: (4806, 4806)
This is a 4806x4806 matrix where each cell represents
the cosine similarity between two movies


In [None]:
## Recommendation Function

def recommend(movie):
    """
    Recommends top 5 similar movies based on content similarity.
    
    Parameters:
    movie (str): Title of the movie to get recommendations for
    
    Returns:
    List of recommended movie titles
    """
    try:
        # Find the index of the movie
        index = new[new['title'] == movie].index[0]
        
        # Get similarity scores for all movies with this movie
        distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1])
        
        # Get top 5 similar movies (excluding the movie itself)
        recommendations = []
        print(f"\nTop 5 movies similar to '{movie}':\n")
        print("-" * 50)
        for i, (idx, score) in enumerate(distances[1:6], 1):
            title = new.iloc[idx].title
            recommendations.append(title)
            print(f"{i}. {title} (Similarity: {score:.4f})")
        print("-" * 50)
        
        return recommendations
    except IndexError:
        print(f"Movie '{movie}' not found in database.")
        return []

print("Recommendation function defined successfully!")

Recommendation function defined successfully!


# 5. Evaluation & Analysis

## Testing the Recommendation System
Let's test the recommendation system with different movie genres to evaluate its performance.

In [None]:
# Test Case 1: Action/Sci-Fi Movie
recommend('Avatar')


Top 5 movies similar to 'Avatar':

--------------------------------------------------
1. Titan A.E. (Similarity: 0.2537)
2. Small Soldiers (Similarity: 0.2511)
3. Ender's Game (Similarity: 0.2442)
4. Aliens vs Predator: Requiem (Similarity: 0.2426)
5. Independence Day (Similarity: 0.2417)
--------------------------------------------------


['Titan A.E.',
 'Small Soldiers',
 "Ender's Game",
 'Aliens vs Predator: Requiem',
 'Independence Day']

In [None]:
# Test Case 2: Superhero Movie
recommend('The Dark Knight')


Top 5 movies similar to 'The Dark Knight':

--------------------------------------------------
1. The Dark Knight Rises (Similarity: 0.4239)
2. Batman Begins (Similarity: 0.3939)
3. Batman Returns (Similarity: 0.3216)
4. Batman Forever (Similarity: 0.2879)
5. Batman & Robin (Similarity: 0.2679)
--------------------------------------------------


['The Dark Knight Rises',
 'Batman Begins',
 'Batman Returns',
 'Batman Forever',
 'Batman & Robin']

In [None]:
# Test Case 3: Drama/Biography
recommend('Gandhi')


Top 5 movies similar to 'Gandhi':

--------------------------------------------------
1. Gandhi, My Father (Similarity: 0.2611)
2. The Wind That Shakes the Barley (Similarity: 0.2474)
3. A Passage to India (Similarity: 0.2282)
4. Guiana 1838 (Similarity: 0.1936)
5. Ramanujan (Similarity: 0.1750)
--------------------------------------------------


['Gandhi, My Father',
 'The Wind That Shakes the Barley',
 'A Passage to India',
 'Guiana 1838',
 'Ramanujan']

In [None]:
# Test Case 4: Animation Movie
recommend('The Lego Movie')


Top 5 movies similar to 'The Lego Movie':

--------------------------------------------------
1. The Adventures of Rocky & Bullwinkle (Similarity: 0.2959)
2. Curious George (Similarity: 0.2874)
3. The Boxtrolls (Similarity: 0.2858)
4. Percy Jackson: Sea of Monsters (Similarity: 0.2800)
5. The Croods (Similarity: 0.2697)
--------------------------------------------------


['The Adventures of Rocky & Bullwinkle',
 'Curious George',
 'The Boxtrolls',
 'Percy Jackson: Sea of Monsters',
 'The Croods']

In [None]:
## Performance Analysis & Evaluation Metrics

print("=" * 60)
print("üìä MODEL PERFORMANCE ANALYSIS")
print("=" * 60)

print(f"\nüé¨ Dataset Statistics:")
print(f"   ‚Ä¢ Total movies in database: {len(new)}")
print(f"   ‚Ä¢ Feature vector dimensions: {vector.shape[1]}")
print(f"   ‚Ä¢ Similarity matrix size: {similarity.shape[0]} √ó {similarity.shape[1]}")
print(f"   ‚Ä¢ Total similarity calculations: {similarity.shape[0] * similarity.shape[1]:,}")

print(f"\nüìà Evaluation Metrics Used:")
print("   ‚Ä¢ Cosine Similarity: Measures the cosine of the angle between two vectors")
print("   ‚Ä¢ Range: 0 (completely different) to 1 (identical)")
print("   ‚Ä¢ Higher similarity score = more similar movies")

print(f"\nüìä Similarity Score Distribution:")
# Get upper triangle of similarity matrix (excluding diagonal)
upper_tri = similarity[np.triu_indices(len(similarity), k=1)]
print(f"   ‚Ä¢ Mean similarity: {upper_tri.mean():.4f}")
print(f"   ‚Ä¢ Max similarity: {upper_tri.max():.4f}")
print(f"   ‚Ä¢ Min similarity: {upper_tri.min():.4f}")
print(f"   ‚Ä¢ Std deviation: {upper_tri.std():.4f}")

print(f"\n‚ö†Ô∏è Model Limitations:")
print("   ‚Ä¢ Only considers content features, not user preferences")
print("   ‚Ä¢ Limited to movies in the dataset (no new releases)")
print("   ‚Ä¢ Doesn't account for movie quality/ratings")
print("   ‚Ä¢ May miss movies with different metadata but similar themes")
print("   ‚Ä¢ Cast limited to top 3 actors may miss important connections")

print(f"\n‚úÖ Model Strengths:")
print("   ‚Ä¢ No cold-start problem for new users")
print("   ‚Ä¢ Transparent and explainable recommendations")
print("   ‚Ä¢ Fast inference with pre-computed similarity matrix")
print("   ‚Ä¢ Works without user history data")

MODEL PERFORMANCE ANALYSIS

üìä Dataset Statistics:
   - Total movies in database: 4806
   - Feature vector dimensions: 5000
   - Similarity matrix size: 4806x4806

üìà Metrics Used:
   - Cosine Similarity: Measures the cosine of the angle between two vectors
   - Range: 0 (completely different) to 1 (identical)
   - Higher similarity score = more similar movies

‚ö†Ô∏è Limitations:
   - Only considers content features, not user preferences
   - Limited to movies in the dataset (no new releases)
   - Doesn't account for movie quality/ratings
   - May miss movies with different metadata but similar themes


# 6. Ethical Considerations & Responsible AI

## Bias and Fairness Considerations
- **Popularity Bias**: Dataset may over-represent popular Western/Hollywood movies
- **Cultural Bias**: Limited representation of international cinema
- **Historical Bias**: Older movies may have less detailed metadata
- **Gender/Diversity**: Cast-based recommendations may perpetuate existing industry biases

## Dataset Limitations
- Limited to ~5000 movies (subset of all movies ever made)
- English-centric metadata and descriptions
- Missing recent movies (dataset has a cutoff date)
- Quality of metadata varies across movies

## Responsible Use of AI Tools
- Recommendations should supplement, not replace, human choice
- Users should be aware that recommendations are based on content similarity only
- The system doesn't consider age-appropriateness or content warnings
- Should be combined with additional filtering for production use

# 7. Conclusion & Future Scope

## Summary of Results
- Successfully built a content-based movie recommendation system
- System uses movie metadata (genres, keywords, cast, crew, overview) for recommendations
- CountVectorizer creates 5000-dimensional feature vectors for each movie
- Cosine similarity effectively measures movie similarity
- Recommendations are relevant and meaningful based on test cases

## Possible Improvements and Extensions
1. **Hybrid Approach**: Combine with collaborative filtering for better recommendations
2. **TF-IDF Vectorization**: Use TF-IDF instead of simple counts for better feature importance
3. **Word Embeddings**: Use Word2Vec or BERT for semantic understanding
4. **Include Ratings**: Factor in movie ratings for quality-aware recommendations
5. **User Profiles**: Add user preference modeling for personalization
6. **Real-time Updates**: Integrate with TMDB API for latest movies
7. **Web Application**: Deploy using Streamlit/Flask for user interaction

In [None]:
## Save Model Artifacts for Deployment

import pickle

# Save the processed movie data
pickle.dump(new, open('movie_list.pkl', 'wb'))

# Save the similarity matrix
pickle.dump(similarity, open('similarity.pkl', 'wb'))

print("=" * 60)
print("‚úÖ MODEL ARTIFACTS SAVED SUCCESSFULLY!")
print("=" * 60)
print("\nFiles created:")
print("   üìÅ movie_list.pkl - Contains processed movie data")
print("   üìÅ similarity.pkl - Contains precomputed similarity matrix")
print("\nThese files can be loaded in a Streamlit/Flask web application for deployment.")

‚úÖ MODEL ARTIFACTS SAVED SUCCESSFULLY!

Files created:
   üìÅ movie_list.pkl - Contains processed movie data
   üìÅ similarity.pkl - Contains precomputed similarity matrix

These files can be loaded in a Streamlit/Flask web application for deployment.
