# Movie Recommendation System 

## Overview 🎬
This Movie Recommendation System suggests movies based on a user-provided movie title. It leverages content-based filtering using TF-IDF and cosine similarity, enhanced by hybrid recommendations that combine content-based and collaborative filtering techniques. The system also integrates with the OMDB API to fetch movie details and display movie posters for an engaging user experience.

## Key Features 🚀

- **Movie Search:** Enter any movie title and retrieve recommendations.
- **Content-Based Recommendations:** Suggests movies similar in plot, genre, and other metadata.
- **Hybrid Recommendation System:** Combines content and collaborative filtering approaches for improved accuracy.
- **Pagination Support:** Browse recommended movies in pages for easier navigation.
- **Movie Details:** Displays movie overview, genres, release year, vote average, and IMDb links.
- **Poster Images:** Fetches movie posters dynamically from the OMDB API.
- **Customizable UI:** Choose between dark and light themes with CSS styling, powered by Streamlit.

## Project Architecture

### Frontend
- Built with **Streamlit** for a reactive, easy-to-use interface.
- Uses CSS files (`style_dark.css` and `style_light.css`) for styling themes.
- Integrates with OMDB API to fetch movie posters and details.

### Backend
- Written in Python.
- Uses a pre-trained recommendation model (`movie_recommender.pkl`) containing a DataFrame and TF-IDF matrix.
- Content-based filtering implemented with **scikit-learn**'s cosine similarity (`linear_kernel`).
- Movie metadata stored in a DataFrame loaded from a pickle file.

## Prerequisites 🔧

- Python 3.7+
- OMDB API Key (Get yours for free from [OMDB API](http://www.omdbapi.com/apikey.aspx))
- Required Python packages (install via `pip install -r requirements.txt`):

```plaintext
pandas
numpy
scikit-learn
requests
streamlit
fuzzywuzzy

In [1]:
# System and Data Handling
import ast
import re
import os
import time
import difflib

# Data Handling
import numpy as np
import pandas as pd

# Machine Learning Handling
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from fuzzywuzzy import process



In [2]:
df = pd.read_csv('TMDB_movie_dataset_v11.csv', low_memory=False)
print("Dataset loaded with shape:", df.shape)
df.head(2)

Dataset loaded with shape: (1298530, 24)


Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."


3. Data Cleaning
Drop duplicates
Select important columns
Fill missing values

In [3]:
# Drop duplicates and reset index
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

# Keep only relevant columns if they exist in the DataFrame
keep_cols = set([
    'id', 'title', 'vote_average', 'vote_count', 'status', 'release_date', 'revenue', 
    'runtime', 'adult', 'backdrop_path', 'budget', 'homepage', 'imdb_id', 'original_language', 
    'original_title', 'overview', 'popularity', 'poster_path', 'tagline', 'genres', 
    'production_companies', 'production_countries', 'spoken_languages', 'keywords'
])

# Only keep columns that exist in the dataframe
df = df[[col for col in keep_cols if col in df.columns]]

# Fill missing data (NA) for relevant columns
fill_columns = ['title', 'overview', 'genres', 'keywords', 'imdb_id' 'cast', 'crew']
for col in fill_columns:
    if col in df.columns:
        df[col].fillna('', inplace=True)

# Output remaining columns and preview
print("Remaining columns:", df.columns.tolist())
df.head(2)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna('', inplace=True)


Remaining columns: ['production_companies', 'popularity', 'runtime', 'genres', 'poster_path', 'imdb_id', 'backdrop_path', 'vote_average', 'budget', 'release_date', 'adult', 'status', 'original_title', 'revenue', 'title', 'production_countries', 'overview', 'original_language', 'spoken_languages', 'id', 'homepage', 'vote_count', 'tagline', 'keywords']


Unnamed: 0,production_companies,popularity,runtime,genres,poster_path,imdb_id,backdrop_path,vote_average,budget,release_date,...,title,production_countries,overview,original_language,spoken_languages,id,homepage,vote_count,tagline,keywords
0,"Legendary Pictures, Syncopy, Warner Bros. Pict...",83.952,148,"Action, Science Fiction, Adventure",/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,tt1375666,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,8.364,160000000,2010-07-15,...,Inception,"United Kingdom, United States of America","Cobb, a skilled thief who commits corporate es...",en,"English, French, Japanese, Swahili",27205,https://www.warnerbros.com/movies/inception,34495,Your mind is the scene of the crime.,"rescue, mission, dream, airplane, paris, franc..."
1,"Legendary Pictures, Syncopy, Lynda Obst Produc...",140.241,169,"Adventure, Drama, Science Fiction",/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,tt0816692,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,8.417,165000000,2014-11-05,...,Interstellar,"United Kingdom, United States of America",The adventures of a group of explorers who mak...,en,English,157336,http://www.interstellarmovie.net/,32571,Mankind was born on Earth. It was never meant ...,"rescue, future, spacecraft, race against time,..."


important columns

In [4]:
# Check the first few entries in the 'genres' column to inspect the data format
df['genres'].head(10)

0             Action, Science Fiction, Adventure
1              Adventure, Drama, Science Fiction
2                 Drama, Action, Crime, Thriller
3    Action, Adventure, Fantasy, Science Fiction
4             Science Fiction, Action, Adventure
5                      Action, Adventure, Comedy
6             Adventure, Action, Science Fiction
7                                          Drama
8             Action, Science Fiction, Adventure
9                                Thriller, Crime
Name: genres, dtype: object

In [5]:
# Clean genres column: Handle plain string values and properly formatted lists/dictionaries
def parse_genres(x):
    if isinstance(x, str):
        try:
            # Try to parse as a list (if it's a string like "['genre1', 'genre2']")
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list):
                return parsed
            elif isinstance(parsed, dict) and 'name' in parsed:
                return [parsed['name']]  # Single genre as a dictionary
        except (ValueError, SyntaxError):
            # Handle plain comma-separated genres like "Action, Adventure"
            return [genre.strip() for genre in x.split(',')]  # Split by comma
    return x  # If already in a proper format, return as is

# Apply parsing to 'genres' column
df['genres'] = df['genres'].apply(parse_genres)

# Extract genre names from lists of dictionaries (if necessary)
df['genres'] = df['genres'].apply(lambda x: [genre['name'] if isinstance(genre, dict) else genre for genre in x])

# Now extract unique genres
genres = df['genres'].explode().dropna().unique()
print("Unique genres:", genres)


Unique genres: ['Action' 'Science Fiction' 'Adventure' 'Drama' 'Crime' 'Thriller'
 'Fantasy' 'Comedy' 'Romance' 'Western' 'Mystery' 'War' 'Animation'
 'Family' 'Horror' 'Music' 'History' 'TV Movie' 'Documentary' '']


In [6]:
# Display the first few rows of the 'overview' column, ensuring no missing or empty values
df['overview'].fillna('', inplace=True)

# Optionally, truncate the overview text to a certain length for easier viewing (e.g., 200 characters)
df['overview'].head().apply(lambda x: x[:200] + '...' if len(x) > 200 else x)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['overview'].fillna('', inplace=True)


0    Cobb, a skilled thief who commits corporate es...
1    The adventures of a group of explorers who mak...
2    Batman raises the stakes in his war on crime. ...
3    In the 22nd century, a paraplegic Marine is di...
4    When an unexpected enemy emerges and threatens...
Name: overview, dtype: object

In [7]:
# Fill missing values in the 'keywords' column
df['keywords'].fillna('', inplace=True)

# Function to parse the 'keywords' column
def parse_keywords(x):
    if isinstance(x, str):
        try:
            # Attempt to parse the string as a list (e.g., string representation of a list)
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list):
                return parsed  # Return the list
        except (ValueError, SyntaxError):
            # If parsing fails, assume the string is a comma-separated list of keywords
            return [keyword.strip() for keyword in x.split(',')]  # Split by commas into list
    return x  # If already in list form, return as is

# Apply the parsing function to the 'keywords' column
df['keywords'] = df['keywords'].apply(parse_keywords)

# Convert list of keywords to a comma-separated string for easier viewing (if needed)
df['keywords'] = df['keywords'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

# Display the first few rows to check the results
df['keywords'].head()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['keywords'].fillna('', inplace=True)


0    rescue, mission, dream, airplane, paris, franc...
1    rescue, future, spacecraft, race against time,...
2    joker, sadism, chaos, secret identity, crime f...
3    future, society, culture clash, space travel, ...
4    new york city, superhero, shield, based on com...
Name: keywords, dtype: object

4. Feature Engineering
Convert genres, keywords, cast, and crew into plain text and combine them with the overview.

In [8]:
# Function to extract names from the list or dict
def extract_names(x):
    if pd.isna(x) or x == '':
        return ''
    try:
        # Try to parse the string as a list or dict
        parsed = ast.literal_eval(x)
        
        if isinstance(parsed, list):
            # Extract 'name' from each dictionary in the list
            return ' '.join([str(d.get('name', '')) for d in parsed if isinstance(d, dict)])
        
        if isinstance(parsed, dict):
            # Extract values from the dictionary (assuming they are strings or ints)
            return ' '.join([str(v) for v in parsed.values() if isinstance(v, (str, int))])
        
    except (ValueError, SyntaxError):  # Specific exceptions
        pass
    
    # Fallback: return cleaned string (remove unwanted characters)
    return re.sub(r'[^\w\s]', ' ', str(x))  # Cleaner regex for alphanumeric and spaces

# Apply the extraction function to relevant columns
for c in ['genres', 'keywords', 'cast', 'crew']:
    if c in df.columns:
        df[c + '_clean'] = df[c].apply(lambda x: extract_names(x) if isinstance(x, str) else x)

# Components for soup creation
components = ['overview', 'genres_clean', 'keywords_clean', 'cast_clean', 'crew_clean']
components = [c for c in components if c in df.columns]

# Function to create the 'soup' column by joining relevant columns
def create_soup(row):
    return ' '.join([str(row.get(c, '')) for c in components])

# Apply the function to create the 'soup' column
df['soup'] = df.apply(create_soup, axis=1)

# Display the first 3 rows with the 'title' and 'soup' columns
df[['title', 'soup']].head(3)


Unnamed: 0,title,soup
0,Inception,"Cobb, a skilled thief who commits corporate es..."
1,Interstellar,The adventures of a group of explorers who mak...
2,The Dark Knight,Batman raises the stakes in his war on crime. ...


5. TF-IDF Vectorization
Transform text soup into numerical vectors using TF-IDF.

In [9]:
# Vectorizing the 'soup' column
tfidf = TfidfVectorizer(stop_words='english', max_features=50000)
tfidf_matrix = tfidf.fit_transform(df['soup'])

# If memory usage is a concern, apply dimensionality reduction (SVD)
svd = TruncatedSVD(n_components=100, random_state=42)
tfidf_matrix_reduced = svd.fit_transform(tfidf_matrix)

# Output the shape of the matrix
print("TF-IDF matrix shape (original):", tfidf_matrix.shape)
print("TF-IDF matrix shape (reduced):", tfidf_matrix_reduced.shape)

TF-IDF matrix shape (original): (1298153, 50000)
TF-IDF matrix shape (reduced): (1298153, 100)


6. linear_kernel

In [10]:
def recommend(movie_index, tfidf_matrix, top_n=10):
    # Compute the similarity between the target movie and all others
    sim = linear_kernel(tfidf_matrix[movie_index], tfidf_matrix).flatten()
    
    # Sort the similarities in descending order and get the top n indices
    top_idx = sim.argsort()[-top_n-1:-1][::-1]
    
    # Optional: If you are working with large datasets, consider filtering movies
    # that have a very low similarity score (e.g., threshold < 0.1)
    
    return top_idx

In [11]:
# Example movie index
movie_index = 0

# Ensure the movie_index is within the valid range
if 0 <= movie_index < len(df):
    # Get top recommendations based on movie_index
    indices = recommend(movie_index, tfidf_matrix, top_n=10)
    print("Top recommended indices:", indices)
    
    # Optionally, show titles of the recommended movies
    print("Top recommended movies:")
    for idx in indices:
        print(f"Title: {df.iloc[idx]['title']} (Index: {idx})")
else:
    print(f"Invalid movie index: {movie_index}. Please provide a valid index between 0 and {len(df)-1}.")

Top recommended indices: [ 916600  459858  807558  409485 1175773  368328 1041302 1197618 1080212
 1025729]
Top recommended movies:
Title: Slumber is Golden (Index: 916600)
Title: The Universe App (Index: 459858)
Title: Inception (Index: 807558)
Title: And For The First Time, I Was Free (Index: 409485)
Title: Mystery Reel (Index: 1175773)
Title: Sound of Waves (Index: 368328)
Title: Sweet Dreams (Index: 1041302)
Title: Domo Dreams (Index: 1197618)
Title: Dream Lover (Index: 1080212)
Title: Reflections (Index: 1025729)


show recommendation

In [12]:
def show_recommendations(movie_index, tfidf_matrix, top_n=10):
    # Get the indices of the top n similar movies
    idxs = recommend(movie_index, tfidf_matrix, top_n)
    
    # Get movie details: title, overview, and additional info if desired
    recommended_movies = df.loc[idxs, ['title', 'overview', 'release_date', 'imdb_id', 'vote_average', 'genres']]
    
    # Handle missing values for release_date, imdb_id, and vote_average
    recommended_movies['release_date'] = recommended_movies['release_date'].fillna('Release date not available')
    recommended_movies['imdb_id'] = recommended_movies['imdb_id'].fillna('No IMDb ID available')
    recommended_movies['vote_average'] = recommended_movies['vote_average'].fillna('No rating available')
    
    # Format genres: If genres are in a list, join them into a string
    recommended_movies['genres'] = recommended_movies['genres'].apply(
        lambda x: ', '.join([genre['name'] for genre in ast.literal_eval(x)] if isinstance(x, str) else x)
    )
    
    return recommended_movies

# Show recommendations for movie at index 0
show_recommendations(0, tfidf_matrix, top_n=10)

Unnamed: 0,title,overview,release_date,imdb_id,vote_average,genres
916600,Slumber is Golden,A young woman named Tyler Ann travels into her...,Release date not available,tt10929300,0.0,"Fantasy, Adventure, Drama"
459858,The Universe App,Enter deep into your subconscious...,2020-04-18,No IMDb ID available,0.0,
807558,Inception,"Dom Cobb is a skilled thief, the best in the f...",Release date not available,No IMDb ID available,0.0,
409485,"And For The First Time, I Was Free",A subconscious journey towards salvation.,2023-06-01,No IMDb ID available,0.0,
1175773,Mystery Reel,A look into the subconscious of the 20th century.,2022-08-19,No IMDb ID available,0.0,
368328,Sound of Waves,Waves break upon a young woman's subconscious.,Release date not available,tt8818400,0.0,
1041302,Sweet Dreams,An exploration of subconscious environments.,2010-02-01,No IMDb ID available,0.0,
1197618,Domo Dreams,A dream house travels across space and time in...,Release date not available,No IMDb ID available,0.0,
1080212,Dream Lover,A man dreams of a woman who is now lost in his...,2011-06-06,tt1984148,0.0,Thriller
1025729,Reflections,A young man finds himself amidst a battle with...,2017-01-23,No IMDb ID available,0.0,


In [13]:
def get_index_from_title(title):
    # Ensure the title is provided
    if not title or title.strip() == '':
        raise ValueError("Please provide a valid movie title.")
    
    # Attempt to find the index of the movie based on title
    matched_indices = df[df['title'].str.lower() == title.lower()].index.values
    
    if len(matched_indices) == 0:
        raise ValueError(f"Movie with title '{title}' not found in the dataset.")
    
    return matched_indices[0]

# Example title to search
movie_title = "The Dark Knight"

try:
    # Get the index from title and display recommendations
    movie_index = get_index_from_title(movie_title)
    recommendations = show_recommendations(movie_index, tfidf_matrix, top_n=10)
    print(recommendations)
except ValueError as e:
    print(e)

                                        title  \
487                                    Batman   
6941     Batman: The Long Halloween, Part Two   
6089     Batman: The Long Halloween, Part One   
25                      The Dark Knight Rises   
4036             Batman: Mask of the Phantasm   
2867               Batman: Under the Red Hood   
1204788                  The Batman - Part II   
538346                   Batman Gotham Awaits   
766752                          Dying Is Easy   
384550                           Joker rising   

                                                  overview  \
487      Batman must face his most ruthless nemesis whe...   
6941     As Gotham City's young vigilante, the Batman, ...   
6089     Following a brutal series of murders taking pl...   
25       Following the death of District Attorney Harve...   
4036     When a powerful criminal, who is connected to ...   
2867     One part vigilante, one part criminal kingpin,...   
1204788                Seq

7. Recommendation Function¶
We define a function to recommend similar movies given a title.

In [14]:
def recommend_movies(title, n=10):
    # Use fuzzy matching to find the closest title match in case of typos
    title_match = process.extractOne(title, df['title'])
    
    if title_match is None or title_match[1] < 80:  # Match threshold (you can adjust this)
        return f"Movie '{title}' not found or doesn't match well enough."
    
    idx = df[df['title'] == title_match[0]].index[0]
    
    # Get the top recommended movie indices based on similarity
    movie_indices = recommend(idx, tfidf_matrix, top_n=n)
    
    # Return a DataFrame with selected columns
    recommended_movies = df[['title', 'release_date', 'imdb_id', 'vote_average', 'vote_count']].iloc[movie_indices]
    
    return recommended_movies

# Example usage
recommend_movies("Avatar", 5)

Unnamed: 0,title,release_date,imdb_id,vote_average,vote_count
174332,The Brother from Space,1988-01-01,tt0380368,2.0,3
6458,Cosmic Sin,2021-03-12,tt11762434,4.139,495
151383,Space Terror,2021-03-12,tt14419274,9.7,3
645206,The Galaxy War: A Vingança do Rei,2024-08-05,,0.0,0
384815,Stort,,,0.0,0


8. Save Model Artifacts

In [15]:
import pickle

#save data and model
with open("movie_recommender.pkl", "wb") as f:
    pickle.dump((df, tfidf_matrix), f)

print("✅ Model and data saved successfully!")

✅ Model and data saved successfully!
