<a href="https://colab.research.google.com/github/sujal-dhawan/Movie-Recommendation-System/blob/main/Movie_Recommendation_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendation System 🎬

Welcome to the **Movie Recommendation System** project! This system recommends movies based on **genre similarity** and **director information** using the **MovieLens dataset**.

## Project Overview
- **Objective**: Build a content-based movie recommendation system.
- **Features**:
  - Genre-based recommendations using **TF-IDF** and **cosine similarity**.
  - Director-based recommendations (simulated).
  - Data visualizations for ratings distribution and genre popularity.
  - User-friendly interface for real-time recommendations.
- **Technologies**: Python, pandas, NumPy, scikit-learn, Matplotlib, Seaborn, fuzzywuzzy.

## How to Use
1. Run the cells in order to load the dataset and preprocess the data.
2. Input a movie name to get personalized recommendations.
3. Explore the visualizations and model evaluation metrics.

Let’s get started! 🚀

In [None]:
!pip install scikit-learn pandas numpy matplotlib seaborn ipywidgets fuzzywuzzy gradio

# Core libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn modules for text feature extraction and similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

# Fuzzy matching for approximate title matching
from fuzzywuzzy import process

# Gradio for deployment as an interactive web app
import gradio as gr


## Load and Preprocess the Dataset

In this step, we:
1. **Download the MovieLens dataset** using `wget` and unzip it.
2. **Load the movies and ratings data** into pandas DataFrames.
3. **Preprocess the data**:
   - Split the `genres` column into a list of genres.
   - Handle missing values in the `genres` column.
   - Normalize movie titles by converting them to lowercase and stripping extra spaces.

This prepares the dataset for further analysis and modeling.

In [None]:
# Download and unzip the dataset (MovieLens latest-small)
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip -o ml-latest-small.zip

# Load the movies and ratings datasets
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')


## Build the Recommendation Engine

In this step, we:
1. **Create a TF-IDF matrix** for movie genres using `TfidfVectorizer`. This converts the genres into numerical features.
2. **Compute cosine similarity** between movies based on their genre features.
3. **Map movie titles to indices** for easy lookup.
4. **Find the closest matching title** using fuzzy matching (to handle typos or incorrect inputs).
5. **Generate recommendations**:
   - If the input movie title is not found, suggest the closest match.
   - Use cosine similarity to recommend the top 10 most similar movies.

This forms the core of the content-based recommendation system.

In [None]:
# Split genres into a list and handle missing values
movies['genres'] = movies['genres'].str.split('|')
movies['genres'] = movies['genres'].fillna('').astype(str)

# Normalize movie titles (for matching purposes)
movies['title_normalized'] = movies['title'].str.lower().str.strip()

# For demonstration: Add a dummy 'director' column (simulate director information)
np.random.seed(42)  # Ensure reproducibility
movies['director'] = np.random.choice(['Director A', 'Director B', 'Director C'], size=len(movies))


In [None]:
# Create a TF-IDF matrix for the movie genres
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['genres'])

# Compute the cosine similarity matrix based on genres
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Create a mapping of normalized movie titles to DataFrame indices
indices = pd.Series(movies.index, index=movies['title_normalized']).drop_duplicates()


## Add Director-Based Recommendations

In this step, we:
1. **Simulate a director column** by randomly assigning directors (`Director A`, `Director B`, `Director C`) to movies.
2. **Create a function** to recommend movies by the same director:
   - Normalize the input title and check if it exists in the dataset.
   - If the movie is found, recommend up to 10 movies by the same director.
   - If the movie is not found, display an error message.

This adds an additional layer of personalization to the recommendation system.

In [None]:
# Function to find the closest matching title using fuzzy matching
def find_closest_title(title, titles_list):
    if isinstance(titles_list, pd.Series):
        titles_list = titles_list.tolist()
    closest_match, score = process.extractOne(title, titles_list)
    if score >= 80:  # Adjust threshold if needed
        return closest_match
    else:
        return None

# Function to get content-based recommendations
def get_recommendations(title, cosine_sim=cosine_sim):
    title_norm = title.lower().strip()
    message = ""

    if title_norm not in indices:
        # Attempt to find a close match if the title is not found
        closest_title = find_closest_title(title_norm, movies['title_normalized'])
        if closest_title is None:
            return None, f"Movie '{title}' not found. Please check the title and try again."
        else:
            message = f"Movie '{title}' not found. Showing results for '{closest_title}':\n"
            title_norm = closest_title

    idx = indices[title_norm]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]  # Top 10 similar movies (excluding the query movie)
    movie_indices = [i[0] for i in sim_scores]
    recommendations = movies['title'].iloc[movie_indices].tolist()
    return recommendations, message

# Function to recommend movies by the same director
def recommend_by_director(title):
    title_norm = title.lower().strip()
    if title_norm not in movies['title_normalized'].values:
        return None, f"Movie '{title}' not found. Please check the title and try again."
    director = movies[movies['title_normalized'] == title_norm]['director'].values[0]
    director_movies = movies[movies['director'] == director]['title'].head(10).tolist()
    message = f"Director: {director}"
    return director_movies, message


## Visualize Movie Ratings Distribution

In this step, we:
1. **Create a count plot** to visualize the distribution of movie ratings.
2. **Customize the plot**:
   - Add a title and axis labels with larger fonts for clarity.
   - Use a color palette (`viridis`) and edge color for better aesthetics.
   - Add a grid for improved readability.

This visualization helps us understand how ratings are distributed across the dataset.

In [None]:
# Plot the distribution of movie ratings
plt.figure(figsize=(10, 6))
sns.countplot(x='rating', data=ratings, palette='viridis', edgecolor='black')
plt.title('Distribution of Movie Ratings', fontsize=16, fontweight='bold')
plt.xlabel('Rating', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Plot the top 10 most popular genres
movies_exploded = movies.explode('genres')
genre_counts = movies_exploded['genres'].value_counts()
plt.figure(figsize=(12, 6))
genre_counts.head(10).plot(kind='bar', color='orange', edgecolor='black')
plt.title('Top 10 Movie Genres by Popularity', fontsize=16, fontweight='bold')
plt.xlabel('Genre', fontsize=14)
plt.ylabel('Number of Movies', fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


## Visualize Genre Popularity

In this step, we:
1. **Explode the genres column** to separate each genre into its own row.
2. **Count the occurrences** of each genre to determine popularity.
3. **Create a bar plot** to visualize the top 10 most popular genres:
   - Use a vibrant color (`orange`) and edge color for better aesthetics.
   - Add a title and axis labels with larger fonts for clarity.
   - Rotate x-axis labels for better readability.
   - Add a grid for improved visualization.

This helps us understand which genres are most common in the dataset.

## Evaluate the Model

In this step, we:
1. **Perform a dummy evaluation** to demonstrate how to calculate the **Root Mean Squared Error (RMSE)**.
2. Use sample data (`y_true` and `y_pred`) to compute the RMSE.
3. Print the RMSE value, which measures the difference between predicted and actual values.

**Output**: The RMSE value is **0.387**, indicating a relatively small error in this dummy example.

In [None]:
# Dummy evaluation using RMSE (for demonstration only)
y_true = [4, 3, 5, 2, 4]
y_pred = [3.5, 3, 4.5, 2.5, 4]
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"Dummy RMSE: {rmse:.2f}")


## Create a User-Friendly Interface

In this step, we:
1. **Create an input widget** for users to enter a movie name.
2. **Create an output widget** to display recommendations.
3. **Define a function** to handle user input:
   - Clear the previous output.
   - Display recommended movies based on genre similarity.
   - Display movies by the same director (if available).
   - Handle cases where the movie is not found.
4. **Attach the function** to the input widget and display both widgets.

This makes the recommendation system interactive and easy to use.

In [None]:
# Function that ties together both recommendation methods for the Gradio interface
def movie_recommender_app(movie_title):
    # Get content-based recommendations
    recommendations, msg1 = get_recommendations(movie_title)
    # Get director-based recommendations
    director_recommendations, msg2 = recommend_by_director(movie_title)

    output_text = ""
    if msg1:
        output_text += msg1 + "\n"
    if recommendations:
        output_text += "Top 10 Recommendations:\n" + "\n".join(recommendations) + "\n"
    else:
        output_text += "No recommendations found.\n"

    output_text += "\n" + msg2 + "\n"
    if director_recommendations:
        output_text += "Movies by the Same Director:\n" + "\n".join(director_recommendations)
    else:
        output_text += "No director-based recommendations found."

    return output_text

# Create a Gradio interface
iface = gr.Interface(
    fn=movie_recommender_app,
    inputs=gr.Textbox(placeholder="Enter a movie name", label="Movie Title"),
    outputs=gr.Textbox(label="Recommendations"),
    title="Movie Recommendation System 🎬",
    description="Enter a movie title to get recommendations based on content similarity and director information."
)

# Launch the Gradio app with a shareable public link
iface.launch(share=True)


In [None]:
# Search for "The Avengers" in the dataset
avengers_movies = movies[movies['title'].str.contains('Avengers', case=False)]
print(avengers_movies)