<a href="https://colab.research.google.com/github/qandeelfatima55/AI-ML-Internship-Tasks/blob/main/task3_movie_recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 3: Movie Recommendation System (Content-Based)
In this notebook, we will build a **content-based** recommender using the MovieLens dataset.

We will:
1. Load the MovieLens data (`movies.csv` and optionally `ratings.csv`).
2. Clean and prepare features (mainly **genres**; MovieLens uses a `|`-separated genre list).
3. Create a text representation with **TF‑IDF**.
4. Compute **cosine similarity** between movies.
5. Build a function `recommend('Movie Title', top_n=5)` to get similar movies.

> Note: MovieLens *small* dataset is enough. On Colab, you can either use Kaggle to download it or simply upload the two CSVs manually when prompted.

## 0) Setup & Imports

In [None]:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import get_close_matches

# Display options for easier reading
pd.set_option('display.max_colwidth', 120)


## 1) Load Data
### Option A (Recommended on phone): Upload files manually
On Colab, run the next cell, then pick `movies.csv` (and optionally `ratings.csv`).

You can get these from the MovieLens small dataset—after unzipping, the files are inside the folder.

### Option B: If files already exist in the environment
If your files are already present (e.g., in the same folder), just make sure the filenames below are correct.

In [None]:

# Change paths if needed
movies_path = 'movies.csv'      # required
ratings_path = 'ratings.csv'    # optional (only used to compute a simple popularity score)

movies = pd.read_csv(movies_path)
print("Movies shape:", movies.shape)
movies.head()


## 2) Basic Cleaning
MovieLens genres look like: `'Adventure|Animation|Children|Comedy|Fantasy'`. We will convert the `|` to spaces so TF‑IDF can treat them as tokens.
We'll also:
- lower-case the text
- fill missing values
- keep a clean copy of titles for display

In [None]:

# Keep original title for display
movies['title_display'] = movies['title'].astype(str)

# Clean genres text
movies['genres'] = movies['genres'].fillna('')
movies['genres_clean'] = (movies['genres']
                          .str.replace('|', ' ', regex=False)
                          .str.replace('-', ' ', regex=False)
                          .str.lower())

import re
def strip_year(t):
    m = re.search(r'\s*\((\d{4})\)$', str(t))
    return re.sub(r'\s*\(\d{4}\)$', '', str(t)).strip()

movies['title_clean'] = movies['title_display'].apply(strip_year).str.lower()
movies[['title_display','genres','genres_clean']].head()

## 3) Build TF‑IDF Features from Genres
This turns the genre words into numeric vectors so we can measure similarity.

In [None]:

tfidf = TfidfVectorizer(token_pattern=r'\b\w+\b')
tfidf_matrix = tfidf.fit_transform(movies['genres_clean'])
tfidf_matrix.shape


## 4) Compute Cosine Similarity
Cosine similarity gives us *how similar* two movies are based on their genre vectors.

In [None]:

cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim.shape


## 5) (Optional) Add a Simple Popularity Signal
If `ratings.csv` is available, we can compute a very rough popularity score to break ties.
This step is optional—you can skip if you didn't upload ratings.

In [None]:

try:
    ratings = pd.read_csv(ratings_path)
    pop = ratings.groupby('movieId').agg(rating_count=('rating','count'),
                                         rating_mean=('rating','mean')).reset_index()
    movies = movies.merge(pop, on='movieId', how='left')
    movies['rating_count'] = movies['rating_count'].fillna(0)
    movies['rating_mean']  = movies['rating_mean'].fillna(movies['rating_mean'].mean())
except Exception as e:
    print("ratings.csv not found or unreadable — proceeding without popularity features.")
    movies['rating_count'] = 0
    movies['rating_mean']  = 0.0


## 6) Build the Recommender Function
We will:
- find the requested movie via exact or fuzzy matching,
- rank other movies by cosine similarity,
- return the top `n` titles (optionally preferring more popular ones).

In [None]:

# Map movie indices for quick lookup
title_to_index = {t:i for i,t in enumerate(movies['title_clean'])}

def find_movie_index(query):
    """Return the index in 'movies' for a given title query (case-insensitive).
    Uses exact match first, then fuzzy match (closest 1)."""
    q = strip_year(query).lower().strip()
    if q in title_to_index:
        return title_to_index[q]
    # Fuzzy: try to find the closest title_clean
    candidates = get_close_matches(q, movies['title_clean'].tolist(), n=1, cutoff=0.6)
    if candidates:
        return title_to_index[candidates[0]]
    return None

def recommend(title, top_n=5, use_popularity=True):
    idx = find_movie_index(title)
    if idx is None:
        return pd.DataFrame({'message': [f"Could not find a close match for '{title}'. Try another title."]})

    # Get similarity scores for the given index
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Exclude the movie itself
    sim_scores = [(i,s) for i,s in sim_scores if i != idx]
    # Sort by similarity score (descending)
    sim_scores.sort(key=lambda x: x[1], reverse=True)

    # Take a pool larger than top_n to allow re-sorting by popularity if available
    pool_size = max(30, top_n*5)
    top_indices = [i for i,_ in sim_scores[:pool_size]]
    candidates = movies.iloc[top_indices].copy()
    candidates['similarity'] = [s for _,s in sim_scores[:pool_size]]

    if use_popularity and 'rating_count' in candidates.columns:
        # Sort by similarity first, then by rating_count to prefer well-rated popular films
        candidates = candidates.sort_values(['similarity','rating_count'], ascending=[False, False])
    else:
        candidates = candidates.sort_values('similarity', ascending=False)

    return candidates[['title_display','genres','similarity','rating_count','rating_mean']].head(top_n).reset_index(drop=True)


## 7) Try It!
Run the cell below and type a movie you know exists in MovieLens (e.g., `Toy Story (1995)`, `Jumanji (1995)`, `Heat (1995)`).

In [None]:

# Example usage:
result = recommend('Toy Story (1995)', top_n=5)
result