# 🎬 Movie Recommendation System

_A pure Python implementation of a TF‑IDF + genre recommender, exactly as in your `recommender.py` module._  

## 1. Imports & Setup

All imports at the top so readers know what dependencies are required 


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import hstack



## 2. Load & Preprocess Data

Read the CSV, fill missing overviews/genres, and extract `genres_list` 

In [2]:
# 2.1 Load dataset
df = pd.read_csv('dataset.csv')

# 2.2 Fill missing values
df['overview'] = df['overview'].fillna('')
df['genre']    = df['genre'].fillna('')

# 2.3 Extract list of genres
df['genres_list'] = df['genre'].apply(
    lambda s: [g.strip() for g in s.split(',')] if s else []
)

# Preview
df.head(5)


Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count,genres_list
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862,"[Drama, Crime]"
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731,"[Comedy, Drama, Romance]"
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280,"[Drama, Crime]"
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959,"[Drama, History, War]"
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811,"[Drama, Crime]"


## 3. Feature Engineering

1. **TF‑IDF** on `overview`  
2. **One‑hot encode** `genres_list`  
3. **Combine** into one sparse matrix 


In [3]:
# 3.1 TF‑IDF on overviews
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf.fit_transform(df['overview'])

# 3.2 One‑hot encode genres
mlb = MultiLabelBinarizer(sparse_output=True)
genre_matrix = mlb.fit_transform(df['genres_list'])

# 3.3 Combine
feature_matrix = hstack([tfidf_matrix, genre_matrix])
feature_matrix.shape


(10000, 5018)

## 4. Compute Cosine Similarity

We compute this **once** and reuse it for all queries 


In [5]:
cosine_sim = cosine_similarity(feature_matrix, feature_matrix, dense_output=False)

## 5. Recommendation Function

A simple function that:
- Finds the index of the chosen title  
- Extracts its similarity row  
- Sorts and returns the top _N_ similar titles  


In [6]:
def recommend(title: str, top_n: int = 5) -> list[str]:
    """
    Return top_n movie titles similar to the given title.
    """
    if title not in df['title'].values:
        return []
    idx = df.index[df['title'] == title][0]
    sim_scores = list(enumerate(cosine_sim[idx].toarray().ravel()))
    # Sort by similarity score descending, skip itself
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1: top_n+1]
    movie_indices = [i for i, _ in sim_scores]
    return df['title'].iloc[movie_indices].tolist()


## 6. Try It Out

Pick a movie and see the top 5 recommendations.


In [11]:
# Example
movie = 'The Matrix'
print(f"Since you watched **{movie}**, you might also like:")
for i, rec in enumerate(recommend(movie, top_n=5), 1):
    print(f"{i}. {rec}")


Since you watched **The Matrix**, you might also like:
1. Logan's Run
2. Rollerball
3. Revolt
4. Kin
5. Battle: Los Angeles
