# 🎬IntelliFlix: Semantic Movie Recommender

This notebook builds a movie recommendation engine using natural language processing (NLP).
Instead of traditional ratings or genres, we use plot summaries ("overview") to find semantically similar movies using **Sentence Transformers** and **FAISS**.

**Dataset**: [TMDB Movies Dataset](https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates)

**Goal**: Given a movie or a custom plot, recommend movies with similar plot themes.

## Install required packages

In [1]:
!pip install -q sentence-transformers faiss-cpu hf_xet huggingface_hub

## Imports Python libraries

In [None]:
import os
import torch
import kagglehub
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss

# for ignoring warnings
import warnings
warnings.filterwarnings("ignore")

## Setup directories

In [3]:
data_dir ="/kaggle/working/data"
embeddings_dir = "/kaggle/working/embeddings"
indexes_dir = "/kaggle/working/indexes"

os.makedirs(data_dir, exist_ok=True)
os.makedirs(embeddings_dir, exist_ok=True)
os.makedirs(indexes_dir, exist_ok=True)

## Accessing User Secrets with Kaggle Secrets API


In [4]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

## Authenticating with Hugging Face API Using a Token

In [5]:
hf_token = user_secrets.get_secret("HF_TOKEN")

from huggingface_hub import HfApi
api = HfApi(token=hf_token)

## Loading TMDB Movies Dataset

In [6]:
csv_path = "/kaggle/input/tmdb-movies-daily-updates/TMDB_all_movies.csv"
df= pd.read_csv(csv_path)

df.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,imdb_id,...,spoken_languages,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path
0,2,Ariel,7.1,346.0,Released,1988-10-21,0.0,73.0,0.0,tt0094675,...,suomi,"Merja Pulkkinen, Eetu Hilkamo, Turo Pajala, Es...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Aki Kaurismäki,,7.4,9154.0,/ojDg0PGvs6R9xYFodRct2kdI6wC.jpg
1,3,Shadows in Paradise,7.293,409.0,Released,1986-10-17,0.0,74.0,0.0,tt0092149,...,"suomi, English, svenska","Esko Nikkari, Mari Rantasila, Marina Martinoff...",Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Mika Kaurismäki,,7.4,7974.0,/nj01hspawPof0mJmlgfjuLyJuRN.jpg
2,5,Four Rooms,5.862,2688.0,Released,1995-12-09,4257354.0,98.0,4000000.0,tt0113101,...,English,"David Proval, Salma Hayek Pinault, Paul Calder...","Allison Anders, Alexandre Rockwell, Quentin Ta...","Phil Parmet, Andrzej Sekula, Guillermo Navarro...","Allison Anders, Alexandre Rockwell, Quentin Ta...","Alexandre Rockwell, Quentin Tarantino, Lawrenc...",Combustible Edison,6.7,114105.0,/75aHn1NOYXh4M7L5shoeQ6NGykP.jpg
3,6,Judgment Night,6.5,349.0,Released,1993-10-15,12136938.0,109.0,21000000.0,tt0107286,...,English,"Doug Wert, Hank McGill, Christine Harnos, Raic...",Stephen Hopkins,Peter Levy,"Lewis Colick, Jere Cunningham","Gene Levy, Marilyn Vance, Lloyd Segan",Alan Silvestri,6.6,19891.0,/3rvvpS9YPM5HB2f4HYiNiJVtdam.jpg
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,2006-01-01,0.0,80.0,42000.0,tt0825671,...,"English, हिन्दी, 日本語, Pусский, Español",,Timo Novotny,Wolfgang Thaler,"Michael Glawogger, Timo Novotny","Ulrich Gehmacher, Timo Novotny",,8.2,284.0,/7ln81BRnPR2wqxuITZxEciCe1lc.jpg


## Uploading Raw TMDB Dataset to Hugging Face Dataset Repository

In [7]:
api.upload_file(
    path_or_fileobj=csv_path,
    path_in_repo="data/tmdb_movies_dataset_raw.csv",
    repo_id="uiuxarghya/intelliflix-store",
    repo_type="dataset",
    commit_message="feat(data): Upload raw TMDB movies dataset."
)

print("✅ Raw dataset uploaded to Hugging Face!")

Uploading...:   0%|          | 0.00/653M [00:00<?, ?B/s]

✅ Raw dataset uploaded to Hugging Face!


## Preprocess the data

In [8]:
# For local execution, ensure the dataset is in the correct path
print("\n📊 Cleaning data...")
print(f"📄 Initial rows: {df.shape[0]}")

# Drop rows with missing overviews
df = df.dropna(subset=["overview"])
print(f"🧹 After dropping missing overviews: {df.shape[0]} rows")

# Drop duplicates
df = df.drop_duplicates(subset=["title", "overview"])
print(f"🧽 After dropping duplicates: {df.shape[0]} rows")

# Drop unreleased 
df = df.dropna(subset=["release_date"])
print(f"🧹 After dropping missing release dates: {df.shape[0]} rows")

# Reset index
df = df.reset_index(drop=True)

# Save cleaned data
clean_csv_path = os.path.join(data_dir, "movies_cleaned.csv")
df.to_csv(clean_csv_path, index=False)

print(f"✅ Cleaned data saved to: {clean_csv_path}")
print(f"📦 File size: {os.path.getsize(clean_csv_path) / 1024:.2f} KB")


📊 Cleaning data...
📄 Initial rows: 1090315
🧹 After dropping missing overviews: 894147 rows
🧽 After dropping duplicates: 892261 rows
🧹 After dropping missing release dates: 808181 rows
✅ Cleaned data saved to: /kaggle/working/data/movies_cleaned.csv
📦 File size: 497284.72 KB


## Uploading Cleaned Dataset to Hugging Face Dataset Repository

In [9]:
api.upload_file(
    path_or_fileobj=clean_csv_path,
    path_in_repo="data/tmdb_movies_dataset_processed.csv",
    repo_id="uiuxarghya/intelliflix-store",
    repo_type="dataset",
    commit_message="feat(data): Upload porcessed TMDB movies dataset."
)

print("✅ Processed dataset uploaded to Hugging Face!")

Uploading...:   0%|          | 0.00/509M [00:00<?, ?B/s]

✅ Processed dataset uploaded to Hugging Face!


## Load SentenceTransformer model (GPU-enabled)
- [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

In [10]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"\n🚀 Using device: {device}")

model = SentenceTransformer("all-MiniLM-L12-v2", device=device)


🚀 Using device: cuda


## Generate Movie Overview Embeddings (as torch tensors for efficiency)

In [11]:
# Adjust based on your GPU memory capacity
# For Kaggle or Google Colab, you can use 1536 if using T4 GPU, P100 GPU or A100 GPU.
batch_size = 1536 

try:
    print(f"\n🔄 Generating embeddings (batch_size: {batch_size})...")
    overview_embeddings = model.encode(
        df["overview"].tolist(),
        batch_size=batch_size,
        convert_to_tensor=True,
        device=device,
        show_progress_bar=True,
    )
except RuntimeError as e:
    if "out of memory" in str(e):
        torch.cuda.empty_cache()
        print("⚠️ OOM detected. Retrying with batch_size=128...")
        overview_embeddings = model.encode(
            df["overview"].tolist(),
            batch_size=128,
            convert_to_tensor=True,
            device=device,
            show_progress_bar=True,
        )
    else:
        raise e

# Convert to NumPy (for FAISS)
embeddings = overview_embeddings.cpu().numpy().astype("float32")
print(f"✅ Embeddings shape: {embeddings.shape}")

# 💾 Save embeddings
embedding_path = os.path.join(embeddings_dir, "embeddings.npy")
np.save(embedding_path, embeddings)
print(f"✅ Embeddings saved to: {embedding_path}")
print(f"📦 File size: {os.path.getsize(embedding_path) / 1024 / 1024:.2f} MB")


🔄 Generating embeddings (batch_size: 1536)...


Batches:   0%|          | 0/527 [00:00<?, ?it/s]

✅ Embeddings shape: (808181, 384)
✅ Embeddings saved to: /kaggle/working/embeddings/embeddings.npy
📦 File size: 1183.86 MB


## Uploading Movie Overview Embeddings to Hugging Face

In [12]:
api.upload_file(
    path_or_fileobj=embedding_path,
    path_in_repo="embeddings/movie_ovierview_embeddings.npy",
    repo_id="uiuxarghya/intelliflix-store",
    repo_type="dataset",
    commit_message="feat(embeddings): Upload movie overview embeddings"
)

print("✅ Embeddings uploaded to Hugging Face!")

Uploading...:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

✅ Embeddings uploaded to Hugging Face!


## Build Movie Overview FAISS index

In [13]:
print("\n📦 Building FAISS index...")
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Save index
index_path = os.path.join(indexes_dir, "movie_overview_index.faiss")
faiss.write_index(index, index_path)
print(f"✅ FAISS index saved to: {index_path}")
print(f"📦 File size: {os.path.getsize(index_path) / 1024 / 1024:.2f} MB")


📦 Building FAISS index...
✅ FAISS index saved to: /kaggle/working/indexes/movie_overview_index.faiss
📦 File size: 1183.86 MB


## Uploading FAISS Index to Hugging Face Dataset Repository

In [14]:
api.upload_file(
    path_or_fileobj=index_path,
    path_in_repo="indexes/movie_overview_index.faiss",
    repo_id="uiuxarghya/intelliflix-store",
    repo_type="dataset",
    commit_message="feat(index): Upload movie overview FAISS index"
)

print("✅ FAISS index uploaded to Hugging Face!")

Uploading...:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

✅ FAISS index uploaded to Hugging Face!


## Semantic Search

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

# Semantic Search with similarity score
def search(query, k=15):
    query_vec = model.encode([query], convert_to_tensor=True, device=device)
    query_np = query_vec.cpu().numpy().astype("float32")
    # Use FAISS for fast index search
    D, I = index.search(query_np, k)

    # Compute cosine similarity for better interpretability
    similarities = cosine_similarity(query_np, embeddings[I[0]])[0]

    # Append similarity scores to result
    results = df.iloc[I[0]].copy()
    results["similarity"] = similarities

    return results.sort_values(by="similarity", ascending=False).reset_index(drop=True)

## Sample Query

In [16]:
query = "An adventure of explorers lost in space for a wormhole and tries to survive on a distant planet."
results = search(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## Test Results

In [17]:
print(f"\n🔍 Query: {query}\n")
for idx, row in results.iterrows():
    print(f"🔹 Rank #{idx + 1}")
    print(f"🎬 Title: {row['title']}")
    print(f"📈 Similarity: {row['similarity']:.4f}")
    print(f"🖼️ Poster Path: {row.get('poster_path', 'N/A')}")
    print(f"📝 Overview: {row['overview']}\n{'-' * 80}\n")


🔍 Query: An adventure of explorers lost in space for a wormhole and tries to survive on a distant planet.

🔹 Rank #1
🎬 Title: Interstellar
📈 Similarity: 0.8289
🖼️ Poster Path: /gEU2QniE6E77NI6lCU6MxlNBvIx.jpg
📝 Overview: The adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage.
--------------------------------------------------------------------------------

🔹 Rank #2
🎬 Title: Beachworld
📈 Similarity: 0.6421
🖼️ Poster Path: /zyevrXJMvK0kmddn1VWBB94dIgc.jpg
📝 Overview: Thousands of years into the future a spaceship traveling through a miscalculated wormhole crashes on a distant planet. The crew struggle for survival among the shifting sands, and encounters with other worldly visitors.
--------------------------------------------------------------------------------

🔹 Rank #3
🎬 Title: Project Gemini
📈 Similarity: 0.5995
🖼️ Poster Path: /rFljUdOozFE