# 🎬IntelliFlix: Semantic Movie Recommender

This notebook builds a movie recommendation engine using natural language processing (NLP).
Instead of traditional ratings or genres, we use plot summaries ("overview") to find semantically similar movies using **Sentence Transformers** and **FAISS**.

**Dataset**: TMDB Movies Dataset

**Goal**: Given a movie or a custom plot, recommend movies with similar plot themes.

### 📦 Install required packages

In [None]:
%pip install -q kagglehub sentence-transformers faiss-cpu hf_xet

### 🔷 Imports Python libraries

In [None]:
import os
import torch
import kagglehub
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss

# for ignoring warnings
import warnings
warnings.filterwarnings("ignore")

### 📁 Setup directories

In [None]:
data_dir ="/data"
model_dir = "/models"
os.makedirs(data_dir, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

### ⬇️ Download dataset from Kaggle (in `data/` folder)

**Dataset**: [TMDB Movies Dataset](https://www.kaggle.com/datasets/alanvourch/tmdb-movies-daily-updates)

In [None]:
# csv_path = "/content/data/TMDB_all_movies.csv"

# For Colab, uncomment the following lines to download the dataset from Kaggle:

print("⬇️ Downloading dataset from Kaggle...")
dataset_path = kagglehub.dataset_download("alanvourch/tmdb-movies-daily-updates")
csv_path = os.path.join(dataset_path, "TMDB_all_movies.csv")
print(f"✅ Dataset downloaded at: {csv_path}")

### 🧑🏻‍💻 Load and preprocess the data

In [None]:
# For local execution, ensure the dataset is in the correct path
print("\n📊 Loading and cleaning data...")
df = pd.read_csv(csv_path)
print(f"📄 Initial rows: {df.shape[0]}")

# Drop rows with missing overviews
df = df.dropna(subset=["overview"])
print(f"🧹 After dropping missing overviews: {df.shape[0]} rows")

# Drop rows with adult content
# df = df[df["adult"] == False]
# print(f"🧼 After dropping adult content: {df.shape[0]} rows")

# Drop duplicates
df = df.drop_duplicates(subset=["title", "overview"])
print(f"🧽 After dropping duplicates: {df.shape[0]} rows")

# Reset index
df = df.reset_index(drop=True)

# Save cleaned data
clean_csv_path = os.path.join(data_dir, "movies_cleaned.csv")
df.to_csv(clean_csv_path, index=False)

print(f"✅ Cleaned data saved to: {clean_csv_path}")
print(f"📦 File size: {os.path.getsize(clean_csv_path) / 1024:.2f} KB")

### 🤖 Load SentenceTransformer model (GPU-enabled)
- [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"\n🚀 Using device: {device}")
model = SentenceTransformer("all-MiniLM-L12-v2", device=device)

### 🧠 Generate embeddings (as torch tensors for efficiency)

In [None]:
batch_size = 900  # Adjust based on your GPU memory capacity
# For Colab, you can increase this to 1536 if using A100 GPU or T4 GPU

try:
    print("\n🔄 Generating embeddings (batch_size)...")
    overview_embeddings = model.encode(
        df["overview"].tolist(),
        batch_size=batch_size,
        convert_to_tensor=True,
        device=device,
        show_progress_bar=True,
    )
except RuntimeError as e:
    if "out of memory" in str(e):
        torch.cuda.empty_cache()
        print("⚠️ OOM detected. Retrying with batch_size=128...")
        overview_embeddings = model.encode(
            df["overview"].tolist(),
            batch_size=128,
            convert_to_tensor=True,
            device=device,
            show_progress_bar=True,
        )
    else:
        raise e

# Convert to NumPy (for FAISS)
embeddings = overview_embeddings.cpu().numpy().astype("float32")
print(f"✅ Embeddings shape: {embeddings.shape}")

# 💾 Save embeddings
embedding_path = os.path.join(model_dir, "embeddings.npy")
np.save(embedding_path, embeddings)
print(f"✅ Embeddings saved to: {embedding_path}")
print(f"📦 File size: {os.path.getsize(embedding_path) / 1024 / 1024:.2f} MB")

### ✨ Build FAISS index

In [None]:
print("\n📦 Building FAISS index...")
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Save index
index_path = os.path.join(model_dir, "movie_index.faiss")
faiss.write_index(index, index_path)
print(f"✅ FAISS index saved to: {index_path}")
print(f"📦 File size: {os.path.getsize(index_path) / 1024 / 1024:.2f} MB")

### 🔍 Semantic Search

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Semantic Search with similarity score
def search(query, k=15):
    query_vec = model.encode([query], convert_to_tensor=True, device=device)
    query_np = query_vec.cpu().numpy().astype("float32")
    # Use FAISS for fast index search
    D, I = index.search(query_np, k)

    # Compute cosine similarity for better interpretability
    similarities = cosine_similarity(query_np, embeddings[I[0]])[0]

    # Append similarity scores to result
    results = df.iloc[I[0]].copy()
    results["similarity"] = similarities

    return results.sort_values(by="similarity", ascending=False).reset_index(drop=True)

### 😇 Sample Query

In [None]:
query = "A group of explorers lost in space in a voyage and tries to survive on a distant planet"
results = search(query)

### 🥳 Test Results

In [None]:
print(f"\n🔍 Query: {query}\n")
for idx, row in results.iterrows():
    print(f"🔹 Rank #{idx + 1}")
    print(f"🎬 Title: {row['title']}")
    print(f"📈 Similarity: {row['similarity']:.4f}")
    print(f"🖼️ Poster Path: {row.get('poster_path', 'N/A')}")
    print(f"📝 Overview: {row['overview']}\n{'-' * 80}\n")