# AI-Powered Content Recommendation System

An NLP-based movie recommendation system using semantic embeddings  
(Hollywood + Bollywood)

## 1. Problem Definition & Objective

### a. Selected Project Track
**Recommendation Systems / NLP-based AI Application**

### b. Problem Statement
With the rapid growth of content on streaming platforms, users often struggle to find movies that align with their interests, moods, or themes. Traditional recommendation systems rely heavily on user history, ratings, or keyword-based filtering, which fail in cold-start scenarios and lack semantic understanding.

### c. Objective
The objective of this project is to build an AI-powered content recommendation system that:
- Accepts free-text natural language queries
- Understands semantic meaning instead of keyword matching
- Recommends relevant movies from a global dataset
- Supports both Hollywood and Bollywood content
- Provides explainable recommendations with relevance scores

### d. Real-World Relevance
Such systems are widely used in OTT platforms like Netflix, Prime Video, and Disney+, where intelligent content discovery significantly improves user engagement and retention.


## 2. Data Understanding & Preparation

### a. Dataset Source
This project uses a **global movie dataset** created by combining:
- **TMDB Movie Dataset** (Hollywood and international cinema)
- **IMDb Bollywood Movie Dataset**

Both datasets are publicly available and contain metadata such as titles, descriptions, genres, and release information.


In [None]:
import pandas as pd

# Load the combined dataset
df = pd.read_csv("data/content_global.csv")

# Display first few rows
df.head()

### b. Dataset Exploration

Key columns in the dataset:
- **title**: Movie name
- **description**: Plot summary
- **genres**: Movie genres
- **source**: Dataset origin (TMDB or Bollywood_IMDB)
- **text**: Combined semantic text field

This unified dataset enables global and regional movie recommendations.

In [None]:
# Dataset shape
df.shape
# Column names
df.columns

In [None]:
# Handle missing values
df = df.fillna("")

# Verify no missing values remain
df.isnull().sum()

### c. Data Cleaning & Preprocessing

- Missing values were replaced with empty strings
- A combined semantic text field (`text`) was used for embedding generation
- No labels were required since this is an unsupervised recommendation system

This preparation ensures clean and consistent input for the NLP model.

## 3. Model / System Design

### a. AI Technique Used
- Natural Language Processing (NLP)
- Semantic Embeddings
- Recommendation System using Similarity Search

### b. Model Selection
A pretrained **SentenceTransformer** model (`all-MiniLM-L6-v2`) was used due to:
- High semantic accuracy
- Low inference latency
- Suitability for sentence-level similarity tasks

### c. System Architecture
1. Convert movie descriptions into vector embeddings
2. Convert user query into an embedding
3. Compute cosine similarity
4. Rank movies based on similarity score
5. Return top recommendations with explanations

In [None]:
from sentence_transformers import SentenceTransformer

# Load pretrained embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
# Generate embeddings for all movies
movie_embeddings = model.encode(
    df["text"].tolist(),
    show_progress_bar=True
)

## 4. Core Implementation

This section demonstrates the recommendation pipeline, including:
- Query embedding
- Similarity computation
- Ranking logic

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def recommend(query, top_k=5):
    query_embedding = model.encode([query])
    
    similarities = cosine_similarity(query_embedding, movie_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = df.iloc[top_indices].copy()
    results["score"] = similarities[top_indices]
    
    return results[["title", "description", "genres", "source", "score"]]

In [None]:
# Sample recommendation
recommend("epic fantasy adventure with magic and heroes", top_k=5)

## 5. Evaluation & Analysis

### a. Evaluation Metrics
Since this is an unsupervised recommendation system, evaluation is qualitative:
- Semantic relevance of recommendations
- Cosine similarity scores
- Diversity of returned results

### b. Sample Observations
- High relevance for abstract queries
- Effective cold-start handling
- Supports both regional and global content

### c. Limitations
- No user personalization
- Depends on quality of textual metadata

## 6. Ethical Considerations & Responsible AI

### Bias & Fairness
- Dataset may overrepresent certain regions or genres
- No demographic profiling or user data used

### Responsible AI Use
- Uses pretrained models responsibly
- No personal data collection
- Explainable and transparent recommendations

### Dataset Limitations
- Genre tagging inconsistencies
- Limited coverage of niche cinema

## 7. Conclusion & Future Scope

### Conclusion
This project demonstrates how modern NLP techniques can be used to build an intelligent content recommendation system. By leveraging semantic embeddings, the system provides meaningful recommendations without relying on user history or ratings.

### Future Scope
- Personalization using user profiles
- Multilingual recommendations
- Integration with live APIs
- Hybrid collaborative + content-based filtering