### Selected Project Track: Content Recommendation System
## Problem Statement: 
Most of the recommendation systems offer movies based on user's watch history, this novel movie bot recommends content based on user's current mood, because: sometimes user feels like watching movie according to their mood or movies similar to the plot they liked before, rather than just based on the past watch history.

This movie bot can be trained on the movies database for a specific OTT platorms like: Netflix or Prime Video and this movie bot can be added as a layer on top of the Netflix or OTT platform's UI and fetch movies according to user's mood and other preferences, This model can further be extended to collect additional information like: user's tastes type of movie they might like and so on

## This project is a movie assistant that recommends movies based on:

user's mood-> predicts genres he might like -> asks language preference-> asks similar movies he might like -> prinnts top 5 movies

## The project consists of step by step approach:
1. *Dataset preparation:* extraction Pre-processing (I used TMDB dataset for movies metadata: https://www.themoviedb.org/)
2. *Creating Vector embeddings:* I used FAISS in memory vector DB to embed movies dataset
3. *LLMs - Integrated Gemini API* to process few ambigous user inputs (like mood prediction, extracting similar movie description)
4. *Create reranker model* to sort the recommendations based on semantic similarity
5. *Recommendation Module->* module that takes user preferences and matches with movie's metadata to display top 5 movies
6. *Converstation Manager Module* -> To manage questions to be ased and processes user inputs by storing user preferences in a state Class
7. *Chatbot Service Module*- used as an orchestrator for Chatbot simulation

### 1 Data Preparation: I extracted data using TMDB API Keys 

-> Dataset present in datasets/raw 
-> after cleansing will load data to datasets/cleaned which has cleaned formatted fields with no null values and additional column added: 'embedding_text' which is embedded using sentence-transformer model in FAISS vector DB

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../datasets/raw/movies_raw.csv")

In [3]:
print(df.dtypes)

movie_id               int64
title                 object
overview              object
genres                object
cast                  object
keywords              object
runtime                int64
release_year           int64
language              object
vote_average         float64
vote_count             int64
combined_features     object
dtype: object


In [4]:
print(df.isnull().sum())

movie_id               0
title                  0
overview             219
genres                 0
cast                   0
keywords               0
runtime                0
release_year           0
language               0
vote_average           0
vote_count             0
combined_features      0
dtype: int64


In [5]:
df = df.fillna({'overview': 'no overview'})

In [6]:
import ast
# this method handles text cleaning: since some lists are of type string

def normalize_text(x):
    # Handle NaN
    if pd.isna(x):
        return ""

    # Case 1: already a real list
    if isinstance(x, list):
        return ", ".join(g.strip() for g in x if isinstance(g, str) and g.strip())

    # Case 2: string that looks like a list -> parse it
    if isinstance(x, str):
        x = x.strip()

        # empty or invalid
        if x.lower() in ["", "nan", "none", "[]"]:
            return ""

        # stringified list like "['Drama', 'Romance']"
        if x.startswith("[") and x.endswith("]"):
            try:
                parsed = ast.literal_eval(x)
                if isinstance(parsed, list):
                    return ", ".join(
                        g.strip() for g in parsed if isinstance(g, str) and g.strip()
                    )
            except Exception:
                pass  # fall through

        # already pipe or comma separated
        return x.replace(",", ", ")

    return ""


In [7]:
import re
# removes brackets and single quotes from string: []'
def clean_brackets(text):
    if pd.isna(text):
        return ""
    # Remove [, ], and ' using regex
    return re.sub(r"[\[\]']", "", str(text))

In [8]:
import re
import pandas as pd

def clean_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r"[^a-z\s,:.\'\"]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [9]:
df['overview']= df['overview'].apply(clean_text)
df['overview'].iloc[67:75]

67    bruce gets phone calls from a woman claiming t...
68    a part original movie featuring scorpio: an ag...
69    abel is a ghostwriter. he just finished writin...
70    many years after a deadly terrorist siege in a...
71    filmed version of the stratford festival produ...
72    following a brutal civil war, an interrogation...
73    tunein on friday midnight to watch relaxing ol...
74                                          no overview
Name: overview, dtype: object

In [10]:
df['keywords'] = df['keywords'].apply(normalize_text).apply(clean_brackets)

df['keywords'].tail(5)

df["genres"] = df["genres"].apply(normalize_text).apply(lambda x: x.lower())

df["genres"].head(2)

df['genres']= df['genres'].apply(clean_brackets)
df["keywords"].iloc[87:94]
df['cast']=df['cast'].apply(normalize_text)

df['cast']= df['cast'].apply(clean_brackets)

df['runtime'] = df['runtime'].astype(int)
df['release_year']=df['release_year'].astype(int)

# Check if the keywords at index 87 is empty (length 0)
print(type(df['overview'].iloc[87]))


<class 'str'>


In [11]:
import pycountry
import pandas as pd
# helper method to convert iso codes of language to full language: ex: 'en' is converted to 'english'
def convert_lang(code):
    if pd.isnull(code):
        return code
    
    # pycountry needs uppercase 2-letter codes (e.g., 'EN')
    code_clean = str(code).strip().upper()
    
    try:
        lang = pycountry.languages.get(alpha_2=code_clean)
        return lang.name.lower() if lang else code
    except (AttributeError, LookupError):
        return code

# Apply the helper function
df["language"] = df["language"].apply(convert_lang)

In [12]:
df['language']= df['language'].apply(lambda x: x.lower() if isinstance(x,str) else x)
df['language'].iloc[78]

'english'

In [13]:
def replace_with_keywords(row):
    """
    Takes a row (Series) from the DataFrame.
    If overview is 'no overview', joins keywords into a string.
    this method replaces empty overview with keywords
    """
    # Access columns by name from the row object
    overview = str(row['overview']).lower().strip()
    keywords = row['keywords']
    genres = str(row['genres'])
    
    # Logic: if 'no overview' and keywords exist
    if overview == 'no overview' and len(keywords) > 0:
        # If keywords is a list, join it; if it's already a string, return it
        if isinstance(keywords, str):
            row['extracted_text'] = str(keywords.strip())
        return row
    
    if overview == 'no overview' and len(genres) > 0:
        if isinstance(genres, str):
            row['extracted_text'] = str(genres.strip())
        return row

    row['extracted_text'] = pd.NA
    # Otherwise, keep the original overview
    return row

df = df.apply(replace_with_keywords, axis =1)

In [14]:
mask = (df['extracted_text'].notna())

df.loc[mask, 'extracted_text'].head()

22                          france, society, confinement
74     sports, basketball, national basketball associ...
98                                       comedy, romance
119                                   documentary, music
120    archive footage, movie star, hollywood star, g...
Name: extracted_text, dtype: object

In [15]:
# Creates a view of the data where extracted_text has values
filtered_df = df[df['extracted_text'].notna()]
print(filtered_df.count())

movie_id             170
title                170
overview             170
genres               170
cast                 170
keywords             170
runtime              170
release_year         170
language             170
vote_average         170
vote_count           170
combined_features    170
extracted_text       170
dtype: int64


In [16]:
from thefuzz import fuzz, process
# helper method to get unique keywords, so that vector embedding does not contain noise
def get_unique_fuzzy_keywords(input_str, threshold=70):
    # 1. Clean and split the string into a list
    raw_keywords = [k.strip() for k in input_str.split(',') if k.strip()]
    raw_keywords.sort(key= len, reverse= True)
    
    unique_keywords = []

    for kw in raw_keywords:
        # 2. Check if the keyword is similar to anything already accepted
        # If the list is empty, just add the first word
        if not unique_keywords:
            unique_keywords.append(kw)
            continue
        
        # 3. Find the best match score among already accepted words
        # extractOne returns (best_match, score)
        _, score = process.extractOne(kw, unique_keywords, scorer=fuzz.token_set_ratio)
        
        # 4. If the similarity score is low, it's a "unique" new concept
        if score < threshold:
            unique_keywords.append(kw)
            
    return unique_keywords

# Example Usage
# data = "sports, basketball, national basketball association (nba)"
# result = get_unique_fuzzy_keywords(data)

# print(result) 
# Output: ['sports', 'national basketball association (nba)']

In [17]:
mask = (df['extracted_text'].notna())

df.loc[mask,'extracted_text']= df.loc[mask,'extracted_text'].apply(get_unique_fuzzy_keywords)

In [18]:
df.loc[mask,'extracted_text'].head(5)

22                        [confinement, society, france]
74       [national basketball association (nba), sports]
98                                     [romance, comedy]
119                                 [documentary, music]
120    [biographical documentary, archive footage, li...
Name: extracted_text, dtype: object

In [19]:
def keywords_to_plot(keywords: list):
    if isinstance(keywords,list):
        keyword_text = ','.join(keywords)
        text_to_append = 'a movie that is about: '
        
        final_text = text_to_append + keyword_text
        return final_text
    

df.loc[mask,'extracted_text']= df.loc[mask,'extracted_text'].apply(keywords_to_plot)

df.loc[mask,'extracted_text'].head(5)

    



22     a movie that is about: confinement,society,france
74     a movie that is about: national basketball ass...
98                 a movie that is about: romance,comedy
119             a movie that is about: documentary,music
120    a movie that is about: biographical documentar...
Name: extracted_text, dtype: object

In [20]:
import pandas as pd
import numpy as np

def new_overview(row):
    overview = row['overview']
    extracted_text = row['extracted_text']
    keywords = row['keywords']

    # Use pd.isna() instead of 'is pd.nan'
    if overview == 'no overview':
        if not pd.isna(extracted_text):
            row['new_overview'] = extracted_text
        else:
            row['new_overview'] = 'no overview'
    else:
        if isinstance(keywords,str) and len(keywords):
            row['new_overview'] = overview+' '+keywords
        else:
            row['new_overview']= overview

        

    return row

# Specify axis=1 to process by row
df = df.apply(new_overview, axis=1)

In [21]:
df[df['new_overview']=='no overview'].count()

movie_id             49
title                49
overview             49
genres               49
cast                 49
keywords             49
runtime              49
release_year         49
language             49
vote_average         49
vote_count           49
combined_features    49
extracted_text        0
new_overview         49
dtype: int64

In [22]:
# Select all rows where the condition is False 
# removing rows that neither have keywords, genres or overview since they don't add any values to embeddings
df = df[~(df['new_overview'] == 'no overview')]


In [23]:
df.count()

movie_id             4951
title                4951
overview             4951
genres               4951
cast                 4951
keywords             4951
runtime              4951
release_year         4951
language             4951
vote_average         4951
vote_count           4951
combined_features    4951
extracted_text        170
new_overview         4951
dtype: int64

In [24]:
df.columns

Index(['movie_id', 'title', 'overview', 'genres', 'cast', 'keywords',
       'runtime', 'release_year', 'language', 'vote_average', 'vote_count',
       'combined_features', 'extracted_text', 'new_overview'],
      dtype='object')

In [25]:
columns_to_keep = [
    'movie_id', 'title', 'genres', 'cast', 'keywords', 
    'runtime', 'release_year', 'language', 
    'vote_average', 'vote_count', 'new_overview'
]

new_df = df[columns_to_keep]

In [26]:
new_df.head(2)

Unnamed: 0,movie_id,title,genres,cast,keywords,runtime,release_year,language,vote_average,vote_count,new_overview
0,670347,Bone Marrow,drama,"Parinaz Izadyar, Babak Hamidian, Javad Ezzati",,108,2020,persian,6.0,3,bahar has divorced her husband and is now rema...
1,695675,Fox Hunting,action,"Eva Huang Shengyi, Xu Jia, Eric Tsang Chi-Wai",,105,2020,chinese,9.0,1,fox hunting adapted from wang jianxing's popul...


In [27]:
new_df= new_df.rename(columns={'new_overview':'overview'})

In [28]:
new_df['embedding_text']= new_df['overview']

In [29]:
new_df.iloc[67:74]

Unnamed: 0,movie_id,title,genres,cast,keywords,runtime,release_year,language,vote_average,vote_count,overview,embedding_text
67,862246,Mama's Dead and Lives in the Basement,mystery,"Eugene Torres, June Cuthbertson",short film,5,2020,english,0.0,0,bruce gets phone calls from a woman claiming t...,bruce gets phone calls from a woman claiming t...
68,756167,Sitsit,horror,"Ivana Alawi, Jake Cuenca, Sarah Patricia Gill",,60,2020,tagalog,5.7,3,a part original movie featuring scorpio: an ag...,a part original movie featuring scorpio: an ag...
69,899308,G√©rard G√©rard,"comedy, romance","Pierre Hancisse, Lucie Debay, Gr√©goire Oestermann",,21,2020,french,6.0,2,abel is a ghostwriter. he just finished writin...,abel is a ghostwriter. he just finished writin...
70,603258,Conference,drama,"Natalya Pavlenkova, Olga Lapshina, Kseniya Zueva","post-traumatic stress disorder (ptsd), sense o...",129,2020,russian,5.8,11,many years after a deadly terrorist siege in a...,many years after a deadly terrorist siege in a...
71,791705,Othello,drama,"Michael Blake, Gordon S. Miller, Amelia Sargisson","jealousy, live theatre, filmed theater, shakes...",116,2020,english,8.0,1,filmed version of the stratford festival produ...,filmed version of the stratford festival produ...
72,760793,Truth,drama,"Rachel Alig, Eric Paul Erickson, Jannica Olin","civil war, criminal",107,2020,english,3.0,1,"following a brutal civil war, an interrogation...","following a brutal civil war, an interrogation..."
73,801489,TreeTV,"documentary, tv movie",,,309,2020,xx,6.0,1,tunein on friday midnight to watch relaxing ol...,tunein on friday midnight to watch relaxing ol...


In [30]:
new_df.columns

Index(['movie_id', 'title', 'genres', 'cast', 'keywords', 'runtime',
       'release_year', 'language', 'vote_average', 'vote_count', 'overview',
       'embedding_text'],
      dtype='object')

In [31]:
col = new_df["embedding_text"]

print("Total rows:", len(col))
print("NaNs:", col.isna().sum())
print("Non-strings:", sum(not isinstance(x, str) for x in col if pd.notna(x)))


Total rows: 4951
NaNs: 0
Non-strings: 0


In [32]:
import os
# this code saves cleaned df to the below path

# os.makedirs("../datasets/cleaned", exist_ok=True)
# new_df.to_csv("../datasets/cleaned/movies_cleaned_2.csv", index=False)

In [33]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parents[0]  # moviebot/
sys.path.append(str(PROJECT_ROOT))

In [None]:
from pathlib import Path

# ========================
# Base paths
# ========================
# PROJECT_ROOT = Path(__file__).resolve().parents[2]

DATASET_DIR = PROJECT_ROOT / "datasets" / "cleaned"
MOVIES_CSV_PATH = DATASET_DIR / "movies_cleaned.csv"
# MOVIES_CSV_PATH = DATASET_DIR / "movies_cleaned_2.csv"

#=========Mood Predictor================
MODEL_PATH = PROJECT_ROOT /"src" /"models" / "emotion_model"

# ========================
# Embedding configuration
# ========================
EMBEDDING_MODEL_NAME = "all-mpnet-base-v2"
EMBEDDING_DTYPE = "float32"

# ========================
# FAISS configuration
# ========================
FAISS_INDEX_DIR = PROJECT_ROOT / "datasets" / "faiss"
FAISS_INDEX_PATH = FAISS_INDEX_DIR / "movies.index"

# ========================
# Recommendation defaults
# ========================
TOP_K_RECOMMENDATIONS = 10

# Similarity weights (used later)
GENRE_BOOST = 0.3
LANGUAGE_BOOST = 0.3
POPULARITY_BOOST = 0.05

YES_WORDS = ["yes", "yeah", "yep", "sure", "okay", "ok", "Yah", "absolutely", "definitely","Yes please","Ya", 'y', 'yea', "sure"]
NO_WORDS = ["no", "nope", "nah", "not really", "don't", "do not","No thanks","No thank you", 'n', 'nah', 'noo']
LANGUAGES = [
    "English", "Hindi", "Telugu", "Tamil", "Korean",
    "Japanese", "Spanish", "French", "Persian", "Urdu", 
    "Arabic", "Bengali","Chinese","German"
]
genres = [
    "action",
    "adventure",
    "animation",
    "comedy",
    "crime",
    "documentary",
    "biography",
    "drama",
    "family",
    "fantasy",
    "history",
    "horror",
    "music",
    "mystery",
    "romance",
    "sci-fi",
    "thriller",
    "war",
    "western"
]
# ========================
# Validation schema
# ========================
REQUIRED_COLUMNS = {
    "movie_id",
    "title",
    "embedding_text",
    "genres",
    "cast",
    "keywords",
    "runtime",
    "language",
    "release_year",
}


2. ## Creating Vector Embeddings: I define 4 helper modules in this section
#### 2.1 movie_repository: Handles loading and querying the cleaned movie dataset.
#### 2.2 embedding_model: Creates vector embeddings for movies dataaset using Sentence Transformer
#### 2.3 index_builder: Builds indexes and creates mappings with movies metadata for respective embeddings
#### 2.4 faiss_index:  builds faiss indexes from vector

In [35]:
#2.1 src/data/movie_repository.py
from typing import Optional, List
import pandas as pd

# from src.config.settings import (
#     MOVIES_CSV_PATH,
#     REQUIRED_COLUMNS
# )


class MovieRepository:
    """
    Handles loading and querying the cleaned movie dataset.
    This class is the single source of truth for movie data access.
    """

    def __init__(self, csv_path: Optional[str] = None):
        self.csv_path = csv_path or MOVIES_CSV_PATH
        self._df: Optional[pd.DataFrame] = None

    def load(self) -> None:
        """Load the movie dataset into memory."""
        if not self.csv_path.exists():
            raise FileNotFoundError(
                f"Movie dataset not found at: {self.csv_path}"
            )

        self._df = pd.read_csv(self.csv_path)
        self._validate_schema()

    def _validate_schema(self) -> None:
        """Ensure required columns exist."""
        if self._df is None:
            raise RuntimeError("Dataset not loaded")

        missing = REQUIRED_COLUMNS - set(self._df.columns)
        if missing:
            raise ValueError(
                f"Dataset missing required columns: {missing}"
            )

    @property
    def df(self) -> pd.DataFrame:
        """Safe access to underlying dataframe."""
        if self._df is None:
            raise RuntimeError("Call load() before accessing data")
        return self._df

    def get_all_movies(self) -> pd.DataFrame:
        """Return full dataset."""
        return self.df.copy()

    def filter_movies(
        self,
        language: Optional[List[str]] = None,
        max_runtime: Optional[int] = None,
        min_runtime: Optional[int] = None,
    ) -> pd.DataFrame:
        """
        Apply hard filters only (no ranking).
        """
        df = self.df

        if language:
            df = df[df["language"].isin(language)]

        if max_runtime is not None:
            df = df[df["runtime"] <= max_runtime]

        if min_runtime is not None:
            df = df[df["runtime"] >= min_runtime]

        return df.copy()

    def get_movie_by_id(self, movie_id: int) -> pd.Series:
        """Fetch a single movie by ID."""
        movie = self.df[self.df["movie_id"] == movie_id]
        if movie.empty:
            raise ValueError(f"Movie with id {movie_id} not found")
        return movie.iloc[0]


In [36]:
#2.2 src/models/embedding_model
from typing import List
import numpy as np
from sentence_transformers import SentenceTransformer

# from src.config.settings import EMBEDDING_MODEL_NAME, EMBEDDING_DTYPE


class EmbeddingModel:
    """
    Wrapper around SentenceTransformer for generating embeddings.
    Responsible ONLY for embedding text.
    """

    def __init__(self, model_name: str = EMBEDDING_MODEL_NAME):
        self.model_name = model_name
        self._model: SentenceTransformer | None = None

    def load(self) -> None:
        """Load the embedding model into memory."""
        if self._model is None:
            self._model = SentenceTransformer(self.model_name)

    def embed_text(self, text: str) -> np.ndarray:
        """
        Embed a single string into a vector.
        """
        if not text or not isinstance(text, str):
            raise ValueError("Text must be a non-empty string")

        self.load()

        vector = self._model.encode(
            text,
            convert_to_numpy=True,
            normalize_embeddings=True
        )

        return vector.astype(EMBEDDING_DTYPE)

    def embed_texts(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
        """
        Embed multiple strings into a 2D array.
        Shape: (num_texts, embedding_dim)
        """
        if not texts or not isinstance(texts, list):
            raise ValueError("Texts must be a non-empty list of strings")

        self.load()

        vectors = self._model.encode(
            texts,
            batch_size=batch_size,
            convert_to_numpy=True,
            normalize_embeddings=True,
            show_progress_bar=True
        )

        return vectors.astype(EMBEDDING_DTYPE)


  from .autonotebook import tqdm as notebook_tqdm


In [37]:
# 2.3 src/index/index_builder
from pathlib import Path
import numpy as np
import pandas as pd

from src.data.movie_repository import MovieRepository
from src.models.embedding_model import EmbeddingModel
from src.index.faiss_index import FaissIndex
from src.config.settings import FAISS_INDEX_PATH


class IndexBuilder:
    """
    Builds and persists a FAISS index for movies.
    """

    def __init__(
        self,
        repository: MovieRepository,
        embedding_model: EmbeddingModel,
        index_path: Path = FAISS_INDEX_PATH,
    ):
        self.repository = repository
        self.embedding_model = embedding_model
        self.index_path = index_path
        self.mapping_path = index_path.with_suffix(".mapping.npy")

    def build(self) -> None:
        """
        Build or append to FAISS index from movie embeddings.
        """
        df = self.repository.get_all_movies()

        if "embedding_text" not in df.columns:
            raise ValueError("embedding_text column missing")

        if df.empty:
            return

        # üëâ IMPORTANT: Only embed rows we are about to add
        texts = df["embedding_text"].tolist()
        vectors = self.embedding_model.embed_texts(texts)

        # Load or create index
        if self.index_path.exists():
            index = FaissIndex.load(self.index_path)
        else:
            index = FaissIndex(dim=vectors.shape[1])

        # Append vectors (THIS is the real add)
        index.add(vectors)

        # Save index first
        index.save(self.index_path)

        # Append mapping AFTER index.add()
        self._save_mapping(df)

        # Safety check
        self._validate_index_vs_mapping(index)

    def _save_mapping(self, df: pd.DataFrame) -> None:
        """
        Append FAISS index ‚Üí movie_id mapping.
        """
        new_movie_ids = df["movie_id"].to_numpy()

        self.mapping_path.parent.mkdir(parents=True, exist_ok=True)

        if self.mapping_path.exists():
            existing_movie_ids = np.load(self.mapping_path)
            combined_movie_ids = np.concatenate(
                [existing_movie_ids, new_movie_ids]
            )
        else:
            combined_movie_ids = new_movie_ids

        np.save(self.mapping_path, combined_movie_ids)

    def _validate_index_vs_mapping(self, index: FaissIndex) -> None:
        """
        Ensure FAISS index and mapping stay aligned.
        """
        movie_ids = np.load(self.mapping_path)
        if index.index.ntotal != len(movie_ids):
            raise RuntimeError(
                f"FAISS index size ({index.index.ntotal}) "
                f"!= mapping size ({len(movie_ids)})"
            )

    def load_index(self) -> tuple[FaissIndex, np.ndarray]:
        """
        Load FAISS index and movie_id mapping.
        """
        index = FaissIndex.load(self.index_path)

        if not self.mapping_path.exists():
            raise FileNotFoundError(
                f"Mapping file not found at {self.mapping_path}"
            )

        movie_ids = np.load(self.mapping_path)
        return index, movie_ids


In [38]:
#2.4 src/index/faiss_index
from typing import Tuple
from pathlib import Path
import numpy as np
import faiss


class FaissIndex:
    """
    Thin wrapper around FAISS index for vector similarity search.
    """

    def __init__(self, dim: int):
        """
        :param dim: embedding dimension (e.g. 768 for SBERT)
        """
        self.dim = dim
        self.index: faiss.Index | None = None

    # -----------------------------
    # BUILD (fresh index)
    # -----------------------------
    def build(self, vectors: np.ndarray) -> None:
        """
        Build a NEW FAISS index from vectors.
        This overwrites any existing index in memory.
        """
        self._validate_vectors(vectors)

        faiss.normalize_L2(vectors)

        self.index = faiss.IndexFlatIP(self.dim)
        self.index.add(vectors)

    # -----------------------------
    # ADD (append vectors)
    # -----------------------------
    def add(self, vectors: np.ndarray) -> None:
        """
        Append vectors to an existing FAISS index.
        Creates index if not present.
        """
        if vectors is None or len(vectors) == 0:
            return

        self._validate_vectors(vectors)

        faiss.normalize_L2(vectors)

        if self.index is None:
            self.index = faiss.IndexFlatIP(self.dim)

        self.index.add(vectors)

    # -----------------------------
    # SAVE / LOAD
    # -----------------------------
    def save(self, path: Path) -> None:
        """
        Persist FAISS index to disk.
        """
        if self.index is None:
            raise RuntimeError("Index has not been built or loaded")

        path.parent.mkdir(parents=True, exist_ok=True)
        faiss.write_index(self.index, str(path))

    @classmethod
    def load(cls, path: Path) -> "FaissIndex":
        """
        Load FAISS index from disk and return a FaissIndex instance.
        """
        if not path.exists():
            raise FileNotFoundError(f"FAISS index not found at {path}")

        index = faiss.read_index(str(path))
        obj = cls(dim=index.d)
        obj.index = index
        return obj

    # -----------------------------
    # SEARCH
    # -----------------------------
    def search(
        self,
        query_vector: np.ndarray,
        top_k: int = 5
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Perform similarity search.
        """
        if self.index is None:
            raise RuntimeError("Index not loaded or built")

        if query_vector.ndim != 1 or query_vector.shape[0] != self.dim:
            raise ValueError(
                f"Expected query vector of shape ({self.dim},), got {query_vector.shape}"
            )

        query_vector = query_vector.astype(np.float32)
        faiss.normalize_L2(query_vector.reshape(1, -1))

        scores, indices = self.index.search(
            query_vector.reshape(1, -1),
            top_k
        )

        return scores[0], indices[0]

    # -----------------------------
    # INTERNAL HELPERS
    # -----------------------------
    def _validate_vectors(self, vectors: np.ndarray) -> None:
        if vectors.ndim != 2:
            raise ValueError(f"Vectors must be 2D, got {vectors.shape}")

        if vectors.shape[1] != self.dim:
            raise ValueError(
                f"Vector dim {vectors.shape[1]} != index dim {self.dim}"
            )

        if vectors.dtype != np.float32:
            raise ValueError("Vectors must be float32")


### 3. LLMs for interpretting user's text
I integrated LLMs using Gemini API, API Key  for calling Gemini model for interpretting mood predicting genres, extracting movie plot based on user's movie preference

In [39]:
# 3 src/llm/extract_movie_info

from typing import List
from pydantic import BaseModel, Field

class Movie(BaseModel):
    """Structured data about a movie."""
    title: str = Field(description="The full title of the movie.")
    themes: List[str] = []
    plot: str  = Field(description="brief plot describing movie intent")


from dotenv import load_dotenv
from src.movie import Movie
from google import genai
import os
from google.genai.errors import APIError

try:
    load_dotenv()
    GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") # enter your gemini api key here
    client = genai.Client(api_key=GEMINI_API_KEY)
except Exception as e:
    print(f"Error initializing Gemini client: {e}")
    client = None

def extract_movie_info(text: str):
    if not client:
      print("Gemini client not initialized. Check API key setup.")
      return None
    prompt = f"""
From the USER TEXT, perform the following steps:

1. Check whether the USER TEXT explicitly mentions a real movie, TV show, or web series
   (even partially or informally).

2. IF a real movie, TV show, or web series is identified:
   - Set official_title to the recognized title
   - Generate 3‚Äì6 concise theme keywords that reflect the tone or subject
   - Write ONE clear sentence describing the plot or narrative intent of the identified title

3. IF NO real movie, TV show, or web series is identified:
   - Leave official_title EMPTY
   - Extract themes ONLY if clearly implied by the text
   - IF the USER TEXT expresses a general movie preference or intent
     (for example: "movie with Christmas theme", "feel-good romantic comedy",
      "high school comedy", "thriller set in winter"),
     THEN copy the USER TEXT verbatim into the plot field
   - Otherwise, leave the plot field EMPTY

Return a JSON object only. Do not add explanations.

USER TEXT: "{text}" """



    try:
        # Call the API with the structured output configuration
        response = client.models.generate_content(
            model="gemini-2.5-flash-lite",
            contents=[prompt],
            config={
                #    'system_instruction': system_msg,
                   'response_mime_type': 'application/json',
                'response_schema': Movie,
            }
        )
        extracted_movie = Movie.model_validate_json(response.text)
        return extracted_movie
    # except APIError as e:
    #     print(f"Gemini API Error (Check Quota/Rate Limits): {e}")
    #     return None
    except Exception as e:
        print(f"Error generating content: {e}")
        return None


# res = extract_movie_info(" I want something like stranger things dark thriller type")
# # print(res)

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


In [40]:
# 3 src/llm/give_customized_mood_Response
# gives response and genres user might like based on user's mood

from pydantic import BaseModel, Field

class MovieGenre(BaseModel):
    """Structured data about a movie."""
    response_text: str = Field(description="The customized response text based on mood.")
    genres: list = Field(description="list of genres associated with the mood.")


from dotenv import load_dotenv
from src.mood_genre import MovieGenre
from google import genai
import os
# from google.genai.errors import APIError

try:
    load_dotenv()
    # GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
    GEMINI_API_KEY = 'AIzaSyCHrN533OwfanNjCscx2dZbE4o9T0_j0sg' 
    client = genai.Client(api_key=GEMINI_API_KEY)
except Exception as e:
    print(f"Error initializing Gemini client: {e}")
    client = None

def give_customized_mood_response(text: str):
    if not client:
      print("Gemini client not initialized. Check API key setup.")
      return None
    
    if len(text)==0:
        return None
    prompt = (f"Based on the user text analyze the mood he is in and suggest movie genres he might like. The response_text must contain customized response something like: Since you are in a mood to watch so and so"
            f"I want you to output a json with response_text as string and standardised genres as a list of strings in lowercase do not output none values, output empty string if no genre or no response_text \n\Mood: \"{text}\"")
    try:
        # Call the API with the structured output configuration
        response = client.models.generate_content(
            model="gemini-2.5-flash",

            contents=[prompt],
            config={
                   'response_mime_type': 'application/json',
                'response_schema': MovieGenre,
            }
        )
        extracted_response = MovieGenre.model_validate_json(response.text)
        return extracted_response

    except Exception as e:
        print(f"Error generating content: {e}")
        return None
    

    # except APIError as e:
    #     print(f"Gemini API Error (Check Quota/Rate Limits): {e}")
    #     return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None
    

# res = give_customized_mood_response("light, uplifting")
# print(f"{res.response_text} {','.join(res.genres)}")

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


### 4. Create reranker module

In [41]:
# src/models/reranker_model
from sentence_transformers import CrossEncoder
import numpy as np
from typing import List, Tuple

class ReRankerModel:
    """
    Wrapper for Cross-Encoder models to re-rank candidates.
    """
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.model_name = model_name
        self._model: CrossEncoder | None = None

    def load(self) -> None:
        if self._model is None:
            self._model = CrossEncoder(self.model_name)

    def rank(self, query: str, documents: List[str]) -> np.ndarray:
        """
        Returns a relevance score for each document relative to the query.
        """
        if not documents:
            return np.array([])
            
        self.load()
        # Create pairs: [[query, doc1], [query, doc2]...]
        pairs = [[query, doc] for doc in documents]
        scores = self._model.predict(pairs)
        return scores

### 5. Recommendation Module: 
This module searches user's query with FAISS embeddings calculates similarity index, uses reranker module to rerank results, applies heard filters on language, gives soft boosts on Genre to improve similarity score 

In [42]:
# src/recommender/recommendation_engine
from typing import List, Dict, Any
import numpy as np
import pandas as pd
from typing import Optional
from src.models.reranker_model import ReRankerModel
from src.models.embedding_model import EmbeddingModel
from src.index.faiss_index import FaissIndex
from src.data.movie_repository import MovieRepository
from src.config.settings import (
    TOP_K_RECOMMENDATIONS,
    GENRE_BOOST,
    POPULARITY_BOOST,
)
GENRE_BOOST = 0.3 #can adjust accordingly
POPULARITY_BOOST = 0.05
TOP_K_RECOMMENDATIONS = 10

class RecommendationEngine:
    """
    Core recommendation engine combining semantic similarity,
    hard filters, and soft ranking boosts.
    """

    def __init__(
        self,
        repository: MovieRepository,
        embedding_model: EmbeddingModel,
        faiss_index: FaissIndex,
        index_to_movie_id: np.ndarray,
        reranker: ReRankerModel,
    ):
        self.repository = repository
        self.embedding_model = embedding_model
        self.index = faiss_index
        self.index_to_movie_id = index_to_movie_id
        self.reranker = reranker

    def recommend(
        self,
        user_profile: Dict[str, Any],
        top_k: int = TOP_K_RECOMMENDATIONS,
        faiss_k: int = 50,
    ) -> pd.DataFrame:
        """
        Generate movie recommendations.
        """
        query_text = user_profile.get("query_text")
        # intents = user_profile.get('intent_terms')
        if not query_text:
            raise ValueError("query_text is required")

        # 1. Embed user query
        text_to_embed = query_text
        query_vector = self.embedding_model.embed_text(text_to_embed)

        # 2. FAISS retrieval
        scores, indices = self.index.search(query_vector, top_k=faiss_k)

        candidate_ids = self.index_to_movie_id[indices]
        df_candidates = self.repository.df[
            self.repository.df["movie_id"].isin(candidate_ids)
        ].copy()

        # Attach similarity scores
        score_map = dict(zip(candidate_ids, scores))
        df_candidates["similarity_score"] = df_candidates["movie_id"].map(score_map)

        # 3. Hard filters
        df_candidates = self._apply_hard_filters(
            df_candidates,
            user_profile
        )

        if df_candidates.empty:
            return df_candidates
        
        if self.reranker and not df_candidates.empty:
            documents = df_candidates["embedding_text"].fillna("").tolist()

            rerank_scores = self.reranker.rank(
                query=query_text,
                documents=documents
            )

            df_candidates["rerank_score"] = rerank_scores

            # Combine FAISS similarity + reranker score
            df_candidates["final_score"] = (
                0.6 * df_candidates["similarity_score"]
                + 0.4 * df_candidates["rerank_score"]
            )
        else:
            df_candidates["final_score"] = df_candidates["similarity_score"]

        # 4. Soft boosts
        df_candidates["final_score"] = df_candidates["similarity_score"]
        df_candidates["final_score"] += self._genre_boost(
            df_candidates,
            user_profile.get("genres")
        )
        df_candidates["final_score"] += self._popularity_boost(df_candidates)
        df_candidates = df_candidates.drop_duplicates(subset=['movie_id'])

        # 5. Rank & return
        return (
            df_candidates
            .sort_values("final_score", ascending=False)
            .head(top_k)
            .reset_index(drop=True)
        )

    # -----------------------
    # Hard filters
    # -----------------------

    def _apply_hard_filters(
        self,
        df: pd.DataFrame,
        user_profile: Dict[str, Any],
    ) -> pd.DataFrame:
        languages = user_profile.get("language")
        # runtime = user_profile.get("runtime", {})

        if languages:
            df = df[df["language"].isin(languages)]

        # if runtime:
        #     if runtime.get("max") is not None:
        #         df = df[df["runtime"] <= runtime["max"]]
        #     if runtime.get("min") is not None:
        #         df = df[df["runtime"] >= runtime["min"]]
        #     if runtime.get("exact") is not None:
        #         df = df[df["runtime"] == runtime["exact"]]
            
        # if not runtime:
        #     df = df[df['runtime'] >= 50]

        return df

    # -----------------------
    # Soft boosts
    # -----------------------

    def _genre_boost(
        self,
        df: pd.DataFrame,
        preferred_genres: Optional[List[str]],
    ) -> np.ndarray:
        if not preferred_genres:
            return 0.0

        def match(genres: str) -> float:
            if not isinstance(genres, str):
                return 0.0
            movie_genres = set(genres.split(","))
            return GENRE_BOOST if movie_genres & set(preferred_genres) else 0.0

        return df["genres"].apply(match)

    def _popularity_boost(self, df: pd.DataFrame) -> np.ndarray:
        """
        Light popularity boost using vote_average.
        """
        if "vote_average" not in df.columns:
            return 0.0

        # Normalize to 0‚Äì1 range
        votes = df["vote_average"].fillna(0)
        norm = (votes - votes.min()) / (votes.max() - votes.min() + 1e-6)

        return norm * POPULARITY_BOOST


### 6.Conversation Manager Module:
I define a state class that manages all variables,  defined helper modules:

6.1 ConversationState -> state class that defines objects or fields in state machine

6.2 interpreter -> interprets users text, like: mood , genres selected, languages, 

6.3 ConversatioManager -> manages conversation state, asks questions processes text

6.4 get_useful_info: -> converts users preferences to a dictionary that can be fed to recommender module
 

In [43]:
# 6.1 # src/conversation_state.py

from dataclasses import dataclass, field
from typing import List, Optional

from src.mood_genre import MovieGenre

@dataclass
class ConversationState:
    mood: Optional[str] = None
    suggested_genres: List[str] = field(default_factory=list)
    selected_genres: List[str] = None
    movie_description: str = None
    language: Optional[str] = None
    res: Optional[MovieGenre] = None
    current_step: str = "ask_mood"
    is_complete = False

    def clear(self):
        self.__init__()


In [None]:
# 6.2 interpreter
# src/interpreter.py

import re
import requests
import nltk
# from spellchecker import SpellChecker
from rapidfuzz import fuzz
from thefuzz import process
import pycountry
import spacy
import re
from typing import Optional, Dict, Any
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# from src.config.settings import genres,YES_WORDS,NO_WORDS, LANGUAGES
# from src.llm.extract_movie_info import extract_movie_info

# spell = SpellChecker()



try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Error: Model 'en_core_web_sm' not found. Please run 'python -m spacy download en_core_web_sm'")
    exit()


try:
    stopwords.words('english')
except LookupError as e:
    print("Downloading NLTK stopwords...", e)


# Define the stop words set globally for maximum efficiency
STOP_WORDS_SET = set(stopwords.words('english'))


def extract_plot_text(text: str, state) -> str:
    if not text:
        return text
    
    plot=''
    enriched = extract_movie_info(text)
    if not enriched:
        return text
    
    if enriched.plot:
      plot = str(enriched.plot).strip()

    # print("========enriched plot========",plot, enriched.themes )

    if enriched.themes:
        themes= ', '.join([a for a in enriched.themes])
        plot += themes

    return plot


   

def remove_stopwords(text: str) -> str:
   word_tokens = word_tokenize(text.lower()) # Convert to lowercase first
   
   filtered_sentence = [w for w in word_tokens if w.isalnum() and w not in STOP_WORDS_SET]
 #  print(filtered_sentence)
   return ' '.join(filtered_sentence)


def detect_yes_no(text: str):
 text = text.lower().strip() if text is not None else ''

    # 1. IMMEDIATE CHECK: Exact Match
    # If the user types a perfect match, we return immediately.
 if text in YES_WORDS:
     return "yes"
 if text in NO_WORDS:
        return "no"
 best_yes_score = 0
 for w in YES_WORDS:
        score = fuzz.ratio(w, text)
        if score > best_yes_score:
            best_yes_score = score
            
    # Check against NO words
 best_no_score = 0
 for w in NO_WORDS:
        score = fuzz.ratio(w, text)
        if score > best_no_score:   
            best_no_score = score

    # Decision Logic: Check if the best score meets the threshold
    # and is significantly better than the opposing category.
    
 if best_yes_score >= 80 and best_yes_score > best_no_score:
        return "yes"
    
 if best_no_score >= 80 and best_no_score > best_yes_score:
        return "no"

 return None


# -------------------------------
# Genre extraction
# -------------------------------
def extract_genre(text: str):
    CORRECT_GENRES = [ g.lower() for g in genres]
    lowercased_text=text.lower()
    cleaned_text=remove_stopwords(lowercased_text)

    candidates = re.split(r'[,\s-]+', cleaned_text)
    best_matches = [process.extractOne(c, CORRECT_GENRES) for c in candidates]
   # candidates = [word for word in text.lower().split() if len(word) > 2]
    # Extract the corrected name if the score is above the threshold (85)
    # print(best_matches)
    corrected_genres = [match[0] for match in best_matches if match[1] > 80]
    
    # Return unique, correctly spelled genres

    if len(corrected_genres) ==0:
        return None
    return list(set(corrected_genres))


def extract_language(text: str):
    result = detect_yes_no(text)

    if result =='no':
        return None
    CORRECT_LANG = [ l.lower() for l in LANGUAGES]
  #  candidates = [word for word in text.lower().split() if len(word) > 2]
    lowercased_text=text.lower() if text is not None else ''
    cleaned_text=remove_stopwords(lowercased_text)
    candidates = re.split(r'[,\s-]+', cleaned_text)
    best_matches = [process.extractOne(c, CORRECT_LANG) for c in candidates]
    
    # Extract the corrected name if the score is above the threshold (85)
    corrected_languages = [match[0] for match in best_matches if match[1] > 75]
    if len(corrected_languages) ==0:
        return None
    
    # Return unique, correctly spelled genres
    return list(set(corrected_languages))



all_language_names = [language.name for language in pycountry.languages]





# def extract_runtime(text: str) -> Optional[Dict[str, Any]]:
#     text= text if text is not None else None
#     t = text.lower().strip()

#     # ---------------------------------
#     # 1) Detect constraint type (symbols first)
#     # ---------------------------------

#     result = detect_yes_no(t)

#     if result == 'no':
#         return None
    
#     if re.search(r"(<=|<)", t):
#         ctype = "max"
#     elif re.search(r"(>=|>)", t):
#         ctype = "min"
#     elif re.search(r"(under|below|less\s+than|at\s+most|within|no\s+more\s+than)", t):
#         ctype = "max"
#     elif re.search(r"(over|greater\s+than|more\s+than|at\s+least|minimum|min\.?)", t):
#         ctype = "min"
#     else:
#         ctype = "exact"

#     # ---------------------------------
#     # 2) Parse runtime and convert to minutes
#     # ---------------------------------
#     minutes = None

#     # handles: "1h 30m", "1 hr 20 min", "2h", "45m"
#     hm = re.search(
#         r"(?:(\d+(?:\.\d+)?)\s*(h|hr|hrs|hour|hours))?\s*"
#         r"(?:(\d+(?:\.\d+)?)\s*(m|min|mins|minute|minutes))?",
#         t
#     )
#     if hm and (hm.group(1) or hm.group(3)):
#         hours = float(hm.group(1)) if hm.group(1) else 0
#         mins = float(hm.group(3)) if hm.group(3) else 0
#         minutes = int(round(hours * 60 + mins))

#     # handles: "< 30 mins", "> 2 hours", "less than 90 minutes"
#     if minutes is None:
#         m = re.search(
#             r"(<=|>=|<|>|under|greater\s+than|below|less\s+than|over|more\s+than|at\s+least|at\s+most)?\s*"
#             r"(\d+(?:\.\d+)?)\s*(h|hr|hrs|hour|hours|m|min|mins|minute|minutes)\b",
#             t
#         )
#         if m:
#             val = float(m.group(2))
#             unit = m.group(3)
#             minutes = int(round(val * 60)) if unit.startswith("h") else int(round(val))

#     if minutes is None:
#         return None

#     return {"type": ctype, "minutes": minutes}

In [None]:
# 6.3  src/conversation/conversation_manager
# src/conversation_manager.py
# from src.llm.give_customized_mood_response import give_customized_mood_response
# from src.conversation.state import ConversationState
# from src.conversation.interpreter import (
#     extract_genre, detect_yes_no, extract_language, extract_plot_text)

#from src.predict_mood import predict_mood
class ConversationManager:

    def __init__(self):
        self.state = ConversationState()


    def next_question(self, state):
        step = state.current_step

        if step == "ask_mood":
            return "What's your mood today, I can recommend you geners you might like üßô‚Äç‚ôÇÔ∏è"
        
        if step == "print_mood_response":
            if  state.selected_genres is None:
              predicted_genres = ','.join([a for a in state.suggested_genres])
              return f"{state.res.response_text}, predicted genres:{predicted_genres} \n  Enter if any other prefered genre else type no.."
            else:
                state.current_step = 'ask_language'
                return ( f"Gotcha! noted that you are interested in: {','.join(state.selected_genre)}."
                        f"Enter language preference if any else type no"
                )
        
        
        if step == "similar_movies":
            return "Please share a similar movie's name or plot description to help me understand your taste better:"
    
        if step == "ask_language":
            return "Please enter prefered language if any else type: 'no'"


        return None

    def update_state(self, state, user_input):
        text = user_input.lower()

        if not len(text):
            return "Please enter valid input, type 'exit' to quit"

        if state.current_step == "ask_mood":

            genres = extract_genre(text) if  extract_genre(text) else None

            response_generated= give_customized_mood_response(text)

            state.suggested_genres = response_generated.genres
            state.selected_genre = genres
            state.current_step = "print_mood_response"
            state.res = response_generated
            return state

        if state.current_step == "print_mood_response":

            yn= detect_yes_no(user_input)
            if yn == "no":
                state.selected_genre = None

            else:
                genre = extract_genre(user_input)
                # print("=====selected genre", genre)
                state.current_step 
                state.selected_genres = genre
           
            state.current_step = "ask_language"
            return state



        if state.current_step == "ask_language":
            yn= detect_yes_no(user_input)
            if yn == "no":
                state.language = None
            else:
                language = extract_language(user_input)
                state.language = language
            state.current_step = "similar_movies"
            return state
        
        if state.current_step == "similar_movies":
            state.movie_description =extract_plot_text(user_input, state)
            state.current_step = "reached_end"
            return state
  

        # if state.current_step == "ask_language_value":
        #   #  language_text= correct_spelling(user_input)
        #     state.language = extract_language(user_input)
        #     state.current_step = "ask_runtime"
        #     return state

        # if state.current_step == "ask_runtime":
        #     yn= detect_yes_no(user_input)
        #     # print("========yn value for runtime:", yn)
        #     if yn == "no":
        #         state.runtime = None
        #     else:
        #         runtime = extract_runtime(user_input)
        #         state.runtime = runtime
        #     state.current_step ="runtime_done"
        #     return state



        return state


In [46]:
# 6.4 src/conversation/get_useful_info 

# from src.conversation.state import ConversationState


def get_useful_info(state: ConversationState) -> dict:
    info = {
        "query_text": state.movie_description if state.movie_description else '',
        "genres": list(state.selected_genres) if state.selected_genres else list(state.suggested_genres),
        "language": list(state.language) if state.language else [],
        # "runtime": state.runtime if state.runtime else None
    }
    return info

### 7. Chatbot Service: Which is the orchestrator of Conversation manager

In [47]:
# from src.conversation.conversation_manager import ConversationManager
# from src.conversation.state import ConversationState
from typing import Union

class ChatbotService():

    def __init__(self):

        self.waiting_for_answer = False
        self.state = ConversationState()
        self.manager = ConversationManager()
        

    def run_chatbot(self):
        print("\nHowdy! I am your movie assistant Nova!\n")

        self.state = ConversationState()
        self.manager = ConversationManager()

        while self.state.current_step != "reached_end":
            question = self.manager.next_question(self.state)
            if question:
                  print("\nNova:", question)
            user_input = input("You: ")
            if user_input.lower() in ["exit", "quit", "q", "bye"]:
               print("Exiting the conversation. Goodbye!")
               break
            elif user_input.strip() == "":
               print("Please provide a valid response.")
               continue
            else:
              self.state = self.manager.update_state(self.state, user_input)

        print("\n--- Conversation Complete ---")
        print("Final state to feed recommender:\n")
        return self.state

In [48]:
import sys
from pathlib import Path
from dotenv import load_dotenv
from src.models.reranker_model import ReRankerModel
from src.conversation.get_useful_info import get_useful_info
from src.chatbotservice.chatbot_service import ChatbotService

from src.data.movie_repository import MovieRepository
from src.models.embedding_model import EmbeddingModel
from src.index.index_builder import IndexBuilder
from src.recommender.recommendation_engine import RecommendationEngine



def print_recommendations(df):
    print("\n There you Go! üé¨ Here are some recommended movies for you!:\n")
    for i, row in df.iterrows():
        print(
            f"{i+1}. {row['title']} "
            f"(Score: {row['final_score']:.3f})"
        )
        
        # Check if genres is a string; if not (like NaN), use "Unknown"
        raw_genres = row['genres']
        if isinstance(raw_genres, str):
            genres_list = raw_genres.split('|')
            genres_display = ', '.join(genres_list)
        else:
            genres_display = "Unknown"
            
        print(f"   Genres: {genres_display}")
        print(f"   Runtime: {row['runtime']} mins")
        print()


def main():
#     # ------------------------------------------------
#     # Load data
#     # ------------------------------------------------
#     load_dotenv()
    repo = MovieRepository()
    repo.load()



    # engine.recommend(user_profile)

    # ------------------------------------------------
    # Load embedding model
    # ------------------------------------------------
    embedder = EmbeddingModel()

    re_ranker = ReRankerModel()

    # ------------------------------------------------
    # Load FAISS index + mapping
    # ------------------------------------------------
    builder = IndexBuilder(repo, embedder)
    faiss_index, mapping = builder.load_index()

    # ------------------------------------------------
    # Initialize core services
    # ------------------------------------------------



    service = ChatbotService()

    result =service.run_chatbot()

    # print("Conversation State:", vars(result))


    recommender = RecommendationEngine(
        repository=repo,
        embedding_model=embedder,
        faiss_index=faiss_index,
        index_to_movie_id=mapping,
        reranker =re_ranker
    )
     

    user_profile = get_useful_info(result)

    # user_profile = {'query_text': 'love destiny power mythology good vs evil epic magical saga', 'genres': ['comedy', 'action', 'adventure'], 'language': None, 'runtime': None}

    # user_profile = get_useful_info(result)
    print("User Profile for Recommendations:", user_profile)
    recommended_df = recommender.recommend(user_profile)
    print_recommendations(recommended_df)


 


if __name__ == "__main__":
    main()


Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.
Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.



Howdy! I am your movie assistant Nova!


Nova: What's your mood today, I can recommend you geners you might like üßô‚Äç‚ôÇÔ∏è

Nova: Since you've had a stressful day, I recommend something light and uplifting to help you unwind., predicted genres:comedy,feel-good,slice of life 
  Enter if any other prefered genre else type no..
=====selected genre ['fantasy']

Nova: Please enter prefered language if any else type: 'no'

Nova: Please share a similar movie's name or plot description to help me understand your taste better:

--- Conversation Complete ---
Final state to feed recommender:

User Profile for Recommendations: {'query_text': 'A young orphan discovers he is a wizard and is invited to attend Hogwarts School of Witchcraft and Wizardry, where he uncovers a dark plot.magic, fantasy, adventure, friendship', 'genres': ['fantasy'], 'language': ['english']}

 There you Go! üé¨ Here are some recommended movies for you!:

1. Wendy (Score: 0.854)
   Genres: fantasy, drama
   Runtime: 11

#### Evaluation and Analysis:
This system does not use a single accuracy metric like traditional ML models. Instead, it uses a ranking-based scoring approach, where each movie is given a relevance score and then ordered from most relevant to least relevant for a given query. The goal is not to predict a label, but to rank the best movies at the top.

Types of scores used in the pipeline
#### 1.	Semantic similarity score (embedding score)
The user query and each movie‚Äôs text (plot + themes) are converted into vector embeddings. A similarity score is computed using vector search (FAISS). This score measures how semantically close the movie description is to the query. This step is fast and helps retrieve relevant candidates, but it is approximate.
#### 2.	Re-ranker score (cross-encoder score)
For the shortlisted candidates, a cross-encoder model scores how well each movie actually matches the query when both texts are read together. This score focuses on intent, tone, and narrative alignment, and is used to reorder results for better relevance.
#### 3.	Final ranking score (combined score)
The final score is a weighted combination of the semantic similarity score, the re-ranker score, and small optional boosts (such as genre preference or popularity). Movies with the highest final score are returned to the user.

The model is evaluated using qualitative relevance checks, not numeric accuracy. 
We check whether:
 *	the top results match the user‚Äôs mood or genres prefered
 * incorrect genre or tone matches are pushed lower
 * 	known reference movies (e.g., Enola Holmes, Mean Girls, Oppenheimer) appear near the top for relevant queries
This approach reflects real-world recommender systems, where ranking quality and user relevance matter more than a single accuracy number




#### Ethical Considerations & Responsible AI

#### Bias in Data representation:
The dataset used for this project is currently limited to 12k and extracted from tmdb dataset which might over represent English language movies, which can bias recommendations toward widely known titles and under-represent regional, independent, or niche cinema.

#### Fairness in Recommendations
Soft boosting for genres, language, and popularity is carefully tuned to avoid disproportionately favoring highly popular movies, helping ensure lesser known films still surface when they semantically match user preferences.

#### Limitations of Mood Interpretation
Mood and intent are inferred from natural language inputs and may be ambiguous or context-dependent. The system avoids making sensitive assumptions and uses mood signals only to guide content recommendations, not to profile users.

#### Transparency and Explainability
Recommendation logic is designed to be interpretable, combining semantic similarity with explicit preference boosts, making it possible to explain why certain movies were recommended.

#### Responsible Use of LLMs
LLMs are used strictly for interpreting user input and generating conversational responses, not for decision-making that affects users in high-risk domains, ensuring safe and appropriate usage.



### Summary and Furthere Enhancements

This project successfully demonstrates an end-to-end AI-powered movie recommendation system that combines data engineering, semantic search, and LLM-based interaction. By leveraging vector embeddings, FAISS indexing, and preference-based ranking, the system is able to generate relevant and personalized movie recommendations from a dataset of approximately 13,000 movies. The conversational flow enables the assistant to understand user mood and preferences in natural language, resulting in recommendations that are both contextually relevant and efficient to retrieve.

Future enhancements could include incorporating real user feedback signals such as clicks or ratings to enable learning-to-rank and more robust evaluation metrics. The system can be extended with larger and more diverse datasets to reduce popularity and language bias. Additional improvements may include fine-tuning embedding models on movie-specific data, introducing session-level memory for better personalization, and expanding the assistant to support other domains such as TV shows or personalized content discovery platforms.