# 🎬 Advanced Movie Recommender System

This Jupyter Notebook serves as the comprehensive development environment for an advanced movie recommendation system. It covers data loading, extensive preprocessing and feature engineering, implementation of various recommendation models (popularity-based, content-based, collaborative filtering, and hybrid), and their evaluation.

## Table of Contents

1.  [**Setup and Data Loading**](#setup-data-loading)
2.  [**Phase 1: Popularity-Based Recommender & Core Preprocessing**](#popularity-recommender)
3.  [**Phase 2: Advanced Feature Engineering for Content-Based Models**](#advanced-feature-engineering)
4.  [**Phase 3: Enhancing Recommendation Models**](#enhancing-models)
    *   [3.1. Content-Based Recommenders (Plot & Metadata)](#content-based-recs)
    *   [3.2. Simple Hybrid Recommender (Content + Weighted Rating)](#simple-hybrid)
    *   [3.3. Collaborative Filtering (SVD)](#collaborative-filtering)
5.  [**Phase 4: Evaluation and Metrics**](#evaluation-metrics)

In [1]:
import pandas as pd
import numpy as np
import re
from ast import literal_eval

# NLTK imports and dummies 
try:
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    print("NLTK successfully imported and initialized.")
except ImportError:
    print("NLTK not found. Text preprocessing (lemmatization/stopwords) will be skipped.")
    print("Please install it: pip install nltk && python -c 'import nltk; nltk.download(\"punkt\"); nltk.download(\"wordnet\"); nltk.download(\"stopwords\")'")
    def word_tokenize(text): return str(text).split()
    class DummyLemmatizer:
        def lemmatize(self, word): return word
    lemmatizer = DummyLemmatizer()
    class DummyStopwords:
        def words(self, lang): return set()
    stopwords_instance = DummyStopwords()
    stop_words = stopwords_instance.words('english')


from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate, train_test_split

NLTK successfully imported and initialized.


<a id='setup-data-loading'></a>
## 1. Setup and Data Loading

In this section, we load all the necessary datasets:
- `movies_metadata.csv`: Contains core movie information including title, overview, vote counts, and averages (used for content features and popular charts).
- `credits.csv`: Contains cast and crew information (used for metadata features).
- `keywords.csv`: Contains associated keywords (used for metadata features).
- `ratings_small.csv`: Contains user ratings (used for collaborative filtering and evaluation ground truth).
- `links.csv`: Crucial for mapping MovieLens IDs (used in `ratings_small.csv`) to TMDB IDs (used in `movies_metadata.csv`) to establish a consistent ID space.

All these files are expected to be located in a `data/` subdirectory.

In [2]:
# --- 0. Data Loading ---
print("--- Loading Data ---")
try:
    metadata = pd.read_csv('data/movies_metadata.csv', low_memory=False)
    credits = pd.read_csv('data/credits.csv')
    keywords = pd.read_csv('data/keywords.csv')
    ratings = pd.read_csv('data/ratings_small.csv')
    links = pd.read_csv('data/links.csv') # Should contain movieId, imdbId, tmdbId
    print("All CSV files loaded successfully (including links.csv).")
except FileNotFoundError as e:
    print(f"Error loading CSV files. Make sure 'data/' directory exists and contains: movies_metadata.csv, credits.csv, keywords.csv, ratings_small.csv, and links.csv. Error: {e}")
    print("Exiting script as essential data is missing.")
    exit()

--- Loading Data ---
All CSV files loaded successfully (including links.csv).


<a id='popularity-recommender'></a>
## 2. Phase 1: Popularity-Based Recommender & Core Preprocessing

This phase focuses on two key aspects:

1.  **Building a Simple Top Movies Chart:** We utilize a weighted rating formula, similar to IMDb's, to rank movies. This prevents movies with few (but high) ratings from unfairly dominating the chart.
    *   **Formula:** Weighted Rating (WR) = (v / (v + m)) * R + (m / (v + m)) * C
        *   `v`: `vote_count` (number of votes)
        *   `m`: minimum votes required to be in the chart (set at the 90th percentile of `vote_count` across the dataset)
        *   `R`: `vote_average` (average rating)
        *   `C`: The mean `vote_average` across the entire dataset.

2.  **Initial Data Integration and ID Alignment:** Crucially, we merge the disparate datasets (`metadata`, `credits`, `keywords`) into a single master DataFrame (`df`). A critical step is to align movie IDs. `movies_metadata.csv` uses TMDB IDs, while `ratings_small.csv` uses MovieLens IDs. `links.csv` provides the necessary bridge to map `MovieLensId` to `TMDBId`, allowing us to connect our content data with user ratings. This creates a unified `movieId` column in our main `df` DataFrame.

In [3]:
# --- 1. Phase 1: Simple Top Movies Chart Recommender & Preprocessing ---
print("\n--- Phase 1: Simple Top Movies Chart Recommender Setup & Data Cleaning ---")

metadata = metadata[pd.to_numeric(metadata['id'], errors='coerce').notnull()]
metadata['id'] = metadata['id'].astype('int')

credits['id'] = credits['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')

df = metadata.merge(credits, on='id', how='inner')
df = df.merge(keywords, on='id', how='inner')
df.drop_duplicates(subset=['id'], inplace=True)
print(f"Merged DataFrame initial shape: {df.shape[0]} rows, {df.shape[1]} columns")


--- Phase 1: Simple Top Movies Chart Recommender Setup & Data Cleaning ---
Merged DataFrame initial shape: 45432 rows, 27 columns


In [4]:
# --- Create MovieID mapping between ratings and metadata ---
print("\n--- Creating Movie ID Mappings ---")
links = links[pd.to_numeric(links['tmdbId'], errors='coerce').notnull()]
links['tmdbId'] = links['tmdbId'].astype('int')

movieId_to_tmdbId = links.set_index('movieId')['tmdbId'].to_dict()
tmdbId_to_movieId = {v: k for k, v in movieId_to_tmdbId.items()}

df['movieId'] = df['id'].map(tmdbId_to_movieId)
df.dropna(subset=['movieId'], inplace=True) # Drop movies without a MovieLens ID mapping
df['movieId'] = df['movieId'].astype('int')

print(f"Main DataFrame `df` now aligned with MovieLens IDs (via `df['movieId']`). Shape: {df.shape}")


--- Creating Movie ID Mappings ---
Main DataFrame `df` now aligned with MovieLens IDs (via `df['movieId']`). Shape: (45432, 28)


In [5]:
# --- Robust Missing Value Handling & Type Conversion for Numerical Features ---
df['vote_count'] = pd.to_numeric(df['vote_count'], errors='coerce').fillna(0)
df['vote_average'] = pd.to_numeric(df['vote_average'], errors='coerce')

C = df['vote_average'].mean()
df['vote_average'] = df['vote_average'].fillna(C)
print(f"Overall Mean vote average (C): {C:.2f}")

m = df['vote_count'].quantile(0.90)
print(f"Minimum votes required (m for chart): {m:.0f}")

def weighted_rating(x, m_val, c_val):
    v = x['vote_count']
    R = x['vote_average']
    if v == 0:
        return c_val
    return (v / (v + m_val)) * R + (m_val / (m + v)) * c_val

df['score'] = df.apply(lambda x: weighted_rating(x, m_val=m, c_val=C), axis=1)

print("\n--- Top 15 Movies Chart (based on Weighted Rating) ---")
print(df[['title', 'vote_count', 'vote_average', 'score']].sort_values('score', ascending=False).head(15).to_string())

Overall Mean vote average (C): 5.62
Minimum votes required (m for chart): 160

--- Top 15 Movies Chart (based on Weighted Rating) ---
                                 title  vote_count  vote_average     score
314           The Shawshank Redemption      8358.0           8.5  8.445874
837                      The Godfather      6024.0           8.5  8.425445
10357      Dilwale Dulhania Le Jayenge       661.0           9.1  8.421501
12541                  The Dark Knight     12269.0           8.3  8.265480
2858                        Fight Club      9678.0           8.3  8.256389
292                       Pulp Fiction      8670.0           8.3  8.251410
522                   Schindler's List      4436.0           8.3  8.206648
23818                         Whiplash      4376.0           8.3  8.205413
5505                     Spirited Away      3968.0           8.3  8.196064
2223                 Life Is Beautiful      3643.0           8.3  8.187182
1187            The Godfather: Part II   

<a id='advanced-feature-engineering'></a>
## 3. Phase 2: Advanced Feature Engineering for Content-Based Models

This section prepares the data for sophisticated content-based recommendation.

The key improvements here are:

1.  **Robust String Parsing:** Safely converts stringified lists (e.g., `cast`, `genres`, `keywords`) into actual Python lists using `literal_eval`, handling missing values and malformed strings gracefully.
2.  **Structured Feature Extraction:** Specific helper functions `get_director` and `get_top_n_names` extract and format relevant information (e.g., top 3 cast members, director) into clean lists of strings.
3.  **Advanced Text Preprocessing:**
    *   **Normalization:** Lowercasing and removing non-alphabetic characters.
    *   **Tokenization:** Breaking text into words.
    *   **Stopword Removal:** Eliminating common words (like "the", "a") that don't add semantic value.
    *   **Lemmatization:** Reducing words to their base or dictionary form (e.g., "running" -> "run", "better" -> "good"). This significantly improves text feature quality by ensuring different word forms are treated as the same concept. This is applied to `overview` (for plot-based) and to the individual components that form the 'metadata soup'.
4.  **Creating a "Processed Soup":** A single string (`processed_soup`) is created by combining cleaned versions of director, top cast, keywords, and genres. This serves as the consolidated textual input for metadata-based similarity calculations.

In [6]:
# --- 2. Advanced Feature Extraction and Text Preprocessing ---
print("\n--- Advanced Feature Extraction & Text Preprocessing ---")

def safe_literal_eval(val):
    if isinstance(val, str):
        try:
            return literal_eval(val)
        except (ValueError, SyntaxError):
            return []
    return []

features_to_parse = ['cast', 'crew', 'keywords', 'genres']
for feature in features_to_parse:
    df[feature] = df[feature].apply(safe_literal_eval)

def get_director(crew_list):
    for member in crew_list:
        if isinstance(member, dict) and member.get('job') == 'Director':
            return member.get('name', '')
    return ''

def get_top_n_names(list_of_dicts, n=3):
    names = []
    if isinstance(list_of_dicts, list):
        for item in list_of_dicts:
            if isinstance(item, dict) and 'name' in item:
                names.append(item['name'])
    return names[:n]

df['director'] = df['crew'].apply(get_director)
df['cast'] = df['cast'].apply(get_top_n_names)
df['keywords'] = df['keywords'].apply(get_top_n_names)
df['genres'] = df['genres'].apply(get_top_n_names)

df['overview'] = df['overview'].fillna('')

def clean_and_lemmatize_text(text):
    if not isinstance(text, str):
        return ""
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text).lower()
    tokens = word_tokenize(text)
    
    processed_tokens = []
    for word in tokens:
        if word not in stop_words:
            processed_tokens.append(lemmatizer.lemmatize(word))
    return " ".join(processed_tokens)

df['clean_overview'] = df['overview'].apply(clean_and_lemmatize_text)

def clean_list_of_strings(lst):
    if isinstance(lst, list):
        return [clean_and_lemmatize_text(item) for item in lst]
    return []

df['processed_keywords'] = df['keywords'].apply(clean_list_of_strings)
df['processed_cast'] = df['cast'].apply(clean_list_of_strings)
df['processed_genres'] = df['genres'].apply(clean_list_of_strings)
df['processed_director'] = df['director'].apply(clean_and_lemmatize_text)

def create_processed_soup(x):
    keywords_str = ' '.join(x['processed_keywords'])
    cast_str = ' '.join(x['processed_cast'])
    genres_str = ' '.join(x['processed_genres'])
    director_str = x['processed_director']
    soup = f"{keywords_str} {cast_str} {director_str} {genres_str}"
    return ' '.join(soup.split())

df['processed_soup'] = df.apply(create_processed_soup, axis=1)

print("Data preprocessing and feature engineering complete.")
print(f"Final DataFrame shape: {df.shape}")


--- Advanced Feature Extraction & Text Preprocessing ---
Data preprocessing and feature engineering complete.
Final DataFrame shape: (45432, 36)


<a id='enhancing-models'></a>
## 4. Phase 3: Enhancing Recommendation Models

With a thoroughly preprocessed and enriched dataset, we now implement and enhance various recommendation models.

1.  **Content-Based Recommenders:**
    *   **Plot-Based:** Utilizes `TfidfVectorizer` on the `clean_overview` column. TF-IDF highlights words unique to a document, reducing the weight of common words.
    *   **Metadata-Based:** Utilizes `CountVectorizer` on the `processed_soup` (director, cast, keywords, genres). CountVectorizer counts word occurrences.
    *   Both models employ **Cosine Similarity** to measure the resemblance between movie content vectors. Recommendations are then generated by finding movies most similar to a given input movie.

2.  **Simple Hybrid Recommender (Content + Weighted Rating):** This model combines the content similarity score with the movie's overall `score` (weighted rating) to re-rank content-based recommendations. This aims to blend item-to-item similarity with a general measure of quality/popularity.

3.  **Collaborative Filtering (SVD):** Uses Singular Value Decomposition (SVD) from the `surprise` library to predict user-item ratings based on historical rating patterns. This enables personalized recommendations by identifying users with similar tastes. The model is trained on the `ratings_small.csv` dataset.

In [7]:
# --- 3. Enhancing Recommendation Models ---
print("\n--- Enhancing Recommendation Models ---")

# Create a dictionary mapping unique titles to their *first* corresponding DataFrame index.
title_to_index = pd.Series(df.index.values, index=df['title']).drop_duplicates().to_dict()


--- Enhancing Recommendation Models ---


In [8]:
# A. Plot-Based Recommender (using 'clean_overview')
print("\n-- Plot-Based Recommender Setup (TF-IDF) --")
tfidf_plot = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=5)
tfidf_matrix_plot = tfidf_plot.fit_transform(df['clean_overview'])
print(f"TF-IDF Matrix (Plot) Shape: {tfidf_matrix_plot.shape}")

def get_recommendations_plot_refined(title, tfidf_matrix_p=tfidf_matrix_plot, df_movies=df, title_map=title_to_index):
    if title not in title_map:
        return pd.DataFrame([{"title": "Movie not found. Please check the title (case-sensitive).", "similarity_score": np.nan}])

    idx = title_map[title]
    movie_vector = tfidf_matrix_p[idx]
    sim_scores = cosine_similarity(movie_vector, tfidf_matrix_p)
    sim_scores = list(enumerate(sim_scores[0]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [score for score in sim_scores if score[0] != idx]
    sim_scores = sim_scores[:10]

    movie_indices = [i[0] for i in sim_scores]
    recommended_movies = df_movies.iloc[movie_indices].copy()
    recommended_movies['similarity_score'] = [s[1] for s in sim_scores]
    return recommended_movies[['title', 'similarity_score']]

print("\n--- Testing Refined Plot-Based Recommender ---")
print("Recommendations for 'The Dark Knight Rises':")
print(get_recommendations_plot_refined('The Dark Knight Rises'))


-- Plot-Based Recommender Setup (TF-IDF) --
TF-IDF Matrix (Plot) Shape: (45432, 18744)

--- Testing Refined Plot-Based Recommender ---
Recommendations for 'The Dark Knight Rises':
                         title  similarity_score
27929             Remonstrance          0.176185
11286  Sketches of Frank Gehry          0.172817
830                   Basquiat          0.168594
17139              High School          0.168499
28091             Foreign Body          0.162601
31127    Nos amis les Terriens          0.161120
18319    Waiting for Happiness          0.149192
31785  Ghost of Goodnight Lane          0.146221
34876  Winter Evening in Gagry          0.142608
2120           Elstree Calling          0.134659


In [9]:
# B. Metadata-Based Recommender (using 'processed_soup')
print("\n-- Metadata-Based Recommender Setup (CountVectorizer) --")
count_metadata = CountVectorizer(stop_words='english')
cv_matrix_metadata = count_metadata.fit_transform(df['processed_soup'])
print(f"CountVectorizer Matrix (Metadata) Shape: {cv_matrix_metadata.shape}")

def get_recommendations_metadata_refined(title, cv_matrix_m=cv_matrix_metadata, df_movies=df, title_map=title_to_index):
    if title not in title_map:
        return pd.DataFrame([{"title": "Movie not found. Please check the title (case-sensitive).", "similarity_score": np.nan}])

    idx = title_map[title]
    movie_vector = cv_matrix_m[idx]
    sim_scores = cosine_similarity(movie_vector, cv_matrix_m)
    sim_scores = list(enumerate(sim_scores[0]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [score for score in sim_scores if score[0] != idx]
    sim_scores = sim_scores[:10]

    movie_indices = [i[0] for i in sim_scores]
    recommended_movies = df_movies.iloc[movie_indices].copy()
    recommended_movies['similarity_score'] = [s[1] for s in sim_scores]
    return recommended_movies[['title', 'similarity_score']]

print("\n--- Testing Refined Metadata-Based Recommender ---")
print("Recommendations for 'The Avengers':")
print(get_recommendations_metadata_refined('The Avengers'))


-- Metadata-Based Recommender Setup (CountVectorizer) --
CountVectorizer Matrix (Metadata) Shape: (45432, 52560)

--- Testing Refined Metadata-Based Recommender ---
Recommendations for 'The Avengers':
                                                   title  similarity_score
28631           Sold Out: A Threevening with Kevin Smith          0.250000
45855           Bird's-Eye View of Dock Front, Galveston          0.250000
24629                   TINY: A Story About Living Small          0.243332
32651                                 Star Spangled Girl          0.235702
27677                                        Santa Claus          0.226134
1934                             The Million Dollar Duck          0.223607
9838                                           Blackball          0.223607
17980                                 Me and the Colonel          0.223607
18057  The Enchanted World of Danny Kaye: The Emperor...          0.223607
26096                               The Hire: Po

In [10]:
# C. Simple Hybrid Recommender (Content-Based + Weighted Rating)
print("\n-- Simple Hybrid Recommender Setup --")
def get_hybrid_recommendations_content_weighted(title, content_source='plot', num_recommendations=10, df_movies=df, title_map=title_to_index):
    if title not in title_map:
        return pd.DataFrame([{"title": "Movie not found. Please check the title (case-sensitive).", "hybrid_ranking_score": np.nan}])

    idx = title_map[title]

    if content_source == 'plot':
        matrix = tfidf_matrix_plot
    elif content_source == 'metadata':
        matrix = cv_matrix_metadata
    else:
        raise ValueError("content_source must be 'plot' or 'metadata'.")

    movie_vector = matrix[idx]
    sim_scores_raw = cosine_similarity(movie_vector, matrix)[0]

    candidate_scores = [(i, score) for i, score in enumerate(sim_scores_raw) if i != idx]
    candidate_scores = sorted(candidate_scores, key=lambda x: x[1], reverse=True)[:50]

    candidate_movie_indices = [i[0] for i in candidate_scores]

    candidate_df = df_movies.iloc[candidate_movie_indices].copy()
    candidate_similarity_map = {original_idx: score for original_idx, score in candidate_scores}
    candidate_df['content_similarity'] = candidate_df.index.map(candidate_similarity_map)
    
    candidate_df['score'] = pd.to_numeric(candidate_df['score'], errors='coerce')
    max_score = candidate_df['score'].max()
    candidate_df['normalized_score'] = candidate_df['score'] / max_score if max_score > 0 else 0

    content_weight = 0.7
    popularity_weight = 0.3

    candidate_df['hybrid_ranking_score'] = (candidate_df['content_similarity'] * content_weight) + \
                                         (candidate_df['normalized_score'] * popularity_weight)

    final_recommendations = candidate_df.sort_values(by='hybrid_ranking_score', ascending=False)

    return final_recommendations[['title', 'hybrid_ranking_score', 'content_similarity', 'score', 'vote_average']].head(num_recommendations)

print("\n--- Testing Simple Hybrid Recommender (Content + Weighted Rating) ---")
print("Hybrid Recommendations for 'The Dark Knight Rises' (using Plot content):")
print(get_hybrid_recommendations_content_weighted('The Dark Knight Rises', content_source='plot'))


-- Simple Hybrid Recommender Setup --

--- Testing Simple Hybrid Recommender (Content + Weighted Rating) ---
Hybrid Recommendations for 'The Dark Knight Rises' (using Plot content):
                         title  hybrid_ranking_score  content_similarity  \
27929             Remonstrance                   NaN                 NaN   
11286  Sketches of Frank Gehry                   NaN                 NaN   
830                   Basquiat                   NaN                 NaN   
17139              High School                   NaN                 NaN   
28091             Foreign Body                   NaN                 NaN   
31127    Nos amis les Terriens                   NaN                 NaN   
18319    Waiting for Happiness                   NaN                 NaN   
31785  Ghost of Goodnight Lane                   NaN                 NaN   
34876  Winter Evening in Gagry                   NaN                 NaN   
2120           Elstree Calling                   NaN     

<a id='collaborative-filtering'></a>
### 4.4. Collaborative Filtering (SVD)

Collaborative Filtering (CF) focuses on leveraging user-item interaction data (ratings) to make recommendations. The underlying assumption is that users who agreed in the past on their ratings will likely agree again in the future.

We use **Singular Value Decomposition (SVD)** from the `surprise` library. SVD is a matrix factorization technique that decomposes the user-item interaction matrix into a set of latent factors for both users and items, which are then used to predict ratings for unrated items.

Key aspects:
-   `ratings_small.csv` provides the `userId`, `movieId`, and `rating`.
-   The `movieId`s from `ratings_small.csv` are consistent with the `movieId` column we added to our main `df` using `links.csv`, ensuring proper mapping to movie metadata.
-   `cross_validate` provides standard evaluation metrics for CF like **RMSE (Root Mean Squared Error)** and **MAE (Mean Absolute Error)**, which measure the accuracy of rating predictions.

In [12]:
# D. Collaborative Filtering (SVD)
print("\n--- Collaborative Filtering (SVD) ---")
df_rated_movies = df[df['movieId'].notna()].copy()

if not ratings.empty:
    reader = Reader(rating_scale=(0.5, 5)) # DEFINED READER HERE
    full_data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader) # DEFINED DATA HERE

    print("Performing cross-validation for SVD...")
    svd = SVD()
    cross_validate(svd, full_data, measures=['RMSE', 'MAE'], cv=5, verbose=False) # USE full_data HERE

    print("Training SVD model on full dataset for general recommendations...")
    trainset_full = full_data.build_full_trainset()
    svd_full = SVD()
    svd_full.fit(trainset_full)
    print("SVD model trained on full dataset.")

    user_id_test = 1
    if not df_rated_movies.empty and df_rated_movies['movieId'].isin(ratings['movieId']).any():
        movie_id_test_from_ratings = df_rated_movies[df_rated_movies['movieId'].isin(ratings['movieId'])]['movieId'].iloc[0]
        movie_title_for_test = df_rated_movies[df_rated_movies['movieId'] == movie_id_test_from_ratings]['title'].iloc[0]
        print(f"Testing SVD prediction for user {user_id_test} on movie {movie_id_test_from_ratings} ('{movie_title_for_test}').")
        prediction_example = svd_full.predict(user_id_test, movie_id_test_from_ratings)
        print(f"Predicted rating: {prediction_example.est:.2f}")
    else:
        movie_id_test_from_ratings = None
        print("No suitable movie found for example prediction between df_rated_movies and ratings. Skipping example.")


    def recommend_movies_svd(user_id, svd_model=svd_full, df_all_movies=df_rated_movies, num_recommendations=10):
        user_train_rated_movie_ids = set()
        if user_id in svd_model.trainset._raw2inner_id_users:
            inner_user_id = svd_model.trainset.to_inner_uid(user_id)
            user_train_rated_movie_ids_inner = {iid for iid, _ in svd_model.trainset.ur[inner_user_id]}
            user_train_rated_movie_ids = {svd_model.trainset.to_raw_iid(inner_iid) for inner_iid in user_train_rated_movie_ids_inner}

        predictions_list = []
        all_known_raw_iids_from_model = {svd_model.trainset.to_raw_iid(inner_id) for inner_id in svd_model.trainset.all_items()}
        
        movies_to_predict = df_all_movies[df_all_movies['movieId'].isin(list(all_known_raw_iids_from_model))]
        
        for _, row in movies_to_predict.iterrows():
            movie_id = row['movieId']
            if movie_id not in user_train_rated_movie_ids:
                prediction = svd_model.predict(user_id, movie_id)
                predictions_list.append({'movieId': movie_id, 'predicted_rating': prediction.est})

        if not predictions_list:
            print(f"User {user_id} has no unrated movies from the available set that also have full metadata.")
            return pd.DataFrame(columns=['title', 'predicted_rating', 'vote_average', 'vote_count', 'score'])

        predictions_df = pd.DataFrame(predictions_list)
        predictions_df = predictions_df.sort_values(by='predicted_rating', ascending=False)

        top_recommendations = pd.merge(
            predictions_df,
            df_all_movies[['movieId', 'title', 'vote_average', 'vote_count', 'score']],
            on='movieId',
            how='left'
        )
        top_recommendations.drop_duplicates(subset=['title'], inplace=True)
        return top_recommendations.head(num_recommendations)

    print("\n--- Testing Collaborative Filtering (SVD) Recommender for User 1 (trained on full data) ---")
    user_to_test_svd = 1
    top_10_recs_svd = recommend_movies_svd(user_to_test_svd, num_recommendations=10)
    print(f"Top 10 SVD Recommendations for User ID {user_to_test_svd}:")
    print(top_10_recs_svd.to_string())

else:
    print("Collaborative filtering part not executed because 'ratings_small.csv' was not loaded.")


--- Collaborative Filtering (SVD) ---
Performing cross-validation for SVD...
Training SVD model on full dataset for general recommendations...
SVD model trained on full dataset.
Testing SVD prediction for user 1 on movie 1 ('Toy Story').
Predicted rating: 2.79

--- Testing Collaborative Filtering (SVD) Recommender for User 1 (trained on full data) ---
Top 10 SVD Recommendations for User ID 1:
   movieId  predicted_rating                                title  vote_average  vote_count     score
0     2318          3.703517                            Happiness           7.4       197.0  6.601548
1      318          3.603997             The Shawshank Redemption           8.5      8358.0  8.445874
2     1204          3.582036                   Lawrence of Arabia           7.8       870.0  7.461119
3     2542          3.581224  Lock, Stock and Two Smoking Barrels           7.5      1671.0  7.335583
4     1203          3.523435                         12 Angry Men           8.2      2130.0  

<a id='evaluation-metrics'></a>
## 5. Phase 4: Evaluation and Metrics

Evaluating a recommendation system is critical to understanding its performance. Beyond basic predictive accuracy metrics (like RMSE and MAE for Collaborative Filtering), we also need to assess the quality of the *ranked lists* of recommendations generated by content-based and hybrid models.

### Key Metrics:

1.  **Precision@K:** Of the top K recommendations, what proportion are relevant (liked) to the user?
    *   `Precision@K = (Number of relevant items in top K) / K`
2.  **Recall@K:** Of all the relevant (liked) items, what proportion did the model recommend in its top K list?
    *   `Recall@K = (Number of relevant items in top K) / (Total number of relevant items)`
3.  **F1-Score@K:** The harmonic mean of Precision and Recall, providing a single metric that balances both.
    *   `F1-Score@K = 2 * (Precision * Recall) / (Precision + Recall)`

### Ground Truth for "Liked" Movies:

For evaluation, we define a movie as "liked" by a user if they rated it **4.0 stars or higher** in the `ratings_small.csv` dataset. This acts as our ground truth for determining relevance.

We will evaluate the performance for content-based and hybrid models by comparing their recommendations for certain seed movies (and associated users) against the ground truth. Collaborative filtering already provides RMSE/MAE, but we will also calculate Precision/Recall for it using a test set split.

**Note:** Achieving very high Precision/Recall on real-world, sparse datasets like MovieLens Small with limited `K` (e.g., 10) is challenging. Low but non-zero scores indicate successful matching, but also highlight the inherent difficulty and sparsity of recommendation tasks.

In [13]:
# --- Evaluation and Metrics Section ---
print("\n--- Phase 3: Evaluation and Metrics ---")

RATING_THRESHOLD = 4.0 # Definition of 'liked'

# user_liked_movies: maps userId to a SET of movieId that they rated >= RATING_THRESHOLD
user_liked_movies = {} 
for user_id in ratings['userId'].unique():
    liked_movie_ids_from_ratings = ratings[(ratings['userId'] == user_id) & (ratings['rating'] >= RATING_THRESHOLD)]['movieId'].tolist()
    # Ensure these liked movies are in our `df_rated_movies` for consistent evaluation context
    user_liked_movies[user_id] = set(df_rated_movies[df_rated_movies['movieId'].isin(liked_movie_ids_from_ratings)]['movieId'].tolist())

print(f"Generated 'liked movies' sets for {len(user_liked_movies)} users (threshold >= {RATING_THRESHOLD} stars).")


def precision_at_k(recommended_items_ids, liked_items_ids, k):
    recommended_k = set(recommended_items_ids[:k])
    num_hit_items = len(recommended_k.intersection(liked_items_ids))
    return num_hit_items / k if k > 0 else 0

def recall_at_k(recommended_items_ids, liked_items_ids, k):
    recommended_k = set(recommended_items_ids[:k])
    num_hit_items = len(recommended_k.intersection(liked_items_ids))
    num_liked_items = len(liked_items_ids)
    return num_hit_items / num_liked_items if num_liked_items > 0 else 0

def f1_score_at_k(precision, recall):
    if (precision + recall) == 0:
        return 0
    return 2 * (precision * recall) / (precision + recall)


--- Phase 3: Evaluation and Metrics ---
Generated 'liked movies' sets for 671 users (threshold >= 4.0 stars).


In [14]:
print("\n--- Evaluating Content-Based and Hybrid Recommenders (using MovieLens IDs) ---")

test_movie_titles = ['Toy Story', 'Finding Nemo', 'The Shawshank Redemption', 'Pulp Fiction']
eval_results = []
K_value = 10

for title_of_base_movie in test_movie_titles:
    if title_of_base_movie not in title_to_index:
        print(f"'{title_of_base_movie}' not found in DataFrame index map. Skipping evaluation for this movie.")
        continue
        
    base_movie_movieId = df.loc[title_to_index[title_of_base_movie], 'movieId']
    relevant_users_for_base_movie = ratings[ratings['movieId'] == base_movie_movieId]['userId'].unique()
    
    if len(relevant_users_for_base_movie) == 0:
        print(f"No users rated '{title_of_base_movie}' (MovieID: {base_movie_movieId}) in the ratings dataset. Skipping content-based evaluation for this movie.")
        continue

    # Evaluate across a few relevant users for robustness (average their P/R/F1)
    users_to_evaluate_against = relevant_users_for_base_movie[:5] 
    
    for eval_user_id in users_to_evaluate_against:
        user_true_likes_movieIds = user_liked_movies.get(eval_user_id, set())

        if not user_true_likes_movieIds:
            # print(f"User {eval_user_id} has no liked movies (>= {RATING_THRESHOLD} stars) in metadata for '{title_of_base_movie}'. Skipping for this user.")
            continue # Skip users with no liked movies

        # --- Content-Based (Plot) ---
        rec_plot_df = get_recommendations_plot_refined(title_of_base_movie, df_movies=df, title_map=title_to_index)
        # Convert recommended *titles* from `rec_plot_df` to their *MovieLens movieIds* via `df_rated_movies`
        plot_recommended_movie_ids = df_rated_movies[df_rated_movies['title'].isin(rec_plot_df['title'])]['movieId'].tolist()

        p_plot = precision_at_k(plot_recommended_movie_ids, user_true_likes_movieIds, K_value)
        r_plot = recall_at_k(plot_recommended_movie_ids, user_true_likes_movieIds, K_value)
        f1_plot = f1_score_at_k(p_plot, r_plot)
        eval_results.append({
            'Movie': title_of_base_movie, 'Method': 'Content (Plot)', 'User': eval_user_id,
            'Precision@K': f"{p_plot:.4f}", 'Recall@K': f"{r_plot:.4f}", 'F1-Score@K': f"{f1_plot:.4f}",
            'Num_Liked': len(user_true_likes_movieIds)
        })

        # --- Content-Based (Metadata) ---
        rec_meta_df = get_recommendations_metadata_refined(title_of_base_movie, df_movies=df, title_map=title_to_index)
        meta_recommended_movie_ids = df_rated_movies[df_rated_movies['title'].isin(rec_meta_df['title'])]['movieId'].tolist()

        p_meta = precision_at_k(meta_recommended_movie_ids, user_true_likes_movieIds, K_value)
        r_meta = recall_at_k(meta_recommended_movie_ids, user_true_likes_movieIds, K_value)
        f1_meta = f1_score_at_k(p_meta, r_meta)
        eval_results.append({
            'Movie': title_of_base_movie, 'Method': 'Content (Metadata)', 'User': eval_user_id,
            'Precision@K': f"{p_meta:.4f}", 'Recall@K': f"{r_meta:.4f}", 'F1-Score@K': f"{f1_meta:.4f}",
            'Num_Liked': len(user_true_likes_movieIds)
        })
        
        # --- Hybrid (Plot + Weighted Rating) ---
        rec_hybrid_df = get_hybrid_recommendations_content_weighted(title_of_base_movie, content_source='plot', df_movies=df, title_map=title_to_index)
        hybrid_recommended_movie_ids = df_rated_movies[df_rated_movies['title'].isin(rec_hybrid_df['title'])]['movieId'].tolist()

        p_hybrid = precision_at_k(hybrid_recommended_movie_ids, user_true_likes_movieIds, K_value)
        r_hybrid = recall_at_k(hybrid_recommended_movie_ids, user_true_likes_movieIds, K_value)
        f1_hybrid = f1_score_at_k(p_hybrid, r_hybrid)
        eval_results.append({
            'Movie': title_of_base_movie, 'Method': 'Hybrid (Plot+WR)', 'User': eval_user_id,
            'Precision@K': f"{p_hybrid:.4f}", 'Recall@K': f"{r_hybrid:.4f}", 'F1-Score@K': f"{f1_hybrid:.4f}",
            'Num_Liked': len(user_true_likes_movieIds)
        })


--- Evaluating Content-Based and Hybrid Recommenders (using MovieLens IDs) ---


In [16]:
print("\n--- Evaluation Results Summary (K=10, Averaged per movie/method across relevant users) ---")
if eval_results:
    eval_df = pd.DataFrame(eval_results)
    avg_eval_df = eval_df.groupby(['Movie', 'Method']).agg(
        Precision=('Precision@K', lambda x: f"{pd.to_numeric(x).mean():.4f}"),
        Recall=('Recall@K', lambda x: f"{pd.to_numeric(x).mean():.4f}"),
        F1_Score=('F1-Score@K', lambda x: f"{pd.to_numeric(x).mean():.4f}"),
        Avg_Num_Liked=('Num_Liked', 'mean')
    ).reset_index()
    print(avg_eval_df.to_string())
else:
    print("No evaluation results were generated. Check if test movies are correctly mapped or if users have liked movies.")




--- Evaluation Results Summary (K=10, Averaged per movie/method across relevant users) ---
                       Movie              Method Precision  Recall F1_Score  Avg_Num_Liked
0               Finding Nemo  Content (Metadata)    0.0200  0.0005   0.0010          116.2
1               Finding Nemo      Content (Plot)    0.0000  0.0000   0.0000          116.2
2               Finding Nemo    Hybrid (Plot+WR)    0.0000  0.0000   0.0000          116.2
3               Pulp Fiction  Content (Metadata)    0.0400  0.0058   0.0101           64.0
4               Pulp Fiction      Content (Plot)    0.0000  0.0000   0.0000           64.0
5               Pulp Fiction    Hybrid (Plot+WR)    0.0000  0.0000   0.0000           64.0
6   The Shawshank Redemption  Content (Metadata)    0.0200  0.0029   0.0051           37.2
7   The Shawshank Redemption      Content (Plot)    0.0000  0.0000   0.0000           37.2
8   The Shawshank Redemption    Hybrid (Plot+WR)    0.0000  0.0000   0.0000           37.