# Project Overview

# Actually, I used an SVD model, not a hybrid model! Need to update the above!

# Datasets

This project utilizes the MovieLens dataset, a widely used dataset in the field of recommender systems, containing movie ratings and metadata.  The data was obtained at https://www.kaggle.com/datasets/parasharmanas/movie-recommendation-system/data.  There are two data files. The first contains the unique movie Id, movie title and a list of genres the movie falls into.  The second consists of user ratings with the user ID, movie ID, rating and timestap of when the review was made.

# REVISED CODE

In [None]:
# conda install joblib

In [None]:
## Optimized Code

## Import Libraries

## Load and Merge Data

In [None]:
# Load and merge movies and ratings data with essential columns and a subset of ratings
def load_and_merge_data(movies_file, ratings_file):
    try:
        movies_df = pd.read_csv(movies_file, usecols=['movieId', 'title', 'genres'])
        ratings_df = pd.read_csv(ratings_file, usecols=['userId', 'movieId', 'rating'])

        # Remove duplicates
        movies_df.drop_duplicates(subset=['movieId'], inplace=True)
        ratings_df.drop_duplicates(subset=['userId', 'movieId'], inplace=True)

        # Handle missing values
        movies_df.fillna({'genres': '(no genres listed)'}, inplace=True)
        ratings_df.dropna(subset=['userId', 'movieId', 'rating'], inplace=True)

        merged_df = pd.merge(ratings_df, movies_df, on='movieId')
        logging.info("Data successfully merged and cleaned.")
        return merged_df
    except FileNotFoundError as e:
        logging.error(f"Error: {e}. File not found. Check the file paths.")
        return None
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
        return None


In [None]:
# CODE TO RUN

# Initial EDA

In [None]:
movies_df.info()

In [None]:
movies_df.describe(include='all')

Movies Dataframe Summary:
- There are 9,742 unique movies.
- The genres column has 951 unique genre combinations, with 'Drama' being the most frequent.  The number of genres does not appear resaonable, so will conduct further analysis on.
- The movieId ranges from 1 to 193,609, indicating a broad and possibly sparse numbering system, as there are only 9,737 unique movie titles.

In [None]:
movies_df.isna().sum()

There are no missing values in the movies dataset.

In [None]:
movies_df.duplicated().sum()

There are no duplicate rows in the movies dataset.

In [None]:
ratings_df.info()

In [None]:
ratings_df.describe(include='all')

Ratings Dataframe Summary:
- Contains 100,836 ratings.
- userId ranges from 1 to 610, indicating 610 unique users.
- Ratings range from 0.5 to 5.0, in increments of 0.5.
- The average rating is approximately 3.50.
- timestamp is an integer representing the time the user rating was made.

I will now merge the two datasets which will allow us to see which user rated which movie, along with the movie's title and genres.

In [None]:
# Merge the two df's on movieId
df = movies_df.merge(ratings_df, on='movieId')

df.shape

In [None]:
df.head()

In [None]:
df.isna().sum()

In [None]:
df.info()

In [None]:
df.describe(include='all')

Combined Dataset Statistics:
- The dataset contains 100,836 ratings for 9,719 unique movies.
- The genres column has 951 unique genre combinations, with 'Comedy' being the most frequent.
- Ratings range from 0.5 to 5.0, with an average of approximately 3.50.

In [None]:
df.isna().sum()

There are no missing values in the combined dataset

In [None]:
df.duplicated().sum()

## Univariate Analysis

In [None]:
# Plot ratings distribution
plt.figure(figsize=(10, 6))
sns.countplot(x='rating', hue='rating', data=df, legend=False, palette='muted'
             )
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


The distribution of ratings shows that:
- Ratings are discrete, in increments of 0.5.
- The most common ratings are around 3.0 to 4.0, indicating a tendency towards higher ratings.
- The extreme ratings (0.5 and 5.0) are less common, suggesting that users are generally moderate in their assessments.

In [None]:
# split genres by | and add as a list
df['genres'] = df['genres'].apply(lambda x:x.split('|'))
                                  
df.head()

# Break out genres included in list and determine count of each
import matplotlib.pyplot as plt

# Explode the genres column to have separate row for each genre
exploded_genres = df.explode('genres')

# Count the occurrences of each genre
genre_counts = exploded_genres['genres'].value_counts()
genre_counts

# Plot the genre frequencies

g = sns.catplot(genre_counts, kind='bar', palette='muted', height=5, aspect=2)
g.fig.suptitle('Frequency of Movie Genres')
plt.xlabel('Genre')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45)
plt.show()

## Bivariate Analysis

In [None]:
import matplotlib.pyplot as plt

# Assuming 'ratings_df' is your DataFrame
user_rating_counts = ratings_df['userId'].value_counts()
movie_rating_counts = ratings_df['movieId'].value_counts()

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))

# Plot for number of ratings per user
axes[0].hist(user_rating_counts, bins=30, edgecolor='black')
axes[0].set_title('Number of Ratings per User')
axes[0].set_xlabel('Number of Ratings')
axes[0].set_ylabel('Number of Users')

# Plot for number of ratings per movie
axes[1].hist(movie_rating_counts, bins=30, edgecolor='black')
axes[1].set_title('Number of Ratings per Movie')
axes[1].set_xlabel('Number of Ratings')
axes[1].set_ylabel('Number of Movies')

plt.tight_layout()
plt.show()



In [None]:
# Average Rating per User

import matplotlib.pyplot as plt
import seaborn as sns

# Calculate average rating per user
user_avg_ratings = ratings_df.groupby('userId')['rating'].mean()

plt.figure(figsize=(10, 6))
sns.histplot(user_avg_ratings, bins=30, kde=True)
plt.title('Average Rating per User')
plt.xlabel('Average Rating')
plt.ylabel('Frequency')
plt.show()


# Distribution of Ratings across genres


## Need to add

In [None]:
# Ratings over Time
ratings_df['datetime'] = pd.to_datetime(ratings_df['timestamp'], unit='s')
ratings_df['year'] = ratings_df['datetime'].dt.year
ratings_df.groupby('year')['rating'].mean().plot(kind='line')
plt.title('Average Ratings Over Years')
plt.xlabel('Year')
plt.ylabel('Average Rating')
plt.show()


In [None]:
# Heatmap of Ratings Over Time

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'timestamp' is a UNIX timestamp in 'ratings_df'
ratings_df['date'] = pd.to_datetime(ratings_df['timestamp'], unit='s')
ratings_df['year'] = ratings_df['date'].dt.year
ratings_df['month'] = ratings_df['date'].dt.month

# Pivot table to prepare data for heatmap
rating_pivot = ratings_df.pivot_table(values='rating', index='month', columns='year', aggfunc='mean')

plt.figure(figsize=(12, 8))
sns.heatmap(rating_pivot, annot=True, fmt=".1f", cmap='coolwarm')
plt.title('Average Monthly Ratings Over the Years')
plt.xlabel('Year')
plt.ylabel('Month')
plt.show()



In [None]:
df.head()

## Preprocess Data

In [None]:
# Preprocess data by adding one-hot encoding for genres
def preprocess_data(df):
    if 'genres' in df.columns:
        df['genres'] = df['genres'].replace('', '(no genres listed)')
        genres_dummies = df['genres'].str.get_dummies(sep='|')
        df = pd.concat([df, genres_dummies], axis=1)
        logging.info("Genre-based features added.")
    else:
        logging.info("Genres column not available for processing.")
    return df

# Add user-genre interaction features to the dataset
def add_user_genre_features(df):
    if 'userId' in df.columns:
        # Select only genre columns for aggregation
        genre_columns = [col for col in df.columns if col not in ('user_mean_', 'title', 'movieId', 'userId', 'rating') and df[col].dtype in [np.float64, np.int64]]

        # Compute mean genre features per user
        user_genre_means = df.groupby(['userId'])[genre_columns].mean()
        user_genre_means.columns = [f'user_mean_{col}' for col in user_genre_means.columns]

        # Merge the new features back into the main DataFrame
        df = pd.merge(df, user_genre_means, on='userId', how='left')
        logging.info("User-genre interaction features added.")
    else:
        logging.info("UserId column not available for interaction features.")
    return df

## Additional EDA

In [None]:
from joblib import Parallel, delayed
from sklearn.neighbors import NearestNeighbors
import logging

# Helper function for hybrid recommendations per user
def hybrid_user_recommendation_worker(user_id, df, svd_model, movie_indices, distances, indices, weight_svd, weight_content, n):
    user_recommendations = []

    try:
        for idx in movie_indices:
            sim_scores = list(enumerate(distances[idx]))
            sorted_indices = np.argsort([x[1] for x in sim_scores])[::-1][1:11]

            movie_indices_list = [indices[idx][i] for i in sorted_indices if indices[idx][i] in df.index]

            sim_movies = df['movieId'].iloc[movie_indices_list]
            svd_recs = [svd_model.predict(user_id, iid) for iid in sim_movies]

            recommendations = [
                (
                    iid,
                    weight_svd * est.est + weight_content * distances[idx][movie_indices_list.index(iid)]
                )
                for iid, est in zip(sim_movies, svd_recs)
                if iid in movie_indices_list
            ]

            recommendations = sorted(recommendations, key=lambda x: x[1], reverse=True)

            user_recommendations.extend(recommendations[:n])

        user_recommendations = sorted(user_recommendations, key=lambda x: x[1], reverse=True)[:n]

        return [(user_id, rec[0], rec[1]) for rec in user_recommendations]

    except ValueError as e:
        logging.error(f"Error in hybrid_user_recommendation_worker: {e}")
        return []

# Function to calculate sparse cosine similarity
def calculate_sparse_cosine_similarity(tfidf_matrix, n_neighbors=10):
    logging.info("Calculating sparse cosine similarity...")
    nbrs = NearestNeighbors(n_neighbors=n_neighbors, metric='cosine', algorithm='brute').fit(tfidf_matrix)
    distances, indices = nbrs.kneighbors(tfidf_matrix)
    logging.info("Sparse cosine similarity calculated.")
    return distances, indices

# Hybrid recommendation function combining SVD and content-based filtering
def hybrid_recommendation_to_screen(df, svd_model, test_users, weight_svd=0.7, weight_content=0.3, n=5, batch_size=2000):
    movie_indices = pd.Series(df.index, index=df['title']).drop_duplicates()

    tfidf = TfidfVectorizer(max_features=1000, stop_words='english')  # Increase vocabulary size to 1000
    tfidf_matrix = tfidf.fit_transform(df['genres'].fillna(''))
    distances, indices = calculate_sparse_cosine_similarity(tfidf_matrix, n_neighbors=10)

    all_recommendations = []

    logging.info(f"Processing users in batches of {batch_size}...")

    # Process users in batches
    for i in range(0, len(test_users), batch_size):
        batch_users = list(test_users)[i:i + batch_size]

        # Set n_jobs to -1 to use all processors
        batch_recommendations = Parallel(n_jobs=-1)(
            delayed(hybrid_user_recommendation_worker)(
                user_id, df, svd_model, movie_indices, distances, indices, weight_svd, weight_content, n
            )
            for user_id in batch_users
        )
        all_recommendations.extend([item for sublist in batch_recommendations for item in sublist])

    # Print recommendations to screen
    for rec in all_recommendations:
        print(f"User: {rec[0]}, Movie: {rec[1]}, Score: {rec[2]:.4f}")


## Train and Evaluate SVD Model

In [None]:
'''

# Train and evaluate the SVD model using fixed parameters
def train_and_evaluate_svd(df):
    reader = Reader(rating_scale=(0.5, 5.0))
    data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

    trainset, testset = train_test_split(data, test_size=0.25, random_state=RANDOM_SEED)

    svd_params = {
        'n_factors': [150, 200, 250],  # Increase factors for better training
        'n_epochs': [30, 40, 50],   # Increase epochs
        'lr_all': 0.005,
        'reg_all': 0.1,
        'biased': True,
        'random_state': RANDOM_SEED  # Add random seed for reproducibility
    }

    model = SVD(**svd_params)
    model.fit(trainset)

    predictions = model.test(testset)
    rmse = accuracy.rmse(predictions)
    mae = accuracy.mae(predictions)

    print(f"Best Model (RMSE): RMSE={rmse:.4f}, MAE={mae:.4f}")
    print(f"Best Parameters: {svd_params}")

    return model, testset
    
'''

In [None]:
'''

best_svd_model, svd_testset = train_and_evaluate_svd(augmented_ratings_df)

'''

In [None]:
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split, RandomizedSearchCV
import logging

RANDOM_SEED = 42

# Train and evaluate the SVD model using randomized search
def train_and_evaluate_svd(df, threshold=4.0, n=5):
    reader = Reader(rating_scale=(0.5, 5.0))
    data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

    trainset, testset = train_test_split(data, test_size=0.25, random_state=RANDOM_SEED)

    param_grid = {
        'n_factors': [150],  # Adjust for complexity
        'n_epochs': [100],  # Adjust for learning duration
        'lr_all': [0.01],
        'reg_all': [0.1],
        'biased': [True]
    }

    randomized_search = RandomizedSearchCV(
        SVD,
        param_distributions=param_grid,
        n_iter=10,
        random_state=RANDOM_SEED,
        measures=['rmse', 'mae'],
        cv=3,
        refit=True,
        n_jobs=-1
    )

    randomized_search.fit(data)

    best_params = randomized_search.best_params['rmse']
    best_model = randomized_search.best_estimator['rmse']

    predictions = best_model.test(testset)
    rmse = accuracy.rmse(predictions)
    mae = accuracy.mae(predictions)

    logging.info(f"Best Model (RMSE): RMSE={rmse:.4f}, MAE={mae:.4f}")
    logging.info(f"Best Parameters: {best_params}")

    print(f"Best Model (RMSE): RMSE={rmse:.4f}, MAE={mae:.4f}")
    print(f"Best Parameters: {best_params}")

    return best_model, testset


## Calculate Metrics

In [None]:
# Calculate recommendation metrics (precision, recall, and F1-score)
def calculate_recommendation_metrics(predictions, threshold=4.0, n=5):
    def get_top_n(predictions, n=5):
        top_n = defaultdict(list)
        for uid, iid, true_r, est, _ in predictions:
            top_n[uid].append((iid, est, true_r))
        for uid, user_ratings in top_n.items():
            user_ratings.sort(key=lambda x: x[1], reverse=True)
            top_n[uid] = user_ratings[:n]
        return top_n

    top_n = get_top_n(predictions, n)
    y_true, y_pred = [], []

    for uid, user_ratings in top_n.items():
        for iid, est_rating, true_rating in user_ratings:
            y_true.append(1 if true_rating >= threshold else 0)
            y_pred.append(1 if est_rating >= threshold else 0)

    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)

    print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}")

    return precision, recall, f1


In [None]:
# pip install memory_profiler


# OLD CODE

In [None]:
import pandas as pd
import numpy as np
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import RandomizedSearchCV, cross_validate, KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics import precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error
from collections import defaultdict
from joblib import Parallel, delayed
import matplotlib.pyplot as plt
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)

# Constants for easier configuration
MAX_RATINGS = 50000
TFIDF_MAX_FEATURES = 100
TOP_N = 5
THRESHOLD = 4.0
MIN_SCORE = 4.0
SVD_PARAM_GRID = {
    'n_factors': [40, 50, 60],
    'n_epochs': [30, 40, 50],
    'lr_all': [0.005, 0.007, 0.01],
    'reg_all': [0.1, 0.2]
}
SVD_FIXED_PARAMS = {'biased': [True]}
K_VALUES = list(range(1, 26))

# Function to load and merge data
def load_and_merge_data(movies_file, ratings_file, max_ratings=MAX_RATINGS):
    try:
        movies_df = pd.read_csv(movies_file, usecols=['movieId', 'title', 'genres'])
        ratings_df = pd.read_csv(ratings_file, usecols=['userId', 'movieId', 'rating']).head(max_ratings)

        movies_df.drop_duplicates(subset=['movieId'], inplace=True)
        ratings_df.drop_duplicates(subset=['userId', 'movieId'], inplace=True)

        movies_df.fillna({'genres': '(no genres listed)'}, inplace=True)
        ratings_df.dropna(subset=['userId', 'movieId', 'rating'], inplace=True)

        merged_df = pd.merge(ratings_df, movies_df, on='movieId')
        logging.info("Data successfully merged and cleaned.")
        return merged_df
    except FileNotFoundError as e:
        logging.error(f"Error: {e}. File not found. Check the file paths.")
        return None
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
        return None

# Preprocess data by adding one-hot encoding for genres
def preprocess_data(df):
    if 'genres' in df.columns:
        df['genres'] = df['genres'].replace('', '(no genres listed)')
        genres_dummies = df['genres'].str.get_dummies(sep='|')
        df = pd.concat([df, genres_dummies], axis=1)
        logging.info("Genre-based features added.")
    else:
        logging.error("Genres column not available for processing.")
    return df

# Add user-genre interaction features to the dataset
def add_user_genre_features(df):
    if 'userId' in df.columns:
        genre_columns = [col for col in df.columns if col not in ('user_mean_', 'title', 'movieId', 'userId', 'rating') and df[col].dtype in [np.float64, np.int64]]
        
        user_genre_means = df.groupby(['userId'])[genre_columns].mean()
        user_genre_means.columns = [f'user_mean_{col}' for col in user_genre_means.columns]
        
        df = pd.merge(df, user_genre_means, on='userId', how='left')
        logging.info("User-genre interaction features added.")
    else:
        logging.error("UserId column not available for interaction features.")
    return df

# Function to calculate recommendation metrics
def calculate_recommendation_metrics(predictions, user_rated_movies, threshold=THRESHOLD, n=TOP_N):
    def get_top_n(predictions, n=5):
        top_n = defaultdict(list)
        for uid, iid, true_r, est, _ in predictions:
            if iid not in user_rated_movies.get(uid, []):
                top_n[uid].append((iid, est, true_r))
        for uid, user_ratings in top_n.items():
            user_ratings.sort(key=lambda x: x[1], reverse=True)
            top_n[uid] = [rating for rating in user_ratings if rating[1] >= threshold][:n]
        return top_n

    top_n = get_top_n(predictions, n)
    y_true, y_pred = [], []

    for uid, user_ratings in top_n.items():
        for iid, est_rating, true_rating in user_ratings:
            y_true.append(1 if true_rating >= threshold else 0)
            y_pred.append(1 if est_rating >= threshold else 0)

    if not y_true or not y_pred:
        return 0.0, 0.0, 0.0

    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)

    return precision, recall, f1

# Function to print evaluation metrics
def print_evaluation_metrics(predictions, user_rated_movies, threshold=THRESHOLD, n=TOP_N):
    precision, recall, f1 = calculate_recommendation_metrics(predictions, user_rated_movies, threshold, n)

    # Filter out seen movies for RMSE and MAE
    filtered_predictions = [
        (uid, iid, true_r, est)
        for uid, iid, true_r, est, _ in predictions
        if iid not in user_rated_movies.get(uid, []) and est >= threshold
    ]
    if not filtered_predictions:
        logging.warning("No unseen predictions available for evaluation metrics.")
        print("No unseen predictions available for evaluation metrics.")
        return

    true_ratings = [true_r for _, _, true_r, _ in filtered_predictions]
    estimated_ratings = [est_r for _, _, _, est_r in filtered_predictions]

    rmse = np.sqrt(mean_squared_error(true_ratings, estimated_ratings))
    mae = mean_absolute_error(true_ratings, estimated_ratings)

    logging.info(f"RMSE: {rmse:.4f}")
    logging.info(f"MAE: {mae:.4f}")
    logging.info(f"Precision: {precision:.4f}")
    logging.info(f"Recall: {recall:.4f}")
    logging.info(f"F1-Score: {f1:.4f}")

    print(f"RMSE: {rmse:.4f}")
    print(f"MAE: {mae:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

# Train and evaluate SVD model with cross-validation
def train_and_evaluate_svd(df):
    reader = Reader(rating_scale=(0.5, 5.0))
    data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

    param_grid = {
        'n_factors': [40, 50, 60],
        'n_epochs': [30, 40, 50],
        'lr_all': [0.005, 0.007, 0.01],
        'reg_all': [0.1, 0.2]
    }

    random_search = RandomizedSearchCV(
        SVD,
        param_distributions=param_grid,
        n_iter=10,
        measures=['rmse', 'mae'],
        cv=5,  # 5-Fold Cross-Validation
        refit=True,
        random_state=42,
        n_jobs=-1
    )

    random_search.fit(data)

    best_params = random_search.best_params['rmse']
    best_model = random_search.best_estimator['rmse']

    # Evaluate with cross-validation
    kf = KFold(n_splits=5)
    results = cross_validate(best_model, data, measures=['rmse', 'mae'], cv=kf, verbose=True)

    logging.info(f"Cross-Validation RMSE: {np.mean(results['test_rmse']):.4f}")
    logging.info(f"Cross-Validation MAE: {np.mean(results['test_mae']):.4f}")
    logging.info(f"Best Parameters: {best_params}")

    trainset = data.build_full_trainset()
    best_model.fit(trainset)

    testset = trainset.build_testset()
    predictions = best_model.test(testset)

    return best_model, predictions

# Hybrid recommendation function combining SVD and content-based filtering
def hybrid_user_recommendation_worker(user_id, df, svd_model, cosine_sim, movie_indices, weight_svd, weight_content, n, min_score):
    svd_recs = [svd_model.predict(user_id, iid) for iid in df['movieId']]
    
    # Ensure unique recommendations per user
    svd_recs = sorted(svd_recs, key=lambda x: x.est, reverse=True)
    svd_recs = list({rec.iid: rec for rec in svd_recs if rec.est >= min_score}.values())

    user_indices = df[df['userId'] == user_id].index
    content_scores = []
    for user_idx in user_indices:
        sim_scores = list(enumerate(cosine_sim[user_idx]))
        sim_scores_sorted = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        content_scores.extend([(df['movieId'][idx], score) for idx, score in sim_scores_sorted[:n]])

    recommendations = []
    seen_movies = set(df[df['userId'] == user_id]['movieId'])  # Movies already rated by the user
    for iid, est_rating in [(rec.iid, rec.est) for rec in svd_recs if rec.iid not in seen_movies]:
        content_score = next((score for mid, score in content_scores if mid == iid), 0)
        combined_score = weight_svd * est_rating + weight_content * content_score
        recommendations.append((iid, combined_score))

    recommendations = sorted(recommendations, key=lambda x: x[1], reverse=True)[:n]
    return [(user_id, rec[0], rec[1]) for rec in recommendations]

# Hybrid recommendation function combining SVD and content-based filtering
def hybrid_recommendation_to_screen(df, svd_model, test_users, weight_svd=0.7, weight_content=0.3, n=TOP_N, min_score=MIN_SCORE, batch_size=2000):
    movie_indices = pd.Series(df.index, index=df['movieId']).drop_duplicates()

    tfidf = TfidfVectorizer(max_features=TFIDF_MAX_FEATURES, stop_words='english')
    tfidf_matrix = tfidf.fit_transform(df['genres'].fillna(''))
    cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

    all_recommendations = []

    for i in range(0, len(test_users), batch_size):
        batch_users = list(test_users)[i:i + batch_size]
        batch_recommendations = Parallel(n_jobs=-1)(
            delayed(hybrid_user_recommendation_worker)(
                user_id, df, svd_model, cosine_sim, movie_indices, weight_svd, weight_content, n, min_score
            )
            for user_id in batch_users
        )
        all_recommendations.extend([item for sublist in batch_recommendations for item in sublist])

    recommendations_df = pd.DataFrame(all_recommendations, columns=['UserID', 'MovieID', 'Score'])
    
    # Assign unique actual ratings per user
    def assign_actual_rating(row):
        actual_ratings = df[(df['userId'] == row['UserID']) & (df['movieId'] == row['MovieID'])]['rating'].values
        return actual_ratings[0] if len(actual_ratings) > 0 else None

    recommendations_df = recommendations_df.merge(df[['movieId', 'title']].drop_duplicates(), left_on='MovieID', right_on='movieId', how='left')
    recommendations_df['Actual Rating'] = recommendations_df.apply(assign_actual_rating, axis=1)
    recommendations_df.drop(columns=['movieId'], inplace=True)
    return recommendations_df

# Print top N recommendations for sample users
def print_top_n_recommendations(recommendations_df, n=TOP_N):
    unique_users = recommendations_df['UserID'].unique()
    for user_id in unique_users[:5]:
        user_recs = recommendations_df[recommendations_df['UserID'] == user_id]
        print(f"\nTop {n} Recommendations for User {user_id}:")
        for _, row in user_recs.head(n).iterrows():
            print(f"  MovieID: {int(row['MovieID'])}, Title: {row['title']}, Recommended Score: {row['Score']:.2f}, Actual Rating: {row['Actual Rating']}")

# Plot RMSE, MAE, Precision, Recall, and F1-Score at different K-values
def plot_metrics_at_k(svd_predictions, k_values, threshold=THRESHOLD):
    def get_top_n(predictions, n):
        """Return top-N predictions for each user"""
        top_n = defaultdict(list)
        for uid, iid, true_r, est, _ in predictions:
            top_n[uid].append((iid, est, true_r))
        for uid, user_ratings in top_n.items():
            user_ratings.sort(key=lambda x: x[1], reverse=True)
            top_n[uid] = user_ratings[:n]
        return top_n

    rmse_values, mae_values = [], []
    precision_values, recall_values, f1_values = [], [], []

    for k in k_values:
        top_n_predictions = get_top_n(svd_predictions, k)

        # Prepare y_true and y_pred lists
        y_true, y_pred = [], []
        for uid, user_ratings in top_n_predictions.items():
            for _, est_rating, true_rating in user_ratings:
                y_true.append(1 if true_rating >= threshold else 0)
                y_pred.append(1 if est_rating >= threshold else 0)

        # Calculate Precision, Recall, and F1-Score
        precision = precision_score(y_true, y_pred, zero_division=0)
        recall = recall_score(y_true, y_pred, zero_division=0)
        f1 = f1_score(y_true, y_pred, zero_division=0)

        precision_values.append(precision)
        recall_values.append(recall)
        f1_values.append(f1)

        # Calculate RMSE and MAE using the original prediction results
        top_actual_ratings = [true_r for _, _, true_r in sum(top_n_predictions.values(), [])]
        top_estimated_ratings = [est_r for _, est_r, _ in sum(top_n_predictions.values(), [])]

        rmse_values.append(np.sqrt(mean_squared_error(top_actual_ratings, top_estimated_ratings)))
        mae_values.append(mean_absolute_error(top_actual_ratings, top_estimated_ratings))

    # Plot RMSE and MAE
    fig, ax1 = plt.subplots(figsize=(10, 6))
    ax1.plot(k_values, rmse_values, label='RMSE', color='tab:red')
    ax1.plot(k_values, mae_values, label='MAE', color='tab:orange')
    ax1.set_xlabel('K-Value')
    ax1.set_ylabel('RMSE/MAE', color='tab:red')
    ax1.legend(loc='upper left')
    plt.title('RMSE and MAE at Different K-Values')
    plt.show()

    # Plot Precision, Recall, and F1-Score
    fig, ax2 = plt.subplots(figsize=(10, 6))
    ax2.plot(k_values, precision_values, label='Precision', color='tab:blue')
    ax2.plot(k_values, recall_values, label='Recall', color='tab:green')
    ax2.plot(f1_values, label='F1-Score', color='tab:purple')
    ax2.set_xlabel('K-Value')
    ax2.set_ylabel('Precision/Recall/F1-Score', color='tab:blue')
    ax2.legend(loc='upper right')
    plt.title('Precision, Recall, and F1-Score at Different K-Values')
    plt.show()

# Example Usage
movies_file = '../data/movies.csv'
ratings_file = '../data/ratings.csv'

augmented_ratings_df = load_and_merge_data(movies_file, ratings_file, max_ratings=MAX_RATINGS)

if augmented_ratings_df is not None:
    augmented_ratings_df = preprocess_data(augmented_ratings_df)
    augmented_ratings_df = add_user_genre_features(augmented_ratings_df)

    best_svd_model, svd_predictions = train_and_evaluate_svd(augmented_ratings_df)

    # Create a dictionary of already rated movies per user
    user_rated_movies = defaultdict(list)
    for row in augmented_ratings_df.itertuples():
        user_rated_movies[row.userId].append(row.movieId)

    # Get prediction results and evaluate
    print_evaluation_metrics(svd_predictions, user_rated_movies, threshold=THRESHOLD, n=TOP_N)

    # Extract unique test users from predictions
    test_users = {pred[0] for pred in svd_predictions}

    # Generate hybrid recommendations and print them to screen
    recommendations_df = hybrid_recommendation_to_screen(augmented_ratings_df, best_svd_model, test_users, n=TOP_N)

    # Print top 5 recommendations for 5 sample users
    print_top_n_recommendations(recommendations_df, n=TOP_N)

    # Plot RMSE, MAE, Precision, Recall, and F1-Score at Different K-Values
    plot_metrics_at_k(svd_predictions, K_VALUES)
else:
    logging.error("Data loading failed, model training not performed.")


# Last code that run


In [None]:
import pandas as pd
import numpy as np
import logging
import re
from collections import defaultdict
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, precision_score, recall_score, f1_score
import random
from scipy.spatial.distance import pdist, squareform

def load_and_merge_data(movies_file, ratings_file, max_ratings=None):
    try:
        movies_df = pd.read_csv(movies_file, usecols=['movieId', 'title', 'genres'])
        ratings_df = pd.read_csv(ratings_file, usecols=['userId', 'movieId', 'rating']).head(max_ratings)

        # Drop duplicates if any exist
        movies_df.drop_duplicates(subset='movieId', inplace=True)
        ratings_df.drop_duplicates(subset=['userId', 'movieId'], inplace=True)

        # Perform merge
        merged_df = pd.merge(ratings_df, movies_df, on='movieId')

        # Create a unique key by combining userId and movieId
        merged_df['uniqueId'] = merged_df['userId'].astype(str) + "_" + merged_df['movieId'].astype(str)

        logging.info("Data loaded and merged successfully.")
        return merged_df, movies_df
    except FileNotFoundError as e:
        logging.error(f"Error: {e}. File not found. Check the file paths.")
        return None, None
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
        return None, None

def extract_release_year(title):
    match = re.search(r'\((\d{4})\)', title)
    if match:
        return int(match.group(1))
    return None

def preprocess_data(df):
    if 'title' in df.columns:
        df['release_year'] = df['title'].apply(extract_release_year)
        df['release_year'].fillna(0, inplace=True)  # Fill NaN with 0 for release year
        logging.info("Release year feature added.")
    else:
        logging.error("Title column not available for processing.")
    
    if 'genres' in df.columns:
        # Split genres by | and add as a list
        df['genres'] = df['genres'].apply(lambda x: x.split('|'))
        
        # Explode the genres column to have a separate row for each genre
        exploded_genres = df.explode('genres')
        
        # One-hot encode the genres
        genres_dummies = pd.get_dummies(exploded_genres['genres'], prefix='genre')
        
        # Concatenate the original DataFrame with the one-hot encoded genres
        df = pd.concat([exploded_genres, genres_dummies], axis=1)
        
        # Drop the original genres column
        df.drop(columns=['genres'], inplace=True)
        
        # Drop duplicate rows
        df = df.drop_duplicates(subset=['userId', 'movieId', 'rating', 'title', 'uniqueId', 'release_year']).reset_index(drop=True)
        
        logging.info("Genre-based features added.")
    else:
        logging.error("Genres column not available for processing.")
    
    logging.info(f"Data after preprocessing: {df.head()}")
    return df

def add_user_genre_features(df):
    if 'userId' in df.columns:
        # Ensure only numeric columns are selected for aggregation
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        genre_columns = [col for col in numeric_cols if col.startswith('genre_')]
        
        logging.info(f"Genre columns used for user genre features: {genre_columns}")
        
        if genre_columns:  # Check if there are genre columns to process
            user_genre_means = df.groupby(['userId'])[genre_columns].mean()
            user_genre_means.columns = [f'user_mean_{col}' for col in user_genre_means.columns]
            
            df = pd.merge(df, user_genre_means, on='userId', how='left')
            logging.info("User-genre interaction features added.")
        else:
            logging.warning("No genre columns found for user genre features.")
    else:
        logging.error("UserId column not available for interaction features.")
    
    logging.info(f"Data after adding user genre features: {df.head()}")
    return df

def get_item_features(df):
    # Select only numeric columns for aggregation
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    feature_columns = [col for col in numeric_cols if col not in ['userId', 'movieId', 'rating']]

    logging.info(f"Numeric feature columns used for item features: {feature_columns}")

    item_features = df.groupby('movieId')[feature_columns].mean()

    for movie_id, features in item_features.iterrows():
        if features.shape[0] != len(feature_columns):
            logging.warning(f"Movie {movie_id} has an unexpected feature vector shape: {features.shape}")
    
    return item_features.to_dict('index')

def derive_user_preferences(df):
    preferences = {}
    for user_id, group in df.groupby('userId'):
        preferred_movies = group[group['rating'] > 4]['movieId'].tolist()
        preferences[user_id] = {'preferred_movies': preferred_movies}
    return preferences

def generate_recommendations(df, movies_df, item_features, n=5):
    recommendations = {}
    all_movie_ids = df['movieId'].unique()
    
    for user in df['userId'].unique():
        user_rated_movies = df[df['userId'] == user]['movieId'].unique()
        candidate_movies = np.setdiff1d(all_movie_ids, user_rated_movies)
        candidate_movies_with_features = [movie for movie in candidate_movies if movie in item_features]
        predictions = [(movie, np.random.uniform(3, 5)) for movie in candidate_movies_with_features]
        top_recommendations = sorted(predictions, key=lambda x: x[1], reverse=True)[:n]
        
        if len(top_recommendations) < n:
            popular_movies = movies_df['movieId'].value_counts().index.tolist()
            fallback_movies = [(movie, np.random.uniform(3, 5)) for movie in popular_movies if movie in item_features and movie not in user_rated_movies and movie not in [rec[0] for rec in top_recommendations]]
            top_recommendations.extend(fallback_movies[:n - len(top_recommendations)])
        
        logging.info(f"User {user} recommendations: {[rec[0] for rec in top_recommendations]}")
        recommendations[user] = top_recommendations
    
    return recommendations

from scipy.spatial.distance import pdist, squareform

def calculate_intra_list_similarity(recommendations, item_features):
    diversity_scores = []

    for user, items in recommendations.items():
        features = []
        missing_features = []
        
        for item in items:
            movie_id = item[0]
            if movie_id in item_features:
                features.append(list(item_features[movie_id].values()))  # Ensure we are appending numerical values
            else:
                missing_features.append(movie_id)
        
        if missing_features:
            logging.warning(f"Missing features for movies {missing_features} for user {user}")

        features_array = np.array(features)
        
        if features_array.ndim == 1:
            features_array = features_array.reshape(1, -1)
        
        if features_array.shape[0] > 1:
            distances = pdist(features_array, 'cosine')
            diversity_scores.append(1 - np.mean(squareform(distances)))
        else:
            logging.warning(f"Not enough features for user {user} with feature array shape: {features_array.shape}. Skipping distance calculation.")
            logging.info(f"User {user} has feature array: {features_array}")
    
    return np.mean(diversity_scores) if diversity_scores else 0

def display_metrics(recommendations, catalog_size, item_popularity, expected_recommendations, user_preferences, item_features):
    coverage = calculate_coverage(recommendations, catalog_size)
    novelty = calculate_novelty(recommendations, item_popularity)
    personalization = calculate_personalization(recommendations)
    serendipity = calculate_serendipity(recommendations, expected_recommendations, user_preferences)
    intra_list_diversity = calculate_intra_list_similarity(recommendations, item_features)

    print(f"Catalog Coverage: {coverage:.2%}")
    print(f"Average Novelty: {novelty:.4f}")
    print(f"Personalization: {personalization:.3f}")
    print(f"Serendipity: {serendipity:.3f}")
    print(f"Intra-list Diversity: {intra_list_diversity:.3f}")

def calculate_coverage(recommendations, catalog_size):
    recommended_items = set()
    for user, items in recommendations.items():
        recommended_items.update([item[0] for item in items])
    coverage = len(recommended_items) / catalog_size
    return coverage

def calculate_novelty(recommendations, item_popularity):
    novelty_scores = []
    for user, items in recommendations.items():
        for item_id, _ in items:
            popularity = item_popularity.get(item_id, 0)
            if popularity > 0:
                novelty_scores.append(-np.log(popularity))
            else:
                novelty_scores.append(0)  # Handle items with no prior recommendations
    return np.mean(novelty_scores) if novelty_scores else 0

def calculate_personalization(recommendations):
    all_items = set()
    for items in recommendations.values():
        all_items.update([item[0] for item in items])
    
    item_list_mapping = {item: idx for idx, item in enumerate(all_items)}
    user_item_matrix = []

    for items in recommendations.values():
        item_vec = [0] * len(all_items)
        for item in items:
            item_vec[item_list_mapping[item[0]]] = 1
        user_item_matrix.append(item_vec)

    if user_item_matrix:
        jaccard_distances = pdist(user_item_matrix, metric='jaccard')
        return 1 - np.mean(jaccard_distances)
    return 0

def calculate_serendipity(recommendations, expected_recommendations, user_preferences):
    unexpected_relevant_count = 0
    total_relevant_count = 0

    for user, items in recommendations.items():
        expected_items = expected_recommendations.get(user, [])
        preferences = user_preferences.get(user, {})
        for item_id, _ in items:
            if item_id not in expected_items and item_is_relevant(item_id, preferences):
                unexpected_relevant_count += 1
            if item_is_relevant(item_id, preferences):
                total_relevant_count += 1

    return unexpected_relevant_count / total_relevant_count if total_relevant_count else 0

def calculate_item_popularity(recommendations):
    item_popularity = {}
    total_recommendations = 0

    for user, items in recommendations.items():
        for item_id, _ in items:
            if item_id in item_popularity:
                item_popularity[item_id] += 1
            else:
                item_popularity[item_id] = 1
            total_recommendations += 1

    for movie_id in item_popularity:
        item_popularity[movie_id] /= total_recommendations

    return item_popularity

def get_expected_recommendations(df, n_most_popular=100):
    most_popular = df['movieId'].value_counts().head(n_most_popular).index.tolist()
    return {user: most_popular for user in df['userId'].unique()}

def item_is_relevant(item_id, user_preferences):
    return item_id in user_preferences.get('preferred_movies', [])

def main():
    # Configure logging
    logging.basicConfig(level=logging.INFO)
    
    # Load and preprocess data
    merged_df, movies_df = load_and_merge_data('../data/movies.csv', '../data/ratings.csv')
    if merged_df is not None:
        # Process and enhance data
        merged_df = preprocess_data(merged_df)
        merged_df = add_user_genre_features(merged_df)

        item_features = get_item_features(merged_df)
        user_preferences = derive_user_preferences(merged_df)
        recommendations = generate_recommendations(merged_df, movies_df, item_features)
        item_popularity = calculate_item_popularity(recommendations)
        expected_recommendations = get_expected_recommendations(merged_df)

        display_metrics(recommendations, len(movies_df), item_popularity, expected_recommendations, user_preferences, item_features)
    else:
        print("Data loading or processing failed.")

if __name__ == "__main__":
    main()


# PRIOR MASTER CODE:

In [None]:
import pandas as pd
import numpy as np
import logging
import re
from collections import defaultdict
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, precision_score, recall_score, f1_score
from scipy.spatial.distance import pdist, squareform
import itertools

# Constants
MOVIES_FILE = '../data/movies.csv'
RATINGS_FILE = '../data/ratings.csv'
MAX_RATINGS = None
N_RECOMMENDATIONS = 5
YEAR_BINS = list(range(1900, 2025, 5))
RELEASE_YEAR_WEIGHT_DIVISOR = 100
RATING_THRESHOLD = 4.0

def load_and_merge_data(movies_file, ratings_file, max_ratings=None):
    try:
        movies_df = pd.read_csv(movies_file, usecols=['movieId', 'title', 'genres'])
        ratings_df = pd.read_csv(ratings_file, usecols=['userId', 'movieId', 'rating']).head(max_ratings)

        movies_df.drop_duplicates(subset='movieId', inplace=True)
        ratings_df.drop_duplicates(subset=['userId', 'movieId'], inplace=True)

        merged_df = pd.merge(ratings_df, movies_df, on='movieId')
        merged_df['uniqueId'] = merged_df['userId'].astype(str) + "_" + merged_df['movieId'].astype(str)

        logging.info("Data loaded and merged successfully.")
        return merged_df, movies_df
    except FileNotFoundError as e:
        logging.error(f"Error: {e}. File not found. Check the file paths.")
        return None, None
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
        return None, None

def extract_release_year(title):
    match = re.search(r'\((\d{4})\)', title)
    if match:
        return int(match.group(1))
    return None

def preprocess_data(df):
    if 'title' in df.columns:
        df['release_year'] = df['title'].apply(extract_release_year)
        df['release_year'].fillna(0, inplace=True)
        logging.info("Release year feature added.")
    else:
        logging.error("Title column not available for processing.")
    
    if 'genres' in df.columns:
        df['genres'] = df['genres'].apply(lambda x: x.split('|'))
        genre_dummies = df['genres'].str.join('|').str.get_dummies()
        df = pd.concat([df, genre_dummies], axis=1)
        logging.info("Genre-based features added.")
    else:
        logging.error("Genres column not available for processing.")
    
    df['year_category'] = pd.cut(df['release_year'], bins=YEAR_BINS, labels=YEAR_BINS[:-1], right=False)
    year_dummies = pd.get_dummies(df['year_category'], prefix='year')
    df = pd.concat([df, year_dummies], axis=1)
    df.drop(columns=['year_category'], inplace=True)

    logging.info(f"Data after preprocessing: {df.head()}")
    return df

def add_user_genre_features(df):
    if 'userId' in df.columns:
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        genre_columns = [col for col in numeric_cols if col.startswith('genre_')]
        
        logging.info(f"Genre columns used for user genre features: {genre_columns}")
        
        if genre_columns:
            user_genre_means = df.groupby(['userId'])[genre_columns].mean()
            user_genre_means.columns = [f'user_mean_{col}' for col in user_genre_means.columns]
            df = pd.merge(df, user_genre_means, on='userId', how='left')
            logging.info("User-genre interaction features added.")
        else:
            logging.warning("No genre columns found for user genre features.")
    else:
        logging.error("UserId column not available for interaction features.")
    
    logging.info(f"Data after adding user genre features: {df.head()}")
    return df

def get_item_features(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    feature_columns = [col for col in numeric_cols if col not in ['userId', 'movieId', 'rating']]
    logging.info(f"Numeric feature columns used for item features: {feature_columns}")
    item_features = df.groupby('movieId')[feature_columns].mean()
    return item_features.to_dict('index')

def derive_user_preferences(df):
    preferences = {}
    for user_id, group in df.groupby('userId'):
        preferred_movies = group[group['rating'] > RATING_THRESHOLD]['movieId'].tolist()
        preferences[user_id] = {'preferred_movies': preferred_movies}
    return preferences

def generate_recommendations(algo, df, user_id, n=5):
    user_data = df[df['userId'] == user_id]
    user_movies = user_data['movieId'].unique()
    all_movies = df['movieId'].unique()
    possible_movies = np.setdiff1d(all_movies, user_movies)
    
    recommendations = []
    for movie_id in possible_movies:
        prediction = algo.predict(user_id, movie_id).est
        recommendations.append((movie_id, prediction))
    
    recommendations = sorted(recommendations, key=lambda x: x[1], reverse=True)[:n]
    return recommendations

def calculate_item_popularity(recommendations):
    item_popularity = {}
    total_recommendations = 0
    
    for user, items in recommendations.items():
        for item_id, _ in items:
            if item_id in item_popularity:
                item_popularity[item_id] += 1
            else:
                item_popularity[item_id] = 1
            total_recommendations += 1
    
    for movie_id in item_popularity:
        item_popularity[movie_id] /= total_recommendations
    
    return item_popularity

def get_expected_recommendations(df, n_most_popular=100):
    most_popular = df['movieId'].value_counts().head(n_most_popular).index.tolist()
    return {user: most_popular for user in df['userId'].unique()}

def calculate_model_metrics(predictions):
    true_ratings = [pred.r_ui for pred in predictions]
    estimated_ratings = [pred.est for pred in predictions]
    rmse = np.sqrt(mean_squared_error(true_ratings, estimated_ratings))
    mae = mean_absolute_error(true_ratings, estimated_ratings)
    y_pred = [1 if est >= RATING_THRESHOLD else 0 for est in estimated_ratings]
    y_true = [1 if true_r >= RATING_THRESHOLD else 0 for true_r in true_ratings]
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    return {
        'RMSE': rmse,
        'MAE': mae,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

def display_metrics(recommendations, catalog_size, item_popularity, expected_recommendations, user_preferences, item_features, movies_df):
    coverage = calculate_coverage(recommendations, catalog_size)
    novelty = calculate_novelty(recommendations, item_popularity)
    personalization = calculate_personalization(recommendations)
    serendipity = calculate_serendipity(recommendations, expected_recommendations, user_preferences)
    intra_list_diversity = calculate_intra_list_similarity(recommendations, item_features)

    print(f"Catalog Coverage: {coverage:.2%}")
    print(f"Average Novelty: {novelty:.4f}")
    print(f"Personalization: {personalization:.3f}")
    print(f"Serendipity: {serendipity:.3f}")
    print(f"Intra-list Diversity: {intra_list_diversity:.3f}")

    for user, items in list(recommendations.items())[:10]:  # Limit to 10 users
        print(f"User {user} recommendations:")
        for movie_id, pred_rating in items:
            title = movies_df[movies_df['movieId'] == movie_id]['title'].values[0]
            print(f"\tMovie: {title}, Predicted Rating: {pred_rating:.2f}")

def calculate_coverage(recommendations, catalog_size):
    recommended_items = set()
    for user, items in recommendations.items():
        recommended_items.update([item[0] for item in items])
    coverage = len(recommended_items) / catalog_size
    return coverage

def calculate_novelty(recommendations, item_popularity):
    novelty_scores = []
    for user, items in recommendations.items():
        for item_id, _ in items:
            popularity = item_popularity.get(item_id, 0)
            if popularity > 0:
                novelty_scores.append(-np.log2(popularity))
    return np.mean(novelty_scores) if novelty_scores else 0

def calculate_personalization(recommendations):
    user_pairs = list(itertools.combinations(recommendations.keys(), 2))
    similarity_sum = 0
    for user1, user2 in user_pairs:
        items1 = {item[0] for item in recommendations[user1]}
        items2 = {item[0] for item in recommendations[user2]}
        similarity_sum += len(items1.intersection(items2)) / len(items1.union(items2))
    personalization = 1 - (similarity_sum / len(user_pairs)) if user_pairs else 1
    return personalization

def calculate_serendipity(recommendations, expected_recommendations, user_preferences):
    serendipity_scores = []
    for user, items in recommendations.items():
        expected_set = set(expected_recommendations[user])
        user_set = set(user_preferences[user]['preferred_movies'])
        for item_id, _ in items:
            if item_id not in expected_set and item_id not in user_set:
                serendipity_scores.append(1)
            else:
                serendipity_scores.append(0)
    return np.mean(serendipity_scores) if serendipity_scores else 0

def calculate_intra_list_similarity(recommendations, item_features):
    diversity_scores = []
    for user, items in recommendations.items():
        features = np.array([list(item_features[item[0]].values()) for item in items if item[0] in item_features])
        if features.ndim == 2 and features.shape[1] > 1:
            distances = pdist(features, 'cosine')
            diversity_scores.append(1 - np.mean(squareform(distances)))
        else:
            logging.error(f"Error in calculating distances for user {user}: A 2-dimensional array must be passed.")
    return np.mean(diversity_scores) if diversity_scores else 0

def main():
    logging.basicConfig(level=logging.INFO)
    
    merged_df, movies_df = load_and_merge_data(MOVIES_FILE, RATINGS_FILE, MAX_RATINGS)
    
    if merged_df is not None:
        merged_df = preprocess_data(merged_df)
        merged_df = add_user_genre_features(merged_df)

        item_features = get_item_features(merged_df)
        user_preferences = derive_user_preferences(merged_df)

        # Load data into Surprise for model training
        reader = Reader(rating_scale=(0.5, 5.0))
        data = Dataset.load_from_df(merged_df[['userId', 'movieId', 'rating']], reader)
        trainset, testset = train_test_split(data, test_size=0.2)

        # Perform GridSearchCV to find the best parameters
        param_grid = {
            'n_epochs': [20, 40, 60],
            'lr_all': [0.002, 0.005, 0.01],
            'reg_all': [0.02, 0.05, 0.1]
        }
        gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
        gs.fit(data)

        best_params = gs.best_params['rmse']
        print(f"Best params: {best_params}")

        algo = SVD(**best_params)
        algo.fit(trainset)
        predictions = algo.test(testset)

        # Calculate model metrics
        metrics = calculate_model_metrics(predictions)
        print("Model Metrics:")
        for metric, value in metrics.items():
            print(f"{metric}: {value:.4f}")

        recommendations = {user_id: generate_recommendations(algo, merged_df, user_id, N_RECOMMENDATIONS) for user_id in merged_df['userId'].unique()}
        item_popularity = calculate_item_popularity(recommendations)
        expected_recommendations = get_expected_recommendations(merged_df)
        
        display_metrics(recommendations, len(movies_df), item_popularity, expected_recommendations, user_preferences, item_features, movies_df)
    else:
        print("Data loading or processing failed.")

if __name__ == "__main__":
    main()


# Surprise Method

In [None]:
import pandas as pd
import numpy as np
import logging
import re
from collections import defaultdict
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, precision_score, recall_score, f1_score
from scipy.spatial.distance import pdist, squareform
import itertools

# Constants
MOVIES_FILE = '../data/movies.csv'
RATINGS_FILE = '../data/ratings.csv'
MAX_RATINGS = None
N_RECOMMENDATIONS = 5
YEAR_BINS = list(range(1900, 2025, 5))
RELEASE_YEAR_WEIGHT_DIVISOR = 50
RATING_THRESHOLD = 4.0

def load_and_merge_data(movies_file, ratings_file, max_ratings=None):
    try:
        movies_df = pd.read_csv(movies_file, usecols=['movieId', 'title', 'genres'])
        ratings_df = pd.read_csv(ratings_file, usecols=['userId', 'movieId', 'rating']).head(max_ratings)

        movies_df.drop_duplicates(subset='movieId', inplace=True)
        ratings_df.drop_duplicates(subset=['userId', 'movieId'], inplace=True)

        merged_df = pd.merge(ratings_df, movies_df, on='movieId')
        merged_df['uniqueId'] = merged_df['userId'].astype(str) + "_" + merged_df['movieId'].astype(str)

        logging.info("Data loaded and merged successfully.")
        return merged_df, movies_df
    except FileNotFoundError as e:
        logging.error(f"Error: {e}. File not found. Check the file paths.")
        return None, None
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
        return None, None

def extract_release_year(title):
    match = re.search(r'\((\d{4})\)', title)
    if match:
        return int(match.group(1))
    return None

def preprocess_data(df):
    if 'title' in df.columns:
        df['release_year'] = df['title'].apply(extract_release_year)
        df['release_year'].fillna(0, inplace=True)
        logging.info("Release year feature added.")
    else:
        logging.error("Title column not available for processing.")
    
    if 'genres' in df.columns:
        df['genres'] = df['genres'].apply(lambda x: x.split('|'))
        genre_dummies = df['genres'].str.join('|').str.get_dummies()
        df = pd.concat([df, genre_dummies], axis=1)
        logging.info("Genre-based features added.")
    else:
        logging.error("Genres column not available for processing.")
    
    df['year_category'] = pd.cut(df['release_year'], bins=YEAR_BINS, labels=YEAR_BINS[:-1], right=False)
    year_dummies = pd.get_dummies(df['year_category'], prefix='year')
    df = pd.concat([df, year_dummies], axis=1)
    df.drop(columns=['year_category'], inplace=True)

    logging.info(f"Data after preprocessing: {df.head()}")
    return df

def add_user_genre_features(df):
    if 'userId' in df.columns:
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        genre_columns = [col for col in numeric_cols if col.startswith('genre_')]
        
        logging.info(f"Genre columns used for user genre features: {genre_columns}")
        
        if genre_columns:
            user_genre_means = df.groupby(['userId'])[genre_columns].mean()
            user_genre_means.columns = [f'user_mean_{col}' for col in user_genre_means.columns]
            df = pd.merge(df, user_genre_means, on='userId', how='left')
            logging.info("User-genre interaction features added.")
        else:
            logging.warning("No genre columns found for user genre features.")
    else:
        logging.error("UserId column not available for interaction features.")
    
    logging.info(f"Data after adding user genre features: {df.head()}")
    return df

def get_item_features(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    feature_columns = [col for col in numeric_cols if col not in ['userId', 'movieId', 'rating']]
    logging.info(f"Numeric feature columns used for item features: {feature_columns}")
    item_features = df.groupby('movieId')[feature_columns].mean()
    return item_features.to_dict('index')

def derive_user_preferences(df):
    preferences = {}
    for user_id, group in df.groupby('userId'):
        preferred_movies = group[group['rating'] > RATING_THRESHOLD]['movieId'].tolist()
        preferences[user_id] = {'preferred_movies': preferred_movies}
    return preferences

def generate_recommendations(algo, df, user_id, n=5):
    user_data = df[df['userId'] == user_id]
    user_movies = user_data['movieId'].unique()
    all_movies = df['movieId'].unique()
    possible_movies = np.setdiff1d(all_movies, user_movies)
    
    recommendations = []
    for movie_id in possible_movies:
        prediction = algo.predict(user_id, movie_id).est
        recommendations.append((movie_id, prediction))
    
    recommendations = sorted(recommendations, key=lambda x: x[1], reverse=True)[:n]
    return recommendations

def calculate_item_popularity(recommendations):
    item_popularity = {}
    total_recommendations = 0
    
    for user, items in recommendations.items():
        for item_id, _ in items:
            if item_id in item_popularity:
                item_popularity[item_id] += 1
            else:
                item_popularity[item_id] = 1
            total_recommendations += 1
    
    for movie_id in item_popularity:
        item_popularity[movie_id] /= total_recommendations
    
    return item_popularity

def get_expected_recommendations(df, n_most_popular=100):
    most_popular = df['movieId'].value_counts().head(n_most_popular).index.tolist()
    return {user: most_popular for user in df['userId'].unique()}

def calculate_model_metrics(predictions):
    true_ratings = [pred.r_ui for pred in predictions]
    estimated_ratings = [pred.est for pred in predictions]
    rmse = np.sqrt(mean_squared_error(true_ratings, estimated_ratings))
    mae = mean_absolute_error(true_ratings, estimated_ratings)
    y_pred = [1 if est >= RATING_THRESHOLD else 0 for est in estimated_ratings]
    y_true = [1 if true_r >= RATING_THRESHOLD else 0 for true_r in true_ratings]
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    return {
        'RMSE': rmse,
        'MAE': mae,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

def display_metrics(recommendations, catalog_size, item_popularity, expected_recommendations, user_preferences, item_features, movies_df):
    coverage = calculate_coverage(recommendations, catalog_size)
    novelty = calculate_novelty(recommendations, item_popularity)
    personalization = calculate_personalization(recommendations)
    serendipity = calculate_serendipity(recommendations, expected_recommendations, user_preferences)
    intra_list_diversity = calculate_intra_list_similarity(recommendations, item_features)

    print(f"Catalog Coverage: {coverage:.2%}")
    print(f"Average Novelty: {novelty:.4f}")
    print(f"Personalization: {personalization:.3f}")
    print(f"Serendipity: {serendipity:.3f}")
    print(f"Intra-list Diversity: {intra_list_diversity:.3f}")

    for user, items in list(recommendations.items())[:10]:  # Limit to 10 users
        print(f"User {user} recommendations:")
        for movie_id, pred_rating in items:
            title = movies_df[movies_df['movieId'] == movie_id]['title'].values[0]
            print(f"\tMovie: {title}, Predicted Rating: {pred_rating:.2f}")

def calculate_coverage(recommendations, catalog_size):
    recommended_items = set()
    for user, items in recommendations.items():
        recommended_items.update([item[0] for item in items])
    coverage = len(recommended_items) / catalog_size
    return coverage

def calculate_novelty(recommendations, item_popularity):
    novelty_scores = []
    for user, items in recommendations.items():
        for item_id, _ in items:
            popularity = item_popularity.get(item_id, 0)
            if popularity > 0:
                novelty_scores.append(-np.log2(popularity))
    return np.mean(novelty_scores) if novelty_scores else 0

def calculate_personalization(recommendations):
    user_pairs = list(itertools.combinations(recommendations.keys(), 2))
    similarity_sum = 0
    for user1, user2 in user_pairs:
        items1 = {item[0] for item in recommendations[user1]}
        items2 = {item[0] for item in recommendations[user2]}
        similarity_sum += len(items1.intersection(items2)) / len(items1.union(items2))
    personalization = 1 - (similarity_sum / len(user_pairs)) if user_pairs else 1
    return personalization

def calculate_serendipity(recommendations, expected_recommendations, user_preferences):
    serendipity_scores = []
    for user, items in recommendations.items():
        expected_set = set(expected_recommendations[user])
        user_set = set(user_preferences[user]['preferred_movies'])
        for item_id, _ in items:
            if item_id not in expected_set and item_id not in user_set:
                serendipity_scores.append(1)
            else:
                serendipity_scores.append(0)
    return np.mean(serendipity_scores) if serendipity_scores else 0

def calculate_intra_list_similarity(recommendations, item_features):
    diversity_scores = []
    for user, items in recommendations.items():
        features = np.array([list(item_features[item[0]].values()) for item in items if item[0] in item_features])
        if features.ndim == 2 and features.shape[1] > 1:
            distances = pdist(features, 'cosine')
            diversity_scores.append(1 - np.mean(squareform(distances)))
        else:
            logging.error(f"Error in calculating distances for user {user}: A 2-dimensional array must be passed.")
    return np.mean(diversity_scores) if diversity_scores else 0

def main():
    logging.basicConfig(level=logging.INFO)
    
    merged_df, movies_df = load_and_merge_data(MOVIES_FILE, RATINGS_FILE, MAX_RATINGS)
    
    if merged_df is not None:
        merged_df = preprocess_data(merged_df)
        merged_df = add_user_genre_features(merged_df)

        item_features = get_item_features(merged_df)
        user_preferences = derive_user_preferences(merged_df)

        # Load data into Surprise for model training
        reader = Reader(rating_scale=(0.5, 5.0))
        data = Dataset.load_from_df(merged_df[['userId', 'movieId', 'rating']], reader)
        trainset, testset = train_test_split(data, test_size=0.2)

        # Perform GridSearchCV to find the best parameters
        param_grid = {
            'n_epochs': [20, 40, 60],
            'lr_all': [0.002, 0.005, 0.01],
            'reg_all': [0.02, 0.05, 0.1]
        }
        gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
        gs.fit(data)

        best_params = gs.best_params['rmse']
        print(f"Best params: {best_params}")

        algo = SVD(**best_params)
        algo.fit(trainset)
        predictions = algo.test(testset)

        # Calculate model metrics
        metrics = calculate_model_metrics(predictions)
        print("Model Metrics:")
        for metric, value in metrics.items():
            print(f"{metric}: {value:.4f}")

        recommendations = {user_id: generate_recommendations(algo, merged_df, user_id, N_RECOMMENDATIONS) for user_id in merged_df['userId'].unique()}
        item_popularity = calculate_item_popularity(recommendations)
        expected_recommendations = get_expected_recommendations(merged_df)
        
        display_metrics(recommendations, len(movies_df), item_popularity, expected_recommendations, user_preferences, item_features, movies_df)
    else:
        print("Data loading or processing failed.")

if __name__ == "__main__":
    main()


# BEST CODE!  WITH MODERATE DOCUMENTATION

Detailed Comments and Documentation:
Import Libraries: Necessary libraries for data processing, model training, and evaluation are imported.

Constants: Constants for file paths, model parameters, and other settings are defined.

load_and_merge_data: Function to load and merge movie and rating data from CSV files.

extract_release_year: Function to extract the release year from the movie title.

preprocess_data: Function to preprocess the data by adding release year and genre features.

add_user_genre_features: Function to add user-specific genre interaction features.

get_item_features: Function to extract item features from the dataset.

derive_user_preferences: Function to derive user preferences based on their ratings.

generate_recommendations: Function to generate movie recommendations for a given user using a trained algorithm.

calculate_item_popularity: Function to calculate item popularity based on recommendations.

get_expected_recommendations: Function to get expected recommendations based on item popularity.

calculate_model_metrics: Function to calculate various model metrics (RMSE, MAE, Precision, Recall, F1 Score).

display_metrics: Function to display various metrics for the recommendations.

calculate_coverage: Function to calculate coverage of recommendations.

calculate_novelty: Function to calculate novelty of recommendations.

calculate_personalization: Function to calculate personalization of recommendations.

calculate_serendipity: Function to calculate serendipity of recommendations.

calculate_intra_list_similarity: Function to calculate intra-list similarity of recommendations.

main: Main function to execute the recommendation system. This function loads and preprocesses data, trains the model, makes predictions, and displays recommendations and metrics.

In [None]:
import pandas as pd
import numpy as np
import itertools
import logging
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.spatial.distance import pdist, squareform

# Constants
MOVIES_FILE = '../data/movies.csv'
RATINGS_FILE = '../data/ratings.csv'
N_RECOMMENDATIONS = 5  # Changed from 10 to 5
YEAR_DIVISOR = 0.05  # Changed from 5 to 0.05
RATING_THRESHOLD = 4.0  # Changed from 3.5 to 4.0

# Logging configuration
logging.basicConfig(level=logging.INFO)

# Load datasets
def load_data(movies_file, ratings_file):
    """Load movies and ratings datasets and merge them."""
    try:
        movies_df = pd.read_csv(movies_file)
        ratings_df = pd.read_csv(ratings_file)
        merged_df = pd.merge(ratings_df, movies_df, on='movieId')
        # Ensure release_year column is extracted from title column if not present
        if 'release_year' not in merged_df.columns:
            merged_df['release_year'] = merged_df['title'].str.extract(r'\((\d{4})\)').astype(float)
        return merged_df, movies_df, ratings_df
    except Exception as e:
        logging.error(f"Error loading data: {e}")
        return None, None, None

# Calculate weighted release year
def get_weighted_release_year(year, divisor):
    """Calculate the weighted release year by dividing the year by the given divisor."""
    return year * divisor

# Get item features
def get_item_features(df):
    """Extract item features from the DataFrame."""
    if 'release_year' not in df.columns:
        raise KeyError("release_year column is missing from the DataFrame.")
    df['release_year_bucket'] = df['release_year'].apply(lambda x: get_weighted_release_year(x, YEAR_DIVISOR))
    genre_columns = [col for col in df.columns if col.startswith('user_mean_')]
    item_features = df[['movieId', 'release_year_bucket'] + genre_columns].drop_duplicates().set_index('movieId')
    item_features_dict = item_features.to_dict(orient='index')
    return item_features_dict

# Derive user preferences
def derive_user_preferences(df):
    """Calculate mean ratings for each genre and release year for each user."""
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    user_genre_means = df.groupby('userId')[numeric_columns].mean().add_prefix('user_mean_')
    return user_genre_means.to_dict(orient='index')

# Generate recommendations
def generate_recommendations(algo, df, user_id, n_recommendations):
    """Generate top N recommendations for a given user."""
    user_rated_items = df[df['userId'] == user_id]['movieId'].tolist()
    all_items = df['movieId'].unique()
    recommendations = []
    for item_id in all_items:
        if item_id not in user_rated_items:
            pred = algo.predict(user_id, item_id)
            recommendations.append((item_id, pred.est))
    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations[:n_recommendations]

# Calculate item popularity
def calculate_item_popularity(recommendations):
    """Calculate how often each item is recommended."""
    item_popularity = {}
    for user, items in recommendations.items():
        for item_id, _ in items:
            item_popularity[item_id] = item_popularity.get(item_id, 0) + 1
    return item_popularity

# Expected recommendations for novelty
def get_expected_recommendations(df):
    """Get expected recommendations based on high ratings."""
    return df[df['rating'] >= RATING_THRESHOLD]['movieId'].unique()

# Calculate model metrics
def calculate_model_metrics(predictions):
    """Calculate various metrics to evaluate the model."""
    true_ratings = [pred.r_ui for pred in predictions]
    estimated_ratings = [pred.est for pred in predictions]
    rmse = np.sqrt(mean_squared_error(true_ratings, estimated_ratings))
    mae = mean_absolute_error(true_ratings, estimated_ratings)
    y_true = [1 if true_r >= RATING_THRESHOLD else 0 for true_r in true_ratings]
    y_pred = [1 if est >= RATING_THRESHOLD else 0 for est in estimated_ratings]
    precision = np.sum(np.array(y_true) & np.array(y_pred)) / np.sum(y_pred)
    recall = np.sum(np.array(y_true) & np.array(y_pred)) / np.sum(y_true)
    f1 = 2 * (precision * recall) / (precision + recall)
    return {
        'RMSE': rmse,
        'MAE': mae,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

# Calculate coverage
def calculate_coverage(recommendations, catalog_size):
    """Calculate the percentage of items in the catalog that have been recommended."""
    recommended_items = set()
    for user, items in recommendations.items():
        recommended_items.update([item_id for item_id, _ in items])
    return len(recommended_items) / catalog_size

# Calculate novelty
def calculate_novelty(recommendations, item_popularity):
    """Calculate the average popularity of recommended items."""
    novelty_scores = []
    for user, items in recommendations.items():
        for item_id, _ in items:
            novelty_scores.append(item_popularity.get(item_id, 0))
    return np.mean(novelty_scores)

# Calculate personalization
def calculate_personalization(recommendations):
    """Calculate how different the recommendations are for different users."""
    user_pairs = list(itertools.combinations(recommendations.keys(), 2))
    similarity_sum = 0
    for user1, user2 in user_pairs:
        items1 = {item_id for item_id, _ in recommendations[user1]}
        items2 = {item_id for item_id, _ in recommendations[user2]}
        similarity_sum += len(items1 & items2) / len(items1 | items2)
    return 1 - (similarity_sum / len(user_pairs))

# Calculate serendipity
def calculate_serendipity(recommendations, expected_recommendations, user_preferences, item_features):
    """Calculate the serendipity of the recommendations."""
    serendipity_scores = []
    for user, items in recommendations.items():
        user_prefs = user_preferences.get(user, {})
        for item_id, _ in items:
            if item_id not in expected_recommendations:
                item_genres = item_features.get(item_id, {})
                similarity = sum(user_prefs.get(f"user_mean_{genre}", 0) * item_genres.get(genre, 0) for genre in item_genres)
                serendipity_scores.append(1 - similarity)
    return np.mean(serendipity_scores) if serendipity_scores else 0

# Calculate intra-list diversity
def calculate_intra_list_similarity(recommendations, item_features):
    """Calculate the diversity within a single user's list of recommendations."""
    diversity_scores = []
    for user, items in recommendations.items():
        features = [list(item_features[item[0]].values()) for item in items if item[0] in item_features]
        if len(features) > 1:
            distances = pdist(features, 'cosine')
            diversity_scores.append(1 - np.mean(squareform(distances)))
    return np.mean(diversity_scores) if diversity_scores else 0

# Display metrics
def display_metrics(recommendations, catalog_size, item_popularity, expected_recommendations, user_preferences, item_features):
    """Display various metrics to evaluate the recommendations."""
    coverage = calculate_coverage(recommendations, catalog_size)
    novelty = calculate_novelty(recommendations, item_popularity)
    personalization = calculate_personalization(recommendations)
    serendipity = calculate_serendipity(recommendations, expected_recommendations, user_preferences, item_features)
    intra_list_diversity = calculate_intra_list_similarity(recommendations, item_features)

    print(f"Catalog Coverage: {coverage:.2%}")
    print(f"Average Novelty: {novelty:.4f}")
    print(f"Personalization: {personalization:.4f}")
    print(f"Serendipity: {serendipity:.4f}")
    print(f"Intra-list Diversity: {intra_list_diversity:.4f}")

# Main function
def main():
    """Main function to load data, train the model, and generate recommendations."""
    merged_df, movies_df, ratings_df = load_data(MOVIES_FILE, RATINGS_FILE)
    if merged_df is not None:
        # Load data into Surprise for model training
        reader = Reader(rating_scale=(0.5, 5.0))
        data = Dataset.load_from_df(merged_df[['userId', 'movieId', 'rating']], reader)
        trainset, testset = train_test_split(data, test_size=0.2)

        # Hyperparameter tuning with GridSearchCV
        param_grid = {
            'n_factors': [50, 100],
            'n_epochs': [20, 30],
            'lr_all': [0.005, 0.01],
            'reg_all': [0.02, 0.1]
        }
        gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
        gs.fit(data)
        best_params = gs.best_params['rmse']
        print(f"Best parameters: {best_params}")

        # Train the best model
        algo = SVD(**best_params)
        algo.fit(trainset)

        # Evaluate the model
        predictions = algo.test(testset)
        metrics = calculate_model_metrics(predictions)
        print("Model Metrics:")
        for metric, value in metrics.items():
            print(f"{metric}: {value:.4f}")

        # Generate recommendations for all users
        item_features = get_item_features(merged_df)
        user_preferences = derive_user_preferences(merged_df)
        recommendations = {user_id: generate_recommendations(algo, merged_df, user_id, N_RECOMMENDATIONS) for user_id in merged_df['userId'].unique()}
        item_popularity = calculate_item_popularity(recommendations)
        expected_recommendations = get_expected_recommendations(merged_df)

        display_metrics(recommendations, len(movies_df), item_popularity, expected_recommendations, user_preferences, item_features)

        # Print recommendations for the first 10 users
        for user_id, recs in list(recommendations.items())[:10]:
            print(f"User {user_id} recommendations:")
            for movie_id, est_rating in recs:
                title = movies_df[movies_df['movieId'] == movie_id]['title'].values[0]
                print(f"  {title}: {est_rating:.2f}")
    else:
        print("Data loading or processing failed.")

if __name__ == "__main__":
    main()


# Testing

In [17]:
import pandas as pd
import numpy as np
import itertools
import logging
import re
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, precision_score, recall_score, f1_score
from scipy.spatial.distance import pdist, squareform

# Constants
MOVIES_FILE = '../data/movies.csv'
RATINGS_FILE = '../data/ratings.csv'
N_RECOMMENDATIONS = 5
YEAR_DIVISOR = 1.0  # Adjust this value based on experimentation
RATING_THRESHOLD = 4.0
RANDOM_SEED = 42  # Random seed for reproducibility
USER_SAMPLE_SIZE = 100  # Set sample size to 100 for now

# Weights for hybrid scoring
CF_WEIGHT = 0.7
CBF_WEIGHT = 0.3

# Set random seeds for reproducibility
np.random.seed(RANDOM_SEED)

# Logging configuration
logging.basicConfig(level=logging.INFO)

# Load datasets
def load_data(movies_file, ratings_file, user_sample_size=None):
    """Load movies and ratings datasets and merge them. Optionally sample a subset of users."""
    try:
        movies_df = pd.read_csv(movies_file)
        ratings_df = pd.read_csv(ratings_file)
        
        if user_sample_size:
            unique_users = ratings_df['userId'].drop_duplicates()
            actual_sample_size = min(user_sample_size, len(unique_users))
            sampled_users = unique_users.sample(n=actual_sample_size, random_state=RANDOM_SEED)
            ratings_df = ratings_df[ratings_df['userId'].isin(sampled_users)]
        
        merged_df = pd.merge(ratings_df, movies_df, on='movieId')
        if 'release_year' not in merged_df.columns:
            merged_df['release_year'] = merged_df['title'].str.extract(r'\((\d{4})\)')[0].astype(float)
        
        # Ensure proper genre handling
        genre_list = list(set(itertools.chain.from_iterable(merged_df['genres'].str.split('|'))))
        for genre in genre_list:
            genre_pattern = re.escape(genre)  # Escape genre to treat it as a literal string
            merged_df[genre] = merged_df['genres'].str.contains(r'\b' + genre_pattern + r'\b').astype(int)
        return merged_df, movies_df, ratings_df
    except Exception as e:
        logging.error(f"Error loading data: {e}")
        return None, None, None

# Calculate weighted release year
def get_weighted_release_year(year, divisor):
    """Calculate the weighted release year by normalizing and then dividing the year by the given divisor."""
    min_year = 1900  # Assuming movies are not older than 1900
    max_year = 2024  # Use the current year as the upper bound
    normalized_year = (year - min_year) / (max_year - min_year)
    return normalized_year / divisor

# Get item features
def get_item_features(df):
    """Extract item features from the DataFrame."""
    if 'release_year' not in df.columns:
        raise KeyError("release_year column is missing from the DataFrame.")
    df['release_year_bucket'] = df['release_year'].apply(lambda x: get_weighted_release_year(x, YEAR_DIVISOR))
    genre_columns = [col for col in df.columns if col not in ['userId', 'movieId', 'rating', 'title', 'genres', 'release_year_bucket']]
    item_features = df[['movieId', 'release_year_bucket'] + genre_columns].drop_duplicates().set_index('movieId')
    # Ensure index is unique
    item_features = item_features.loc[~item_features.index.duplicated(keep='first')]
    # Handle missing values by filling them with zeros or an appropriate value
    item_features = item_features.fillna(0)
    # Debugging: Print item features to ensure correctness
    print("Item features:")
    print(item_features.head())
    item_features_dict = item_features.to_dict(orient='index')
    return item_features_dict

# Derive user preferences
def derive_user_preferences(df):
    """Calculate mean ratings for each genre and release year for each user."""
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    user_genre_means = df.groupby('userId')[numeric_columns].mean().add_prefix('user_mean_')
    return user_genre_means.to_dict(orient='index')

# Normalize content scores
def normalize_scores(scores):
    """Normalize scores to be between 0.5 and 5.0."""
    scores = np.array(scores)
    min_score = np.nanmin(scores)
    max_score = np.nanmax(scores)
    if max_score != min_score:
        normalized_scores = 0.5 + 4.5 * ((scores - min_score) / (max_score - min_score))
    else:
        normalized_scores = scores
    return normalized_scores

# Generate recommendations
def generate_recommendations(algo, df, user_id, n_recommendations, item_features, item_popularity, global_popularity, verbose=False):
    """Generate top N recommendations for a given user with diversity penalty."""
    user_rated_items = df[df['userId'] == user_id]['movieId'].tolist()
    all_items = df['movieId'].unique()
    recommendations = []

    cf_scores = []
    cbf_scores = []
    items_to_recommend = []

    for item_id in all_items:
        if item_id not in user_rated_items:
            pred = algo.predict(user_id, item_id)
            cf_scores.append(pred.est)
            content_score = np.mean(list(item_features.get(item_id, {}).values()))
            cbf_scores.append(content_score)
            items_to_recommend.append(item_id)

    normalized_cbf_scores = normalize_scores(cbf_scores)

    for idx, item_id in enumerate(items_to_recommend):
        cf_score = cf_scores[idx]
        content_score = normalized_cbf_scores[idx]
        popularity_penalty = item_popularity.get(item_id, 0) / len(df['userId'].unique())
        diversity_penalty = global_popularity.get(item_id, 0) / len(df['userId'].unique())
        hybrid_score = (CF_WEIGHT * cf_score + CBF_WEIGHT * content_score) - (popularity_penalty * 0.01) - (diversity_penalty * 0.01)
        if verbose:
            print(f"User {user_id}, Item {item_id}: CF {cf_score:.2f}, CBF {content_score:.2f}, Penalty {popularity_penalty:.4f}, Diversity {diversity_penalty:.4f}, Hybrid {hybrid_score:.2f}")
        recommendations.append((item_id, hybrid_score))

    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations[:n_recommendations]

# Calculate item popularity
def calculate_item_popularity(df):
    """Calculate how often each item is rated."""
    item_popularity = df['movieId'].value_counts().to_dict()
    return item_popularity

# Expected recommendations for novelty
def get_expected_recommendations(df):
    """Get expected recommendations based on high ratings."""
    return df[df['rating'] >= RATING_THRESHOLD]['movieId'].unique()

# Calculate model metrics
def calculate_model_metrics(predictions):
    """Calculate various metrics to evaluate the model."""
    true_ratings = [pred.r_ui for pred in predictions]
    estimated_ratings = [pred.est for pred in predictions]
    rmse = np.sqrt(mean_squared_error(true_ratings, estimated_ratings))
    mae = mean_absolute_error(true_ratings, estimated_ratings)
    y_true = [1 if true_r >= RATING_THRESHOLD else 0 for true_r in true_ratings]
    y_pred = [1 if est >= RATING_THRESHOLD else 0 for est in estimated_ratings]
    precision = np.sum(np.array(y_true) & np.array(y_pred)) / np.sum(y_pred)
    recall = np.sum(np.array(y_true) & np.array(y_pred)) / np.sum(y_true)
    f1 = 2 * (precision * recall) / (precision + recall)
    return {
        'RMSE': rmse,
        'MAE': mae,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

# Calculate hit rate
def calculate_hit_rate(recommendations, df, n_recommendations):
    """Calculate hit rate."""
    hits = 0
    for user, items in recommendations.items():
        user_rated_items = df[df['userId'] == user]['movieId'].tolist()
        hits += len(set(item_id for item_id, _ in items).intersection(set(user_rated_items)))
    return hits / (len(recommendations) * n_recommendations)

# Calculate mean reciprocal rank
def calculate_mrr(recommendations, df):
    """Calculate mean reciprocal rank."""
    rr_sum = 0
    for user, items in recommendations.items():
        user_rated_items = df[df['userId'] == user]['movieId'].tolist()
        for rank, (item, _) in enumerate(items, start=1):
            if item in user_rated_items:
                rr_sum += 1 / rank
                break
    return rr_sum / len(recommendations)

# Calculate average precision
def calculate_ap(recommendations, df):
    """Calculate average precision."""
    ap_sum = 0
    for user, items in recommendations.items():
        user_rated_items = df[df['userId'] == user]['movieId'].tolist()
        relevant_items = set(user_rated_items)
        hits = 0
        precision_sum = 0
        for rank, (item, _) in enumerate(items, start=1):
            if item in relevant_items:
                hits += 1
                precision_sum += hits / rank
        ap_sum += precision_sum / min(len(relevant_items), len(items))
    return ap_sum / len(recommendations)

# Calculate coverage
def calculate_coverage(recommendations, catalog_size):
    """Calculate the percentage of items in the catalog that have been recommended."""
    recommended_items = set()
    for user, items in recommendations.items():
        recommended_items.update([item_id for item_id, _ in items])
    return len(recommended_items) / catalog_size

# Calculate novelty
def calculate_novelty(recommendations, item_popularity):
    """Calculate the average popularity of recommended items."""
    novelty_scores = []
    for user, items in recommendations.items():
        for item_id, _ in items:
            novelty_scores.append(item_popularity.get(item_id, 0))
    return np.mean(novelty_scores)

# Calculate personalization
def calculate_personalization(recommendations):
    """Calculate how different the recommendations are for different users."""
    user_pairs = list(itertools.combinations(recommendations.keys(), 2))
    similarity_sum = 0
    for user1, user2 in user_pairs:
        items1 = {item_id for item_id, _ in recommendations[user1]}
        items2 = {item_id for item_id, _ in recommendations[user2]}
        similarity_sum += len(items1 & items2) / len(items1 | items2)
    return 1 - (similarity_sum / len(user_pairs))

# Calculate serendipity
def calculate_serendipity(recommendations, expected_recommendations, user_preferences, item_features):
    """Calculate the serendipity of the recommendations."""
    serendipity_scores = []
    for user, items in recommendations.items():
        user_prefs = user_preferences.get(user, {})
        for item_id, _ in items:
            if item_id not in expected_recommendations:
                item_genres = item_features.get(item_id, {})
                similarity = sum(user_prefs.get(f"user_mean_{genre}", 0) * item_genres.get(genre, 0) for genre in item_genres)
                serendipity_scores.append(1 - similarity)
    return np.mean(serendipity_scores) if serendipity_scores else 0

# Calculate intra-list diversity
def calculate_intra_list_similarity(recommendations, item_features):
    """Calculate the diversity within a single user's list of recommendations."""
    diversity_scores = []
    for user, items in recommendations.items():
        features = [list(item_features[item[0]].values()) for item in items if item[0] in item_features]
        if len(features) > 1:
            distances = pdist(features, 'cosine')
            diversity_scores.append(1 - np.mean(distances))
    return np.mean(diversity_scores) if diversity_scores else 0

# Display metrics
def display_metrics(recommendations, catalog_size, item_popularity, expected_recommendations, user_preferences, item_features, df, n_recommendations):
    """Display various metrics to evaluate the recommendations."""
    coverage = calculate_coverage(recommendations, catalog_size)
    novelty = calculate_novelty(recommendations, item_popularity)
    personalization = calculate_personalization(recommendations)
    serendipity = calculate_serendipity(recommendations, expected_recommendations, user_preferences, item_features)
    intra_list_diversity = calculate_intra_list_similarity(recommendations, item_features)
    hit_rate = calculate_hit_rate(recommendations, df, n_recommendations)
    mrr = calculate_mrr(recommendations, df)
    ap = calculate_ap(recommendations, df)

    print(f"Catalog Coverage: {coverage:.2%}")
    print(f"Average Novelty: {novelty:.4f}")
    print(f"Personalization: {personalization:.4f}")
    print(f"Serendipity: {serendipity:.4f}")
    print(f"Intra-list Diversity: {intra_list_diversity:.4f}")
    print(f"Hit Rate: {hit_rate:.4f}")
    print(f"Mean Reciprocal Rank (MRR): {mrr:.4f}")
    print(f"Average Precision (AP): {ap:.4f}")

# Function to calculate metrics on recommended movies
def calculate_recommendation_metrics(recommendations, df):
    """Calculate metrics (RMSE, MAE, Precision, Recall, F1) on the recommended movies."""
    y_true = []
    y_pred = []
    
    for user, items in recommendations.items():
        user_rated_items = df[df['userId'] == user][['movieId', 'rating']]
        for item_id, predicted_rating in items:
            actual_rating = user_rated_items[user_rated_items['movieId'] == item_id]['rating'].values
            if len(actual_rating) > 0:
                y_true.append(actual_rating[0])
                y_pred.append(predicted_rating)
    
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    precision = precision_score([1 if y >= RATING_THRESHOLD else 0 for y in y_true], [1 if y >= RATING_THRESHOLD else 0 for y in y_pred], zero_division=0)
    recall = recall_score([1 if y >= RATING_THRESHOLD else 0 for y in y_true], [1 if y >= RATING_THRESHOLD else 0 for y in y_pred], zero_division=0)
    f1 = f1_score([1 if y >= RATING_THRESHOLD else 0 for y in y_true], [1 if y >= RATING_THRESHOLD else 0 for y in y_pred], zero_division=0)
    
    return {
        'RMSE': rmse,
        'MAE': mae,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

def main():
    """Main function to load data and prepare for EDA."""
    merged_df, movies_df, ratings_df = load_data(MOVIES_FILE, RATINGS_FILE, user_sample_size=USER_SAMPLE_SIZE)
    if merged_df is not None:
        print("Data loaded successfully. You can now perform EDA on the loaded datasets.")
        return merged_df, movies_df, ratings_df
    else:
        print("Data loading or processing failed.")
        return None, None, None

if __name__ == "__main__":
    merged_df, movies_df, ratings_df = main()

    # Example EDA
    if merged_df is not None:
        print("\nMovies DataFrame:")
        print(movies_df.head())
        
        print("\nRatings DataFrame:")
        print(ratings_df.head())
        
        print("\nMerged DataFrame:")
        print(merged_df.head())
        
        # Example EDA: Distribution of ratings
        ratings_distribution = ratings_df['rating'].value_counts().sort_index()
        print("\nRatings Distribution:")
        print(ratings_distribution)
        
        # Example EDA: Number of ratings per user
        user_ratings_count = ratings_df['userId'].value_counts()
        print("\nNumber of Ratings per User:")
        print(user_ratings_count.describe())
        
        # Train the model and generate recommendations
        reader = Reader(rating_scale=(0.5, 5.0))
        data = Dataset.load_from_df(merged_df[['userId', 'movieId', 'rating']], reader)
        trainset, testset = train_test_split(data, test_size=0.2, random_state=RANDOM_SEED)

        # Hyperparameter tuning with GridSearchCV
        param_grid = {
            'n_factors': [50, 100],
            'n_epochs': [20, 30],
            'lr_all': [0.005, 0.01],
            'reg_all': [0.02, 0.1],
            'biased': [True, False]  # Adding the 'biased' parameter
        }
        gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1)  # Use all available CPU cores
        gs.fit(data)
        best_params = gs.best_params['rmse']
        print(f"Best parameters: {best_params}")

        # Train the best model
        algo = SVD(**best_params)
        algo.fit(trainset)

        # Evaluate the model
        predictions = algo.test(testset)
        metrics = calculate_model_metrics(predictions)
        print("Model Metrics:")
        for metric, value in metrics.items():
            print(f"{metric}: {value:.4f}")

        # Generate recommendations for all users
        item_features = get_item_features(merged_df)
        user_preferences = derive_user_preferences(merged_df)
        item_popularity = calculate_item_popularity(merged_df)
        global_popularity = item_popularity  # Assuming global popularity is the same as item popularity here
        recommendations = {user_id: generate_recommendations(algo, merged_df, user_id, N_RECOMMENDATIONS, item_features, item_popularity, global_popularity) for user_id in merged_df['userId'].unique()}
        expected_recommendations = get_expected_recommendations(merged_df)

        display_metrics(recommendations, len(movies_df), item_popularity, expected_recommendations, user_preferences, item_features, merged_df, N_RECOMMENDATIONS)

        # Calculate and display recommendation metrics
        recommendation_metrics = calculate_recommendation_metrics(recommendations, merged_df)
        print("Recommendation Metrics:")
        for metric, value in recommendation_metrics.items():
            print(f"{metric}: {value:.4f}")

        # Print recommendations for the first 10 users
        for user_id, recs in list(recommendations.items())[:10]:
            print(f"User {user_id} recommendations:")
            for movie_id, est_rating in recs:
                title = movies_df[movies_df['movieId'] == movie_id]['title'].values[0]
                print(f"  {title}: {est_rating:.2f}")


Data loaded successfully. You can now perform EDA on the loaded datasets.

Movies DataFrame:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

Ratings DataFrame:
     userId  movieId  rating   timestamp
261       3       31     0.5  1306463578
262       3      527     0.5  1306464275
263       3      647     0.5  1306463619
264       3      688     0.5  1306464228
265       3      720     0.5  1306463595

Merged DataFrame:
   userId  movieId  rating 

ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

In [15]:
def recommend_for_profile(algo, merged_df, movies_df, item_features, item_popularity):
    """Prompt the user for a movie profile and generate recommendations."""
    print("Please enter the details of your preferred movie profile.")
    
    # Get genres from the user
    genres = []
    available_genres = list(set(itertools.chain.from_iterable(merged_df['genres'].str.split('|'))))
    for genre in available_genres:
        include = input(f"Do you like {genre} movies? (yes/no): ").strip().lower()
        if include == 'yes':
            genres.append(genre)
    
    # Get release year from the user
    release_year = float(input("Enter the preferred release year (e.g., 2000): "))
    weighted_release_year = get_weighted_release_year(release_year, YEAR_DIVISOR)
    
    # Create a profile vector
    profile_vector = [weighted_release_year] + [1 if genre in genres else 0 for genre in available_genres]
    
    # Calculate content scores for all items
    content_scores = []
    for item_id in item_features.keys():
        item_vector = list(item_features[item_id].values())
        similarity = np.dot(profile_vector, item_vector) / (np.linalg.norm(profile_vector) * np.linalg.norm(item_vector))
        content_scores.append((item_id, similarity))
    
    # Normalize content scores
    content_scores = sorted(content_scores, key=lambda x: x[1], reverse=True)
    content_scores = content_scores[:N_RECOMMENDATIONS]
    
    # Print recommendations
    print("\nRecommended movies based on your profile:")
    for item_id, score in content_scores:
        title = movies_df[movies_df['movieId'] == item_id]['title'].values[0]
        print(f"{title}: {score:.2f}")

# Example usage (run this after the main function):
recommend_for_profile(algo, merged_df, movies_df, item_features, item_popularity)


Please enter the details of your preferred movie profile.


Do you like Documentary movies? (yes/no):  no
Do you like Action movies? (yes/no):  y
Do you like (no genres listed) movies? (yes/no):  n
Do you like Film-Noir movies? (yes/no):  n
Do you like Fantasy movies? (yes/no):  y
Do you like IMAX movies? (yes/no):  y
Do you like Mystery movies? (yes/no):  n
Do you like Drama movies? (yes/no):  y
Do you like Romance movies? (yes/no):  n
Do you like Adventure movies? (yes/no):  y
Do you like Crime movies? (yes/no):  n
Do you like Western movies? (yes/no):  n
Do you like Musical movies? (yes/no):  n
Do you like Thriller movies? (yes/no):  y
Do you like Sci-Fi movies? (yes/no):  y
Do you like Children movies? (yes/no):  n
Do you like War movies? (yes/no):  y
Do you like Comedy movies? (yes/no):  y
Do you like Animation movies? (yes/no):  n
Do you like Horror movies? (yes/no):  y
Enter the preferred release year (e.g., 2000):  2000


ValueError: shapes (21,) and (23,) not aligned: 21 (dim 0) != 23 (dim 0)

In [None]:
import pandas as pd
import numpy as np
import itertools
import logging
import re
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, precision_score, recall_score, f1_score
from scipy.spatial.distance import pdist, squareform

# Constants
MOVIES_FILE = '../data/movies.csv'
RATINGS_FILE = '../data/ratings.csv'
N_RECOMMENDATIONS = 5
YEAR_DIVISOR = 1.0  # Adjust this value based on experimentation
RATING_THRESHOLD = 4.0
RANDOM_SEED = 42  # Random seed for reproducibility
USER_SAMPLE_SIZE = 100  # Set sample size to 100 for now

# Weights for hybrid scoring
CF_WEIGHT = 0.7
CBF_WEIGHT = 0.3

# Set random seeds for reproducibility
np.random.seed(RANDOM_SEED)

# Logging configuration
logging.basicConfig(level=logging.INFO)

# Load datasets
def load_data(movies_file, ratings_file, user_sample_size=None):
    """Load movies and ratings datasets and merge them. Optionally sample a subset of users."""
    try:
        movies_df = pd.read_csv(movies_file)
        ratings_df = pd.read_csv(ratings_file)
        
        if user_sample_size:
            unique_users = ratings_df['userId'].drop_duplicates()
            actual_sample_size = min(user_sample_size, len(unique_users))
            sampled_users = unique_users.sample(n=actual_sample_size, random_state=RANDOM_SEED)
            ratings_df = ratings_df[ratings_df['userId'].isin(sampled_users)]
        
        merged_df = pd.merge(ratings_df, movies_df, on='movieId')
        if 'release_year' not in merged_df.columns:
            merged_df['release_year'] = merged_df['title'].str.extract(r'\((\d{4})\)')[0].astype(float)
        
        # Ensure proper genre handling
        genre_list = list(set(itertools.chain.from_iterable(merged_df['genres'].str.split('|'))))
        for genre in genre_list:
            genre_pattern = re.escape(genre)  # Escape genre to treat it as a literal string
            merged_df[genre] = merged_df['genres'].str.contains(r'\b' + genre_pattern + r'\b').astype(int)
        return merged_df, movies_df, ratings_df
    except Exception as e:
        logging.error(f"Error loading data: {e}")
        return None, None, None

# Calculate weighted release year
def get_weighted_release_year(year, divisor):
    """Calculate the weighted release year by normalizing and then dividing the year by the given divisor."""
    min_year = 1900  # Assuming movies are not older than 1900
    max_year = 2024  # Use the current year as the upper bound
    normalized_year = (year - min_year) / (max_year - min_year)
    return normalized_year / divisor

# Get item features
def get_item_features(df):
    """Extract item features from the DataFrame."""
    if 'release_year' not in df.columns:
        raise KeyError("release_year column is missing from the DataFrame.")
    df['release_year_bucket'] = df['release_year'].apply(lambda x: get_weighted_release_year(x, YEAR_DIVISOR))
    genre_columns = [col for col in df.columns if col not in ['userId', 'movieId', 'rating', 'title', 'genres', 'release_year_bucket']]
    item_features = df[['movieId', 'release_year_bucket'] + genre_columns].drop_duplicates().set_index('movieId')
    # Ensure index is unique
    item_features = item_features.loc[~item_features.index.duplicated(keep='first')]
    # Handle missing values by filling them with zeros or an appropriate value
    item_features = item_features.fillna(0)
    # Debugging: Print item features to ensure correctness
    print("Item features:")
    print(item_features.head())
    item_features_dict = item_features.to_dict(orient='index')
    return item_features_dict

# Derive user preferences
def derive_user_preferences(df):
    """Calculate mean ratings for each genre and release year for each user."""
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    user_genre_means = df.groupby('userId')[numeric_columns].mean().add_prefix('user_mean_')
    return user_genre_means.to_dict(orient='index')

# Normalize content scores
def normalize_scores(scores):
    """Normalize scores to be between 0.5 and 5.0."""
    scores = np.array(scores)
    min_score = np.nanmin(scores)
    max_score = np.nanmax(scores)
    if max_score != min_score:
        normalized_scores = 0.5 + 4.5 * ((scores - min_score) / (max_score - min_score))
    else:
        normalized_scores = scores
    return normalized_scores

# Generate recommendations
def generate_recommendations(algo, df, user_id, n_recommendations, item_features, item_popularity, verbose=False):
    """Generate top N recommendations for a given user."""
    user_rated_items = df[df['userId'] == user_id]['movieId'].tolist()
    all_items = df['movieId'].unique()
    recommendations = []

    cf_scores = []
    cbf_scores = []
    items_to_recommend = []

    for item_id in all_items:
        if item_id not in user_rated_items:
            pred = algo.predict(user_id, item_id)
            cf_scores.append(pred.est)
            content_score = np.mean(list(item_features.get(item_id, {}).values()))
            cbf_scores.append(content_score)
            items_to_recommend.append(item_id)

    normalized_cbf_scores = normalize_scores(cbf_scores)

    for idx, item_id in enumerate(items_to_recommend):
        cf_score = cf_scores[idx]
        content_score = normalized_cbf_scores[idx]
        popularity_penalty = item_popularity.get(item_id, 0) / len(df['userId'].unique())
        hybrid_score = (CF_WEIGHT * cf_score + CBF_WEIGHT * content_score) - (popularity_penalty * 0.01)
        if verbose:
            print(f"User {user_id}, Item {item_id}: CF {cf_score:.2f}, CBF {content_score:.2f}, Penalty {popularity_penalty:.4f}, Hybrid {hybrid_score:.2f}")
        recommendations.append((item_id, hybrid_score))

    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations[:n_recommendations]

# Calculate item popularity
def calculate_item_popularity(df):
    """Calculate how often each item is rated."""
    item_popularity = df['movieId'].value_counts().to_dict()
    return item_popularity

# Expected recommendations for novelty
def get_expected_recommendations(df):
    """Get expected recommendations based on high ratings."""
    return df[df['rating'] >= RATING_THRESHOLD]['movieId'].unique()

# Calculate model metrics
def calculate_model_metrics(predictions):
    """Calculate various metrics to evaluate the model."""
    true_ratings = [pred.r_ui for pred in predictions]
    estimated_ratings = [pred.est for pred in predictions]
    rmse = np.sqrt(mean_squared_error(true_ratings, estimated_ratings))
    mae = mean_absolute_error(true_ratings, estimated_ratings)
    y_true = [1 if true_r >= RATING_THRESHOLD else 0 for true_r in true_ratings]
    y_pred = [1 if est >= RATING_THRESHOLD else 0 for est in estimated_ratings]
    precision = np.sum(np.array(y_true) & np.array(y_pred)) / np.sum(y_pred)
    recall = np.sum(np.array(y_true) & np.array(y_pred)) / np.sum(y_true)
    f1 = 2 * (precision * recall) / (precision + recall)
    return {
        'RMSE': rmse,
        'MAE': mae,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

# Calculate hit rate
def calculate_hit_rate(recommendations, df, n_recommendations):
    """Calculate hit rate."""
    hits = 0
    for user, items in recommendations.items():
        user_rated_items = df[df['userId'] == user]['movieId'].tolist()
        hits += len(set(item_id for item_id, _ in items).intersection(set(user_rated_items)))
    return hits / (len(recommendations) * n_recommendations)

# Calculate mean reciprocal rank
def calculate_mrr(recommendations, df):
    """Calculate mean reciprocal rank."""
    rr_sum = 0
    for user, items in recommendations.items():
        user_rated_items = df[df['userId'] == user]['movieId'].tolist()
        for rank, (item, _) in enumerate(items, start=1):
            if item in user_rated_items:
                rr_sum += 1 / rank
                break
    return rr_sum / len(recommendations)

# Calculate average precision
def calculate_ap(recommendations, df):
    """Calculate average precision."""
    ap_sum = 0
    for user, items in recommendations.items():
        user_rated_items = df[df['userId'] == user]['movieId'].tolist()
        relevant_items = set(user_rated_items)
        hits = 0
        precision_sum = 0
        for rank, (item, _) in enumerate(items, start=1):
            if item in relevant_items:
                hits += 1
                precision_sum += hits / rank
        ap_sum += precision_sum / min(len(relevant_items), len(items))
    return ap_sum / len(recommendations)

# Calculate coverage
def calculate_coverage(recommendations, catalog_size):
    """Calculate the percentage of items in the catalog that have been recommended."""
    recommended_items = set()
    for user, items in recommendations.items():
        recommended_items.update([item_id for item_id, _ in items])
    return len(recommended_items) / catalog_size

# Calculate novelty
def calculate_novelty(recommendations, item_popularity):
    """Calculate the average popularity of recommended items."""
    novelty_scores = []
    for user, items in recommendations.items():
        for item_id, _ in items:
            novelty_scores.append(item_popularity.get(item_id, 0))
    return np.mean(novelty_scores)

# Calculate personalization
def calculate_personalization(recommendations):
    """Calculate how different the recommendations are for different users."""
    user_pairs = list(itertools.combinations(recommendations.keys(), 2))
    similarity_sum = 0
    for user1, user2 in user_pairs:
        items1 = {item_id for item_id, _ in recommendations[user1]}
        items2 = {item_id for item_id, _ in recommendations[user2]}
        similarity_sum += len(items1 & items2) / len(items1 | items2)
    return 1 - (similarity_sum / len(user_pairs))

# Calculate serendipity
def calculate_serendipity(recommendations, expected_recommendations, user_preferences, item_features):
    """Calculate the serendipity of the recommendations."""
    serendipity_scores = []
    for user, items in recommendations.items():
        user_prefs = user_preferences.get(user, {})
        for item_id, _ in items:
            if item_id not in expected_recommendations:
                item_genres = item_features.get(item_id, {})
                similarity = sum(user_prefs.get(f"user_mean_{genre}", 0) * item_genres.get(genre, 0) for genre in item_genres)
                serendipity_scores.append(1 - similarity)
    return np.mean(serendipity_scores) if serendipity_scores else 0

# Calculate intra-list diversity
def calculate_intra_list_similarity(recommendations, item_features):
    """Calculate the diversity within a single user's list of recommendations."""
    diversity_scores = []
    for user, items in recommendations.items():
        features = [list(item_features[item[0]].values()) for item in items if item[0] in item_features]
        if len(features) > 1:
            distances = pdist(features, 'cosine')
            diversity_scores.append(1 - np.mean(distances))
    return np.mean(diversity_scores) if diversity_scores else 0

# Display metrics
def display_metrics(recommendations, catalog_size, item_popularity, expected_recommendations, user_preferences, item_features, df, n_recommendations):
    """Display various metrics to evaluate the recommendations."""
    coverage = calculate_coverage(recommendations, catalog_size)
    novelty = calculate_novelty(recommendations, item_popularity)
    personalization = calculate_personalization(recommendations)
    serendipity = calculate_serendipity(recommendations, expected_recommendations, user_preferences, item_features)
    intra_list_diversity = calculate_intra_list_similarity(recommendations, item_features)
    hit_rate = calculate_hit_rate(recommendations, df, n_recommendations)
    mrr = calculate_mrr(recommendations, df)
    ap = calculate_ap(recommendations, df)

    print(f"Catalog Coverage: {coverage:.2%}")
    print(f"Average Novelty: {novelty:.4f}")
    print(f"Personalization: {personalization:.4f}")
    print(f"Serendipity: {serendipity:.4f}")
    print(f"Intra-list Diversity: {intra_list_diversity:.4f}")
    print(f"Hit Rate: {hit_rate:.4f}")
    print(f"Mean Reciprocal Rank (MRR): {mrr:.4f}")
    print(f"Average Precision (AP): {ap:.4f}")

# Function to calculate metrics on recommended movies
def calculate_recommendation_metrics(recommendations, df):
    """Calculate metrics (RMSE, MAE, Precision, Recall, F1) on the recommended movies."""
    y_true = []
    y_pred = []
    
    for user, items in recommendations.items():
        user_rated_items = df[df['userId'] == user][['movieId', 'rating']]
        for item_id, predicted_rating in items:
            actual_rating = user_rated_items[user_rated_items['movieId'] == item_id]['rating'].values
            if len(actual_rating) > 0:
                y_true.append(actual_rating[0])
                y_pred.append(predicted_rating)
    
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    precision = precision_score([1 if y >= RATING_THRESHOLD else 0 for y in y_true], [1 if y >= RATING_THRESHOLD else 0 for y in y_pred], zero_division=0)
    recall = recall_score([1 if y >= RATING_THRESHOLD else 0 for y in y_true], [1 if y >= RATING_THRESHOLD else 0 for y in y_pred], zero_division=0)
    f1 = f1_score([1 if y >= RATING_THRESHOLD else 0 for y in y_true], [1 if y >= RATING_THRESHOLD else 0 for y in y_pred], zero_division=0)
    
    return {
        'RMSE': rmse,
        'MAE': mae,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

def main():
    """Main function to load data and prepare for EDA."""
    merged_df, movies_df, ratings_df = load_data(MOVIES_FILE, RATINGS_FILE, user_sample_size=USER_SAMPLE_SIZE)
    if merged_df is not None:
        print("Data loaded successfully. You can now perform EDA on the loaded datasets.")
        return merged_df, movies_df, ratings_df
    else:
        print("Data loading or processing failed.")
        return None, None, None

if __name__ == "__main__":
    merged_df, movies_df, ratings_df = main()

    # Example EDA
    if merged_df is not None:
        print("\nMovies DataFrame:")
        print(movies_df.head())
        
        print("\nRatings DataFrame:")
        print(ratings_df.head())
        
        print("\nMerged DataFrame:")
        print(merged_df.head())
        
        # Example EDA: Distribution of ratings
        ratings_distribution = ratings_df['rating'].value_counts().sort_index()
        print("\nRatings Distribution:")
        print(ratings_distribution)
        
        # Example EDA: Number of ratings per user
        user_ratings_count = ratings_df['userId'].value_counts()
        print("\nNumber of Ratings per User:")
        print(user_ratings_count.describe())
        
        # Train the model and generate recommendations
        reader = Reader(rating_scale=(0.5, 5.0))
        data = Dataset.load_from_df(merged_df[['userId', 'movieId', 'rating']], reader)
        trainset, testset = train_test_split(data, test_size=0.2, random_state=RANDOM_SEED)

        # Hyperparameter tuning with GridSearchCV
        param_grid = {
            'n_factors': [50, 100],
            'n_epochs': [20, 30],
            'lr_all': [0.005, 0.01],
            'reg_all': [0.02, 0.1],
            'biased': [True, False]  # Adding the 'biased' parameter
        }
        gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1)  # Use all available CPU cores
        gs.fit(data)
        best_params = gs.best_params['rmse']
        print(f"Best parameters: {best_params}")

        # Train the best model
        algo = SVD(**best_params)
        algo.fit(trainset)

        # Evaluate the model
        predictions = algo.test(testset)
        metrics = calculate_model_metrics(predictions)
        print("Model Metrics:")
        for metric, value in metrics.items():
            print(f"{metric}: {value:.4f}")

        # Generate recommendations for all users
        item_features = get_item_features(merged_df)
        user_preferences = derive_user_preferences(merged_df)
        item_popularity = calculate_item_popularity(merged_df)
        recommendations = {user_id: generate_recommendations(algo, merged_df, user_id, N_RECOMMENDATIONS, item_features, item_popularity) for user_id in merged_df['userId'].unique()}
        expected_recommendations = get_expected_recommendations(merged_df)

        display_metrics(recommendations, len(movies_df), item_popularity, expected_recommendations, user_preferences, item_features, merged_df, N_RECOMMENDATIONS)

        # Calculate and display recommendation metrics
        recommendation_metrics = calculate_recommendation_metrics(recommendations, merged_df)
        print("Recommendation Metrics:")
        for metric, value in recommendation_metrics.items():
            print(f"{metric}: {value:.4f}")

        # Print recommendations for the first 10 users
        for user_id, recs in list(recommendations.items())[:10]:
            print(f"User {user_id} recommendations:")
            for movie_id, est_rating in recs:
                title = movies_df[movies_df['movieId'] == movie_id]['title'].values[0]
                print(f"  {title}: {est_rating:.2f}")


Explanation of Key Functions:

- load_data: Loads and merges the movie and rating datasets, extracting the release year if necessary.
- get_weighted_release_year: Computes the weighted release year by dividing the year by a divisor.
- get_item_features: Extracts item features from the DataFrame, ensuring the release year is included.
- derive_user_preferences: Calculates mean ratings for each genre and release year for each user, focusing only on numeric columns.
- generate_recommendations: Generates top N recommendations for a given user, excluding already rated items.
- calculate_item_popularity: Computes how often each item is recommended across all users.
- get_expected_recommendations: Identifies items that are expected to be recommended based on high ratings.
- calculate_model_metrics: Computes RMSE, MAE, precision, recall, and F1 score to evaluate the model.
- calculate_coverage: Determines the percentage of items in the catalog that have been recommended.
- calculate_novelty: Calculates the average popularity of recommended items.
- calculate_personalization: Measures how different the recommendations are for different users.
- calculate_serendipity: Evaluates the serendipity of the recommendations.
- calculate_intra_list_similarity: Assesses the diversity within a single user's list of recommendations.
- display_metrics: Displays various metrics to evaluate the recommendations.

In [None]:
#------------------------------------------------------------------------------------------------------

In [None]:
import pandas as pd
import numpy as np
import logging
import re
from collections import defaultdict
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse, mae
from sklearn.metrics import mean_squared_error, mean_absolute_error, precision_score, recall_score, f1_score
import random
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt
import seaborn as sns

# Use ggplto style for graphs
plt.style.use('ggplot')
sns.set_palette('colorblind')

# Set a random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Setup logging
logging.basicConfig(level=logging.INFO)



# Your function definitions here...

#if __name__ == "__main__":
#    main()


## Section 2: Compute and Plot Metrics at Different Values of K


In [None]:
def compute_metrics_for_top_k(predictions, k_values):
    metrics = {
        'k': [],
        'precision': [],
        'recall': [],
        'f1': [],
        'rmse': [],
        'mae': []
    }
    
    for k in k_values:
        top_k_preds = defaultdict(list)
        for uid, _, true_r, est, _ in predictions:
            top_k_preds[uid].append((est, true_r))
        
        for uid in top_k_preds:
            top_k_preds[uid].sort(reverse=True, key=lambda x: x[0])
            top_k_preds[uid] = top_k_preds[uid][:k]
        
        flat_preds = [pred for sublist in top_k_preds.values() for pred in sublist]
        estimated_ratings = [pred[0] for pred in flat_preds]
        true_ratings = [pred[1] for pred in flat_preds]
        
        rmse = np.sqrt(mean_squared_error(true_ratings, estimated_ratings))
        mae = mean_absolute_error(true_ratings, estimated_ratings)
        y_true = [1 if r >= THRESHOLD else 0 for r in true_ratings]
        y_pred = [1 if r >= THRESHOLD else 0 for r in estimated_ratings]
        precision = precision_score(y_true, y_pred, zero_division=0)
        recall = recall_score(y_true, y_pred, zero_division=0)
        f1 = f1_score(y_true, y_pred, zero_division=0)
        
        metrics['k'].append(k)
        metrics['precision'].append(precision)
        metrics['recall'].append(recall)
        metrics['f1'].append(f1)
        metrics['rmse'].append(rmse)
        metrics['mae'].append(mae)
    
    return metrics

def plot_metrics(metrics):
    plt.figure(figsize=(12, 8))
    
    plt.subplot(3, 2, 1)
    plt.plot(metrics['k'], metrics['precision'], marker='o', linestyle='-')
    plt.title('Precision at top k')
    plt.xlabel('k')
    plt.ylabel('Precision')
    
    plt.subplot(3, 2, 2)
    plt.plot(metrics['k'], metrics['recall'], marker='o', linestyle='-')
    plt.title('Recall at top k')
    plt.xlabel('k')
    plt.ylabel('Recall')
    
    plt.subplot(3, 2, 3)
    plt.plot(metrics['k'], metrics['f1'], marker='o', linestyle='-')
    plt.title('F1-Score at top k')
    plt.xlabel('k')
    plt.ylabel('F1 Score')
    
    plt.subplot(3, 2, 4)
    plt.plot(metrics['k'], metrics['rmse'], marker='o', linestyle='-')
    plt.title('RMSE at top k')
    plt.xlabel('k')
    plt.ylabel('RMSE')
    
    plt.subplot(3, 2, 5)
    plt.plot(metrics['k'], metrics['mae'], marker='o', linestyle='-')
    plt.title('MAE at top k')
    plt.xlabel('k')
    plt.ylabel('MAE')
    
    plt.tight_layout()
    plt.show()




## Section 3: Print Single Evaluation Metrics - Evaluation against Industry Avereges



Typical Accuracy Levels:

- General Accuracy: It's challenging to state a specific "average accuracy" because it depends highly on the context and the specific system configuration. However, good movie recommendation systems generally achieve:
    - RMSE: Values around 0.8 to 1.2 for rating predictions, with lower values indicating better accuracy.
    - Precision/Recall: Precision and recall can vary, but good systems might achieve over 20-30% precision in top-N recommendations in practical settings.
    - High-Performance Systems: In competitions like the Netflix Prize, the winning entries achieved RMSEs around 0.85, considered very high accuracy in a real-world system.

In [None]:
'''

# Ranges for industry standards (min and max)
industry_ranges = {
    'RMSE': (0.85, 0.95),       # Min and max RMSE in industry
    'MAE': (0.70, 0.75),         # Min and max MAE in industry
    'Precision': (0.70, 0.80),  # Min and max precision in industry
    'Recall': (0.10, 0.40),     # Min and max recall in industry
    'F1 Score': (0.30, 0.50),   # Min and max F1 score in industry
}

metrics = list(model_metrics.keys())
x = np.arange(len(metrics))  # label locations
bar_width = 0.35  # width of the bars

fig, ax = plt.subplots(figsize=(10, 6))

# Plotting bars for model metrics
ax.bar(x, [model_metrics[metric] for metric in metrics], width=bar_width, color='lightblue', label='Model Metrics')

# Calculate means and error margins for industry standards
industry_means = [(industry_ranges[metric][0] + industry_ranges[metric][1]) / 2 for metric in metrics]
industry_errors = [(industry_ranges[metric][1] - industry_ranges[metric][0]) / 2 for metric in metrics]

# Adding error bars to indicate the range of industry standards
ax.errorbar(x + bar_width / 2, industry_means, yerr=industry_errors, fmt='o', color='red', capsize=5, label='Industry Range')

# Adding labels, title, and custom x-axis tick labels
ax.set_ylabel('Values')
ax.set_title('Model Metrics vs Industry Ranges')
ax.set_xticks(x + bar_width / 2)
ax.set_xticklabels(metrics)
ax.set_ylim(0, 1)  # Setting the y-limit to encompass typical ranges for these metrics
ax.legend()

plt.tight_layout()
plt.show()



# After all the evaluations and plots, print the single run evaluation metrics
model_metrics = compute_model_metrics(predictions)
print("Single Evaluation Metrics:")
print(model_metrics)

'''


## Additional Metrics