# DSAIT4335 Recommender Systems
# Final Project

In this project, you will work to build different recommendation models and evaluate the effectiveness of these models through offline experiments. The dataset used for the experiments is **MovieLens100K**, a movie recommendation dataset collected by GroupLens: https://grouplens.org/datasets/movielens/100k/. For more details, check the project description on Brightspace.

# Instruction

The MovieLens100K is already splitted into 80% training and 20% test sets. Along with training and test sets, movies metadata as content information is also provided.

**Expected file structure** for this assignment:   
   
   ```
   RecSysProject/
   ├── training.txt
   ├── test.txt
   ├── movies.txt
   └── codes.ipynb
   ```

**Note:** Be sure to run all cells in each section sequentially, so that intermediate variables and packages are properly carried over to subsequent cells.

**Note** Be sure to run all cells such that the submitted file contains the output of each cell.

**Note** Feel free to add cells if you need more for answering a question.

**Submission:** Answer all the questions in this jupyter-notebook file. Submit this jupyter-notebook file (your answers included) to Brightspace. Change the name of this jupyter-notebook file to your group number: example, group10 -> 10.ipynb.

# Setup

In [None]:
# %pip install transformers torch
# %pip install -r requirements.txt

In [None]:
import os.path
from typing import Any
from numpy import floating
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.dpi'] = 300 # for clearer plots in the notebook
plt.rcParams['savefig.dpi'] = 300

from scipy.sparse import csr_matrix
from scipy.spatial.distance import cosine, correlation
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from transformers import logging 
from recommendation_algorithms.hybrid_recommender import HybridRecommender
from recommendation_algorithms.matrix_factorization import MatrixFactorizationSGD
from recommendation_algorithms.bayesian_probabilistic_ranking import BayesianProbabilisticRanking
from recommendation_algorithms.item_knn import ItemKNN
from recommendation_algorithms.user_knn import UserKNN
from recommendation_algorithms.content_based import ContentBasedRecommender
from evaluation.grid_search import grid_search
from evaluation.score_prediction_metrics import MAE, MSE, RMSE 
logging.set_verbosity_error()
import re
import time, math
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(10)

print("Libraries imported successfully!")

# Load dataset

In [None]:
# loading the training set and test set
columns_name=['user_id','item_id','rating','timestamp']
train_data = pd.read_csv('data/training.txt', sep='\t', names=columns_name)
test_data = pd.read_csv('data/test.txt', sep='\t', names=columns_name)
display(train_data[['user_id','item_id','rating']].head())
print(f'The shape of the training data: {train_data.shape}')
print(f'The shape of the test data: {test_data.shape}')

movies = pd.read_csv('data/movies.txt',names=['item_id','title','genres','description'],sep='\t')
display(movies.head())

# Task 1) Implementation of different recommendation models as well as a hybrid model combining those recommendation models

<h3>Abstract Recommender</h3>

In [None]:
percentage = 0.25
movies_small = movies.iloc[0: int(percentage * len(movies))]
train_data_small = train_data[train_data["item_id"].isin(movies_small["item_id"])]
content = movies_small["title"] + movies_small["description"] + movies_small["description"]
content_full = movies["title"] + movies["description"] + movies["description"]
HYPERPARAMETER_TUNING_ON = False
RESTORE_STATES = True
# TODO insert Abstract Recommender

To facilitate the implementation of the hybrid recommender system, we created an abstract recommender class. Each of the recommendation algorithms implemented in this task, extends this abstract recommender class and implements a method to train the algorithm and predict a score for a user/item pair. Furthermore, the class provides functionality to save and load predictions from a csv file to facilitate evaluation.

Below we list the implementation of each single recommendation algorithm and the tuning of hyperparameters on a small subset of the training data.

In [None]:
# TODO add grid search code

<i>Explain why we use this hyperparameter tuning approach</i>

### Content-Based

In [None]:
# TODO Insert Content-Based recommender

#### Hyperparameter Tuning

In [None]:
hyperparameters_content_based = {
    "aggregation_method": ["average", "weighted_average", "avg_pos"],
    "bert_model": ['boltuix/bert-mini', 'distilbert-base-uncased'],
    "data": [train_data_small],
    "batch_size": [16],
    "content": [content]
}

if HYPERPARAMETER_TUNING_ON: 
    best_parameters_cb, params_cb = grid_search(hyperparameters_content_based, ContentBasedRecommender, train_data_small, RMSE)
else: 
    best_parameters_cb = {
        "aggregation_method": "avg_pos",
        "bert_model": 'boltuix/bert-mini',
        "data": train_data,
        "batch_size": 16,
        "content": content_full
    }

<i>Describe implementation and hyper parameters</i>

```
Best params metric 0.4182111704005534
Best params: [('aggregation_method', 'avg_pos'), ('bert_model', 'boltuix/bert-mini'), ('batch_size', 16)]
```

#### Train Best Model

In [None]:
content_based_best = ContentBasedRecommender(**best_parameters_cb)
if content_based_best.checkpoint_exists() and RESTORE_STATES:
    content_based_best.load_predictions_from_file()
    content_based_best.load_all_rankings_from_file(train_data)
if HYPERPARAMETER_TUNING_ON :
    content_based_best.train(train_data)
    content_based_best.calculate_all_predictions(train_data)
    content_based_best.calculate_all_rankings(10, train_data)
    content_based_best.save_predictions_to_file()
    content_based_best.save_rankings_to_file()

#### User-based Collaborative Filtering

In [None]:
# TODO insert User-KNN

In [None]:
hyperparameters_user_knn = {
    "k": [5, 7, 8, 9, 10, 11, 12, 13, 15]
}
if HYPERPARAMETER_TUNING_ON:
    u_knn = UserKNN(2)
    u_knn.calculate_all_predictions(train_data_small)
    display(u_knn.predictions.head())
    u_knn.calculate_all_rankings(5, train_data_small)
    display(u_knn.get_ranking(1, 5))
    similarity_matrix = u_knn.similarity_matrix
    best_parameters_uknn, params_uknn = grid_search(hyperparameters_user_knn, UserKNN, train_data_small, RMSE, similarity_matrix=similarity_matrix)
else:
    best_parameters_uknn = {
        "k": 11
    }

<i>Describe implementation and hyper parameters</i>

```
Best params metric 0.45440612603397196
Best params: [('k', 11)]
```


#### Train Best Model

In [None]:
user_knn_best = UserKNN(**best_parameters_uknn)
if user_knn_best.checkpoint_exists():
    user_knn_best.load_predictions_from_file()
    user_knn_best.load_all_rankings_from_file(train_data)
if HYPERPARAMETER_TUNING_ON or not RESTORE_STATES:
    user_knn_best.train(train_data)
    user_knn_best.calculate_all_predictions(train_data)
    user_knn_best.calculate_all_rankings(10, train_data)
    user_knn_best.save_predictions_to_file()
    user_knn_best.save_rankings_to_file()

### Item-based Collaborative Filtering

In [None]:
# TODO insert User-KNN

#### Hyperparameter Tuning

In [None]:
hyperparameters_item_knn = {
    "k": [2, 3, 5, 7, 8, 9, 10, 11]
}
if HYPERPARAMETER_TUNING_ON:
    i_knn = ItemKNN(2)
    i_knn.train(train_data_small)
    i_knn.calculate_all_predictions(train_data_small)
    display(i_knn.predictions.head())
    i_knn.calculate_all_rankings(5, train_data_small)
    display(i_knn.get_ranking(1, 5))
    similarity_matrix = i_knn.similarity_matrix
    best_parameters_iknn, params_iknn = grid_search(hyperparameters_item_knn, ItemKNN, train_data_small, RMSE, similarity_matrix=similarity_matrix)
else:
    best_parameters_iknn = {
        "k": 11
    }

<i>Describe implementation and hyper parameters</i>

```
Best params metric 0.4124278800175163
Best params: [('k', 11)]
```


#### Train Best Model

In [None]:
item_knn_best = ItemKNN(**best_parameters_iknn)

if item_knn_best.checkpoint_exists() and RESTORE_STATES:
    item_knn_best.load_predictions_from_file()
    item_knn_best.load_all_rankings_from_file(train_data)
if HYPERPARAMETER_TUNING_ON:
    item_knn_best.train(train_data)
    item_knn_best.calculate_all_predictions(train_data)
    item_knn_best.calculate_all_rankings(10, train_data)
    item_knn_best.save_predictions_to_file()
    item_knn_best.save_rankings_to_file()

### Matrix Factorization

In [None]:
# TODO insert Matrix factorization

In [None]:
hyperparameters_matrix_factorization = {
    'n_factors':[5, 10, 20, 25, 50], 
    'learning_rate':[0.001, 0.01, 0.05, 0.1], 
    'regularization':[0.002, 0.02, 0.2], 
    'n_epochs': [5, 20], 
    'use_bias':[True, False]
}

if HYPERPARAMETER_TUNING_ON:
    best_parameters_mf, params_mf = grid_search(hyperparameters_matrix_factorization, MatrixFactorizationSGD, train_data_small, RMSE)
else: 
    best_parameters_mf = {
        'n_factors':5, 
        'learning_rate': 0.01, 
        'regularization':0.2, 
        'n_epochs': 5, 
        'use_bias':True
    }

```
Best params: [('n_factors', 5), ('learning_rate', 0.001), ('regularization', 0.2), ('n_epochs', 5), ('use_bias', True)]
```

<i>Describe implementation and hyper parameters</i>

#### Train Best Model

In [None]:
mf_best = MatrixFactorizationSGD(**best_parameters_mf)
if mf_best.checkpoint_exists() and RESTORE_STATES:
    mf_best.load_predictions_from_file()
    mf_best.load_all_rankings_from_file(train_data)
if HYPERPARAMETER_TUNING_ON:
    mf_best.train(train_data)
    mf_best.calculate_all_predictions(train_data)
    mf_best.calculate_all_rankings(10, train_data)
    mf_best.save_predictions_to_file()
    mf_best.save_rankings_to_file()   

### Bayesian Probabilistic Ranking

In [None]:
# TODO insert BPR

#### Hyperparameter Tuning

In [None]:
hyperparameters_bpr = {
    'n_factors':[5, 10, 20, 25, 50], 
    'learning_rate':[0.001, 0.01, 0.05, 0.1], 
    'regularization':[0.002, 0.02, 0.2], 
    'n_epochs': [5, 20], 
}
if HYPERPARAMETER_TUNING_ON:
    best_parameters_bpr, params_bpr = grid_search(hyperparameters_bpr, BayesianProbabilisticRanking, train_data_small, RMSE)
else: 
    best_parameters_bpr = {
        'n_factors':5, 
        'learning_rate': 0.1, 
        'regularization':0.2, 
        'n_epochs': 20, 
    }

```
Best params metric 0.4304978367972904
Best params: [('n_factors', 5), ('learning_rate', 0.01), ('regularization', 0.02), ('n_epochs', 20)]
```

<i>Describe implementation and hyper parameters</i>

#### Train Best Model

In [None]:
bpr_best = BayesianProbabilisticRanking(**best_parameters_bpr)
if bpr_best.checkpoint_exists() and RESTORE_STATES:
    bpr_best.load_predictions_from_file()
    bpr_best.load_all_rankings_from_file(train_data)
if HYPERPARAMETER_TUNING_ON:
    bpr_best.train(train_data)
    bpr_best.calculate_all_predictions(train_data)
    bpr_best.calculate_all_rankings(10, train_data)
    bpr_best.save_predictions_to_file()
    bpr_best.save_rankings_to_file()

<h3>Hybrid Model</h3>

In [None]:
# TODO insert hybrid model class

The hybrid model combines the predictions of the models implemented above into a single model by combining their predictions using a weighted sum approach. For the rating prediction task, the weights are found by minimizing an objective function, in our case the mean squared error (MSE). We could also use the RMSE, but this is equivalent to minimizing the MSE. For the minimization we use scipy's minimize function with the commonly used L-BFGS-B method.

For the ranking task we use a slightly different approach:
1. Assume we want a recommendation list of size K.
2. For each recommendation we predict this list of item_ids and ratings.
3. Each rating for an item is multiplied by the algorithm's associated (predefined) weight to obtain new ratings for each item.
4. In the case that an item is recommended by multiple algorithms, the weighted ratings are summed together.
5. Finally, items are re-ranked by their new predicted rating and the top-K is taken as the new ranking.

As mentioned in the steps above, the weights for the ranking task are predefined, unlike the rating prediction task. This is because, as mentioned in the lectures, ranking evaluation metrics, such as NDCG and AP are non-smooth functions. Smooth approximations of these functions exist, but these approximations are not always good. Therefore, we opted for manually finding nearly optimal weights based on evaluation metrics (F1-score and NDCG) on a small subset of the training data, similar to the hyperparameter tuning.

In [None]:
# TODO remove all these imports when classes defined
from recommendation_algorithms.hybrid_recommender import HybridRecommender
from recommendation_algorithms.matrix_factorization import MatrixFactorizationSGD
from recommendation_algorithms.bayesian_probabilistic_ranking import BayesianProbabilisticRanking
from recommendation_algorithms.item_knn import ItemKNN
from recommendation_algorithms.user_knn import UserKNN


content_based = ContentBasedRecommender(**best_parameters_cb)
item_knn = ItemKNN(**best_parameters_iknn)
user_knn = UserKNN(**best_parameters_uknn)
matrix_factorization = MatrixFactorizationSGD(**best_parameters_mf)
bpr = BayesianProbabilisticRanking(**best_parameters_bpr)
rating_recommenders = [matrix_factorization, item_knn, user_knn]
ranking_recommenders = [matrix_factorization, bpr, item_knn, user_knn]
max_k = 10 # Recommendation list size
ranking_weights = {
    'Content Based Recommender':0.2,
	'Matrix Factorization': 0.2,
	'Bayesian Probabilistic Ranking': 0.2,
	'Item KNN': 0.2,
	'User KNN': 0.2,
}
hybrid_recommender = HybridRecommender(train_data, rating_recommenders, ranking_recommenders, max_k, ranking_weights, True)

<h4>Ranking Weight Optimization</h4>

In [None]:
# TODO optimize ranking weights in terms of F1-score and NDCG (maybe pick one)

# TODO set ranking weights of hybrid model to optimized weights

<i>Discuss optimization approach (do not have to discuss the coefficients yet, that's a different task)</i>

# Task 2) Experiments for both rating prediction and ranking tasks, and conducting offline evaluation

In task 2 we evaluate all individual models and the hybrid model for both rating prediction and ranking tasks by calculating evaluation metrics (implemented below) on the test set.

## Code

In [None]:
## RATING TESTING
from recommendation_algorithms.matrix_factorization import MatrixFactorizationSGD

k=10

mf = MatrixFactorizationSGD()
mf.train(train_data)

# training data predictions
print('Getting ratings...')
mf.calculate_all_predictions(train_data)
print('Getting rankings...')
mf.calculate_all_rankings(k, train_data)

# mf.save_predictions_to_file()
# mf.save_rankings_to_file()

In [None]:
# Test data - rankings
def get_ranking_test_data(test_data: pd.DataFrame, k: int = 10) -> dict:
    """
    Create ground truth ranking series dict from test data for ranking evaluation.
    :param test_data: pd.DataFrame with columns=['user_id', 'item_id', 'rating']
    :param k: cut-off for ranking
    :return: dict where keys are user ids and values are pd.Series with index=item_id and values=rating
    """
    users = test_data['user_id'].unique().tolist()
    user_rankings = {
        user: test_data[test_data['user_id'] == user][['item_id', 'rating']]
        .sort_values(by='rating', ascending=False)
        .head(k)
        .set_index('item_id')['rating']
        for user in users
    }
    return user_rankings

# creates ranking series dict
user_rankings_test = get_ranking_test_data(test_data)


## Evaluation scripts

The evaluation scripts load from saved results to allow for batch processing of different models and baselines.

### Rating task

In [None]:
for recommender in rating_recommenders:
    recommender.calculate_rating_predictions_test_data(test_data)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300 # TODO - remove imports for final

prediction_filepaths = { # filepaths to the saved predictions from different models
    'MF': 'model_checkpoints/test/matrix_factorization/predictions.csv',
}

def load_rating_predictions(file_path: str) -> pd.DataFrame:
    """
    Load predictions from a CSV file.

    :param file_path: path to the CSV file - assumes saved with columns=['user_id', 'item_id', 'predicted_score']
    :return: pd.DataFrame with columns=['user_id', 'item_id', 'predicted_rating']
    """
    predictions = pd.read_csv(file_path)
    predictions = predictions.rename(columns={'user_id': 'user_id', 'item_id': 'item_id', 'predicted_score': 'pred_rating'})
    return predictions

def load_all_rating_predictions(filepaths: dict) -> dict:
    """
    Load predictions from multiple CSV files.

    :param filepaths: dictionary where keys are model names and values are file paths
    :return: dictionary where keys are model names and values are pd.DataFrames with predictions
    """
    all_predictions = {}
    for model_name, file_path in filepaths.items():
        all_predictions[model_name] = load_rating_predictions(file_path)
    return all_predictions


In [None]:
d = load_all_rating_predictions(prediction_filepaths)
d

In [None]:
# EVALUATION functions
from sklearn.metrics import root_mean_squared_error

def evaluate_rating(ground_truth: list[float], predictions: list[float]) -> float:
    """
    Evaluation function for one model for rating prediction task - RMSE. Takes two lists of rating values as input and returns RMSE and MSE. Assumes that the two lists are aligned (i.e., the i-th element in each list corresponds to the same user-item pair).

    :param ground_truth: list of actual ratings
    :param predictions:  list of predicted ratings
    :return: float
    """
    return root_mean_squared_error(ground_truth, predictions)

def evaluate_rating_all(rating_prediction_dict: dict, test_data: pd.DataFrame) -> dict:
    """
    Evaluate a baseline or model against the test data for rating prediction task - RMSE for all models.
    :param rating_prediction_dict: dict of model/baseline predictions {model_name: pd.DataFrame} with columns=['user_id', 'item_id', 'pred_rating']
    :param test_data: pd.DataFrame with columns=['user_id', 'item_id', 'rating']
    :return:
    """
    res_dict = {}
    print(f'Evaluating rating predictions for all models...')
    for i, df in tqdm(rating_prediction_dict.items()):
        df2 = df.merge(test_data[['user_id','item_id','rating']], on=['user_id','item_id']).dropna() # TODO - there is a nan for some reason
        rmse = evaluate_rating(df2['rating'].tolist(), df2['pred_rating'].tolist())
        print(f'- {i}: RMSE = {rmse:.4f}')
        res_dict[i] = rmse
    # TODO - save

    return res_dict

res_dict = evaluate_rating_all(d, test_data)
res_dict

In [None]:
# VISUALISATION

# results_rating = { # debugging data
#     'content-based' : 1.2,
#     'user-based CF' : 1.5,
#     'item-based CF' : 1.3,
#     'matrix factorisation' : 0.9,
#     'hybrid' : 0.8,
# }

def plot_rating_results(results: dict):
    """
    Plot RMSE results for different recommendation models.

    :param results: dictionary where keys are model names and values are RMSE scores
    """
    models = list(results.keys())
    rmse_scores = list(results.values())

    plt.figure(figsize=(10, 6))
    sns.barplot(x=models, y=rmse_scores)
    plt.title('RMSE of Different Recommendation Models')
    plt.xlabel('Recommendation Model')
    plt.ylabel('RMSE')
    plt.ylim(0, max(rmse_scores) + 1)
    plt.xticks(rotation=45)  # readability
    plt.show()

plot_rating_results(res_dict)

<i>Discuss rating results</i>

### Ranking task

In [None]:
for recommender in ranking_recommenders:
    recommender.calculate_ranking_predictions_test_data(test_data, max_k)

In [None]:
# Loading data in
import json
import os

max_k = 10

ranking_prediction_filepaths = { # filepaths to the saved predictions from different models
    'MF': 'model_checkpoints/matrix_factorization/rankings/',
    # TODO add other checkpoint paths
}

def load_model_ranking_predictions(folder_path: str) -> dict:
    """
    Load ranking predictions from a CSV file for all users.

    :param folder_path: path to the folder containing the rankings and mapping file
    :return: dictionary where keys are user IDs and values are ordered pd.Series with index=item_id and values=predicted_score
    """
    mapping_file = json.loads(open(os.path.join(folder_path, 'user_ranking_file_map.json'), 'r').read())
    user_dict = {}

    for user_id, file in mapping_file.items():
        predictions = pd.read_csv(file)
        p = predictions.set_index('item_id')['predicted_score']
        user_dict[int(user_id)] = p

    return user_dict

def load_all_ranking_predictions(filepaths: dict) -> dict:
    """
    Load ranking predictions from multiple CSV files.
    :param filepaths: dictionary where keys are model names and values are folder paths
    :return: ditionary where keys are model names and values are dictionaries { user_id: pd.Series with ranking predictions }
    """
    all_predictions = {}
    for i, d in filepaths.items():
        all_predictions[i] = load_model_ranking_predictions(d)
    return all_predictions

ranking_predictions = load_all_ranking_predictions(ranking_prediction_filepaths)

In [None]:
# Evaluation ranking task

def ndcg(ground_truth: list, rec_list: list, k = 10) -> float:
    """
    Calculate Normalized Discounted Cumulative Gain (NDCG) for a single user.
    :param ground_truth: list of relevant item ids
    :param rec_list: ranked list of recommended item ids
    :param k: cut off for NDCG calculation
    :return:
    """
    if k > len(rec_list):
        k = len(rec_list)
    dcg = 0.0
    for i in range(k):
        numerator = 1 if rec_list[i] in ground_truth else 0
        denominator = np.log2(i + 2)
        dcg += numerator / denominator
    ideal_len = min(k, len(ground_truth))
    if ideal_len == 0:
        return 0.0
    else:
        IDCG = sum(1.0 / np.log2(i + 2) for i in range(ideal_len))
        return dcg / IDCG


def evaluate_ranking(ground_truth: list[pd.Series], rec_list: list[pd.Series], k=10) -> tuple[
    floating[Any], floating[Any], floating[Any]]:
    """
    Calculate Precision, Recall, and NDCG for ranking task.

    Assume that items in rec_list are relevant (rel = 1) and items not in rec_list are non-relevant (rel = 0).

    :param ground_truth: lists of pd.Series of item ids that are relevant
    :param rec_list: list of pd.Series of recommended top-k item ids - index=item_ids, values=rating
    :param k: cut-off for ndcg (may change to be for P and R as well) - TODO
    :return:
    """
    # Compute Precision & Recall
    gt_items = [set(gt.index.values) for gt in ground_truth]
    rec_items = [set(rl.index.values) for rl in rec_list]
    len_intersections = np.array([len(set(gt).intersection(rl)) for rl, gt in zip(rec_items, gt_items)])
    len_rls = np.array([len(rl) for rl in rec_items])
    len_gts = np.array([len(gt) for gt in gt_items])

    p = np.nanmean(100 * len_intersections / len_rls)  # precision
    r = np.nanmean(100 * len_intersections / len_gts)  # recall

    # Compute NDCG
    ndcgs = [ndcg(list(gt), list(rl), k) for rl, gt in zip(rec_items, gt_items)]
    ndcg_mean = np.nanmean(ndcgs)

    return p, r, ndcg_mean

def evaluate_ranking_all(prediction_dict: dict, test_data: dict, k=10, save_path: str = None) -> dict:
    """
    Evaluate a baseline or model against the test data for ranking task - Precision, Recall, NDCG for all models.
    :param save_path: full file path to save results to, if any
    :param prediction_dict: { model : { user_id: pd.Series with ranking predictions } }
    :param test_data: { user_id: pd.Series with ground truth ratings }
    :param k:
    :return:
    """
    results = {}
    users = test_data.keys()
    print('Evaluating ranking predictions for all models...')

    for model_name, user_predictions in tqdm(prediction_dict.items()):
        ground_truth = []
        rec_list = []
        for user in users:
            if user in user_predictions:
                ground_truth.append(test_data[user])
                rec_list.append(user_predictions[user].nlargest(k))
        precision, recall, ndcg_mean = evaluate_ranking(ground_truth, rec_list, k)
        results[model_name] = [precision, recall, ndcg_mean]
        print(f'- {model_name}: Precision = {precision:.2f}%, Recall = {recall:.2f}%, NDCG = {ndcg_mean:.4f}')

    if save_path:
        df = pd.DataFrame.from_dict(results, orient='index', columns=['precision', 'recall', 'ndcg']).reset_index(names='model')
        df.to_csv(save_path, index=False)
    return results

In [None]:
# DEBUG - example use

# ground_truth = [[1, 2, 3], [2, 3, 4], [1, 4]]
# rec_list = [[2, 3, 5], [1, 2, 3], [4, 5, 6]]
#
# ground_truth = [pd.Series(np.ones(len(gt)), index=gt) for gt in ground_truth]
# rec_list = [pd.Series(np.ones(len(rl)), index=rl) for rl in rec_list]
#
# precision, recall, ndcg_mean = evaluate_ranking(ground_truth, rec_list, k=3)
# print(f'Precision: {precision:.2f}%, Recall: {recall:.2f}%, NDCG: {ndcg_mean:.4f}')
#
# models = {'m1' : {'u1' : rec_list[0], 'u2' : rec_list[1], 'u3' : rec_list[2]}}
# test = {'u1' : ground_truth[0], 'u2' : ground_truth[1], 'u3' : ground_truth[2]}
#
# results = evaluate_ranking_all(models, test, k=3)

results = evaluate_ranking_all(ranking_predictions, user_rankings_test, k=3)

accuracy_metrics_df = pd.DataFrame.from_dict(results, orient='index', columns=['precision', 'recall', 'ndcg']).reset_index().rename(columns={'index': 'model'}).set_index('model')


# adding rmse
accuracy_metrics_df['rmse'] = accuracy_metrics_df.index.map(res_dict)

#save to csv
# accuracy_metrics_df.to_csv('accuracy_metrics_df.csv')
display(accuracy_metrics_df)

In [None]:
# VISUALISATION

# results_ranking = {  # [precision, recall, ndcg] -- DEBUG DATA
#     'content-based' : [20.0, 15.0, 0.1],
#     'user-based CF' : [10.0, 20.0, 0.6],
#     'item-based CF' : [05.0, 45.0, 0.8],
#     'matrix factorisation' : [30.0, 35.0, 0.4],
#     'hybrid' : [20.0, 40.0, 0.2],
# }


def visualise_ranking_results(results: dict, tight: bool = False):
    """
    Plot Precision and Recall, and NDCG results for different recommendation models.

    :param results: dictionary where keys are model names and values are lists of [precision, recall, ndcg]
    :param tight: whether to display the two plots (Precision & Recall, NDCG) side by side
    """
    df = pd.DataFrame.from_dict(results, orient='index', columns=['precision', 'recall', 'ndcg']).reset_index().rename(columns={'index': 'model'})
    df_melt = df.melt(id_vars='model', value_vars=['precision', 'recall'], var_name='metric', value_name='value')

    if not tight:
        plt.figure(figsize=(10, 6))
        sns.barplot(data=df_melt, x='model', y='value', hue='metric', palette=['tab:blue', 'tab:orange'], errorbar=None)
        plt.title('Precision and Recall of Different Recommendation Models')
        plt.xlabel('Recommendation Model')
        plt.ylabel('%')
        plt.xticks(rotation=45)  # readability
        plt.show()

        plt.figure(figsize=(10, 6))
        sns.barplot(data=df, x='model', y='ndcg', errorbar=None)
        plt.title('NDCG of Different Recommendation Models')
        plt.xlabel('Recommendation Model')
        plt.ylabel('NDCG')
        plt.xticks(rotation=45)

    else:
        fig, axes = plt.subplots(1, 2, figsize=(14, 6))
        # Left - grouped Precision & Recall
        sns.barplot(data=df_melt, x='model', y='value', hue='metric',
                    palette=['tab:blue', 'tab:orange'], errorbar=None, ax=axes[0])
        axes[0].set_title('Precision and Recall of Different Recommendation Models')
        axes[0].set_xlabel('Recommendation Model')
        axes[0].set_ylabel('%')
        axes[0].set_ylim(0, df_melt['value'].max() + 5)
        axes[0].tick_params(axis='x', rotation=45)
        axes[0].legend(title=None)

        # Right - NDCG
        sns.barplot(data=df, x='model', y='ndcg', errorbar=None, ax=axes[1])
        axes[1].set_title('NDCG of Different Recommendation Models')
        axes[1].set_xlabel('Recommendation Model')
        axes[1].set_ylabel('NDCG')
        axes[1].tick_params(axis='x', rotation=45)

        plt.tight_layout()
    plt.show()

visualise_ranking_results(results, tight=True)

<i>Discuss ranking results</i>

# Task 3) Implement baselines for both rating prediction and ranking tasks, and perform experiments with those baselines

## Code

<h3>Rating Baselines</h3>

In [None]:
class AverageRater(AbstractRecommender):
    train_data: pd.DataFrame

    def __init__(self, train_data: pd.DataFrame):
        self.train_data = train_data
   
    def train(self, train_data: pd.DataFrame) -> None:
        pass 

    def get_name(self) -> str:
        return "Average Item Rating Recommender"

    def predict_score(self, user_id: int, item_id: int) -> float:
        # Calculate the mean score for an item
        return np.mean(self.train_data.loc[(self.train_data['item_id'] == item_id), 'rating'])

average_rater = AverageRater(train_data_small)
average_rater.predict_score(1, 2)

In [None]:
# TODO insert mean hybrid rater
# Easiest to just create new hybrid model instance, train, and set rating_weights to 1/len(rating_recommenders)

<h3>Ranking Baselines</h3>

In [None]:
class RandomRanker(AbstractRecommender):
    unseen_items: Dict[int, List[int]] # For each user keep track of unseen items

    def __init__(self, train_data: pd.DataFrame):
        self.unseen_items = {}
        self.train(train_data)

    def get_name(self) -> str:
        return "Random Ranker"
    
    def train(self, train_data: pd.DataFrame) -> None:
        # Find unseen items for each user
        user_ids = train_data['user_id'].unique()
        item_ids = train_data['item_id'].unique()
        for user_id in user_ids:
            seen_items = train_data.loc[(train_data['user_id'] == user_id), 'item_id'].unique()
            unseen_items_for_user = [item_id for item_id in item_ids if item_id not in seen_items]
            self.unseen_items[user_id] = unseen_items_for_user

    def predict_score(self, user_id: int, item_id: int) -> float:
        return np.random.uniform(0, 5)
    
    def calculate_all_rankings(self, k: int, train_data: pd.DataFrame) -> None:
        self.rankings = {}
        for user_id in train_data['user_id'].unique():
            unseen_items = self.unseen_items[user_id]
            items_with_scores = [(item_id, self.predict_score(user_id, item_id)) for item_id in unseen_items]
            sorted_items = sorted(items_with_scores, key= lambda x : x[1], reverse=True)[:k]
            self.rankings[user_id] = sorted_items

random_ranker = RandomRanker(train_data_small)
random_ranker.calculate_all_rankings(5, train_data_small)
random_ranker.get_ranking(1, 5)

In [None]:
class PopularRanker(AbstractRecommender):
    unseen_items: Dict[int, List[int]] # For each user keep track of unseen items
    popularities: Dict[int, int] # For each item keep track of amount of ratings 

    def __init__(self, train_data: pd.DataFrame):
        self.unseen_items = {}
        self.popularities = {}
        self.train(train_data)

    def get_name(self) -> str:
        return "Popularity Based Ranker"

    def train(self, train_data: pd.DataFrame) -> None:
        # Find unseen items for each user
        user_ids = train_data['user_id'].unique()
        item_ids = train_data['item_id'].unique()
        for user_id in user_ids:
            seen_items = train_data.loc[(train_data['user_id'] == user_id), 'item_id'].unique()
            unseen_items_for_user = [item_id for item_id in item_ids if item_id not in seen_items]
            self.unseen_items[user_id] = unseen_items_for_user
        
        # Find popularity of each item (amount of ratings)
        for item_id in item_ids:
            user_ratings = train_data.loc[
                (train_data['item_id'] == item_id),
                'user_id'
            ].unique()
            self.popularities[item_id] = len(user_ratings)

    def predict_score(self, user_id: int, item_id: int) -> float:
        raise ValueError("Predicting score not implemented for ranker")

    def predict_ranking(self, user_id: int, k: int) -> List[tuple[int, float]]:
        # Recommend most popular items that are not yet interacted by the target user. Most popular items are the ones that are rated by majority of users in the training data.
        unseen_items = self.unseen_items[user_id]
        def normalize_popularity(popularity: int) -> float:
            return popularity / max(self.popularities.values()) * 5.0  # Scale to rating range (1-5)
        items_with_popularity = [(item_id, normalize_popularity(self.popularities[item_id])) for item_id in unseen_items]
        sorted_items = sorted(items_with_popularity, key= lambda x : x[1], reverse=True)
        return sorted_items[:k]
    
    def calculate_all_rankings(self, k: int, train_data: pd.DataFrame) -> None:
        self.rankings = {}
        for user_id in train_data['user_id'].unique():
            ranking = self.predict_ranking(user_id, k)
            self.rankings[user_id] = ranking

popular_ranker = PopularRanker(train_data_small)
popular_ranker.calculate_all_rankings(5, train_data_small)
popular_ranker.get_ranking(1, 5)

In [None]:
# TODO insert mean hybrid ranker
# Easiest to just create new hybrid model instance, train, and set ranking_weights to 1/len(ranking_recommenders)

## Evaluation

You should be able to use the evaluation functions defined in Task 2 for evaluating the baselines, even in one big batch! The functions available are (pass is mainly written for my editor):

```python
def evaluate_rating_all(rating_prediction_dict: dict, test_data: pd.DataFrame) -> dict: pass
    # Takes {model_name: pd.DataFrame} with columns=['user_id', 'item_id', 'pred_rating']

def evaluate_ranking_all(prediction_dict: dict, test_data: dict, k=10, save_path: str = None) -> dict: pass
    # Takes { model : { user_id: pd.Series(index=item_id, values=predicted_rating) } }
```

<h3>Rating Baselines</h3>

In [None]:
# TODO evaluate rating all

<i>Discuss rating results for baselines, compare to other models</i>

<h3>Ranking Baselines</h3>

In [None]:
# TODO evaluate ranking all

<i>Discuss ranking results for baselines, compare to other models</i>

# Task 4) Analysis of recommendation models. Analyzing the coefficients of hybrid model and the success of recommendation models for different users' groups. 

<i>Analyze the coefficients of regression model (hybrid model) for both rating prediction and ranking tasks -> Which models contribute the most to prediction</i>

<i>Where is each recommendation model successful in delivering accurate recommendation? -> For which user groups each recommendation model results in the highest accuracy?</i>

# Task 5) Evaluation of beyond accuracy

_Discuss your observations comparing the models in terms of both accuracy and non-accuracy metrics_

Apart from solely evaluating the models on accuracy metrics, we also look at the following non-accuracy metrics:
- Diversity (intra-list diversity)
- Novelty (surprisal)
- Calibration
- A number of fairness metrics (user- and item-side)

These metrics are first implemented below in sections 10.1-10.4, then computed and analysed in section 10.5.

## Diversity - ILD

Diversity measures how different the items in a recommendation list are from each other. A diverse recommendation list is desirable as it exposes users to a wider range of items, potentially increasing user satisfaction and engagement. In our implementation, we use intra-list diversity (ILD) as the diversity metric. We take the Jaccard distance between the items' genres as the distance function, where a higher value indicates more difference between the genres. The formula for ILD is as follows:

$$
ILD(L) = \frac{1}{|L|(|L|-1)} \sum_{i,j \in L}dist(i,j)
$$
where:
- $dist(i,j) = 1 - \frac{|G_1 \cap G_2|}{|G_1 \cup G_2|}$ - distance function of how different $i$ and $j$ are - Jaccard distance of genres.

In [None]:
def diversity(rec_list: pd.Series, dist_func, movies: pd.DataFrame) -> float:
    """
    Calculate intra-list diversity (ILD) for a given recommendation list using a specified distance function.
    :param rec_list: top-k recommended item ids
    :param dist_func: function taking two item ids and movie data, and returning a distance value
    :return:
    """
    if len(rec_list) <= 1:
        return 0.0
    L = len(rec_list)
    frac = 1 / (L * (L - 1))
    total_dist = np.sum([dist_func(i,j, movies) for i in rec_list.index.to_list() for j in rec_list.index.to_list()])
    return frac * total_dist


def genre_distance(item1, item2, movies):
    """
    Genre distance using Jaccard distance.
    :param item1: item id 1
    :param item2: item id 2
    :param movies: movie data
    :return:
    """
    i1_genres = set(movies.at[item1, 'genres'].split(','))
    i2_genres = set(movies.at[item2, 'genres'].split(','))
    intersection = len(i1_genres.intersection(i2_genres))
    union = len(i1_genres.union(i2_genres))
    if union == 0:
        return 0.0
    return 1 - intersection / union

def avg_diversity(ranking_predictions: dict, movies: pd.DataFrame, dist_func) -> tuple[floating, np.ndarray]:
    """
    Calculate average diversity for all users in ranking predictions and return this along with results.
    :param ranking_predictions: { model : { user_id: pd.Series with ranking predictions } }
    :param movies: movie data
    :return: (mean diversity, distribution of diversity scores)
    """
    results = np.array([diversity(ranking, dist_func, movies) for u, ranking in ranking_predictions.items()])
    return np.mean(results), results

def diversity_all(ranking_prediction_dict: dict, movies: pd.DataFrame, dist_func) -> dict:
    """
    Calculate diversity for all models in ranking prediction dict.
    :param ranking_prediction_dict: { model : { user_id: pd.Series with ranking predictions } }
    :param movies: movie data
    :return: { model : average diversity score }
    """
    res_dict = {}
    print(f'Calculating diversity for all models...')
    for model_name, user_rankings in tqdm(ranking_prediction_dict.items()):
        avg_div, distribution = avg_diversity(user_rankings, movies, dist_func)
        print(f'- {model_name}: Diversity = {avg_div:.4f}')
        res_dict[model_name] = avg_div
        res_dict[model_name+'_distribution'] = distribution
    return res_dict

div = diversity(ranking_predictions['MF'][1], genre_distance, movies)

diversities = diversity_all(ranking_predictions, movies, genre_distance)
diversities.keys()
# TODO - code for running on results of all models

## Novelty - surprisal

Novelty aims to measure how “novel” or “unexpected” the recommended items are to the user. A novel recommendation list is desirable as it can help users discover new items they might not have found otherwise, potentially increasing user satisfaction and engagement. In our implementation, we use self-information (surprisal) as the novelty metric. The formula for novelty is as follows:

$$
novelty(i) = -\log_{2} pop(i)
$$
where:
- $pop(i) = \frac{\text{no. interactions on }i}{\text{total no. interactions}}$ - popularity of item $i$ - percentage of interactions on item $i$

In [None]:
def popularity_matrix(train_data: pd.DataFrame) -> pd.Series:
    """
    Calculate the popularity of each item in the training data.
    :param train_data: training data
    :return: pd.Series with item ids as index and popularity as values
    """
    total_interactions = len(train_data)
    counts = train_data['item_id'].value_counts()
    popularity = counts / total_interactions
    return popularity

def novelty(rec_list: pd.Series, train_data: pd.DataFrame, weighting_scheme:str = 'uniform') -> float:
    """
    Calculate the novelty / surprisal of the items in a recommendation list
    :param rec_list: pd.Series of recommended item ids, columns=['rating'], index=item ids
    :param train_data: training data
    :param weighting_scheme: 'uniform' or 'log' - how to weight the novelty of items in the list
    :return: novelty score
    """

    popularity = train_data['item_id'].value_counts(normalize=True)
    surprisal = -np.log2(popularity)

    # Find the weightings for the averaging
    if weighting_scheme == 'uniform':
        weights = np.ones(len(rec_list)) / len(rec_list)
    elif weighting_scheme == 'log':
        ranks = np.arange(1, len(rec_list) + 1)
        weights = 1 / np.log2(ranks + 1)  # TODO - check!
        weights /= np.sum(weights)
    else:
        raise ValueError("weighting_scheme must be 'uniform' or 'log'")

    surprisals = np.array([surprisal.loc[item] for item in rec_list.index.tolist()])
    novelty_score = np.sum(weights * surprisals)
    return novelty_score

def avg_novelty(ranking_predictions: dict, train_data: pd.DataFrame, weighting: str = 'uniform') -> tuple[floating, np.ndarray]:
    """
    Calculate average diversity for all users in ranking predictions and return this along with results.
    :param weighting: how to average the novelty scores - 'uniform' or 'log'
    :param ranking_predictions: { model : { user_id: pd.Series with ranking predictions } }
    :param train_data: training data
    :return: (mean diversity, distribution of diversity scores)
    """
    results = np.array([novelty(ranking, train_data, weighting) for u, ranking in ranking_predictions.items()])
    return np.mean(results), results

def novelty_all(ranking_prediction_dict: dict, train_data: pd.DataFrame, weighting) -> dict:
    """
    Calculate diversity for all models in ranking prediction dict.
    :param weighting: how to average the novelty scores - 'uniform' or 'log'
    :param ranking_prediction_dict: { model : { user_id: pd.Series with ranking predictions } }
    :param train_data: training data
    :return: { model : average diversity score }
    """
    res_dict = {}
    print(f'Calculating diversity for all models...')
    for model_name, user_rankings in tqdm(ranking_prediction_dict.items()):
        avg_div, distribution = avg_novelty(user_rankings, train_data, weighting)
        print(f'- {model_name}: Novelty = {avg_div:.4f}')
        res_dict[model_name] = avg_div
        res_dict[model_name+'_distribution'] = distribution
    return res_dict

nov = novelty(ranking_predictions['MF'][1], train_data, 'uniform')

novelties = novelty_all(ranking_predictions, train_data, 'uniform')
novelties.keys()

## Calibration

Calibration measures how well the recommended items align with the user's preferences. A well-calibrated recommendation list is desirable as it ensures that the recommendations are relevant to the user's interests, potentially increasing user satisfaction and engagement. In our implementation, we use Kullback-Leibler (KL) divergence as the calibration metric. The formula for calibration is as follows:

**Calibration metric** - Kullback-Leibler divergence (lower = better)
$$
\begin{align}
MC_{KL}(p,q) &= KL(p||q) = \sum_{g} p(g|u) \log \frac{p(g|u)}{q(g|u)} \\
\text{where...} \\
p(g|u) &= \frac{\sum_{i\in \mathcal{H}}w_{u,i} \times p(g|i)}{\sum_{i \in \mathcal{H}} w_{u,i}} \\
q(g|u) &= \frac{\sum_{i\in \mathcal{L}}w_{r(i)} \times p(g|i)}{\sum_{i \in \mathcal{L}} w_{r(i)}}
\end{align}
$$
where:
- $p(g|i)$ - genre-distribution of each movie - 'categorisation of item'
- $p(g|u)$ - distribution of genres $g$ in user $u$'s profile (based on training data)
    - $\mathcal{H}$ - interaction history
    - $w_{u,i}$ - weight of item $i$ - rating given by user $u$ to item $i$
- $q(g|u)$ - distribution of genres $g$ in the recommendation list for
    - $\mathcal{L}$ - recommended items
    - $w_{r(i)}$ - weight of item $i$ at rank $r(i)$ - weighting scheme used in ranking metrics - EG, MRR, nDCG - TODO!!!
- to avoid division by zero - mask out anywhere where p(g|u) = 0 [Link to wiki](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)
  - $\tilde{q}(g|u) = (1-\alpha) \cdot q(g|u) + \alpha \cdot p(g|u)$ with small $\alpha > 0$, s.t. $q \approx\tilde{q}$

In [None]:
def genre_distribution(movies: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate genre distribution for each movie.
    :param movies: [pd.DataFrame] containing movie metadata with columns=['item_id','title','genres','description']
    :return: pd.dataframe with all genres as columns, item id's as index, and p(g|i) as values
    """
    mov_genres = movies[['item_id', 'genres']].copy()
    mov_genres['genres'] = mov_genres['genres'].apply(lambda x: x.split(',')) # make the genres a list
    item_ids = mov_genres['item_id'].unique()
    # find all the genres present in the dataset
    all_genres = set()
    for genres in mov_genres['genres']:
        all_genres.update(genres)
    all_genres = list(all_genres)

    # calculate the distributions
    genre_dist = pd.DataFrame(np.zeros((len(item_ids), len(all_genres))), columns=all_genres, index=item_ids)
    for _, row in mov_genres.iterrows():
        item_id = row['item_id']
        genres = row['genres']
        genre_count = len(genres)
        for genre in genres:
            genre_dist.at[item_id, genre] = 1 / genre_count  # uniform distribution over genres
    return genre_dist

def get_interaction_history(user_id, train_data: pd.DataFrame) ->  pd.Series:
    """
    Get interaction history of a user from training data.
    :param user_id: user id
    :param train_data: training data dataframe
    :return: list of item ids the user has interacted with
    """
    user_history = train_data[train_data['user_id'] == user_id]
    return user_history[['item_id', 'rating']].set_index('item_id')['rating']

def compute_genre_distribution_of_user(genre, genre_dist: pd.DataFrame, history: pd.Series):
    """
    Helper function for calibration metric - compute p(g|u) / q(g|u) for a given genre and user interaction history.

    Formulas are basically equivalent:
        p(g|u) = (w_{u,i} * p(g|i) for items in user history) / (sum of weights)
        q(g|u) = (w_{r(i) * p(g|i) for items in recommendation list) / (sum of weights)

    :param genre: genre to compute distribution for
    :param genre_dist: pd.DataFrame with all genres as columns, item id's as index, and p(g|i) as values
    :param history: pd.Series of item ids and ratings the user has interacted with, index=item ids, values=ratings
    :return:
    """
    pgi = [genre_dist.at[item, genre] for item in history.index.tolist()]
    ratings = history.values
    weighted_sum = np.sum(np.array(pgi) * np.array(ratings))
    return weighted_sum / np.sum(ratings)

genre_distributions = genre_distribution(movies)
u1_history = get_interaction_history(1, train_data)
user_genre_distribution = compute_genre_distribution_of_user('Action', genre_distributions, u1_history)

In [None]:
def calibration(rec_list: pd.Series, user, train_data: pd.DataFrame, movie_data: pd.DataFrame) -> float:
    """
    Calculate calibration metric for a given recommendation list and user.
    :param rec_list: pd.Series of recommended item ids, columns=['rating'], index=item ids
    :param user: user for whom the recommendation was made
    :param train_data: training data
    :param movie_data:
    :return:
    """
    a = 0.001  # small alpha to avoid division by zero
    genre_dist = genre_distribution(movie_data) # p(g|i) - should work
    genres = genre_dist.columns.tolist()

    # pgu - genre distribution in user profile
    user_history = get_interaction_history(user, train_data) # H - works
    pgu = np.array([compute_genre_distribution_of_user(g, genre_dist, user_history) for g in genres]) # p(g|u) - should work

    # qgu - genre distribution in recommendation list
    qgu = np.array([compute_genre_distribution_of_user(g, genre_dist, rec_list) for g in genres]) # q(g|u)

    mask = (pgu != 0) & (qgu != 0)
    res = np.sum(pgu[mask] * np.log(pgu[mask] / qgu[mask]))
    return res

def calibration_all(ranking_prediction_dict: dict, train_data: pd.DataFrame, movie_data: pd.DataFrame) -> dict:
    """
    Calculate calibration for all models in ranking prediction dict.
    :param ranking_prediction_dict: { model : { user_id: pd.Series with ranking predictions } }
    :param train_data: training data
    :param movie_data:
    :return: { model : average calibration score }
    """
    res_dict = {}
    print(f'Calculating calibration for all models...')
    for model_name, user_rankings in tqdm(ranking_prediction_dict.items()):
        results = np.array([calibration(ranking, user, train_data, movie_data) for user, ranking in user_rankings.items()])
        avg_cal = np.mean(results)
        print(f'- {model_name}: Calibration = {avg_cal:.4f}')
        res_dict[model_name] = avg_cal
        res_dict[model_name+'_distribution'] = results
    return res_dict

cal = calibration(ranking_predictions['MF'][1], 1, train_data, movies)
# it seems to run without errors, but not sure if the values are correct
# TODO - code for running on results of all models
calibrations = calibration_all(ranking_predictions, train_data, movies)
calibrations.keys()
# should take ~ 30-40 secs on new mac

## Fairness

Fairness in recommendation systems aims to ensure that the recommendations provided to users are equitable and unbiased across different user groups or item categories. This is important to prevent discrimination and promote inclusivity in the recommendations. In our implementation, we consider both user-side and item-side fairness metrics. The fairness metrics we implement are as follows:

- **User-side** - RecSys serve individual users/groups equally
    - Group Recommendation Unfairness - GRU
    - User Popularity Deviation - UPD
- **Item-side** - fair representation of items
    - catalog coverage - fraction of items recommended at least once (need results for all rankings (item-user pairs))
    - equality of exposure using gini index

### User-side fairness

$$
\displaylines{
GRU(G_1, G_2, Q) = \left| \frac{1}{|G_1|} \sum_{i \in G_1} \mathcal{F} (Q_i) - \frac{1}{|G_2|} \sum_{i \in G_2} \mathcal{F}(Q_i) \right| \\
UPD(u) = dist(P(R_u), P(L_u))
}
$$

where:
- $\mathcal{F}(Q_i)$ - recommendation quality for user $u_i$, invoking a metric such as NDCG@K or F1 score
- $P(R_u)$ - popularity distribution of items in user $u$'s recommendation list
- $P(L_u)$ - popularity distribution of items in user $u$'s interaction history

In [None]:
def group_rec_unfairness(group1: list, group2: list, rank_scores: pd.DataFrame) -> float:
    """
    Calculate Group Recommendation Unfairness (GRU) between two user groups, given a quality metric.
    :param group1: list of user ids in group 1
    :param group2: list of user ids in group 2
    :param metric: metric to use - ['nDCG', 'Precision', 'Recall', ...] - should match the column names in rank_scores
    :param rank_scores: scores of ranking tasks
    :return: GRU value as a float
    """
    g1_size = len(group1)
    g2_size = len(group2)
    if g1_size == 0 or g2_size == 0:
        return 0.0  # cannot compare a group w/ no users

    g1_avg = np.mean(rank_scores.at[group1]) / g1_size
    g2_avg = np.mean(rank_scores.at[group2]) / g2_size
    return g1_avg - g2_avg

group_rec_unfairness([1, 2, 3], [4, 5, 6], accuracy_metrics_df['ndcg'])  # TODO - get metric per user
# TODO - code for running on results of all models


In [None]:
def user_popularity_bias(user_id, rec_list: pd.Series, train_data: pd.DataFrame, ) -> float:
    item_popularity = popularity_matrix(train_data)
    user_history = get_interaction_history(user_id, train_data)
    p_ru = item_popularity.loc[rec_list.index.tolist()]
    p_lu = item_popularity.loc[user_history.index.tolist()]
    return np.mean(p_ru) - np.mean(p_lu)

def all_user_popularity_bias(ranking_prediction_dict: dict, train_data: pd.DataFrame) -> dict:
    res_dict = {}
    print(f'Calculating user popularity bias for all models...')
    for model_name, user_rankings in tqdm(ranking_prediction_dict.items()):
        user_pop_biases = [user_popularity_bias(user_id, rec_list, train_data) for user_id, rec_list in user_rankings.items()]
        avg_pop_bias = np.mean(user_pop_biases)
        print(f'- {model_name}: Average User Popularity Bias = {avg_pop_bias:.4f}')
        res_dict[model_name] = avg_pop_bias
        res_dict[model_name+'_distribution'] = user_pop_biases
    return res_dict

user_pop_biases = [user_popularity_bias(k, v, train_data) for k, v in ranking_predictions['MF'].items()]
avg_pop_bias_MF = np.mean(user_pop_biases)
# TODO - code for running on results of all models

user_pop_biases_all = all_user_popularity_bias(ranking_predictions, train_data)
user_pop_biases_all.keys()

### Item-side fairness

$$
\displaylines{
\text{catalog coverage} = \frac{\text{no. items appearing in 1+ recommendation}}{\text{total no. items in movie data}} \\
\text{equality of exposure} = 1 - 2 \sum_{i=1}^{N} P(i) \cdot \frac{i}{N}
}
$$

where:
- ffff

In [None]:
def catalog_coverage(rec_lists: list[pd.Series], movie_data: pd.DataFrame) -> float:
    total_no_movies = movie_data['item_id'].nunique()
    if total_no_movies == 0:
        return 0.0
    recommended_items = set()
    for rec_list in rec_lists:
        recommended_items.update(rec_list.index.tolist())
    no_recommended_items = len(recommended_items)
    return no_recommended_items / total_no_movies

def catalog_coverage_all(ranking_prediction_dict: dict, movie_data: pd.DataFrame) -> dict:
    """
    Calculate catalog coverage for all models in ranking prediction dict.
    :param ranking_prediction_dict: { model : { user_id: pd.Series with ranking predictions } }
    :param movie_data:
    :return: { model : catalog coverage score }
    """
    res_dict = {}
    print(f'Calculating catalog coverage for all models...')
    for model_name, user_rankings in tqdm(ranking_prediction_dict.items()):
        cov = catalog_coverage(user_rankings.values(), movie_data)
        print(f'- {model_name}: Catalog Coverage = {cov:.4f}')
        res_dict[model_name] = cov
    return res_dict

c = catalog_coverage(ranking_predictions['MF'].values(), movies)
# TODO - code for running on results of all models

cat_cov_all = catalog_coverage_all(ranking_predictions, movies)
cat_cov_all.keys()

In [None]:
def equality_of_exposure(rec_lists: list[pd.Series], movie_data: pd.DataFrame) -> float: # TODO - go over
    total_no_movies = movie_data['item_id'].nunique()
    if total_no_movies == 0:
        return 0.0
    exposure_counts = pd.Series(0, index=movie_data['item_id'].tolist())
    for rec_list in rec_lists:
        for item in rec_list.index.tolist():
            exposure_counts.at[item] += 1
    exposure_probs = exposure_counts / exposure_counts.sum()
    gini_index = 1 - 2 * np.sum(exposure_probs.cumsum() * (1 / total_no_movies))
    return gini_index

def equality_of_exposure_all(ranking_prediction_dict: dict, movie_data: pd.DataFrame) -> dict:
    """
    Calculate equality of exposure for all models in ranking prediction dict.
    :param ranking_prediction_dict: { model : { user_id: pd.Series with ranking predictions } }
    :param movie_data:
    :return: { model : equality of exposure score }
    """
    res_dict = {}
    print(f'Calculating equality of exposure for all models...')
    for model_name, user_rankings in tqdm(ranking_prediction_dict.items()):
        eq_exp = equality_of_exposure(user_rankings.values(), movie_data)
        print(f'- {model_name}: Equality of Exposure = {eq_exp:.4f}')
        res_dict[model_name] = eq_exp
    return res_dict


e = equality_of_exposure(ranking_predictions['MF'].values(), movies)
# TODO - code for running on results of all models

eq_exp_all = equality_of_exposure_all(ranking_predictions, movies)
eq_exp_all.keys()

## Evaluation of non-accuracy metrics


In [None]:
# running all non-accuracy metrics on ranking results
diversities = diversity_all(ranking_predictions, movies, genre_distance)
diversity_df = pd.DataFrame.from_dict(diversities, orient='index', columns=['value']).reset_index().rename(columns={'index':'model'})
diversity_df = diversity_df[~diversity_df['model'].str.endswith('_distribution')]

novelties = novelty_all(ranking_predictions, train_data, 'uniform')
novelty_df = pd.DataFrame.from_dict(novelties, orient='index', columns=['value']).reset_index().rename(columns={'index':'model'})
novelty_df = novelty_df[~novelty_df['model'].str.endswith('_distribution')]

calibrations = calibration_all(ranking_predictions, train_data, movies)
calibration_df = pd.DataFrame.from_dict(calibrations, orient='index', columns=['value']).reset_index().rename(columns={'index':'model'})
calibration_df = calibration_df[~calibration_df['model'].str.endswith('_distribution')]

user_pop_biases_all = all_user_popularity_bias(ranking_predictions, train_data)
user_pop_biases_df = pd.DataFrame.from_dict(user_pop_biases_all, orient='index', columns=['value']).reset_index().rename(columns={'index':'model'})
user_pop_biases_df = user_pop_biases_df[~user_pop_biases_df['model'].str.endswith('_distribution')]
## Item-side
cat_cov_all = catalog_coverage_all(ranking_predictions, movies)
cat_cov_df = pd.DataFrame.from_dict(cat_cov_all, orient='index', columns=['value']).reset_index().rename(columns={'index':'model'})
cat_cov_df = cat_cov_df[~cat_cov_df['model'].str.endswith('_distribution')]

eq_exp_all = equality_of_exposure_all(ranking_predictions, movies)
eq_exp_df = pd.DataFrame.from_dict(eq_exp_all, orient='index', columns=['value']).reset_index().rename(columns={'index':'model'})
eq_exp_df = eq_exp_df[~eq_exp_df['model'].str.endswith('_distribution')]

non_ac_metrics = {
    'diversity': diversities,
    'novelty': novelties,
    'calibration': calibrations,
    'user_popularity_bias': user_pop_biases_all,
    'catalog_coverage': cat_cov_all,
    'equality_of_exposure': eq_exp_all
}

In [None]:
diversity_df

In [None]:
# Analysis - plot all non-accuracy metrics -> subplots for space
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
# first - diversity
sns.barplot(data=diversity_df, x='model', y='value', ax=axes[0,0])
axes[0,0].set_title('Diversity')
axes[0,0].set_xlabel('Recommendation Model')
axes[0,0].set_ylabel('Diversity Score')
axes[0,0].tick_params(axis='x', rotation=45)

# second - novelty
sns.barplot(data=novelty_df, x='model', y='value', ax=axes[0,1])
axes[0,1].set_title('Novelty')
axes[0,1].set_xlabel('Recommendation Model')
axes[0,1].set_ylabel('Surprisal Score')
axes[0,1].tick_params(axis='x', rotation=45)

# third - calibration
sns.barplot(data=calibration_df, x='model', y='value', ax=axes[0,2])
axes[0,2].set_title('Calibration')
axes[0,2].set_xlabel('Recommendation Model')
axes[0,2].set_ylabel('KL Divergence')
axes[0,2].tick_params(axis='x', rotation=45)

# fourth - user popularity bias
sns.barplot(data=user_pop_biases_df, x='model', y='value', ax=axes[1,0])
axes[1,0].set_title('User Popularity Bias')
axes[1,0].set_xlabel('Recommendation Model')
axes[1,0].set_ylabel('Average Popularity Bias')
axes[1,0].tick_params(axis='x', rotation=45)

# fifth - catalog coverage
sns.barplot(data=cat_cov_df, x='model', y='value', ax=axes[1,1])
axes[1,1].set_title('Catalog Coverage')
axes[1,1].set_xlabel('Recommendation Model')
axes[1,1].set_ylabel('Coverage Score')
axes[1,1].tick_params(axis='x', rotation=45)

# sixth - equality of exposure
sns.barplot(data=eq_exp_df, x='model', y='value', ax=axes[1,2])
axes[1,2].set_title('Equality of Exposure')
axes[1,2].set_xlabel('Recommendation Model')
axes[1,2].set_ylabel('Gini Index')
axes[1,2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

\[Analysis here]

In [None]:
# accuracy vs non-accuracy metrics correlation
# merge accuracy and non-accuracy metrics into one dataframe for ranking models
full_df = accuracy_metrics_df.merge(
    diversity_df.rename(columns={'value':'diversity'}),
    on='model'
).merge(
    novelty_df.rename(columns={'value':'novelty'}),
    on='model'
).merge(
    calibration_df.rename(columns={'value':'calibration'}),
    on='model'
).merge(
    user_pop_biases_df.rename(columns={'value':'user_popularity_bias'}),
    on='model'
).merge(
    cat_cov_df.rename(columns={'value':'catalog_coverage'}),
    on='model'
).merge(
    eq_exp_df.rename(columns={'value':'equality_of_exposure'}),
    on='model'
)

full_df.set_index('model', inplace=True)
full_df.to_csv('results/accuracy_non_accuracy_metrics_ranking.csv')

correlation_matrix = full_df.corr()
# correlation_matrix

# sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm")
# plt.title("Correlation Matrix between Accuracy and Non-Accuracy Metrics")
# plt.show()