*UE Learning from User-generated Data, CP MMS, JKU Linz 2024*
# Exercise 4: Evaluation

In this exercise we evaluate accuracy of three different RecSys we already implemented. First we implement DCG and nDCG metrics, then we create a simple evaluation framework to compare the three recommenders in terms of nDCG. The implementations for the three recommender systems are provided in a file rec.py and are imported later in the notebook.
Please consult the lecture slides and the presentation from UE Session 4 for a recap.

Make sure to rename the notebook according to the convention:

LUD24_ex04_k<font color='red'><Matr. Number\></font>_<font color='red'><Surname-Name\></font>.ipynb

for example:

LUD24_ex04_k000007_Bond_James.ipynb

## Implementation
In this exercise, as before, you are required to write a number of functions. Only implemented functions are graded. Insert your implementations into the templates provided. Please don't change the templates even if they are not pretty. Don't forget to test your implementation for correctness and efficiency. **Make sure to try your implementations on toy examples and sanity checks.**

Please **only use libraries already imported in the notebook**.

In [16]:
import pandas as pd
import numpy as np

## <font color='red'>TASK 1/2</font>: Evaluation Metrics

Implement DCG and nDCG in the corresponding templates.

### DCG Score
Implement DCG following the input/output convention:
#### Input:
* predictions - (not an interaction matrix!) numpy array with recommendations. Row index corresponds to User_id, column index corresponds to the rank of the item mentioned in the cell. Every cell (i,j) contains **item id** recommended to the user (i) on the position (j) in the list. For example:

The following predictions structure [[12, 7, 99], [0, 97, 6]] means that the user with id==1 (second row) got recommended item **0** on the top of the list, item **97** on the second place and item **6** on the third place.

* test_interaction_matrix - (plain interaction matrix format as before!) interaction matrix constructed from interactions held out as a test set, rows - users, columns - items, cells - 0 or 1

* topK - integer - top "how many" to consider for the evaluation. By default top 10 items are to be considered

#### Output:
* DCG score

Don't forget, DCG is calculated for every user separately and then the average is returned.


<font color='red'>**Attention!**</font> Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

In [17]:
def get_dcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK: int = 10) -> float:
    """
    predictions - 2D np.ndarray, predictions of the recommendation algorithm for each user;
    test_interaction_matrix - 2D np.ndarray, test interaction matrix for each user;
    
    returns - float, mean dcg score over all user;
    """
    num_users = predictions.shape[0]
    dcgs = np.zeros(num_users)

    for user_id in range(num_users):
        gains = np.zeros(topK)
        
        for rank in range(min(topK, predictions.shape[1])):
            item_id = predictions[user_id, rank]
            relevance = test_interaction_matrix[user_id, item_id]
            # 1/log(rank+1)
            if relevance == 1:
                if rank == 0:
                    gains[rank] = 1
                else:
                    gains[rank] = 1 / np.log2(rank + 1)
            
        dcgs[user_id] = np.sum(gains)

    mean_dcg = np.mean(dcgs)
    return mean_dcg

In [18]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

dcg_score = get_dcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(dcg_score, 1), "1 expected"

* Can DCG score be higher than 1?
* Can the average DCG score be higher than 1?
* Why?

### nDCG Score

Following the same parameter convention as for DCG implement nDCG metric.

<font color='red'>**Attention!**</font> Remember that ideal DCG is calculated separately for each user and depends on the number of tracks held out for them as a Test set! Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

<font color='red'>**Note:**</font> nDCG is calculated for **every user separately** and then the average is returned. You do not necessarily need to use the function you implemented above. Writing nDCG from scratch might be a good idea as well.

In [19]:
def get_ndcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray, predictions of the recommendation algorithm for each user;
    test_interaction_matrix - np.ndarray, test interaction matrix for each user;
    topK - int, topK recommendations should be evaluated;
    
    returns - float, average ndcg score over all users;
    """
    num_users = predictions.shape[0]
    ndcgs = np.zeros(num_users)

    for user_id in range(num_users):
        user_predictions = predictions[user_id, :topK]
        relevances = test_interaction_matrix[user_id, user_predictions]

        actual_topK = min(topK, len(user_predictions))
        
        if actual_topK > 1:
            discounts = np.log2(np.arange(2, actual_topK + 1))
            dcg = relevances[0] + np.sum(relevances[1:actual_topK] / discounts)
        else:
            dcg = relevances[0]

        sorted_relevances = np.sort(test_interaction_matrix[user_id])[-actual_topK:][::-1]
        if actual_topK > 1:
            idcg = sorted_relevances[0] + np.sum(sorted_relevances[1:actual_topK] / discounts)
        else:
            idcg = sorted_relevances[0] 
        
        if idcg == 0:
            ndcgs[user_id] = 0
        else:
            ndcgs[user_id] = dcg / idcg
    
    average_ndcg = np.mean(ndcgs)
    return average_ndcg

In [20]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

ndcg_score = get_ndcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(ndcg_score, 1), "ndcg score is not correct."

* Can nDCG score be higher than 1?

## <font color='red'>TASK 2/2</font>: Evaluation
Use provided rec.py (see imports below) to build a simple evaluation framework. It should be able to evaluate POP, ItemKNN and SVD.


*Make sure to place provided rec.py next to your notebook for the imports to work.*


In [21]:
from rec import svd_decompose, svd_recommend_to_list  #SVD
from rec import inter_matr_implicit
from rec import recTopK  #ItemKNN
from rec import recTopKPop  #TopPop

Load the users, items and both the train interactions and test interactions
from the **new version of the lfm-tiny-tunes dataset** provided with the assignment

In [22]:
def read(dataset, file):
    return pd.read_csv(dataset + '/' + dataset + '.' + file, sep='\t')

# TODO: YOUR IMPLEMENTATION

users = read('lfm-tiny-tunes','user')
items = read('lfm-tiny-tunes','item')
train_inters = read('lfm-tiny-tunes','inter_train')
test_inters = read('lfm-tiny-tunes','inter_test')

train_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=train_inters,
                                               dataset_name="lfm-tiny-tunes")
test_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=test_inters,
                                              dataset_name="lfm-tiny-tunes")

### Get Recommendations

Implement the function below to get recommendations from all 3 recommender algorithms. Make sure you use the provided config dictionary and pay attention to the structure for the output dictionary - we will use it later.

In [23]:
config_predict = {
    #interaction matrix
    "train_inter": train_interaction_matrix,
    #topK parameter used for all algorithms
    "top_k": 10,
    #specific parameters for all algorithms
    "recommenders": {
        "SVD": {
            "n_factors": 50
        },
        "ItemKNN": {
            "n_neighbours": 5
        },
        "TopPop": {
        }
    }
}

In [29]:
def get_recommendations_for_algorithms(config: dict) -> dict:
    """
    config - dict, configuration as defined above;

    returns - dict, already predefined below with name "rec_dict";
    """

    #use this structure to return results
    rec_dict = {"recommenders": {
        "SVD": {
            #Add your predictions here
            "predictions": np.array([])
        },
        "ItemKNN": {
            "predictions": np.array([])
        },
        "TopPop": {
            "predictions": np.array([])
        },
    }}

    # SVD 
    try:
        U_final, V_final = svd_decompose(config['train_inter'], config['recommenders']['SVD']['n_factors'])
        for user_id in range(config['train_inter'].shape[0]):
            seen_item_ids = np.where(config['train_inter'][user_id] > 0)[0]  # Items the user has interacted with
            recommendations = svd_recommend_to_list(user_id, seen_item_ids, U_final, V_final, config['top_k'])
            rec_dict['recommenders']['SVD']['predictions'].append(recommendations)
        rec_dict['recommenders']['SVD']['predictions'] = np.vstack(rec_dict['recommenders']['SVD']['predictions'])
    except Exception as e:
        print("SVD Decomposition Failed:", str(e))

    # ItemKNN 
    try:
        for user_id in range(config['train_inter'].shape[0]):
            recommendations = recTopK(config['train_inter'], user_id, config['top_k'], config['recommenders']['ItemKNN']['n_neighbours'])
            rec_dict['recommenders']['ItemKNN']['predictions'].append(recommendations)
        rec_dict['recommenders']['ItemKNN']['predictions'] = np.vstack(rec_dict['recommenders']['ItemKNN']['predictions'])
    except Exception as e:
        print("ItemKNN Recommendation Failed:", str(e))

    # TopPop 
    try:
        for user_id in range(config['train_inter'].shape[0]):
            recommendations = recTopKPop(config['train_inter'], user_id, config['top_k'])
            rec_dict['recommenders']['TopPop']['predictions'].append(recommendations)
        rec_dict['recommenders']['TopPop']['predictions'] = np.vstack(rec_dict['recommenders']['TopPop']['predictions'])
    except Exception as e:
        print("TopPop Recommendation Failed:", str(e))

    return rec_dict


In [30]:
recommendations = get_recommendations_for_algorithms(config_predict)

assert "SVD" in recommendations["recommenders"] and "predictions" in recommendations["recommenders"]["SVD"]
assert isinstance(recommendations["recommenders"]["SVD"]["predictions"], np.ndarray)
assert "ItemKNN" in recommendations["recommenders"] and "predictions" in recommendations["recommenders"]["ItemKNN"]
assert isinstance(recommendations["recommenders"]["ItemKNN"]["predictions"], np.ndarray)
assert "TopPop" in recommendations["recommenders"] and "predictions" in recommendations["recommenders"]["TopPop"]
assert isinstance(recommendations["recommenders"]["TopPop"]["predictions"], np.ndarray)


SVD Decomposition Failed: 'numpy.ndarray' object has no attribute 'append'
ItemKNN Recommendation Failed: 'numpy.ndarray' object has no attribute 'append'
TopPop Recommendation Failed: 'numpy.ndarray' object has no attribute 'append'


In [None]:
users = pd.read_csv('lfm-tiny-tunes/lfm-tiny-tunes.user', sep='\t')

print(users.info())
print(users.head())

items = pd.read_csv('lfm-tiny-tunes/lfm-tiny-tunes.item', sep='\t')

print(items.info())
print(items.head())

train_inters = pd.read_csv('lfm-tiny-tunes/lfm-tiny-tunes.inter_train', sep='\t')

print(train_inters.info())
print(train_inters.head())

test_inters = pd.read_csv('lfm-tiny-tunes/lfm-tiny-tunes.inter_test', sep='\t')

print(test_inters.info())
print(test_inters.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1215 entries, 0 to 1214
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   user_id              1215 non-null   int64 
 1   country              972 non-null    object
 2   age_at_registration  1215 non-null   int64 
 3   gender               1213 non-null   object
 4   registration_date    1215 non-null   object
dtypes: int64(2), object(3)
memory usage: 47.6+ KB
None
   user_id country  age_at_registration gender    registration_date
0        0      RU                   25      m  2006-06-12 13:25:12
1        1      US                   23      m  2005-08-18 15:25:41
2        2      FR                   25      m  2006-02-26 22:39:03
3        3      DE                    2      m  2007-02-28 10:12:13
4        4      UA                   23      n  2007-10-09 15:21:20
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394 entries, 0 to 393
Data columns (t

### Evaluate Recommendations

Implement the function such that it evaluates the previously generated recommendations. Make sure you use the provided config dictionary. **DO NOT** load it directly from the *config_test*. Pay attention to the structure for the output dictionary.

In [None]:
config_test = {
    "top_k": 10,
    "test_inter": test_interaction_matrix,
    "recommenders": {}  # here you can access the recommendations from get_recommendations_for_algorithms

}
# add dictionary with recommendations to config dictionary
config_test.update(recommendations)

In [None]:
def evaluate_algorithms(config: dict) -> dict:
    """
    config - dict, configuration as defined above;

    returns - dict, { Recommender Key from input dict: { "ndcg": float - ndcg from evaluation for this recommender} };
    """

    metrics = {
        "SVD": {
        },
        "ItemKNN": {
        },
        "TopPop": {
        },
    }

    # TODO: YOUR IMPLEMENTATION.

    return metrics

### Evaluating Every Algorithm
Make sure everything works.
We expect KNN to outperform other algorithms on our small data sample.

In [None]:
evaluations = evaluate_algorithms(config_test)

assert "SVD" in evaluations and "ndcg" in evaluations["SVD"] and isinstance(evaluations["SVD"]["ndcg"], float)
assert "ItemKNN" in evaluations and "ndcg" in evaluations["ItemKNN"] and isinstance(evaluations["ItemKNN"]["ndcg"], float)
assert "TopPop" in evaluations and "ndcg" in evaluations["TopPop"] and isinstance(evaluations["TopPop"]["ndcg"], float)

AssertionError: 

In [None]:
for recommender in evaluations.keys():
    print(f"{recommender} ndcg: {evaluations[recommender]['ndcg']}")

## Questions and Potential Future Work
* How would you try improve performance of all three algorithms?
* What other metrics would you consider to compare these recommender systems?

In [None]:
# The end.