*UE Learning from User-generated Data, CP MMS, JKU Linz 2023*
# Exercise 4: Evaluation

In this exercise we evaluate accuracy of three different RecSys we already implemented. First we implement DCG and nDCG metrics, then we create a simple evaluation framework to compare the three recommenders in terms of nDCG. The implementations for the three recommender systems are provided in a file rec.py and are imported later in the notebook.
Please consult the lecture slides and the presentation from UE Session 4 for a recap.

Make sure to rename the notebook according to the convention:

LUD23_ex03_k<font color='red'><Matr. Number\></font>_<font color='red'><Surname-Name\></font>.ipynb

for example:

LUD23_ex03_k000007_Bond_James.ipynb

## Implementation
In this exercise, as before, you are reqired to write a number of functions. Only implemented functions are graded. Insert your implementations into the templates provided. Please don't change the templates even if they are not pretty. Don't forget to test your implementation for correctness and efficiency. **Make sure to try your implementations on toy examples and sanity checks.**

Please **only use libraries already imported in the notebook**.

In [1]:
import pandas as pd
import numpy as np

## <font color='red'>TASK 1/2</font>: Evaluation Metrics

Implement DCG and nDCG in the corresponding templates.

### DCG Score
Implement DCG following the input/output convention:
#### Input:
* predictions - (not an interaction matrix!) numpy array with recommendations. Row index corresponds to User_id, column index corresponds to the rank of the item mentioned in the sell. Every cell (i,j) contains **item id** recommended to the user (i) on the position (j) in the list. For example:

The following predictions structure [[12, 7, 99], [0, 97, 6]] means that the user with id==1 (second row) got recommended item **0** on the top of the list, item **97** on the second place and item **6** on the third place.

* test_interaction_matrix - (plain interaction matrix format as before!) interaction matrix constructed from interactions held out as a test set, rows - users, columns - items, cells - 0 or 1

* topK - integer - top "how many" to consider for the evaluation. By default top 10 items are to be considered

#### Output:
* DCG score

Don't forget, DCG is calculated for every user separately and then the average is returned.


<font color='red'>**Attention!**</font> Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

In [2]:
def get_dcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user.
    test_interaction_matrix - np.ndarray - test interaction matrix for each user.
    
    returns - float - mean dcg score over all user.
    """
    score = 0

    # TODO: YOUR IMPLEMENTATION.
    def get_no_interactions(test_interaction_matrix):
        to_del = []
        for i, row in enumerate(test_interaction_matrix):
            if np.all((row == 0)):
                to_del.append(i)
        return to_del
    
    to_delete = get_no_interactions(test_interaction_matrix)
    test_interaction_matrix = np.delete(test_interaction_matrix, to_delete, 0)
    predictions = np.delete(predictions, to_delete, 0)

    for user in range(predictions.shape[0]):
        for sample in range(min(topK, predictions.shape[1])):
            if test_interaction_matrix[user][sample]:
                score += 1 / np.log2(2 + predictions[user][sample])
                
    score /= predictions.shape[0]
    return score

In [3]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

dcg_score = get_dcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(dcg_score, 1), "1 expected"

* Can DCG score be higher than 1?<br>
$\quad$Yes When the commulative gain is higher than 1, especially in the first items, this can result in a dcg score higher than 1
* Can the average DCG score be higher than 1?<br>
$\quad$If we are averaging the dcg scores over the number of users without normalization, then Yes it can get higher.<br>
$\quad$But if we are considering the normalized version then No it cannot get higher.
* Why?<br>
$\quad$For the Average DCG over the number of users, Yes the value can get higher, because each dcg is calculated for each user separately, and if all of them are higher than 1 then in that case the average dcg can get higher than 1<br>
$\quad$For the Normalized DCG "nDCG" we know that nDCG = DCG / ideal DCG and since the nDCG is always lower than the ideal DCG, this should result in a maximum value of 1.<br>

### nDCG Score

Following the same parameter convention as for DCG implement nDCG metric.

<font color='red'>**Attention!**</font> Remember that ideal DCG is calculated separetely for each user and depends on the number of tracks held out for them as a Test set! Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

<font color='red'>**Note:**</font> nDCG is calculated for **every user separately** and then the average is returned. You do not necessarily need to use the function you implemented above. Writing nDCG from scatch might be a good idea as well.

In [4]:
def get_ndcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user.
    test_interaction_matrix - np.ndarray - test interaction matrix for each user.
    topK - int - topK recommendations should be evaluated.
    
    returns - average ndcg score over all users.
    """
    score = None
    
    # TODO: YOUR IMPLEMENTATION.
    score = 0
    def get_no_interactions(test_interaction_matrix):
        to_del = []
        for i, row in enumerate(test_interaction_matrix):
            if np.all((row == 0)):
                to_del.append(i)
        return to_del
    
    to_delete = get_no_interactions(test_interaction_matrix)
    test_interaction_matrix = np.delete(test_interaction_matrix, to_delete, 0)
    predictions = np.delete(predictions, to_delete, 0)
    
    
    def get_local_score(predictions, test_interaction_matrix, topK):
        perfect_score = 0
        local_score = 0
        c = 0
        for sample in range(min(topK, predictions.shape[1])):
            if test_interaction_matrix[user, predictions[user, sample]]:
                local_score += 1 / np.log2(2 + c)
            if sample < maximum:
                perfect_score += 1 / np.log2(2 + c)
            c += 1
        return local_score, perfect_score
            
    for user in range(predictions.shape[0]):
        maximum = np.sum(test_interaction_matrix[user])
        local_score, perfect_score = get_local_score(predictions, test_interaction_matrix, topK)
        score += local_score / perfect_score
        
    score /= predictions.shape[0]
    return score

In [5]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

ndcg_score = get_ndcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(ndcg_score, 1), "ndcg score is not correct."

* Can nDCG score be higher than 1?<br>
No, since nDCG = DCG / ideal_DCG and since the nDCG is always lower than the ideal_DCG, this should result in a maximum value of 1.

## <font color='red'>TASK 2/2</font>: Evaluation
Use provided rec.py (see imports below) to build a simple evaluation framework. It should be able to evaluate POP, ItemKNN and SVD.

*Make sure to place provided rec.py next to your notebook for the imports to work.*


In [6]:
from rec import svd_decompose, svd_recommend_to_list  #SVD
from rec import inter_matr_implicit
from rec import recTopK  #ItemKNN
from rec import recTopKPop  #TopPop

Load the users, items and both the train interactions and test interactions
from the **new version of the lfm-tiny dataset** provided with the assignment

In [7]:
def read(dataset, file):
    return pd.read_csv(dataset + '/' + dataset + '.' + file, sep='\t')

# TODO: YOUR IMPLEMENTATION

users = read('lfm-tiny', 'user')
items = read('lfm-tiny', 'item')
train_inters = read('lfm-tiny', 'inter_train')
test_inters = read('lfm-tiny', 'inter_test')

train_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=train_inters,
                                               dataset_name="lfm-tiny")
test_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=test_inters,
                                              dataset_name="lfm-tiny")

### Get Recommendations

Implement the function below to get recommendations from all 3 recommender algorithms. Make sure you use the provided config dictionary and pay attention to the structure for the output dictionary - we will use it later.

In [8]:
config_predict = {
    #interaction matrix
    "train_inter": train_interaction_matrix,
    #topK parameter used for all algorithms
    "top_k": 10,
    #specific parameters for all algorithms
    "recommenders": {
        "SVD": {
            "n_factors": 50
        },
        "ItemKNN": {
            "n_neighbours": 5
        },
        "TopPop": {
        }
    }
}

In [9]:
def get_recommendations_for_algorithms(config: dict) -> dict:
    """
    config - dict - configuration as defined above

    returns - dict - already predefined below with name "rec_dict"
    """

    #use this structure to return results
    rec_dict = {"recommenders": {
        "SVD": {
            #Add your predictions here
            "recommendations": np.array([])
        },
        "ItemKNN": {
            "recommendations": np.array([])
        },
        "TopPop": {
            "recommendations": np.array([])
        },
    }}
    
    # TODO: YOUR IMPLEMENTATION.
    train_inter_mat = config["train_inter"]
    users = list(range(train_inter_mat.shape[0]))
    
    svd_recs = np.full((train_inter_mat.shape[0], config['top_k']), -1)
    knn_recs = np.full((train_inter_mat.shape[0], config['top_k']), -1)
    top_pop_recs = np.full((train_inter_mat.shape[0], config['top_k']), -1)

    # SVD

    U, V = svd_decompose(train_inter_mat)
    def get_seen_items(inter_matrix_train):
        seen = [[] for i in range(train_inter_mat.shape[0])]
        for i, inter_row in enumerate(train_inter_mat):
            for j, item in enumerate(inter_row):
                if item == 1:
                    seen[i].append(j)
        return seen
    seen = get_seen_items(train_inter_mat)
    
    for user_id in users:
        svd_recs[user_id] = svd_recommend_to_list(user_id, seen[user_id], U, V, config["top_k"])
        knn_recs[user_id] = recTopK(train_inter_mat, user_id, config["top_k"], config["recommenders"]["ItemKNN"]["n_neighbours"])
        top_pop_recs[user_id] = recTopKPop(train_inter_mat, user_id, config["top_k"])
        
    rec_dict["recommenders"]["SVD"]["recommendations"] = svd_recs
    rec_dict["recommenders"]["ItemKNN"]["recommendations"] = knn_recs
    rec_dict["recommenders"]["TopPop"]["recommendations"] = top_pop_recs
    
    print(rec_dict["recommenders"]["SVD"]["recommendations"][0])
    print(rec_dict["recommenders"]["ItemKNN"]["recommendations"][0])
    print(rec_dict["recommenders"]["TopPop"]["recommendations"][0])
    return rec_dict

In [10]:
recommendations = get_recommendations_for_algorithms(config_predict)

assert "SVD" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["SVD"]
assert isinstance(recommendations["recommenders"]["SVD"]["recommendations"], np.ndarray)
assert "ItemKNN" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["ItemKNN"]
assert isinstance(recommendations["recommenders"]["ItemKNN"]["recommendations"], np.ndarray)
assert "TopPop" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["TopPop"]
assert isinstance(recommendations["recommenders"]["TopPop"]["recommendations"], np.ndarray)


[227  30  45 187 124 125 197 186 129 156]
[ 56  54  45  30  12  42  58  55  43 165]
[ 43  42 151 105  96  68 104  51 167 150]


### Evaluate Recommendations

Implement the function such that it evaluates the previously generated recommendations. Make sure you use the provided config dictionary and pay attention to the structure for the output dictionary.

In [11]:
config_test = {
    "top_k": 10,
    "test_inter": test_interaction_matrix,
    "recommenders": {}  # here you can access the recommendations from get_recommendations_for_algorithms

}
# add dictionary with recommendations to config dictionary
config_test.update(recommendations)

In [12]:
def evaluate_algorithms(config: dict) -> dict:
    """
    config - dict - configuration as defined above

    returns - dict - { Recommender Key from input dict: { "ndcg": float - ndcg from evaluation for this recommender} }
    """

    metrics = {
        "SVD": {
            "ndcg": get_ndcg_score(config["recommenders"]["SVD"]["recommendations"], config["test_inter"], config["top_k"])
        },
        "ItemKNN": {
            "ndcg": get_ndcg_score(config["recommenders"]["ItemKNN"]["recommendations"], config["test_inter"], config["top_k"])
        },
        "TopPop": {
            "ndcg": get_ndcg_score(config["recommenders"]["TopPop"]["recommendations"], config["test_inter"], config["top_k"])
        },
    }

    # # TODO: YOUR IMPLEMENTATION.  
    # for alg in metrics:
    #     preds = config["recommenders"][alg]["recommendations"]
    #     metrics[alg]["ndcg"] = get_ndcg_score(preds, config["test_inter"], config["top_k"])


    # print(config_test)
    return metrics

### Evaluating Every Algorithm
Make sure everything works.
We expect KNN to outperform other algorithms on our small data sample.

In [13]:
evaluations = evaluate_algorithms(config_test)

assert "SVD" in evaluations and "ndcg" in evaluations["SVD"] and isinstance(evaluations["SVD"]["ndcg"], float)
assert "ItemKNN" in evaluations and "ndcg" in evaluations["ItemKNN"] and isinstance(evaluations["ItemKNN"]["ndcg"], float)
assert "TopPop" in evaluations and "ndcg" in evaluations["TopPop"] and isinstance(evaluations["TopPop"]["ndcg"], float)

In [14]:
for recommender in evaluations.keys():
    print(f"{recommender} ndcg: {evaluations[recommender]['ndcg']}")

SVD ndcg: 0.10828262931154427
ItemKNN ndcg: 0.19129278246326312
TopPop ndcg: 0.10324700052451588


## Questions and Potential Future Work
* How would you try improve performance of all three algorithms?<br>
$\quad$- KNN: We Can incorporate more information about the items/users to increase the feature space of the recommender.<br>
$\quad$- SVD: We can use some sort of Regularization like the L2, since its highly prone to overfitting.<br>
$\quad$- General: Hyperparameters Tuning.<br>
* What other metrics would you consider to compare these recommender systems?<br>
$\quad$- Mean average precision<br>
$\quad$- F1 score<br>
$\quad$- Diversity:.<br>


In [15]:
# The end.