# project ai: Easer

by Michiel Téblick and thibaut Van Goethem

In this notebook we will look at the easer model proposed at https://dl.acm.org/doi/pdf/10.1145/3308558.3313710.

This model will be applied to a dataset from foods.com which containes a bunch of recipes with user ratings/reactions on them.




In [6]:
import math

import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import KFold
import time
import pickle
from scipy import sparse
import statistics as st

## Reading and preprocessing the data

In [7]:
use_less_data = False


df_train = pd.read_csv('../folds/fold_0/train.csv')
df_test = pd.read_csv('../folds/fold_0/test.csv')
df_validate = pd.read_csv('../folds/fold_0/validate.csv')
df = pd.concat([df_train, df_test, df_validate])

print("amount of interactions in the full dataset: ",len(df))
print("amount of recipes in the full dataset: ",len(df.recipe_id.unique()))
print("amount of users in the full dataset: ",len(df.user_id.unique()))

if use_less_data:
    df = df[df['count_user'] >= 4]
    print("amount of recipes in the smaller dataset: ",len(df.recipe_id.unique()))
    print("amount of users in the smaller dataset: ",len(df.user_id.unique()))
df.reset_index()


amount of interactions in the full dataset:  733951
amount of recipes in the full dataset:  80511
amount of users in the full dataset:  32635
amount of recipes in the smaller dataset:  80494
amount of users in the smaller dataset:  24716


Unnamed: 0,index,user_id,recipe_id,date,rating,review,count_user,count_item
0,0,56680,79222,2006-11-11,5,"Oh, This was wonderful! Had a soup and salad ...",174,18
1,1,827374,79222,2010-11-29,3,We made this last night and really enjoyed it....,10,18
2,2,462571,208980,2007-07-05,5,These were a snap to whip up and were fantasti...,87,8
3,3,222139,208980,2007-09-08,5,I chose this recipe from Fall Pick A Chef. Co...,446,8
4,4,423539,342209,2009-06-02,5,These are lovely cookies not bland at all. \r\...,35,8
...,...,...,...,...,...,...,...,...
710189,151908,96177,196735,2009-01-12,5,We just loved these tatters. Quick easy and ve...,563,17
710190,151909,573325,196735,2010-09-08,5,"What a great, healthy, easy and yummy recipe!<...",880,17
710191,151910,203111,213546,2007-06-28,5,Very good potatoes! I served them with fried ...,217,6
710192,151911,41468,82303,2006-09-01,5,WOW this was great. What I love the most is th...,30,13


Set all ratings to 1 (even negative interactions are seen as interactions)

In [8]:
df.loc[:,'rating'] = 1

Below here are two ways to cut down on the amount of interactions that are used in this notebook
- The first one randomly removes x% of the users,
- The second one removes all user and recipes that have less than X amount of interaction containing them

We opted for the second form as this is more representative of how the models should be used due to the lower amount if recipes but more reactions per recipe. Also the second choice is a deterministic way of removing data, which the first one is not.
This does end up mostly giving slightly worse result compared to the first choice.

The reason we need to remove data is because a matrix inversion is done, which can not be done in a smart way.
Also the result of the inversion is not necessarily a sparse matrix so the full calculation needs to be done on dense matrices. This end up scaling O(n^3) in time complexity and O(n^2) for memory needed. n here is the amount of recipes.
So running on the full dataset would require more than 200gb of ram which we do not have.


### rescaling the id's
The recipes and users don't go from 0 to amount so if we were to put this in a matrix we would get empty columns and rows. This is not that handy so we reindex both the user_id and recipe_ids

This is a step we must not forget when entering the data in the model, as we also need to remap our input data using the same remapping that was used here

In [9]:
userSet = set(df['user_id'].to_list())
user_transform_dict = dict(map(reversed, enumerate(userSet)))
recipeSet = set(df['recipe_id'].to_list())
recipe_transform_dict = dict(map(reversed, enumerate(recipeSet)))
recipe_dict = dict(enumerate(recipeSet))

In [10]:
keep_nan_user = [k for k, v in user_transform_dict.items() if pd.isnull(v)]
keep_nan_recipe = [k for k, v in recipe_transform_dict.items() if pd.isnull(v)]


def transform_id(dataframe):
    tochange = dataframe['user_id']
    dataframe['user_id'] = tochange.map(user_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan_user)))

    tochange = dataframe['recipe_id']
    dataframe['recipe_id'] = tochange.map(recipe_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan_recipe)))
    return dataframe

def open_csv(filename, use_less_data=False):
    df = pd.read_csv(filename)
    if use_less_data:
        df = df[df['count_user'] >= 4]
    df = transform_id(df)
    df.loc[:,'rating'] = 1
    df.drop('review', axis=1, inplace=True)
    return df


### Creation of the folds


In [11]:
k = 10
folds = list()
for directory in ["../folds/fold_%d" % i for i in range(k)]:
    folds.append(( directory + "/train.csv", directory + "/validate.csv",directory + "/test.csv"))

## Creation model
Here we define the models used for the experiments. Both the easer predictor and a populaliry predictor are created. the popularity predictor is used as a baseline

In [12]:
def split_test(data_set):
    ground_truth = data_set.sort_values('date').groupby('user_id').tail(1)
    predict = pd.concat([data_set, ground_truth]).drop_duplicates(keep=False)
    return predict, ground_truth

def data_frame_to_matrix(dataframe):
    ratings = dataframe.rating
    idx = (dataframe.recipe_id, dataframe.user_id)
    return sparse.csc_matrix((ratings, idx), shape=(len(df.recipe_id.unique()), len(df.user_id.unique())),
                                dtype=float)

In [13]:
class popularity:
    def __init__(self):
        pass
    def train(self, data):
        data = data.sort_values('count_user',ascending=False)
        self.pop = data[data.columns[1]].to_numpy()
    def predict(self):
        return self.pop

In [14]:
class Easer:
    def __init__(self):
        pass

    def train(self, X_train, lambda_=1250):
        #Code here is a modified version of the code provided in the paper

        G = X_train.T.dot(X_train)
        G = G.toarray()
        diagIndices = np.diag_indices(G.shape[0])
        G[diagIndices] += lambda_
        P = scipy.linalg.inv(G)
        del G
        div = -np.diag(P)
        self.B = P / div
        self.B[diagIndices] = 0

    def predict(self, xu):
        return xu * self.B


In [15]:
K = 20
K2 = 50

def recal_easer(model, predict_data, test_data):
    total = len(test_data)

    X_train = data_frame_to_matrix(predict_data)
    y_pred = model.predict(X_train)
    y_pred = y_pred.T

    X_test = data_frame_to_matrix(test_data)
    X_test = X_test.T

    interacted_recipes = (X_train.T == 1).toarray()
    y_pred[interacted_recipes] = -100000
    idx_top_scores = (-y_pred).argsort()[:,:100]
    dense_X_test = X_test.toarray()

    correct_K = 0
    correct_K2 = 0
    ndcg = 0

    for idx, row in enumerate(idx_top_scores):
        for rank, index in enumerate(row):
            if dense_X_test[idx][index] == 1:
                if rank < K:
                    correct_K += 1
                if rank < K2:
                    correct_K2 += 1
                ndcg += 1/(math.log2(rank+2))

    # print("easer recall@%s = %s" % (str(K), str(correct_K / total)))
    # print("easer recall@%s = %s" % (str(K2), str(correct_K2 / total)))
    # print("easer ndcg@%s = %s" % (100, str(ndcg / total)), end="\n\n")

    return correct_K/total, correct_K2/total, ndcg/total

## training models + evaluation


In [None]:
#Please enter the path here of where you will place the pickle files (with trailing /)
data_path="../results_aiproject/"
result_list_K = list()
result_list_K2 = list()
result_ndcg = list()

for f_idx, fold_files in enumerate(folds):
    start = time.time()
    train_data = open_csv(fold_files[0], use_less_data)
    #Here we have the user item matrix
    interactions, ground_truth = split_test(train_data)
    X_train = data_frame_to_matrix(interactions)

    #train models

    model = Easer()
    model.train(X_train, lambda_=1250)

    recall20, recall50, ndcg = recal_easer(model, interactions, ground_truth)

    result_list_K.append(recall20)
    result_list_K2.append(recall50)
    result_ndcg.append(ndcg)

    print("done fold:",str(f_idx))

    print("easer fold: %s, recall@%s = %s" % (str(f_idx), str(K), recall20))
    print("easer fold: %s, recall@%s = %s" % (str(f_idx), str(K2), recall50))
    print("easer fold: %s, ndcg@%s = %s" % (str(f_idx), 100, ndcg), end="\n\n")

    end = time.time()
    print("training took : ", end - start, "s")
    break

print("mean recall@%s over 10 folds: " % str(K), str(st.mean(result_list_K)))
print("mean recall@%s over 10 folds: " % str(K2), str(st.mean(result_list_K2)))
print("mean ndcg@%s over 10 folds: " % str(100), str(st.mean(result_ndcg)), end="\n\n")
print("standard deviation recall@%s over 10 folds: " % str(K), str(st.pstdev(result_list_K)))
print("standard deviation recall@%s over 10 folds: " % str(K2), str(st.pstdev(result_list_K2)))
print("standard deviation ndcg@%s over 10 folds: " % str(100), str(st.pstdev(result_ndcg)))
