# project ai: Easer

by Michiel Téblick and thibaut Van Goethem

In this notebook we will look at the easer model proposed at https://dl.acm.org/doi/pdf/10.1145/3308558.3313710.

This model will be applied to a dataset from foods.com which containes a bunch of recipes with user ratings/reactions on them.




In [12]:
import math

import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import KFold
import time
import pickle
from scipy import sparse
import statistics as st

## Reading and preprocessing the data

In [16]:
df_train = pd.read_csv('../folds/fold_0/train.csv')
df_test = pd.read_csv('../folds/fold_0/test.csv')
df_validate = pd.read_csv('../folds/fold_0/validate.csv')
df = pd.concat([df_train, df_test, df_validate])
df.reset_index()

print("amount of interactions in the full dataset: ",len(df))
print("amount of recipes in the full dataset: ",len(df.recipe_id.unique()))
print("amount of users in the full dataset: ",len(df.user_id.unique()))

amount of interactions in the full dataset:  1132367
amount of recipes in the full dataset:  231637
amount of users in the full dataset:  226570


Set all ratings to 1 (even negative interactions are seen as interactions)

In [17]:
df.loc[:,'rating'] = 1

Below here are two ways to cut down on the amount of interactions that are used in this notebook
- The first one randomly removes x% of the users,
- The second one removes all user and recipes that have less than X amount of interaction containing them

We opted for the second form as this is more representative of how the models should be used due to the lower amount if recipes but more reactions per recipe. Also the second choice is a deterministic way of removing data, which the first one is not.
This does end up mostly giving slightly worse result compared to the first choice.

The reason we need to remove data is because a matrix inversion is done, which can not be done in a smart way.
Also the result of the inversion is not necessarily a sparse matrix so the full calculation needs to be done on dense matrices. This end up scaling O(n^3) in time complexity and O(n^2) for memory needed. n here is the amount of recipes.
So running on the full dataset would require more than 200gb of ram which we do not have.


### rescaling the id's
The recipes and users don't go from 0 to amount so if we were to put this in a matrix we would get empty columns and rows. This is not that handy so we reindex both the user_id and recipe_ids

This is a step we must not forget when entering the data in the model, as we also need to remap our input data using the same remapping that was used here

In [6]:
userSet = set(df['user_id'].to_list())
user_transform_dict = dict(map(reversed, enumerate(userSet)))
recipeSet = set(df['recipe_id'].to_list())
recipe_transform_dict = dict(map(reversed, enumerate(recipeSet)))
recipe_dict = dict(enumerate(recipeSet))

In [7]:
keep_nan_user = [k for k, v in user_transform_dict.items() if pd.isnull(v)]
keep_nan_recipe = [k for k, v in recipe_transform_dict.items() if pd.isnull(v)]


def transform_id(dataframe):
    tochange = dataframe['user_id']
    dataframe['user_id'] = tochange.map(user_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan_user)))

    tochange = dataframe['recipe_id']
    dataframe['recipe_id'] = tochange.map(recipe_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan_recipe)))
    return dataframe

def open_csv(filename):
    df = pd.read_csv(filename)
    transform_id(df)
    df.loc[:,'rating'] = 1
    df.drop('review', axis=1, inplace=True)
    df.drop('date', axis=1, inplace=True)
    return df


### Creation of the folds


In [8]:
k = 10
folds = list()
for directory in ["../folds/fold_%d" % i for i in range(k)]:
    folds.append((directory + "/test.csv", directory + "/train.csv", directory + "/validate.csv"))

## Creation model
Here we define the models used for the experiments. Both the easer predictor and a populaliry predictor are created. the popularity predictor is used as a baseline

In [9]:
class popularity:
    def __init__(self):
        pass
    def train(self, data):
        data = data.sort_values('count_user',ascending=False)
        self.pop = data[data.columns[1]].to_numpy()
    def predict(self):
        return self.pop

In [10]:
class Easer:
    def __init__(self):
        pass

    def train(self, X_train, lambda_=1250):
        #Code here is a modified version of the code provided in the paper
        self.X = X_train

        G = X_train.T.dot(X_train)
        G = G.toarray()
        diagIndices = np.diag_indices(G.shape[0])
        G[diagIndices] += lambda_
        P = scipy.linalg.inv(G)
        div = -np.diag(P)
        self.B = P / div
        self.B[diagIndices] = 0

        self.pred = self.X * self.B

    def predicts(self, xu):
        return xu * self.B

## training models + k-fold validation


In [11]:
#Please enter the path here of where you will place the pickle files (with trailing /)
data_path="../results_aiproject/"
for f_idx, fold_files in enumerate(folds):
    start = time.time()
    train_data = open_csv(fold_files[0])
    ratings = train_data.rating
    idx = (train_data.user_id, train_data.recipe_id)
    #Here we have the user item matrix
    X_train = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())),
                                dtype=float)
    #train models
    model_pop=popularity()
    model_pop.train(train_data)
    model = Easer()
    model.train(X_train)
    print("done fold:",str(f_idx))

    end = time.time()
    print("training took : ", end - start, "s")

    #Dump data for later usage (note that the easer model files end up being pretty large (approx 5gb each))
    # datafile = open(data_path+"data_fold" + str(f_idx) + ".pkl", mode='wb')
    # pickle.dump(fold_data, datafile)
    modelfile = open(data_path+"model_fold" + str(f_idx) + ".pkl", mode='wb')
    modelpopfile = open(data_path+"model_pop_fold" + str(f_idx) + ".pkl", mode='wb')
    pickle.dump(model, modelfile)
    pickle.dump(model_pop, modelpopfile)
    # datafile.close()
    modelfile.close()
    modelpopfile.close()


done fold: 0
training took :  114.88806819915771 s
done fold: 1
training took :  113.40782117843628 s
done fold: 2
training took :  112.22119212150574 s
done fold: 3
training took :  114.10429406166077 s
done fold: 4
training took :  111.17725205421448 s
done fold: 5
training took :  111.97335028648376 s
done fold: 6
training took :  107.35915303230286 s
done fold: 7
training took :  113.66571307182312 s
done fold: 8
training took :  110.96520709991455 s
done fold: 9
training took :  113.2748670578003 s


## Evaluation results of the folds

Here we use recall@20, recal@50 and ndcg@100

In [12]:
K = 20
K2 = 50

In [13]:
result_list_K = list()
result_list_K2 = list()
result_ndcg = list()
for i in range(k):

    #Evaluate recall@k
    #Do elementwise multiplication of top K predicts and true interactions

    data = pickle.load(open(data_path+"data_fold"+str(i)+".pkl", mode='rb'))
    model = pickle.load(open(data_path+"model_fold"+str(i)+".pkl", mode='rb'))

    test_data = data[1]
    predict_data = data[0]
    total = len(test_data)

    ratings = predict_data.rating
    idx = (predict_data.user_id, predict_data.recipe_id)
    X_train = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())), dtype=float)
    y_pred = model.pred

    ratings_test = test_data.rating
    idx_test = (test_data.user_id, test_data.recipe_id)
    X_test = sparse.csc_matrix((ratings_test, idx_test), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())), dtype=np.single)

    interacted_recipes = (X_train == 1).toarray()
    y_pred[interacted_recipes] = -100000
    idx_top_scores = (-y_pred).argsort()[:,:100]
    dense_X_test = X_test.toarray()

    correct_K = 0
    correct_K2 = 0
    ndcg = 0

    for idx, row in enumerate(idx_top_scores):
        for rank, index in enumerate(row):
            if dense_X_test[idx][index] == 1:
                if rank < K:
                    correct_K += 1
                if rank < K2:
                    correct_K2 += 1
                ndcg += 1/(math.log2(rank+2))

    result_list_K.append(correct_K / total)
    result_list_K2.append(correct_K2 / total)
    result_ndcg.append(ndcg / total)

    print("easer fold: %s, recall@%s = %s" % (str(i), str(K), str(correct_K / total)))
    print("easer fold: %s, recall@%s = %s" % (str(i), str(K2), str(correct_K2 / total)))
    print("easer fold: %s, ndcg@%s = %s" % (str(i), 100, str(ndcg / total)), end="\n\n")

print("mean recall@%s over 10 folds: " % str(K), str(st.mean(result_list_K)))
print("mean recall@%s over 10 folds: " % str(K2), str(st.mean(result_list_K2)))
print("mean ndcg@%s over 10 folds: " % str(100), str(st.mean(result_ndcg)), end="\n\n")
print("standard deviation recall@%s over 10 folds: " % str(K), str(st.pstdev(result_list_K)))
print("standard deviation recall@%s over 10 folds: " % str(K2), str(st.pstdev(result_list_K2)))
print("standard deviation ndcg@%s over 10 folds: " % str(100), str(st.pstdev(result_ndcg)))

easer fold: 0, recall@20 = 0.043204387135022976
easer fold: 0, recall@50 = 0.07756533768094462
easer fold: 0, ndcg@100 = 0.030759881131819704

easer fold: 1, recall@20 = 0.04449000765791359
easer fold: 1, recall@50 = 0.0783083421852227
easer fold: 1, ndcg@100 = 0.03136378175091975

easer fold: 2, recall@20 = 0.044020651663743485
easer fold: 2, recall@50 = 0.07633210642029595
easer fold: 2, ndcg@100 = 0.03070206895817723

easer fold: 3, recall@20 = 0.044786443022652604
easer fold: 3, recall@50 = 0.07583804747906425
easer fold: 3, ndcg@100 = 0.030974543864755493

easer fold: 4, recall@20 = 0.0443170870284825
easer fold: 4, recall@50 = 0.07648032410266545
easer fold: 4, ndcg@100 = 0.030936003172855706

easer fold: 5, recall@20 = 0.04416886934611299
easer fold: 5, recall@50 = 0.07564042390257157
easer fold: 5, ndcg@100 = 0.030968373260119326

easer fold: 6, recall@20 = 0.04426768113435933
easer fold: 6, recall@50 = 0.07522047380252464
easer fold: 6, ndcg@100 = 0.030627769331440992

easer f

In [14]:
#recall score for popularity
result_list_pop_K=list()
result_list_pop_K2=list()
result_list_pop_ndcg=list()
for i in range(k):
    data = pickle.load(open(data_path+"data_fold"+str(i)+".pkl", mode='rb'))
    model = pickle.load(open(data_path+"model_pop_fold"+str(i)+".pkl", mode='rb'))
    test_data = data[1]
    predict_data = data[0]
    pop=model.predict()
    total = 0
    correct_K = 0
    correct_K2 = 0
    ndcg = 0
    for idx, interaction in test_data.iterrows():
        user = interaction['user_id']
        user_data = predict_data.loc[(predict_data['user_id'] == user)]
        already_interacted_recipes = user_data[user_data.columns[1]].to_numpy()
        newpop = pop[:150]
        newpop = newpop[~np.in1d(newpop,already_interacted_recipes)]
        newpop_K = newpop[:K]
        newpop_K2 = newpop[:K2]
        newpop_ndcg = newpop[:100]
        recipe = interaction['recipe_id']
        if recipe in newpop_K:
            correct_K += 1
        if recipe in newpop_K2:
            correct_K2 += 1
        if recipe in newpop_ndcg:
            ndcg += 1/(math.log2(np.where(newpop_ndcg == recipe)[0]+2))
        total += 1
    result_list_pop_K.append(correct_K / total)
    result_list_pop_K2.append(correct_K2 / total)
    result_list_pop_ndcg.append(ndcg / total)
    print("popularity fold: %s, recall@%s = %s" % (str(i),str(K), str(correct_K / total)))
    print("popularity fold: %s, recall@%s = %s" % (str(i),str(K2), str(correct_K2 / total)))
    print("popularity fold: %s, ndcg@%s = %s" % (str(i),str(100), str(ndcg / total)), end="\n\n")

print("mean recall@%s over 10 folds: " % str(K), str(st.mean(result_list_pop_K)))
print("mean recall@%s over 10 folds: " % str(K2), str(st.mean(result_list_pop_K2)))
print("mean ndcg@%s over 10 folds: " % str(100), str(st.mean(result_list_pop_ndcg)), end="\n\n")
print("standard deviation recall@%s over 10 folds: " % str(K), str(st.pstdev(result_list_pop_K)))
print("standard deviation recall@%s over 10 folds: " % str(K2), str(st.pstdev(result_list_pop_K2)))
print("standard deviation ndcg@%s over 10 folds: " % str(100), str(st.pstdev(result_list_pop_ndcg)))

popularity fold: 0, recall@20 = 0.002396126673583321
popularity fold: 0, recall@50 = 0.0036559458524776443
popularity fold: 0, ndcg@100 = 0.0016267585221498146

popularity fold: 1, recall@20 = 0.0010375237765865469
popularity fold: 1, recall@50 = 0.004866480571132136
popularity fold: 1, ndcg@100 = 0.0016943386433841108

popularity fold: 2, recall@20 = 0.0008151972530322868
popularity fold: 2, recall@50 = 0.002272671129665769
popularity fold: 2, ndcg@100 = 0.0013136193129889022

popularity fold: 3, recall@20 = 0.0008399002000938712
popularity fold: 3, recall@50 = 0.0017292062943109112
popularity fold: 3, ndcg@100 = 0.0007970803395989333

popularity fold: 4, recall@20 = 0.0013586620883871447
popularity fold: 4, recall@50 = 0.0026679182826511204
popularity fold: 4, ndcg@100 = 0.001116874704603834

popularity fold: 5, recall@20 = 0.0012104444060176379
popularity fold: 5, recall@50 = 0.002371482917912107
popularity fold: 5, ndcg@100 = 0.0011670027158484017

popularity fold: 6, recall@20 = 0

The next section is a demonstration that selects a random user and makes a recommendation prediction for this user.

In [15]:
import random
# read recipe data and load pre-trained model
df_recipes = pd.read_csv('../data/RAW_recipes.csv')
df_recipes.drop(['minutes', 'contributor_id', 'submitted', 'tags',
                 'nutrition', 'n_steps', 'steps', 'description', 'n_ingredients'], axis=1, inplace=True)
data = pickle.load(open(data_path+"data_fold0.pkl", mode='rb'))
model = pickle.load(open(data_path+"model_fold0.pkl", mode='rb'))
predict_data = data[0]
ratings = predict_data.rating
idx = (predict_data.user_id, predict_data.recipe_id)
x_train = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())), dtype=float)

# get random user and make prediction
random_user = x_train.getrow(random.randint(0, len(df.user_id.unique())))
prediction = model.predicts(random_user)[0]
interacted_recipes = []
for recipe_id in random_user.indices:
    interacted_recipes.append(recipe_dict[recipe_id])
    prediction[recipe_id] = -100000


top_index = (-prediction).argsort()[:10]
recommended_recipes = []
for recipe_id in top_index:
    recommended_recipes.append(recipe_dict[recipe_id])

# get interacted recipes and recommended recipes
user_interactions = df_recipes[df_recipes['id'].isin(interacted_recipes)].drop('id', axis=1)
user_recommendations = df_recipes[df_recipes['id'].isin(recommended_recipes)].drop('id', axis=1)

display(user_interactions)
display(user_recommendations)


Unnamed: 0,name,ingredients
76978,easy lemon dijon shrimp scampi,"['large shrimp', 'fresh garlic', 'butter', 'dr..."
115988,just like loaded baked potatoes casserole,"['cauliflower', 'sour cream', 'shredded chedda..."
125954,low carb breaded chicken,"['boneless skinless chicken breasts', 'egg', '..."
158185,pepsi pork roast,"['pork shoulder butt', 'pepsi', 'cream of mush..."
188871,slow cooker chicken stew,"['boneless skinless chicken breast', 'boneless..."


Unnamed: 0,name,ingredients
62431,creamy cajun chicken pasta,"['boneless skinless chicken breast halves', 'l..."
66687,crock pot chicken with black beans cream cheese,"['boneless chicken breasts', 'black beans', 'c..."
98463,grilled cheese diner style,"['bread', 'american cheese', 'mayonnaise']"
114444,japanese mum s chicken,"['chicken drumsticks', 'water', 'balsamic vine..."
115303,jo mama s world famous spaghetti,"['italian sausage', 'onion', 'garlic cloves', ..."
126146,low carb pizza,"['cream cheese', 'eggs', 'heavy cream', 'parme..."
126216,low carb taco bake,"['cream cheese', 'eggs', 'heavy cream', 'taco ..."
163319,pork chops yum yum,"['pork chops', 'chicken broth', 'honey', 'soy ..."
165494,poverty meal,"['onion', 'garlic cloves', 'olive oil', 'groun..."
213826,to die for crock pot roast,"['beef roast', 'brown gravy mix', 'dried itali..."
