# project ai: Easer

by Michiel Téblick and thibaut Van Goethem

In this notebook we will look at the easer model proposed at https://dl.acm.org/doi/pdf/10.1145/3308558.3313710.

This model will be applied to a dataset from foods.com which containes a bunch of recipes with user ratings/reactions on them.




In [1]:
import math

import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import KFold
import time
import pickle
from scipy import sparse
import statistics as st

## Reading and preprocessing the data

In [2]:
use_less_data = True

df_train = pd.read_csv('../folds/fold_0/train.csv')
df_test = pd.read_csv('../folds/fold_0/test.csv')
df_validate = pd.read_csv('../folds/fold_0/validate.csv')
full_df = pd.concat([df_train, df_test, df_validate])
full_df.reset_index()
full_df.loc[:, 'rating'] = 1
#df.reset_index()
df=full_df

print("amount of interactions in the full dataset: ", len(df))
print("amount of recipes in the full dataset: ", len(df.recipe_id.unique()))
print("amount of users in the full dataset: ", len(df.user_id.unique()))

if use_less_data:
    lesser_cuttoff=15
    df = df[df['count_item'] >= lesser_cuttoff]
    # df_smaller.reset_index()
    print("amount of recipes in the smaller dataset: ", len(df.recipe_id.unique()))
    print("amount of users in the smaller dataset: ", len(df.user_id.unique()))
df.reset_index()


amount of interactions in the full dataset:  733951
amount of recipes in the full dataset:  80511
amount of users in the full dataset:  32635
amount of recipes in the smaller dataset:  9030
amount of users in the smaller dataset:  31075


Unnamed: 0,index,user_id,recipe_id,date,rating,review,count_user,count_item
0,0,56680,79222,2006-11-11,1,"Oh, This was wonderful! Had a soup and salad ...",174,18
1,1,827374,79222,2010-11-29,1,We made this last night and really enjoyed it....,10,18
2,21,89831,33096,2004-03-15,1,Merlot...this is the second time that I made y...,2572,27
3,22,231054,33096,2007-10-25,1,I love this -- and idea behind it. I'm sure y...,266,27
4,23,470894,33096,2010-04-27,1,so simple to put together and very refreshing....,13,27
...,...,...,...,...,...,...,...,...
347051,151892,315805,55438,2011-12-29,1,I had a ham bone leftover from Christmas dinne...,46,17
347052,151900,1423741,39902,2012-07-25,1,I switched out the American Cheese for Sharp C...,7,15
347053,151907,422893,196735,2009-01-02,1,"Yum, Yum, I love potatoes with Rosemary & this...",1130,17
347054,151908,96177,196735,2009-01-12,1,We just loved these tatters. Quick easy and ve...,563,17


Set all ratings to 1 (even negative interactions are seen as interactions)

In [3]:
df.loc[:, 'rating'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


Below here are two ways to cut down on the amount of interactions that are used in this notebook
- The first one randomly removes x% of the users,
- The second one removes all user and recipes that have less than X amount of interaction containing them

We opted for the second form as this is more representative of how the models should be used due to the lower amount if recipes but more reactions per recipe. Also the second choice is a deterministic way of removing data, which the first one is not.
This does end up mostly giving slightly worse result compared to the first choice.

The reason we need to remove data is because a matrix inversion is done, which can not be done in a smart way.
Also the result of the inversion is not necessarily a sparse matrix so the full calculation needs to be done on dense matrices. This end up scaling O(n^3) in time complexity and O(n^2) for memory needed. n here is the amount of recipes.
So running on the full dataset would require more than 200gb of ram which we do not have.


### rescaling the id's
The recipes and users don't go from 0 to amount so if we were to put this in a matrix we would get empty columns and rows. This is not that handy so we reindex both the user_id and recipe_ids

This is a step we must not forget when entering the data in the model, as we also need to remap our input data using the same remapping that was used here

In [4]:
userSet = set(df['user_id'].to_list())
user_transform_dict = dict(map(reversed, enumerate(userSet)))
recipeSet = set(df['recipe_id'].to_list())
recipe_transform_dict = dict(map(reversed, enumerate(recipeSet)))
recipe_dict = dict(enumerate(recipeSet))

In [5]:
reverse_recipes = {v: k for k, v in recipe_transform_dict.items()}
reverse_users = {v: k for k, v in user_transform_dict.items()}

In [6]:
keep_nan_user = [k for k, v in user_transform_dict.items() if pd.isnull(v)]
keep_nan_recipe = [k for k, v in recipe_transform_dict.items() if pd.isnull(v)]


def transform_id(dataframe):
    userSet = set(dataframe['user_id'].to_list())
    user_transform_dict = dict(map(reversed, enumerate(userSet)))
    tochange = dataframe['user_id']
    dataframe['user_id'] = tochange.map(user_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan_user)))

    tochange = dataframe['recipe_id']
    dataframe['recipe_id'] = tochange.map(recipe_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan_recipe)))
    return dataframe


def open_csv(filename, use_less_data=False):
    df = pd.read_csv(filename)
    if use_less_data:
        df = df[df['count_item'] >= lesser_cuttoff]
    df = transform_id(df)
    df.loc[:, 'rating'] = 1
    df.drop('review', axis=1, inplace=True)
    df['count_user'] = df.groupby(['user_id'])['user_id'].transform('size')
    # df = df.drop(df[(df['count_user'] <= 2)].index)
    df['count_item'] = df.groupby(['recipe_id'])['recipe_id'].transform('size')
    test=len(df.user_id.unique())
    # df = df.drop(df[(df['count_item'] <= 2)].index)
    #df.drop('date', axis=1, inplace=True)
    return df


### Creation of the folds


In [7]:
k = 10
folds = list()
for directory in ["../folds/fold_%d" % i for i in range(k)]:
    folds.append((directory + "/train.csv", directory + "/validate.csv", directory + "/test.csv"))

## Creation model
Here we define the models used for the experiments. Both the easer predictor and a populaliry predictor are created. the popularity predictor is used as a baseline

In [8]:
def split_test(data_set):
    test=len(data_set.user_id.unique())
    ground_truth = data_set.sort_values('date').groupby('user_id').tail(1)
    predict = pd.concat([data_set, ground_truth]).drop_duplicates(keep=False)
    test=len(predict.user_id.unique())
    predict=predict[predict.recipe_id.isin(reverse_recipes)==True]
    predict.reset_index()
    test=len(predict.user_id.unique())
    return predict, ground_truth


def data_frame_to_matrix(dataframe):
    ratings = dataframe.rating
    idx = (dataframe.user_id, dataframe.recipe_id)
    test=len(dataframe.user_id.unique())
    test2=len(df.recipe_id.unique())
    return sparse.csc_matrix((ratings, idx), shape=(dataframe.user_id.max()+1, len(df.recipe_id.unique())),
                             dtype=np.float32)

In [9]:
class popularity:
    def __init__(self):
        pass

    def train(self, data):
        data = data.sort_values('count_item', ascending=False)
        self.pop = data[data.columns[1]].unique()

    def predict(self):
        return self.pop

In [10]:
class Easer:
    def __init__(self):
        pass

    def train(self, X_train, lambda_=1250):
        #Code here is a modified version of the code provided in the paper
        #self.X = X_train

        G = X_train.T.dot(X_train)
        G = G.toarray()
        diagIndices = np.diag_indices(G.shape[0])
        G[diagIndices] += lambda_
        diagIndices
        P = scipy.linalg.inv(G)
        del G
        div = -np.diag(P)
        self.B = P / div
        self.B[diagIndices] = 0

        #self.pred = self.X * self.B

    def predict(self, xu):
        return xu * self.B


In [11]:
K = 5
K2 = 10
K3 = 20


def recal_easer(model, predict_data, test_data):
    total = len(test_data)
    print(total)
    X_train = data_frame_to_matrix(predict_data)
    y_pred = model.predict(X_train)
    print(len(test_data[test_data.recipe_id.isin(reverse_recipes)==False]))
    test_data=test_data[test_data.recipe_id.isin(reverse_recipes)==True]
    X_test = data_frame_to_matrix(test_data)

    interacted_recipes = (X_train == 1).toarray()
    y_pred[interacted_recipes] = -100000
    idx_top_scores = (-y_pred).argsort()[:, :100]
    dense_X_test = X_test.toarray()

    correct_K = 0
    correct_K2 = 0
    correct_K3 = 0
    ndcg_K = 0
    ndcg_K2 = 0
    ndcg_K3 = 0

    for idx, row in enumerate(idx_top_scores):
        if(idx>=len(dense_X_test)):continue
        for rank, index in enumerate(row):
            if dense_X_test[idx][index] == 1:
                if rank < K:
                    correct_K += 1
                    ndcg_K += 1/(math.log2(rank+2))
                if rank < K2:
                    correct_K2 += 1
                    ndcg_K2 += 1/(math.log2(rank+2))
                if rank < K3:
                    correct_K3 += 1
                    ndcg_K3 += 1/(math.log2(rank+2))


    print("easer recall@5,10,20 = %s,%s,%s" % (str(correct_K / total),str(correct_K2 / total),str(correct_K3 / total)))
    print("easer ndcg@5,10,20 = %s,%s,%s" % (str(ndcg_K / total),str(ndcg_K2 / total),str(ndcg_K3 / total)), end="\n\n")

    return correct_K/total, correct_K2/total,correct_K3/total, ndcg_K/total,ndcg_K2/total,ndcg_K3/total

## training models + evaluation


In [12]:
#Please enter the path here of where you will place the pickle files (with trailing /)
data_path = "D:/results_aiproject_improvement/"
result_list_K = list()
result_list_K2 = list()
result_ndcg = list()

for f_idx, fold_files in enumerate(folds):
    start = time.time()
    train_data = open_csv(fold_files[0], True)
    print(len(train_data))
    #Here we have the user item matrix
    X_train = data_frame_to_matrix(train_data)

    #train models

    # model_pop = popularity()
    # model_pop.train(train_data)
    # modelpopfile = open(data_path + "model_pop_fold" + str(f_idx) + ".pkl", mode='wb')
    # pickle.dump(model_pop, modelpopfile)
    # modelpopfile.close()
    # del model_pop

    test_data = open_csv(fold_files[1], False)

    model = Easer()
    model.train(X_train, lambda_=1250)
    interactions, ground_truth = split_test(test_data)

    recal_easer(model, interactions, ground_truth)

    # result_list_K.append(recall20)
    # result_list_K2.append(recall50)
    # result_ndcg.append(ndcg)
    #
    # print("done fold:",str(f_idx))
    #
    # print("easer fold: %s, recall@%s = %s" % (str(f_idx), str(K), recall20))
    # print("easer fold: %s, recall@%s = %s" % (str(f_idx), str(K2), recall50))
    # print("easer fold: %s, ndcg@%s = %s" % (str(f_idx), 100, ndcg), end="\n\n")

    end = time.time()
    print("training took : ", end - start, "s")
    break

print("mean recall@%s over 10 folds: " % str(K), str(st.mean(result_list_K)))
print("mean recall@%s over 10 folds: " % str(K2), str(st.mean(result_list_K2)))
print("mean ndcg@%s over 10 folds: " % str(100), str(st.mean(result_ndcg)), end="\n\n")
print("standard deviation recall@%s over 10 folds: " % str(K), str(st.pstdev(result_list_K)))
print("standard deviation recall@%s over 10 folds: " % str(K2), str(st.pstdev(result_list_K2)))
print("standard deviation ndcg@%s over 10 folds: " % str(100), str(st.pstdev(result_ndcg)))


242967
6528
2731
easer recall@5,10,20 = 0.013786764705882353,0.020833333333333332,0.03262867647058824
easer ndcg@5,10,20 = 0.009151513766222055,0.011409928793355017,0.014396394451212014

training took :  20.78832244873047 s


StatisticsError: mean requires at least one data point

In [None]:
test_data = open_csv(folds[0][1], False)
interactions, ground_truth = split_test(test_data)
recal_easer(model, interactions, ground_truth)

## Evaluation results of the folds

Here we use recall@20, recal@50 and ndcg@100

In [None]:
print(recipe_transform_dict[67256])
for i in recipe_transform_dict:
    if (recipe_transform_dict[i] in [10785, 35451, 15749, 13017, 28357, 22286, 8752, 10150, 27639, 32783]):
        # print(i, recipe_transform_dict[i])
        print(i)

In [None]:
data_path = "D:/results_aiproject_improvement/"
#recall score for popularity
result_list_pop_K5 = list()
result_list_pop_K10 = list()
result_list_pop_K20 = list()
result_list_pop_ndcg5 = list()
result_list_pop_ndcg10 = list()
result_list_pop_ndcg20 = list()
for i in range(k):
    test_data = open_csv(folds[i][1], use_less_data)
    predict_data, ground_truth = split_test(test_data)
    print(len(test_data["user_id"].unique()))
    model = pickle.load(open(data_path + "model_pop_fold" + str(i) + ".pkl", mode='rb'))
    pop = model.predict()
    total = 0
    correct_K5 = 0
    correct_K10 = 0
    correct_K20 = 0
    ndcg5 = 0
    ndcg10 = 0
    ndcg20 = 0
    for idx, interaction in ground_truth.iterrows():
        user = interaction['user_id']
        user_data = predict_data.loc[(predict_data['user_id'] == user)]
        already_interacted_recipes = user_data[user_data.columns[1]].to_numpy()
        newpop = pop[:150]
        newpop = newpop[~np.in1d(newpop, already_interacted_recipes)]
        newpop_K5 = newpop[:5]
        newpop_K10 = newpop[:10]
        newpop_K20 = newpop[:20]
        # newpop_ndcg5 = newpop[:5]
        # newpop_ndcg10 = newpop[:10]
        # newpop_ndcg20 = newpop[:20]
        recipe = interaction['recipe_id']
        if recipe in newpop_K5:
            correct_K5 += 1
        if recipe in newpop_K10:
            correct_K10 += 1
        if recipe in newpop_K20:
            correct_K20 += 1

        if recipe in newpop_K5:
            ndcg5 += 1 / (math.log2(np.where(newpop_K5 == recipe)[0] + 2))
        if recipe in newpop_K5:
            ndcg10 += 1 / (math.log2(np.where(newpop_K10 == recipe)[0] + 2))
        if recipe in newpop_K5:
            ndcg20 += 1 / (math.log2(np.where(newpop_K20 == recipe)[0] + 2))
        total += 1
    result_list_pop_K5.append(correct_K5 / total)
    result_list_pop_K10.append(correct_K10 / total)
    result_list_pop_K20.append(correct_K20 / total)
    result_list_pop_ndcg5.append(ndcg5 / total)
    result_list_pop_ndcg10.append(ndcg10 / total)
    result_list_pop_ndcg20.append(ndcg20 / total)
    print("popularity fold: %s, recall@5,10,20 = %s,%s,%s" % (
    str(i), str(correct_K5 / total), str(correct_K10 / total), str(correct_K20 / total)))
    print("popularity fold: %s, ndcg@5,10,20 = %s,%s,%s" % (
    str(i), str(ndcg5 / total), str(ndcg10 / total), str(ndcg20 / total)), end="\n\n")

print("mean recall@5,10,20 over 10 folds: %s,%s,%s" % (
str(st.mean(result_list_pop_K5)), str(st.mean(result_list_pop_K10)), str(st.mean(result_list_pop_K20))))
print("mean ndcg@5,10,20 over 10 folds: %s,%s,%s" % (
str(st.mean(result_list_pop_ndcg5)), str(st.mean(result_list_pop_ndcg10)), str(st.mean(result_list_pop_ndcg20))),
      end="\n\n")
print("standard deviation recall@5,10,20 over 10 folds: %s,%s,%s" % (
str(st.pstdev(result_list_pop_K5)), str(st.pstdev(result_list_pop_K10)), str(st.pstdev(result_list_pop_K20))))
print("standard deviation ndcg@5,10,20 over 10 folds: %s,%s,%s" % (
str(st.pstdev(result_list_pop_ndcg5)), str(st.pstdev(result_list_pop_ndcg10)), str(st.pstdev(result_list_pop_ndcg20))))

The next section is a demonstration that selects a random user and makes a recommendation prediction for this user.

In [None]:
# import random
# # read recipe data and load pre-trained model
# df_recipes = pd.read_csv('../data/RAW_recipes.csv')
# df_recipes.drop(['minutes', 'contributor_id', 'submitted', 'tags',
#                  'nutrition', 'n_steps', 'steps', 'description', 'n_ingredients'], axis=1, inplace=True)
# data = pickle.load(open(data_path+"data_fold0.pkl", mode='rb'))
# model = pickle.load(open(data_path+"model_fold0.pkl", mode='rb'))
# predict_data = data[0]
# ratings = predict_data.rating
# idx = (predict_data.user_id, predict_data.recipe_id)
# x_train = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())), dtype=float)
#
# # get random user and make prediction
# random_user = x_train.getrow(random.randint(0, len(df.user_id.unique())))
# prediction = model.predict(random_user)[0]
# interacted_recipes = []
# for recipe_id in random_user.indices:
#     interacted_recipes.append(recipe_dict[recipe_id])
#     prediction[recipe_id] = -100000
#
#
# top_index = (-prediction).argsort()[:10]
# recommended_recipes = []
# for recipe_id in top_index:
#     recommended_recipes.append(recipe_dict[recipe_id])
#
# # get interacted recipes and recommended recipes
# user_interactions = df_recipes[df_recipes['id'].isin(interacted_recipes)].drop('id', axis=1)
# user_recommendations = df_recipes[df_recipes['id'].isin(recommended_recipes)].drop('id', axis=1)
#
# display(user_interactions)
# display(user_recommendations)
