# project ai: Easer

by Michiel Téblick and thibaut Van Goethem

In this notebook we will look at the easer model proposed at https://dl.acm.org/doi/pdf/10.1145/3308558.3313710.

This model will be applied to a dataset from foods.com which containes a bunch of recipes with user ratings/reactions on them.




In [105]:
import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import KFold
import time
import pickle
from scipy import sparse
import statistics as st

## Reading and preprocessing the data

In [92]:
df = pd.read_csv('../data/RAW_interactions.csv')
df.drop('review', axis=1, inplace=True)
df.drop('date', axis=1, inplace=True)
df.reset_index()
df.drop_duplicates(subset=['user_id', 'recipe_id'])
print("amount of interactions in the full dataset: ",len(df))
print("amount of recipes in the full dataset: ",len(df.recipe_id.unique()))

amount of interactions in the full dataset:  1132367
amount of recipes in the full dataset:  231637


Set all ratings to 1 (even negative interactions are seen as interactions)

In [93]:
df.loc[:,'rating'] = 1

Below here are two ways to cut down on the amount of interactions that are used in this notebook
- The first one randomly removes x% of the users,
- The second one removes all user and recipes that have less than X amount of interaction containing them

We opted for the second form as this is more representative of how the models should be used due to the lower amount if recipes but more reactions per recipe. Also the second choice is a deterministic way of removing data, which the first one is not.
This does end up mostly giving slightly worse result compared to the first choice.

The reason we need to remove data is because a matrix inversion is done, which can not be done in a smart way.
Also the result of the inversion is not necessarily a sparse matrix so the full calculation needs to be done on dense matrices. This end up scaling O(n^3) in time complexity and O(n^2) for memory needed. n here is the amount of recipes.
So running on the full dataset would require more than 200gb of ram which we do not have.

In [94]:
# # randomly drop a subset of data as we dont have enough resource to run the entire dataset
# unique_recipes = df.recipe_id.unique()
# subset = np.random.choice(unique_recipes, size=int(len(unique_recipes) / 10), replace=False, p=None)
# # Keep only the recipes that were in the randomly sampled df
# df = df[df['recipe_id'].isin(subset)]
# df.reset_index()
#
# # Preprocessing step where we remove all recipes that only have a single review from a person that only has a single review
# # This is done as these items will never be connected to other items and thus will never be recommended
#
# df['count_user'] = df.groupby(['recipe_id'])['recipe_id'].transform('size')
# df['count_item'] = df.groupby(['user_id'])['user_id'].transform('size')
# # 1121916 interaction in df after the and drop
# # df = df.drop(df[(df['counts'] == 1) & (df['counts_user'] == 1)].index)
#
# # 884607 interactions after the or drop
# df = df.drop(df[(df['count_item'] == 1) | (df['count_user'] == 1)].index)
# df.reset_index(drop=True, inplace=True)

In [95]:
df = df.drop_duplicates()
g1 = df.groupby('recipe_id', as_index=False)['user_id'].size()
g1 = g1.rename({'size': 'count_item'}, axis='columns')
g2 = df.groupby('user_id', as_index=False)['recipe_id'].size()
g2 = g2.rename({'size': 'count_user'}, axis='columns')
df = pd.merge(df, g1, how='left', on=['recipe_id'])
df = pd.merge(df, g2, how='left', on=['user_id'])
df = df[df['count_item'] >= 15]
df = df[df['count_user'] >= 15]

### rescaling the id's
The recipes and users don't go from 0 to amount so if we were to put this in a matrix we would get empty columns and rows. This is not that handy so we reindex both the user_id and recipe_ids

This is a step we must not forget when entering the data in the model, as we also need to remap our input data using the same remapping that was used here

In [96]:
userSet = set(df['user_id'].to_list())
user_transform_dict = dict(map(reversed, enumerate(userSet)))
recipeSet = set(df['recipe_id'].to_list())
recipe_transform_dict = dict(map(reversed, enumerate(recipeSet)))

In [97]:
keep_nan = [k for k, v in user_transform_dict.items() if pd.isnull(v)]
tochange = df['user_id']
df['user_id'] = tochange.map(user_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan)))

keep_nan = [k for k, v in recipe_transform_dict.items() if pd.isnull(v)]
tochange = df['recipe_id']
df['recipe_id'] = tochange.map(recipe_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan)))


### Creation of the folds


In [98]:
k = 10
kf = KFold(n_splits=k, shuffle=True)
kf.get_n_splits(df)
folds = list()
for train_index, test_index in kf.split(df):
    X_train = df.iloc[train_index]
    X_test = df.iloc[test_index]
    folds.append((X_train, X_test))

## Creation model
Here we define the models used for the experiments. Both the easer predictor and a populaliry predictor are created. the popularity predictor is used as a baseline

In [99]:
class popularity:
    def __init__(self):
        pass
    def train(self,data):
        data=data.sort_values('count_user',ascending=False)
        self.pop=data[data.columns[1]].to_numpy()
    def predict(self):
        return self.pop

In [100]:
class Easer:
    def __init__(self):
        pass

    def train(self, X_train, lambda_=0.5):
        #Code here is a modified version of the code provided in the paper
        self.X = X_train

        G = X_train.T.dot(X_train)  # sparse
        G = G.toarray()
        diagIndices = np.diag_indices(G.shape[0])
        G[diagIndices] += lambda_
        P = scipy.linalg.inv(G)
        div = -np.diag(P)
        self.B = P / div
        self.B[diagIndices] = 0

        self.pred = self.X * self.B

    def predicts(self, xu):
        return xu * self.B

## training models + k-fold validation


In [101]:
#Please enter the path here of where you will place the pickle files (with trailing /)
data_path="D:\\results_aiproject\\"
for f_idx, fold_data in enumerate(folds):
    start = time.time()
    train_data = fold_data[0]
    ratings = train_data.rating
    idx = (train_data.user_id, train_data.recipe_id)
    #Here we have the user item matrix
    X_train = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())),
                                dtype=float)
    #train models
    model_pop=popularity()
    model_pop.train(train_data)
    model = Easer()
    model.train(X_train)
    print("done fold:",str(f_idx))

    end = time.time()
    print("training took : ", end - start, "s")

    #Dump data for later usage (note that the easer model files end up being pretty large (approx 5gb each))
    datafile = open(data_path+"data_fold" + str(f_idx) + ".pkl", mode='wb')
    pickle.dump(fold_data, datafile)
    modelfile = open(data_path+"model_fold" + str(f_idx) + ".pkl", mode='wb')
    modelpopfile = open(data_path+"model_pop_fold" + str(f_idx) + ".pkl", mode='wb')
    pickle.dump(model, modelfile)
    pickle.dump(model_pop, modelpopfile)
    datafile.close()
    modelfile.close()
    modelpopfile.close()

8653
12078
done fold: 0
training took :  31.7008957862854 s
8653
12078
done fold: 1
training took :  30.615259170532227 s
8653
12078
done fold: 2
training took :  30.170352458953857 s
8653
12078
done fold: 3
training took :  31.428001642227173 s
8653
12078
done fold: 4
training took :  30.85005211830139 s
8653
12078
done fold: 5
training took :  30.393954515457153 s
8653
12078
done fold: 6
training took :  31.337446212768555 s
8653
12078
done fold: 7
training took :  30.910082817077637 s
8653
12078
done fold: 8
training took :  30.360827684402466 s
8653
12078
done fold: 9
training took :  31.39309811592102 s


## Evaluation results of the folds

Here we use recall@20

In [None]:
K = 20

In [106]:
result_list=list()
for i in range(10):
    data = pickle.load(open(data_path+"data_fold"+str(i)+".pkl", mode='rb'))
    model = pickle.load(open(data_path+"model_fold"+str(i)+".pkl", mode='rb'))

    test_data = data[1]
    predict_data = data[0]
    ratings = predict_data.rating
    idx = (predict_data.user_id, predict_data.recipe_id)
    x_test = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())), dtype=float)
    y_pred = model.predicts(x_test)
    # print(y_pred)

    #Evaluate recall@k
    #Do elementwise multiplication of top K predicts and true interactions

    total = 0
    correct = 0
    wrong = 0
    for idx, interaction in test_data.iterrows():
        user = interaction['user_id']
        user_data = predict_data.loc[(predict_data['user_id'] == user)]
        already_interacted_recipes = user_data[user_data.columns[1]].to_numpy()
        predicted = y_pred[user]
        np.put(predicted, already_interacted_recipes, -5)
        ind = (-predicted).argsort()[:K]
        recipe = interaction['recipe_id']
        if (recipe in ind):
            correct += 1
        else:
            wrong += 1
        total += 1
    result_list.append(correct / total)
    print("easer fold: %s, recall@%s = %s" % (str(i),str(K), str(correct / total)))
print("mean over 10 folds: ",str(st.mean(result_list)))
print("standard deviation over 10 folds: ",str(st.pstdev(result_list)))

[[ 0.00159878  0.00304444  0.00029673 ... -0.00391544 -0.00409826
  -0.0155874 ]
 [-0.00018305 -0.00041879 -0.00227176 ... -0.001979   -0.00206915
  -0.00190249]
 [-0.00230521 -0.001165    0.00070936 ... -0.00185954 -0.00133918
   0.01157269]
 ...
 [-0.00055766 -0.01121321  0.00628293 ... -0.00140608 -0.00026007
  -0.00582701]
 [ 0.00098206  0.00141904 -0.01065987 ... -0.00457449  0.00492925
   0.01315035]
 [-0.00064329 -0.00018688  0.0013005  ... -0.00207607 -0.0026008
  -0.01249914]]
easer fold: 0, recall@20 = 0.023421725669672905
[[ 4.90398719e-04  6.67254702e-03  7.33437609e-03 ...  1.15406614e-02
  -3.62127084e-03 -1.03321019e-02]
 [-1.06408645e-04  2.52630084e-04 -9.32316518e-04 ... -1.70483045e-03
  -3.33298594e-03 -1.92199790e-03]
 [ 1.24529121e-03  1.69831956e-03  7.36867833e-03 ... -5.85334652e-03
  -1.09406706e-02  6.85367646e-04]
 ...
 [-3.52923530e-04 -2.63507788e-03 -3.89912996e-08 ... -1.96568177e-03
   8.12220148e-04  7.84613424e-03]
 [ 2.99510945e-03  1.73585972e-03  1

In [107]:
#recall score for popularity
result_list_pop=list()
for i in range(10):
    data = pickle.load(open(data_path+"data_fold"+str(i)+".pkl", mode='rb'))
    model = pickle.load(open(data_path+"model_pop_fold"+str(i)+".pkl", mode='rb'))
    test_data = data[1]
    predict_data = data[0]
    pop=model.predict()
    total = 0
    correct = 0
    wrong = 0
    for idx, interaction in test_data.iterrows():
        user = interaction['user_id']
        user_data = predict_data.loc[(predict_data['user_id'] == user)]
        already_interacted_recipes = user_data[user_data.columns[1]].to_numpy()
        newpop=pop[:50]
        newpop=newpop[~np.in1d(newpop,already_interacted_recipes)]
        newpop=newpop[:K]
        recipe = interaction['recipe_id']
        if (recipe in newpop):
            correct += 1
        else:
            wrong += 1
        total += 1
    result_list_pop.append(correct / total)
    print("popularity fold: %s, recall@%s = %s" % (str(i),str(K), str(correct / total)))
print("mean over 10 folds: ",str(st.mean(result_list_pop)))
print("standard deviation over 10 folds: ",str(st.pstdev(result_list_pop)))

popularity fold: 0, recall@20 = 0.00188450666307713
popularity fold: 1, recall@20 = 0.001615291425494683
popularity fold: 2, recall@20 = 0.0017835509489837124
popularity fold: 3, recall@20 = 0.0010768609503297886
popularity fold: 4, recall@20 = 0.0027258042805222774
popularity fold: 5, recall@20 = 0.00131242428321443
popularity fold: 6, recall@20 = 0.002961367613406919
popularity fold: 7, recall@20 = 0.0017162471395881006
popularity fold: 8, recall@20 = 0.0016825952348902948
popularity fold: 9, recall@20 = 0.003903620944945484
mean over 10 folds:  0.002066226948445282
standard deviation over 10 folds:  0.0008212995366287753
