# project ai: Easer

by Michiel Téblick and thibaut Van Goethem

In this notebook we will look at the easer model proposed at https://dl.acm.org/doi/pdf/10.1145/3308558.3313710.

This model will be applied to a dataset from foods.com which containes a bunch of recipes with user ratings/reactions on them.




In [1]:
import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import KFold
import time
import pickle
from scipy import sparse
import statistics as st

## Reading and preprocessing the data

In [2]:
df = pd.read_csv('../data/RAW_interactions.csv')
df.drop('review', axis=1, inplace=True)
df.drop('date', axis=1, inplace=True)
df.reset_index()
df.drop_duplicates(subset=['user_id', 'recipe_id'])
print("amount of interactions in the full dataset: ",len(df))
print("amount of recipes in the full dataset: ",len(df.recipe_id.unique()))

amount of interactions in the full dataset:  1132367
amount of recipes in the full dataset:  231637


Set all ratings to 1 (even negative interactions are seen as interactions)

In [3]:
df.loc[:,'rating'] = 1

Below here are two ways to cut down on the amount of interactions that are used in this notebook
- The first one randomly removes x% of the users,
- The second one removes all user and recipes that have less than X amount of interaction containing them

We opted for the second form as this is more representative of how the models should be used due to the lower amount if recipes but more reactions per recipe. Also the second choice is a deterministic way of removing data, which the first one is not.
This does end up mostly giving slightly worse result compared to the first choice.

The reason we need to remove data is because a matrix inversion is done, which can not be done in a smart way.
Also the result of the inversion is not necessarily a sparse matrix so the full calculation needs to be done on dense matrices. This end up scaling O(n^3) in time complexity and O(n^2) for memory needed. n here is the amount of recipes.
So running on the full dataset would require more than 200gb of ram which we do not have.

In [4]:
# # randomly drop a subset of data as we dont have enough resource to run the entire dataset
# unique_recipes = df.recipe_id.unique()
# subset = np.random.choice(unique_recipes, size=int(len(unique_recipes) / 10), replace=False, p=None)
# # Keep only the recipes that were in the randomly sampled df
# df = df[df['recipe_id'].isin(subset)]
# df.reset_index()
#
# # Preprocessing step where we remove all recipes that only have a single review from a person that only has a single review
# # This is done as these items will never be connected to other items and thus will never be recommended
#
# df['count_user'] = df.groupby(['recipe_id'])['recipe_id'].transform('size')
# df['count_item'] = df.groupby(['user_id'])['user_id'].transform('size')
# # 1121916 interaction in df after the and drop
# # df = df.drop(df[(df['counts'] == 1) & (df['counts_user'] == 1)].index)
#
# # 884607 interactions after the or drop
# df = df.drop(df[(df['count_item'] == 1) | (df['count_user'] == 1)].index)
# df.reset_index(drop=True, inplace=True)

In [5]:
df = df.drop_duplicates()
g1 = df.groupby('recipe_id', as_index=False)['user_id'].size()
g1 = g1.rename({'size': 'count_item'}, axis='columns')
g2 = df.groupby('user_id', as_index=False)['recipe_id'].size()
g2 = g2.rename({'size': 'count_user'}, axis='columns')
df = pd.merge(df, g1, how='left', on=['recipe_id'])
df = pd.merge(df, g2, how='left', on=['user_id'])
df = df[df['count_item'] >= 15]
df = df[df['count_user'] >= 15]

### rescaling the id's
The recipes and users don't go from 0 to amount so if we were to put this in a matrix we would get empty columns and rows. This is not that handy so we reindex both the user_id and recipe_ids

This is a step we must not forget when entering the data in the model, as we also need to remap our input data using the same remapping that was used here

In [6]:
userSet = set(df['user_id'].to_list())
user_transform_dict = dict(map(reversed, enumerate(userSet)))
recipeSet = set(df['recipe_id'].to_list())
recipe_transform_dict = dict(map(reversed, enumerate(recipeSet)))
recipe_dict = dict(enumerate(recipeSet))

In [7]:
keep_nan = [k for k, v in user_transform_dict.items() if pd.isnull(v)]
tochange = df['user_id']
df['user_id'] = tochange.map(user_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan)))

keep_nan = [k for k, v in recipe_transform_dict.items() if pd.isnull(v)]
tochange = df['recipe_id']
df['recipe_id'] = tochange.map(recipe_transform_dict).fillna(tochange.mask(tochange.isin(keep_nan)))


### Creation of the folds


In [8]:
k = 10
kf = KFold(n_splits=k, shuffle=True)
kf.get_n_splits(df)
folds = list()
for train_index, test_index in kf.split(df):
    X_train = df.iloc[train_index]
    X_test = df.iloc[test_index]
    folds.append((X_train, X_test))

## Creation model
Here we define the models used for the experiments. Both the easer predictor and a populaliry predictor are created. the popularity predictor is used as a baseline

In [9]:
class popularity:
    def __init__(self):
        pass
    def train(self,data):
        data=data.sort_values('count_user',ascending=False)
        self.pop=data[data.columns[1]].to_numpy()
    def predict(self):
        return self.pop

In [10]:
class Easer:
    def __init__(self):
        pass

    def train(self, X_train, lambda_=0.5):
        #Code here is a modified version of the code provided in the paper
        self.X = X_train

        G = X_train.T.dot(X_train)
        G = G.toarray()
        diagIndices = np.diag_indices(G.shape[0])
        G[diagIndices] += lambda_
        P = scipy.linalg.inv(G)
        div = -np.diag(P)
        self.B = P / div
        self.B[diagIndices] = 0

        self.pred = self.X * self.B

    def predicts(self, xu):
        return xu * self.B

## training models + k-fold validation


In [11]:
#Please enter the path here of where you will place the pickle files (with trailing /)
data_path="../results_aiproject/"
for f_idx, fold_data in enumerate(folds):
    start = time.time()
    train_data = fold_data[0]
    ratings = train_data.rating
    idx = (train_data.user_id, train_data.recipe_id)
    #Here we have the user item matrix
    X_train = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())),
                                dtype=float)
    #train models
    model_pop=popularity()
    model_pop.train(train_data)
    model = Easer()
    model.train(X_train)
    print("done fold:",str(f_idx))

    end = time.time()
    print("training took : ", end - start, "s")

    #Dump data for later usage (note that the easer model files end up being pretty large (approx 5gb each))
    datafile = open(data_path+"data_fold" + str(f_idx) + ".pkl", mode='wb')
    pickle.dump(fold_data, datafile)
    modelfile = open(data_path+"model_fold" + str(f_idx) + ".pkl", mode='wb')
    modelpopfile = open(data_path+"model_pop_fold" + str(f_idx) + ".pkl", mode='wb')
    pickle.dump(model, modelfile)
    pickle.dump(model_pop, modelpopfile)
    datafile.close()
    modelfile.close()
    modelpopfile.close()

done fold: 0
training took :  39.41595792770386 s
done fold: 1
training took :  39.61205983161926 s
done fold: 2
training took :  39.19988489151001 s
done fold: 3
training took :  39.33319401741028 s
done fold: 4
training took :  38.85149812698364 s
done fold: 5
training took :  39.084100008010864 s
done fold: 6
training took :  39.282891035079956 s
done fold: 7
training took :  39.29134488105774 s
done fold: 8
training took :  38.910627126693726 s
done fold: 9
training took :  41.14596605300903 s


## Evaluation results of the folds

Here we use recall@20

In [12]:
K = 20

In [13]:
result_list=list()
for i in range(10):
    data = pickle.load(open(data_path+"data_fold"+str(i)+".pkl", mode='rb'))
    model = pickle.load(open(data_path+"model_fold"+str(i)+".pkl", mode='rb'))

    test_data = data[1]
    predict_data = data[0]
    ratings = predict_data.rating
    idx = (predict_data.user_id, predict_data.recipe_id)
    x_test = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())), dtype=float)
    y_pred = model.predicts(x_test)
    # print(y_pred)

    #Evaluate recall@k
    #Do elementwise multiplication of top K predicts and true interactions

    total = 0
    correct = 0
    wrong = 0
    for idx, interaction in test_data.iterrows():
        user = interaction['user_id']
        user_data = predict_data.loc[(predict_data['user_id'] == user)]
        already_interacted_recipes = user_data[user_data.columns[1]].to_numpy()
        predicted = y_pred[user]
        np.put(predicted, already_interacted_recipes, -5)
        ind = (-predicted).argsort()[:K]
        recipe = interaction['recipe_id']
        if (recipe in ind):
            correct += 1
        else:
            wrong += 1
        total += 1
    result_list.append(correct / total)
    print("easer fold: %s, recall@%s = %s" % (str(i),str(K), str(correct / total)))
print("mean over 10 folds: ",str(st.mean(result_list)))
print("standard deviation over 10 folds: ",str(st.pstdev(result_list)))

easer fold: 0, recall@20 = 0.022748687575716787
easer fold: 1, recall@20 = 0.02385920043074438
easer fold: 2, recall@20 = 0.025104320904563198
easer fold: 3, recall@20 = 0.024902409476376362
easer fold: 4, recall@20 = 0.023556333288464126
easer fold: 5, recall@20 = 0.025037017095167587
easer fold: 6, recall@20 = 0.023926504240139992
easer fold: 7, recall@20 = 0.02426302328711805
easer fold: 8, recall@20 = 0.02335442186027729
easer fold: 9, recall@20 = 0.023825548526046573
mean over 10 folds:  0.024057746668461433
standard deviation over 10 folds:  0.0007334957383737352


In [14]:
#recall score for popularity
result_list_pop=list()
for i in range(10):
    data = pickle.load(open(data_path+"data_fold"+str(i)+".pkl", mode='rb'))
    model = pickle.load(open(data_path+"model_pop_fold"+str(i)+".pkl", mode='rb'))
    test_data = data[1]
    predict_data = data[0]
    pop=model.predict()
    total = 0
    correct = 0
    wrong = 0
    for idx, interaction in test_data.iterrows():
        user = interaction['user_id']
        user_data = predict_data.loc[(predict_data['user_id'] == user)]
        already_interacted_recipes = user_data[user_data.columns[1]].to_numpy()
        newpop=pop[:50]
        newpop=newpop[~np.in1d(newpop,already_interacted_recipes)]
        newpop=newpop[:K]
        recipe = interaction['recipe_id']
        if (recipe in newpop):
            correct += 1
        else:
            wrong += 1
        total += 1
    result_list_pop.append(correct / total)
    print("popularity fold: %s, recall@%s = %s" % (str(i),str(K), str(correct / total)))
print("mean over 10 folds: ",str(st.mean(result_list_pop)))
print("standard deviation over 10 folds: ",str(st.pstdev(result_list_pop)))

popularity fold: 0, recall@20 = 0.0015143357114012653
popularity fold: 1, recall@20 = 0.0017162471395881006
popularity fold: 2, recall@20 = 0.0015143357114012653
popularity fold: 3, recall@20 = 0.0012451204738188181
popularity fold: 4, recall@20 = 0.0019854623771705477
popularity fold: 5, recall@20 = 0.0014133799973078476
popularity fold: 6, recall@20 = 0.002894063804011307
popularity fold: 7, recall@20 = 0.0019854623771705477
popularity fold: 8, recall@20 = 0.0019181585677749361
popularity fold: 9, recall@20 = 0.0012114685691210123
mean over 10 folds:  0.0017398034728765648
standard deviation over 10 folds:  0.0004708982581915345


The next section is a demonstration that selects a random user and makes a recommendation prediction for this user.

In [15]:
import random
# read recipe data and load pre-trained model
df_recipes = pd.read_csv('../data/RAW_recipes.csv')
df_recipes.drop(['minutes', 'contributor_id', 'submitted', 'tags',
                 'nutrition', 'n_steps', 'steps', 'description', 'n_ingredients'], axis=1, inplace=True)
data = pickle.load(open(data_path+"data_fold0.pkl", mode='rb'))
model = pickle.load(open(data_path+"model_fold0.pkl", mode='rb'))
predict_data = data[0]
ratings = predict_data.rating
idx = (predict_data.user_id, predict_data.recipe_id)
x_train = sparse.csc_matrix((ratings, idx), shape=(len(df.user_id.unique()), len(df.recipe_id.unique())), dtype=float)

# get random user and make prediction
random_user = x_train.getrow(random.randint(0, len(df.user_id.unique())))
prediction = model.predicts(random_user)[0]
interacted_recipes = []
for recipe_id in random_user.indices:
    interacted_recipes.append(recipe_dict[recipe_id])
    prediction[recipe_id] = -100000


top_index = (-prediction).argsort()[:10]
recommended_recipes = []
for recipe_id in top_index:
    recommended_recipes.append(recipe_dict[recipe_id])

# get interacted recipes and recommended recipes
user_interactions = df_recipes[df_recipes['id'].isin(interacted_recipes)].drop('id', axis=1)
user_recommendations = df_recipes[df_recipes['id'].isin(recommended_recipes)].drop('id', axis=1)

display(user_interactions)
display(user_recommendations)


15


Unnamed: 0,name,ingredients
5047,amish white bread,"['dry yeast', 'water', 'sugar', 'salt', 'short..."
17450,barb s best zucchini bread,"['eggs', 'sugar', 'zucchini', 'vegetable oil',..."
30137,brown rice and lentil casserole,"['chicken broth', 'lentils', 'brown rice', 'on..."
39655,cheese tortellini pesto pasta salad,"['artichoke hearts', 'grape tomatoes', 'black ..."
45648,chicken scaloppine with lemon glaze low fat ...,"['boneless skinless chicken breasts', 'dijon m..."
50107,chocolate chocolate chip sour cream banana bread,"['butter', 'sugar', 'vanilla extract', 'eggs',..."
57151,confetti orzo salad,"['orzo pasta', 'olive oil', 'lemon juice', 'le..."
57732,copycat olive garden minestrone soup by todd w...,"['olive oil', 'white onion', 'zucchini', 'ital..."
72879,dot s corn muffins,"['all-purpose flour', 'cornmeal', 'sugar', 'ba..."
78584,easy venison steaks,"['flour', 'mrs. dash seasoning mix', 'garlic s..."


Unnamed: 0,name,ingredients
574,whatever floats your boat brownies,"['butter', 'unsweetened cocoa', 'sugar', 'eggs..."
17179,banana chocolate chip muffins,"['bananas', 'egg', 'low-fat buttermilk', 'gran..."
45020,chicken packets oamc,"['cooked chicken', 'cream cheese', 'chives', '..."
89197,fudge crinkles a great 4 ingredient cake mix ...,"[""devil's food cake mix"", 'vegetable oil', 'eg..."
115303,jo mama s world famous spaghetti,"['italian sausage', 'onion', 'garlic cloves', ..."
167258,pumpkin pie dip,"['cream cheese', 'powdered sugar', 'pumpkin pi..."
189309,slow cooker beef short ribs,"['flour', 'salt', 'pepper', 'boneless beef sho..."
193828,spanish chicken rice bake,"['cream of chicken soup', 'salsa', 'water', 'w..."
203644,sunflower centerpiece,"['chocolate cake', 'chocolate frosting', 'choc..."
228062,wonderful salsa,"['tomatoes', 'onions', 'green peppers', 'jalap..."
