## LightFM model

LightFM is a hybrid matrix factorization model that represents users and items as linear combinations of their content features' latent factors. In other words, it is capable of using both collaborative and content-based information for recommendations, which makes it a good choice for dealing with both cold start and sparsity problems. 

The inputs to the LightFM model can be divided into two categories:

#### Collaborative Input:  
This is the user-item interaction matrix. It's a binary matrix where each row represents a user, each column represents an item, and each cell represents the interaction between the user and the item. This can be expressed as explicit feedback (e.g., ratings) or implicit feedback (e.g., views, clicks).
#### Content-Based Input: 
These are the user and item features. They are also represented as binary matrices:
#### User features: 
Each row represents a user, and each column represents a user feature.
#### Item features: 
Each row represents an item, and each column represents an item feature.
Note that if you don't provide user and/or item features, LightFM uses their identities as features, which makes it equivalent to Matrix Factorization or collaborative filtering.

More info on the method: https://arxiv.org/pdf/1507.08439.pdf

https://github.com/lyst/lightfm/issues/494

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import auc_score
from lightfm.evaluation import precision_at_k

from scipy.sparse import csr_matrix



In [2]:
# Load product info
product_info = pd.read_csv('../data/raw/archive/product_info.csv')

In [3]:
product_info.rename(columns={'product_id': 'item_id'}, inplace=True)
product_info1 = product_info[['item_id','limited_edition', 'new', 'online_only', 'out_of_stock', 'sephora_exclusive']]
#, 'primary_category', 'child_count'

In [4]:
# All item ids are unique in the dataset
print(product_info1.shape)
product_info1 = product_info1.drop_duplicates(subset=['item_id'])
print(product_info1.shape)

(8494, 6)
(8494, 6)


In [5]:
product_info1.out_of_stock.unique()

array([0, 1])

In [6]:
reviews250 = pd.read_csv('../data/raw/archive/reviews_0_250.csv', low_memory=False, index_col=0)
reviews500 = pd.read_csv('../data/raw/archive/reviews_250_500.csv', low_memory=False, index_col=0)
reviews750 = pd.read_csv('../data/raw/archive/reviews_500_750.csv', low_memory=False, index_col=0)
reviews1000 = pd.read_csv('../data/raw/archive/reviews_750_1000.csv', low_memory=False, index_col=0)
reviews1500 = pd.read_csv('../data/raw/archive/reviews_1000_1500.csv', low_memory=False, index_col=0)
reviewsend = pd.read_csv('../data/raw/archive/reviews_1500_end.csv', low_memory=False, index_col=0)

In [7]:
df = pd.concat([reviews250, reviews500, reviews750, reviews1000, reviews1500, reviewsend])
print(df.shape)
df.info()

(1301136, 18)
<class 'pandas.core.frame.DataFrame'>
Index: 1301136 entries, 0 to 49976
Data columns (total 18 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   author_id                 1301136 non-null  object 
 1   rating                    1301136 non-null  int64  
 2   is_recommended            1107162 non-null  float64
 3   helpfulness               631670 non-null   float64
 4   total_feedback_count      1301136 non-null  int64  
 5   total_neg_feedback_count  1301136 non-null  int64  
 6   total_pos_feedback_count  1301136 non-null  int64  
 7   submission_time           1301136 non-null  object 
 8   review_text               1299520 non-null  object 
 9   review_title              930754 non-null   object 
 10  skin_tone                 1103798 non-null  object 
 11  eye_color                 1057734 non-null  object 
 12  skin_type                 1172830 non-null  object 
 13  hair_color          

In [8]:
# LightFM uses those names internally
df.rename(columns={'author_id': 'user_id', 'product_id':'item_id'}, inplace=True)

In [9]:
# from combined reviews data frame df we subset interaction matrix inputs and user traits
df = df[['user_id','item_id','rating', 'skin_tone', 'eye_color', 'skin_type', 'hair_color']]

In [10]:
# making new data frame with dropped NA values 
print(df.shape)
df =df.dropna(axis = 0, how ='any')    
print(df.shape)

(1301136, 7)
(1010828, 7)


In [11]:
df = df.drop_duplicates(subset=['user_id', 'item_id'])
df = df.reset_index(drop=True)
print(df.shape)

(843003, 7)


In [12]:
df.head()

Unnamed: 0,user_id,item_id,rating,skin_tone,eye_color,skin_type,hair_color
0,5061282401,P420652,5,light,brown,dry,blonde
1,42802569154,P420652,4,lightMedium,brown,normal,brown
2,6941883808,P420652,2,light,blue,combination,brown
3,27926227988,P420652,5,fairLight,brown,combination,brown
4,7656791726,P420652,5,light,blue,normal,blonde


Now this df is ready for interaction matrix since there are unique use/item combos. To get the unique user features we need to drop all user_id duplicates. 


In [13]:
# Building the interactions matrix
interactions_df = df[['user_id', 'item_id', 'rating']]
interactions = list(interactions_df.itertuples(index=False, name=None))
interactions[0:10]



[('5061282401', 'P420652', 5),
 ('42802569154', 'P420652', 4),
 ('6941883808', 'P420652', 2),
 ('27926227988', 'P420652', 5),
 ('7656791726', 'P420652', 5),
 ('38727834382', 'P420652', 5),
 ('37554855017', 'P420652', 5),
 ('21858073785', 'P420652', 4),
 ('1216391002', 'P420652', 3),
 ('45004161653', 'P420652', 5)]

In [14]:
user_features_df = df.drop_duplicates(subset=['user_id'])
print(user_features_df.shape)

(377879, 7)


df data frame contains both user features and interactions in it. We will now break it up to prepare to feed into LightFM

In [15]:
user_features_df= df[['user_id', 'skin_tone','eye_color', 'skin_type', 'hair_color']]
user_features_df = pd.get_dummies(user_features_df, columns=['skin_tone','eye_color', 'skin_type', 'hair_color'])

In [16]:
user_features_df.head()
# Create an iterable of tuples in the form (user_id, [feature1, feature2, ...])
#users_features = [(row[0], list(row[1:])) for row in user_features_df.itertuples(index=False)]
#items_features = [(row[0], list(row[1:])) for row in product_info1.itertuples(index=False)]

Unnamed: 0,user_id,skin_tone_dark,skin_tone_deep,skin_tone_ebony,skin_tone_fair,skin_tone_fairLight,skin_tone_light,skin_tone_lightMedium,skin_tone_medium,skin_tone_mediumTan,...,skin_type_dry,skin_type_normal,skin_type_oily,hair_color_auburn,hair_color_black,hair_color_blonde,hair_color_brown,hair_color_brunette,hair_color_gray,hair_color_red
0,5061282401,False,False,False,False,False,True,False,False,False,...,True,False,False,False,False,True,False,False,False,False
1,42802569154,False,False,False,False,False,False,True,False,False,...,False,True,False,False,False,False,True,False,False,False
2,6941883808,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,27926227988,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,7656791726,False,False,False,False,False,True,False,False,False,...,False,True,False,False,False,True,False,False,False,False


In [17]:
# Convert DataFrame to a list of tuples
users_features = [
    (row[1]['user_id'], {col: row[1][col] for col in user_features_df.columns[1:]})
    for row in user_features_df.iterrows()
]


In [18]:
# Convert DataFrame to a list of tuples
items_features = [
    (row[1]['item_id'], {col: row[1][col] for col in product_info1.columns[1:]})
    for row in product_info1.iterrows()
]



In [19]:
print(interactions_df.shape)
train_set = interactions_df.iloc[0:293252]
val_set = interactions_df.iloc[293252:]
print(val_set[['user_id','item_id','rating']])

(843003, 3)
            user_id  item_id  rating
293252   5879518678  P479734       4
293253  11027774146  P479734       4
293254  11408314215  P479734       4
293255   2574379911  P479734       5
293256   2025971463  P479734       5
...             ...      ...     ...
842998   2555611347  P505392       5
842999  32344707475  P505392       5
843000   6747641337  P505392       5
843001  12175422344  P505392       5
843002   1539813076  P505392       5

[549751 rows x 3 columns]


In [None]:
train_set.shape

### Side note: calculate interaction matrix sparsity to judge what methods should handle it better

In [None]:
# Pivot the DataFrame to create a user-item matrix
matrix = interactions_df.pivot(index='user_id', columns='item_id', values='rating')

# Assuming 'interactions' is your interaction matrix
total_entries = matrix.shape[0] * matrix.shape[1]
num_nan_entries = np.isnan(matrix).sum().sum()
sparsity = num_nan_entries / total_entries

print(f'Sparsity: ', sparsity)

### Build dataset

In [20]:
user_features_df['user_id']

0          5061282401
1         42802569154
2          6941883808
3         27926227988
4          7656791726
             ...     
842998     2555611347
842999    32344707475
843000     6747641337
843001    12175422344
843002     1539813076
Name: user_id, Length: 843003, dtype: object

In [23]:
product_info1['item_id']

0       P473671
1       P473668
2       P473662
3       P473660
4       P473658
         ...   
8489    P467659
8490    P500874
8491    P504428
8492    P504448
8493    P505461
Name: item_id, Length: 8494, dtype: object

In [24]:
dataset = Dataset()
# Include all features in fit
dataset.fit(
    user_features_df['user_id'], 
    product_info1['item_id'],
    item_features=product_info1.columns[1:],
    user_features=user_features_df.columns[1:]
)

In [25]:
(interactions, weights) = dataset.build_interactions(interactions)

print(interactions.shape, weights.shape)

(377879, 8494) (377879, 8494)


In [30]:
print(interactions_df.shape)

(843003, 3)


In [45]:
# Building the interactions matrix
interactions_df = df[['user_id', 'item_id', 'rating']]
#interactions = list(interactions_df.itertuples(index=False, name=None))
#interactions[0:10]

train_set = interactions_df.iloc[0:800003]
val_set = interactions_df.iloc[800003:]

train_set = list(train_set.itertuples(index=False, name=None))
val_set = list(val_set.itertuples(index=False, name=None))

print(val_set[0:10], train_set[0:10])

(interactions, weights) = dataset.build_interactions(train_set)


(val_interactions, val_weights) = dataset.build_interactions(val_set)



[('23137230454', 'P465803', 4), ('2016492497', 'P465803', 1), ('7065781564', 'P465803', 5), ('5711099888', 'P465803', 5), ('1714992794', 'P465803', 5), ('7700407241', 'P465803', 3), ('8022329851', 'P465803', 5), ('35085403080', 'P465803', 5), ('2131503558', 'P465803', 5), ('11175368306', 'P465803', 5)] [('5061282401', 'P420652', 5), ('42802569154', 'P420652', 4), ('6941883808', 'P420652', 2), ('27926227988', 'P420652', 5), ('7656791726', 'P420652', 5), ('38727834382', 'P420652', 5), ('37554855017', 'P420652', 5), ('21858073785', 'P420652', 4), ('1216391002', 'P420652', 3), ('45004161653', 'P420652', 5)]


In [None]:
weights

If you want to use feature values as part of your model, you should turn them into feature names first. This effectively one-hot encodes your features.

In [26]:
user_features_list = dataset.build_user_features(users_features)

item_features_list = dataset.build_item_features(items_features)



In [52]:
model = LightFM()

model.fit(
    interactions=interactions,
    #sample_weight=weights,
    item_features=item_features_list,
    user_features=user_features_list,
    verbose=True,
    epochs=10,
   
)

Epoch: 100%|███████████████████████████████████████████████████████████| 10/10 [00:13<00:00,  1.34s/it]


<lightfm.lightfm.LightFM at 0x7fbb640d3a60>

In [53]:
from lightfm.evaluation import auc_score
auc_score(model, val_interactions, train_interactions=interactions, user_features=user_features_list, item_features=item_features_list, preserve_rows=False, num_threads=1, check_intersections=True).mean()

0.4954862

In [48]:
from lightfm.evaluation import precision_at_k
precision_at_k(model,val_interactions, train_interactions=interactions, k=100, user_features=user_features_list, item_features=item_features_list, preserve_rows=False, num_threads=1, check_intersections=True).mean()

0.00023891895

Yes, it is possible to compare the LightFM model to SVD (Singular Value Decomposition), or more specifically, to matrix factorization models such as those implemented in packages like Surprise, which can use SVD.

Both LightFM and SVD-based models are collaborative filtering methods and can be used to make item recommendations. The primary difference lies in the type of information they can utilize:

SVD-based models are purely collaborative filtering methods, meaning they make predictions based on user-item interaction history alone. They don't typically take into account any additional user or item features.
LightFM, on the other hand, is a hybrid model that can use both collaborative filtering information and additional features about the users or items (like user demographic information or item attributes).
To compare the two, you could fit both types of models to your data and then compare their predictive performance on a held-out test set using an appropriate metric (like precision at K, recall at K, or AUC-ROC, etc.). Keep in mind the fairness of the comparison, as LightFM might have additional information (through user/item features) that SVD models don't have access to.



## Generate predictions: 

In [37]:

# Get the list of user_ids and item_ids
user_ids = list(dataset.mapping()[0].keys())[:100]
item_ids = list(dataset.mapping()[2].keys())[:200]

# Initialize an empty DataFrame
preds = pd.DataFrame(index=user_ids, columns=item_ids)

# Loop over users and items and store the predictions in the DataFrame
for user_id in user_ids:
    for item_id in item_ids:
        internal_user_id = dataset.mapping()[0][user_id]
        internal_item_id = dataset.mapping()[2][item_id]
        predicted_score = model.predict(internal_user_id, np.array([internal_item_id]))
        preds.loc[user_id, item_id] = predicted_score[0]
# Find the original min and max of the DataFrame
original_min = preds.min().min()
original_max = preds.max().max()

# Perform min-max scaling
normalized_preds = 1 + 4 * ((preds - original_min) / (original_max - original_min))

normalized_preds

Unnamed: 0,P473671,P473668,P473662,P473660,P473658,P473661,P473659,P473666,P472300,P473667,...,P503754,P503762,P503732,P503763,P503767,P503759,P504779,P503731,P404916,P393281
5061282401,2.772348,2.688248,2.625838,2.692483,2.623582,2.627005,2.689822,2.632319,2.632658,2.686186,...,2.464563,2.503393,2.408705,2.437641,2.500026,2.363281,2.532597,2.417677,2.340836,2.342098
42802569154,2.288107,2.197438,2.112893,2.181214,2.137878,2.111893,2.181569,2.133735,2.110953,2.19301,...,1.905611,1.960124,1.812233,1.878367,1.938712,1.749898,1.959118,1.82341,1.797784,1.788953
6941883808,2.377938,2.295445,2.218981,2.292684,2.227798,2.222195,2.283796,2.234237,2.225886,2.285315,...,2.04031,2.082072,1.969405,2.012532,2.074875,1.917089,2.103771,1.977464,1.924192,1.922394
27926227988,2.772784,2.68418,2.627889,2.692838,2.616759,2.63251,2.689341,2.629053,2.627989,2.68601,...,2.467412,2.495752,2.407281,2.43157,2.506773,2.364477,2.533286,2.414777,2.336723,2.340135
7656791726,2.524084,2.431491,2.357018,2.423148,2.371355,2.354193,2.421789,2.371447,2.350846,2.429559,...,2.15777,2.195804,2.060462,2.114818,2.18809,2.005367,2.214824,2.072894,2.036029,2.032556
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5353168486,2.77347,2.689643,2.626781,2.691149,2.625134,2.626165,2.692058,2.633233,2.631845,2.687033,...,2.465608,2.50736,2.41048,2.439417,2.502304,2.366251,2.527407,2.419444,2.341688,2.343603
7770185909,2.25496,2.187226,2.138756,2.202941,2.108424,2.151026,2.198091,2.129907,2.157997,2.176255,...,2.027331,2.053458,2.015259,2.003101,2.071095,1.983706,2.097657,2.015215,1.892942,1.897339
37463371315,2.515904,2.418399,2.346345,2.408786,2.356459,2.342667,2.409399,2.357719,2.336732,2.421619,...,2.133038,2.173935,2.036226,2.094568,2.168759,1.981557,2.203281,2.0466,2.019089,2.012781
45595459670,2.621043,2.532435,2.483374,2.540561,2.471092,2.478792,2.539582,2.479352,2.480717,2.533864,...,2.334585,2.363902,2.273956,2.29321,2.365252,2.233622,2.382684,2.284683,2.195835,2.197029
