### Overview

![3a_train](../docs/nbs/Model_Training-training_3a.jpg)

# Model recommendation with lighfm

### Import libraries

In [1]:
import numpy as np
import pandas as pd
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm import cross_validation
import json
import mlflow

### Defining variables

In [2]:
with open('config.json', 'r') as f:
    config = json.load(f)

In [3]:
try:
    mlflow.end_run()
except:
    pass

In [4]:
mlflow.set_experiment("LightFM Grid Search")

mlflow.start_run()

<ActiveRun: >

In [5]:

TEST_PERCENTAGE = 0.20
LEARNING_RATE = 0.1
NUM_EPOCHS = 80
NUM_COMPONENTS = 10
NUM_THREADS = 3
ALPHA_REG_L2 = 1e-3
MAX_SAMPLED = 10
SEED = 42

mlflow.log_param("test_percentage", TEST_PERCENTAGE)
mlflow.log_param("learning_rate", LEARNING_RATE)
mlflow.log_param("num_epochs", NUM_EPOCHS)
mlflow.log_param("num_components", NUM_COMPONENTS)
mlflow.log_param("alpha_reg_l2", ALPHA_REG_L2)
mlflow.log_param("max_sampled", MAX_SAMPLED)

10

### Retrieve data

In [6]:
dtype_df_train_score = {
"userId" : 'string',
"userType" : 'category',
"history" : 'string',
"score" : 'Float32',
"historyFreshnessNormalized" : 'Float32'
}

In [7]:
df_merged = pd.read_csv(config["DF_TRAIN_SCORES"], dtype=dtype_df_train_score)
df_merged.drop(columns=["Unnamed: 0"],inplace=True)
df_merged

Unnamed: 0,userId,history,userType,historyFreshnessNormalized,score
0,fbb963d61eb8149e7f43b1bd905457ba5e106a830ddc27...,80aa7bb2-adce-4a55-9711-912c407927a1,Non-Logged,0.980416,0.704604
1,fbb963d61eb8149e7f43b1bd905457ba5e106a830ddc27...,d9e5f15d-b441-4d8b-bee4-462b106d3916,Non-Logged,0.613061,0.721303
2,17f1083e6079b0f28f7820a6803583d1c1b405c0718b11...,e273dba4-136c-45fb-bdd6-0cc57b13aaf0,Non-Logged,0.880859,0.637834
3,528a8d7a2af73101da8d6709c1ec875b449a5a58749a99...,a0562805-c7d1-4ffd-b622-87c50ae006f4,Non-Logged,0.945895,0.622225
4,2dd18b58a634a4e77181a202cf152df6169dfb3e4230ef...,233f8238-2ce0-470f-a9d5-0e0ac530382a,Non-Logged,0.13293,0.68147
...,...,...,...,...,...
6335310,5889d6ebbf62e6c115e0a280063dc8189cca490cbfea56...,7a349b09-badc-40a9-a194-83d959aeb50c,Non-Logged,0.966442,0.663881
6335311,5889d6ebbf62e6c115e0a280063dc8189cca490cbfea56...,6f344c45-e731-41b4-8c65-9967ebc03096,Non-Logged,0.937478,0.843734
6335312,5889d6ebbf62e6c115e0a280063dc8189cca490cbfea56...,4c586bb4-f71d-4b39-9df8-e38ac3f632a0,Non-Logged,0.939154,0.473613
6335313,5889d6ebbf62e6c115e0a280063dc8189cca490cbfea56...,855d20b7-53f2-4678-a10f-55402d085018,Non-Logged,0.929145,0.669966


### Prepare data

Before fitting the LightFM model, we need to create an instance of `Dataset` which holds the interaction matrix.

In [8]:
dataset = Dataset()

# Get unique values for users, items, and user features
unique_users = df_merged["userId"].unique()
unique_items = df_merged["history"].unique()
unique_user_features = df_merged["userType"].unique().tolist()

# Fit dataset with users, items, and user feature names
dataset.fit(
    users=unique_users,
    items=unique_items,
    user_features=unique_user_features  # Register user features
)

In [9]:
(interactions, weights) = dataset.build_interactions([
    (row.userId, row.history, row.score) 
    for _, row in df_merged.iterrows()
])

In [10]:
user_features_list = [
    (row.userId, [row.userType])  
    for _, row in df_merged.iterrows()
]

user_features = dataset.build_user_features(user_features_list)

LightLM works slightly differently compared to other packages as it expects the train and test sets to have same dimension. Therefore the conventional train test split will not work.

The package has included the `cross_validation.random_train_test_split` method to split the interaction data and splits it into two disjoint training and test sets. 

However, note that **it does not validate the interactions in the test set to guarantee all items and users have historical interactions in the training set**. Therefore this may result into a partial cold-start problem in the test set.

In [11]:
# Split train and test sets (80/20 split)
train, test = cross_validation.random_train_test_split(interactions, test_percentage=TEST_PERCENTAGE, random_state=SEED)
train_weights, test_weights  = cross_validation.random_train_test_split(weights, test_percentage=TEST_PERCENTAGE, random_state=SEED)


Double check the size of both the train and test sets.

In [12]:
print(f"Shape of train interactions: {train.shape}")
print(f"Shape of test interactions: {test.shape}")

Shape of train interactions: (530491, 230722)
Shape of test interactions: (530491, 230722)


In [13]:
print(f"Shape of train interactions: {train_weights.shape}")
print(f"Shape of test interactions: {test_weights.shape}")

Shape of train interactions: (530491, 230722)
Shape of test interactions: (530491, 230722)


### Fit the LightFM model

In this notebook, the LightFM model will be using the weighted Approximate-Rank Pairwise (WARP) as the loss. Further explanation on the topic can be found [here](https://making.lyst.com/lightfm/docs/examples/warp_loss.html#learning-to-rank-using-the-warp-loss).


In general, it maximises the rank of positive examples by repeatedly sampling negative examples until a rank violation has been located. This approach is recommended when only positive interactions are present.

The LightFM model can be fitted with the following code:

In [14]:
model = LightFM(no_components=NUM_COMPONENTS,loss="warp",learning_rate=LEARNING_RATE,user_alpha=ALPHA_REG_L2,max_sampled=MAX_SAMPLED,random_state=np.random.RandomState(SEED))  # Weighted Approximate-Rank Pairwise (WARP) loss
model.fit(train, sample_weight=train_weights, epochs=NUM_EPOCHS, num_threads=NUM_THREADS, user_features=user_features)


<lightfm.lightfm.LightFM at 0x7190318d56c0>

### Evaluate model

In [15]:
# Import the evaluation routines
from lightfm.evaluation import auc_score

# Compute evaluation metrics
auc_train = auc_score(model, train, user_features=user_features, num_threads=NUM_THREADS).mean()
auc_test = auc_score(model, test, train_interactions=train, user_features=user_features, num_threads=NUM_THREADS).mean()

# Print evaluation results
print(f"AUC test Score: {auc_test:.4f}")
print(f"AUC train Score: {auc_train:.4f}")

AUC test Score: 0.8772
AUC train Score: 0.8725


In [16]:
mlflow.log_metric("auc_train", auc_train)
mlflow.log_metric("auc_test", auc_test)

mlflow.end_run()

### Save pkls to serve model

In [17]:
user_id_map, user_feature_map, item_id_map, item_feature_map = dataset.mapping()
item_id_map_reverse = {v: k for k, v in item_id_map.items()}
interactions_shape = interactions.shape

In [18]:
from utils.custom_data_structs import UserItemData

user_item_data = UserItemData(
    user_id_map = user_id_map, 
    item_id_map = item_id_map, 
    user_id_map_reverse = None, 
    item_id_map_reverse = item_id_map_reverse, 
    user_feature_map = user_feature_map, 
    item_feature_map = item_feature_map, 
    interactions_shape = interactions_shape
    )

### Import Pkls to Test Serving Model

In [19]:
import pickle

pickle.dump(model, open('artifacts/lightfm_model.pkl', 'wb'))
pickle.dump(user_item_data, open('artifacts/user_item_data.pkl', 'wb'))

In [20]:
import pickle
from utils.custom_data_structs import UserItemData
from utils.model_funcs import recommend_by_model_scores

In [21]:
loaded_model = pickle.load(open('artifacts/lightfm_model.pkl', 'rb'))
loaded_user_item_data:UserItemData = pickle.load(open('artifacts/user_item_data.pkl', 'rb'))

### Make predictions to known and unknowm on same recommendation function with pkls

In [22]:
# predict for known user
user_hash = '5f5e17781fc2ec0ddcfb2e9356e61c5d3d4b0b3c8fabd20917feb9e807463856'
recommendation_list = recommend_by_model_scores(user_hash,loaded_user_item_data,loaded_model)
print(recommendation_list)

['64a10a17-b54e-42f9-b4f8-8653ac8d3dfa', 'cb0d3637-f60b-4c74-bffe-46bbbd3317da', '180c9fda-9dac-4527-92c0-311cf61792cd', '5170d432-dc98-4921-b7f1-9f3745147028', '95afd4ba-3f07-4705-aff0-7a77c60859d6', '149ace4e-8454-4b6b-8acd-f3210314c5d4', 'd544624c-1f1c-4143-9b8c-2d9ec40f14ca']


In [23]:
# predict for unknown user
user_hash = ''
recommendation_list = recommend_by_model_scores(user_hash,loaded_user_item_data,loaded_model)
print(recommendation_list)

['a0836cb2-4c51-4964-a4dc-2f47d647ef5b', 'cbe29bac-065d-412b-9e4f-dc0f70341daa', 'e2597c82-3c22-419c-9e38-9f72ce7edc3c', '40610df4-da21-452c-8f01-8cf07eb1afa1', '180c9fda-9dac-4527-92c0-311cf61792cd', 'cb0d3637-f60b-4c74-bffe-46bbbd3317da', '9c28556b-937b-4a94-b66a-c22214b69a48']
