<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

## FastAI Recommender

This notebook shows how to use the [FastAI](https://fast.ai) recommender which is using [Pytorch](https://pytorch.org/) under the hood. 

In [4]:
# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")

import os
import sys
import numpy as np
import pandas as pd
import torch
import fastai
from tempfile import TemporaryDirectory

from fastai.collab import collab_learner, CollabDataLoaders, load_learner

from recommenders.utils.constants import (
    DEFAULT_USER_COL as USER, 
    DEFAULT_ITEM_COL as ITEM, 
    DEFAULT_RATING_COL as RATING, 
    DEFAULT_TIMESTAMP_COL as TIMESTAMP, 
    DEFAULT_PREDICTION_COL as PREDICTION
) 
from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.models.fastai.fastai_utils import cartesian_product, score
from recommenders.evaluation.python_evaluation import map, ndcg_at_k, precision_at_k, recall_at_k
from recommenders.evaluation.python_evaluation import rmse, mae, rsquared, exp_var
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Fast AI version: {}".format(fastai.__version__))
print("Torch version: {}".format(torch.__version__))
print("CUDA Available: {}".format(torch.cuda.is_available()))
print("CuDNN Enabled: {}".format(torch.backends.cudnn.enabled))

System version: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]
Pandas version: 2.2.3
Fast AI version: 2.8.1
Torch version: 2.6.0+cu124
CUDA Available: False
CuDNN Enabled: True


Defining some constants to refer to the different columns of our dataset.

In [6]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# Model parameters
N_FACTORS = 40
EPOCHS = 5

In [7]:
ratings_df = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=[USER,ITEM,RATING,TIMESTAMP]
)

# make sure the IDs are loaded as strings to better prevent confusion with embedding ids
ratings_df[USER] = ratings_df[USER].astype('str')
ratings_df[ITEM] = ratings_df[ITEM].astype('str')

ratings_df.head()

100%|██████████| 4.81k/4.81k [00:00<00:00, 12.2kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [8]:
# Split the dataset
train_valid_df, test_df = python_stratified_split(
    ratings_df, 
    ratio=0.75, 
    min_rating=1, 
    filter_by="item", 
    col_user=USER, 
    col_item=ITEM
)

In [9]:
# Remove "cold" users from test set  
test_df = test_df[test_df.userID.isin(train_valid_df.userID)]

In [10]:
test_df

Unnamed: 0,userID,itemID,rating,timestamp
88028,57,1,5.0,883698581
17299,141,1,3.0,884584753
44422,184,1,4.0,889907652
16768,15,1,1.0,879455635
58395,486,1,4.0,879874870
...,...,...,...,...
52763,798,998,3.0,875915317
11582,299,998,2.0,889503774
98130,642,998,3.0,886569765
98226,682,999,2.0,888521942


## Training

In [11]:
# fix random seeds to make sure our runs are reproducible
np.random.seed(101)
torch.manual_seed(101)
torch.cuda.manual_seed_all(101)

In [12]:
with Timer() as preprocess_time:
    data = CollabDataLoaders.from_df(train_valid_df, 
                                     user_name=USER, 
                                     item_name=ITEM, 
                                     rating_name=RATING, 
                                     valid_pct=0)


In [16]:
data.show_batch()

Unnamed: 0,userID,itemID,rating
0,560,181,4.0
1,805,725,3.0
2,933,734,2.0
3,627,546,3.0
4,280,73,3.0
5,201,68,2.0
6,747,510,5.0
7,180,186,4.0
8,864,183,4.0
9,694,204,4.0


Now we will create a `collab_learner` for the data, which by default uses the [EmbeddingDotBias](https://docs.fast.ai/collab.html#EmbeddingDotBias) model. We will be using 40 latent factors. This will create an embedding for the users and the items that will map each of these to 40 floats as can be seen below. Note that the embedding parameters are not predefined, but are learned by the model.

Although ratings can only range from 1-5, we are setting the range of possible ratings to a range from 0 to 5.5 -- that will allow the model to predict values around 1 and 5, which improves accuracy. Lastly, we set a value for weight-decay for regularization.

In [17]:
learn = collab_learner(data, n_factors=N_FACTORS, y_range=[0,5.5], wd=1e-1)
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 40)
  (i_weight): Embedding(1683, 40)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1683, 1)
)

Now train the model for 5 epochs setting the maximal learning rate. The learner will reduce the learning rate with each epoch using cosine annealing.

In [18]:
with Timer() as train_time:
    learn.fit_one_cycle(EPOCHS, lr_max=5e-3)

print("Took {} seconds for training.".format(train_time))

epoch,train_loss,valid_loss,time
0,0.877889,,00:06
1,0.696432,,00:05
2,0.563979,,00:05
3,0.511963,,00:05
4,0.50049,,00:05


Took 28.5196 seconds for training.


Save the learner so it can be loaded back later for inferencing / generating recommendations

In [19]:
tmp = TemporaryDirectory()
model_path = os.path.join(tmp.name, "movielens_model.pkl")

In [22]:
learn.export(model_path)

## Generating Recommendations

Load the learner from disk.

In [23]:
learner = load_learner(model_path)

Get all users and items that the model knows

In [24]:
total_users, total_items = learner.dls.classes.values()
total_items = total_items[1:]
total_users = total_users[1:]

In [25]:
total_users

(#943) ['1','10','100','101','102','103','104','105','106','107','108','109','11','110','111','112','113','114','115','116'...]

Get all users from the test set and remove any users that were know in the training set

In [26]:
test_users = test_df[USER].unique()
test_users = np.intersect1d(test_users, total_users)

In [27]:
# Assume 'model' is an instance of EmbeddingDotBias created via from_classes
user_embeddings = learner.weight(['1', '10'], is_item=False)

In [28]:
user_embeddings

tensor([[-1.8268e-01,  1.2263e-01,  4.2290e-01, -6.8767e-01,  3.5050e-01,
          2.7706e-01,  8.8839e-02,  2.0550e-02, -1.4047e-01,  1.8901e-01,
          8.6447e-02,  2.4860e-01, -1.7799e-01,  1.4392e-01,  4.2438e-01,
          2.0740e-01, -3.1880e-01, -8.7065e-02,  3.5110e-01, -3.5063e-01,
         -1.1037e-01, -1.0008e-01, -1.7147e-01, -2.4778e-01,  3.9967e-01,
         -6.0207e-02, -5.9832e-01, -2.4140e-01, -4.1792e-01,  1.0378e-01,
         -1.8700e-01, -2.2076e-01,  4.8409e-01,  1.7672e-01,  1.6418e-01,
         -2.3211e-02,  1.8199e-01, -2.8878e-01, -6.6121e-02,  4.2110e-04],
        [-6.0007e-02,  2.2414e-01,  3.1868e-02,  3.4639e-02,  1.0609e-01,
          4.8940e-02,  1.9685e-01, -2.7202e-01, -1.7338e-01, -7.1262e-02,
          2.7214e-01,  2.2561e-01,  1.5987e-01,  4.1024e-02,  1.3439e-01,
          1.7937e-01, -2.2741e-01, -1.5880e-01,  4.7721e-02, -2.0809e-01,
         -7.9615e-02,  2.9610e-01,  1.4738e-01, -1.0029e-01,  1.5733e-01,
         -1.4578e-01,  9.5020e-02, -2

In [29]:
len(user_embeddings[0])

40

In [39]:
test_users

array(['1', '10', '100', '101', '102', '103', '104', '105', '106', '107',
       '108', '109', '11', '110', '111', '112', '113', '114', '115',
       '116', '117', '118', '119', '12', '120', '121', '122', '123',
       '124', '125', '126', '127', '128', '129', '13', '130', '131',
       '132', '133', '134', '135', '136', '137', '138', '139', '14',
       '140', '141', '142', '143', '144', '145', '146', '147', '148',
       '149', '15', '150', '151', '152', '153', '154', '155', '156',
       '157', '158', '159', '16', '160', '161', '162', '163', '164',
       '165', '166', '167', '168', '169', '17', '170', '171', '172',
       '173', '174', '175', '176', '177', '178', '179', '18', '180',
       '181', '182', '183', '184', '185', '186', '187', '188', '189',
       '19', '190', '191', '192', '193', '194', '195', '196', '197',
       '198', '199', '2', '20', '200', '201', '202', '203', '204', '205',
       '206', '207', '208', '209', '21', '210', '211', '212', '213',
       '214', '215', '

Build the cartesian product of test set users and all items known to the model

In [40]:
np.array(test_users)

array(['1', '10', '100', '101', '102', '103', '104', '105', '106', '107',
       '108', '109', '11', '110', '111', '112', '113', '114', '115',
       '116', '117', '118', '119', '12', '120', '121', '122', '123',
       '124', '125', '126', '127', '128', '129', '13', '130', '131',
       '132', '133', '134', '135', '136', '137', '138', '139', '14',
       '140', '141', '142', '143', '144', '145', '146', '147', '148',
       '149', '15', '150', '151', '152', '153', '154', '155', '156',
       '157', '158', '159', '16', '160', '161', '162', '163', '164',
       '165', '166', '167', '168', '169', '17', '170', '171', '172',
       '173', '174', '175', '176', '177', '178', '179', '18', '180',
       '181', '182', '183', '184', '185', '186', '187', '188', '189',
       '19', '190', '191', '192', '193', '194', '195', '196', '197',
       '198', '199', '2', '20', '200', '201', '202', '203', '204', '205',
       '206', '207', '208', '209', '21', '210', '211', '212', '213',
       '214', '215', '

In [41]:
np.array(total_items)

array(['1', '10', '100', ..., '997', '998', '999'],
      shape=(1682,), dtype='<U4')

In [42]:
users_items = cartesian_product(np.array(test_users),np.array(total_items))

In [44]:
users_items

Unnamed: 0,userID,itemID
0,1,1
1,1,10
2,1,100
3,1,1000
4,1,1001
...,...,...
1586121,99,995
1586122,99,996
1586123,99,997
1586124,99,998


In [43]:

users_items = pd.DataFrame(users_items, columns=[USER,ITEM])

In [45]:
users_items

Unnamed: 0,userID,itemID
0,1,1
1,1,10
2,1,100
3,1,1000
4,1,1001
...,...,...
1586121,99,995
1586122,99,996
1586123,99,997
1586124,99,998



Lastly, remove the user/items combinations that are in the training set -- we don't want to propose a movie that the user has already watched.

In [26]:
training_removed = pd.merge(users_items, train_valid_df.astype(str), on=[USER, ITEM], how='left')
training_removed = training_removed[training_removed[RATING].isna()][[USER, ITEM]]

### Score the model to find the top K recommendation

In [27]:
with Timer() as test_time:
    top_k_scores = score(learner, 
                         test_df=training_removed,
                         user_col=USER, 
                         item_col=ITEM, 
                         prediction_col=PREDICTION)

print("Took {} seconds for {} predictions.".format(test_time, len(training_removed)))

Took 1.9665 seconds for 1511060 predictions.


Calculate some metrics for our model

In [28]:
eval_map = map(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
               col_rating=RATING, col_prediction=PREDICTION, 
               relevancy_method="top_k", k=TOP_K)

In [29]:
eval_ndcg = ndcg_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                      col_rating=RATING, col_prediction=PREDICTION, 
                      relevancy_method="top_k", k=TOP_K)

In [30]:
eval_precision = precision_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                                col_rating=RATING, col_prediction=PREDICTION, 
                                relevancy_method="top_k", k=TOP_K)

In [31]:
eval_recall = recall_at_k(test_df, top_k_scores, col_user=USER, col_item=ITEM, 
                          col_rating=RATING, col_prediction=PREDICTION, 
                          relevancy_method="top_k", k=TOP_K)

In [32]:
print("Model:\t\t" + learn.__class__.__name__,
      "Top K:\t\t%d" % TOP_K,
      "MAP:\t\t%f" % eval_map,
      "NDCG:\t\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

Model:		Learner
Top K:		10
MAP:		0.021466
NDCG:		0.133483
Precision@K:	0.123224
Recall@K:	0.049921


The above numbers are lower than [SAR](../sar_single_node_movielens.ipynb), but expected, since the model is explicitly trying to generalize the users and items to the latent factors. Next look at how well the model predicts how the user would rate the movie. Need to score `test_df` user-items only. 

In [33]:
scores = score(learner, 
               test_df=test_df.copy(), 
               user_col=USER, 
               item_col=ITEM, 
               prediction_col=PREDICTION)

Now calculate some regression metrics

In [34]:
eval_r2 = rsquared(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_rmse = rmse(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_mae = mae(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)
eval_exp_var = exp_var(test_df, scores, col_user=USER, col_item=ITEM, col_rating=RATING, col_prediction=PREDICTION)

print("Model:\t\t\t" + learn.__class__.__name__,
      "RMSE:\t\t\t%f" % eval_rmse,
      "MAE:\t\t\t%f" % eval_mae,
      "Explained variance:\t%f" % eval_exp_var,
      "R squared:\t\t%f" % eval_r2, sep='\n')

Model:			Learner
RMSE:			0.924832
MAE:			0.732161
Explained variance:	0.327327
R squared:		0.325991


That RMSE is competitive in comparison with other models.

In [36]:
# Record results for tests - ignore this cell
store_metadata("map", eval_map)
store_metadata("ndcg", eval_ndcg)
store_metadata("precision", eval_precision)
store_metadata("recall", eval_recall)
store_metadata("rmse", eval_rmse)
store_metadata("mae", eval_mae)
store_metadata("exp_var", eval_exp_var)
store_metadata("rsquared", eval_r2)
store_metadata("train_time", train_time.interval)
store_metadata("test_time", test_time.interval)

In [27]:
tmp.cleanup()