# Incremental Matrix Factorization on ML-100k using River library

## Setup

In [None]:
!pip install -U river numpy

Restart the session at this point!

In [None]:
import json
import river
from river.evaluate import progressive_val_score

## Data

In [None]:
!wget -q --show-progress https://github.com/sparsh-ai/model-retraining/raw/main/data/bronze/ml_100k.csv



In [None]:
def get_data_stream():
    data_stream = river.stream.iter_csv('ml_100k.csv',
                                        target="rating",
                                        delimiter="\t",
                                        converters={
                                            "timestamp": int,
                                            "release_date": int,
                                            "age": float,
                                            "rating": float,
                                        })
    return data_stream 

In [None]:
for x, y in get_data_stream():
    print(f'x = {json.dumps(x, indent=4)}\ny = {y}')
    break

x = {
    "user": "259",
    "item": "255",
    "timestamp": 874731910000000000,
    "title": "My Best Friend's Wedding (1997)",
    "release_date": 866764800000000000,
    "genres": "comedy, romance",
    "age": 21.0,
    "gender": "M",
    "occupation": "student",
    "zip_code": "48823"
}
y = 4.0


Let's define a routine to evaluate our different models on MovieLens 100K. Mean Absolute Error and Root Mean Squared Error will be our metrics printed alongside model's computation time and memory usage:

In [None]:
def evaluate(model):
    X_y = get_data_stream()
    metric = river.metrics.MAE() + river.metrics.RMSE()
    _ = progressive_val_score(X_y, model, metric, print_every=25_000, show_time=True, show_memory=True)

## Naive prediction

It's good practice in machine learning to start with a naive baseline and then iterate from simple things to complex ones observing progress incrementally. Let's start by predicing the target running mean as a first shot:



In [None]:
mean = river.stats.Mean()
metric = river.metrics.MAE() + river.metrics.RMSE()

for i, x_y in enumerate(get_data_stream(), start=1):
    _, y = x_y
    metric.update(y, mean.get())
    mean.update(y)

    if not i % 25_000:
        print(f'[{i:,d}] {metric}')

[25,000] MAE: 0.934259, RMSE: 1.124469
[50,000] MAE: 0.923893, RMSE: 1.105
[75,000] MAE: 0.937359, RMSE: 1.123696
[100,000] MAE: 0.942162, RMSE: 1.125783


## Parameters

let's review the important parameters to tune when dealing with this family of methods:

- n_factors: the number of latent factors. The more you set, the more items aspects and users preferences you are going to learn. Too many will cause overfitting, l2 regularization could help.
- *_optimizer: the optimizers. Classic stochastic gradient descent performs well, finding the good learning rate will make the difference.
- initializer: the latent weights initialization. Latent vectors have to be initialized with non-constant values. We generally sample them from a zero-mean normal distribution with small standard deviation.

## Baseline model

Now we can do machine learning and explore available models in river.reco module starting with the baseline model. It extends our naive prediction by adding to the global running mean two bias terms characterizing the user and the item discrepancy from the general tendency. This baseline model can be viewed as a linear regression where the intercept is replaced by the target running mean with the users and the items one hot encoded.

All machine learning models in river expect dicts as input with feature names as keys and feature values as values. Specifically, models from river.reco expect a 'user' and an 'item' entries without any type constraint on their values (i.e. can be strings or numbers). Other entries, if exist, are simply ignored. This is quite useful as we don't need to spend time and storage doing one hot encoding.

In [None]:
baseline_params = {
    'optimizer': river.optim.SGD(0.025),
    'l2': 0.,
    'initializer': river.optim.initializers.Zeros()
}

model = river.meta.PredClipper(
    regressor=river.reco.Baseline(**baseline_params),
    y_min=1,
    y_max=5
)

evaluate(model)

[25,000] MAE: 0.761844, RMSE: 0.960972 – 0:00:01.316880 – 170.36 KB
[50,000] MAE: 0.753292, RMSE: 0.951223 – 0:00:02.651206 – 239 KB
[75,000] MAE: 0.754177, RMSE: 0.953376 – 0:00:03.961798 – 282.81 KB
[100,000] MAE: 0.754651, RMSE: 0.954148 – 0:00:05.297710 – 306.41 KB


## Matrix Factorization

### Funk Matrix Factorization (FunkMF)

It's the pure form of matrix factorization consisting of only learning the users and items latent representations. Simon Funk popularized its stochastic gradient descent optimization in 2006 during the Netflix Prize. FunkMF is sometimes referred as Probabilistic Matrix Factorization which is an extended probabilistic version.

In [None]:
funk_mf_params = {
    'n_factors': 10,
    'optimizer': river.optim.SGD(0.05),
    'l2': 0.1,
    'initializer': river.optim.initializers.Normal(mu=0., sigma=0.1, seed=73)
}

model = river.meta.PredClipper(
    regressor = river.reco.FunkMF(**funk_mf_params),
    y_min=1,
    y_max=5
)

evaluate(model)

[25,000] MAE: 1.070136, RMSE: 1.397014 – 0:00:02.411592 – 938.37 KB
[50,000] MAE: 0.99174, RMSE: 1.290666 – 0:00:04.800103 – 1.13 MB
[75,000] MAE: 0.961072, RMSE: 1.250842 – 0:00:07.118442 – 1.33 MB
[100,000] MAE: 0.944883, RMSE: 1.227688 – 0:00:09.463518 – 1.5 MB


Results are equivalent to our naive prediction (0.9448 vs 0.9421). By only focusing on the users preferences and the items characteristics, the model is limited in his ability to capture different views of the problem. Despite its poor performance alone, this algorithm is quite useful combined in other models or when we need to build dense representations for other tasks.

### Biased Matrix Factorization (BiasedMF)
It's the combination of the Baseline model and FunkMF. Biased Matrix Factorization name is used by some people but some others refer to it by SVD or Funk SVD. It's the case of Yehuda Koren and Robert Bell in Recommender Systems Handbook (Chapter 5 Advances in Collaborative Filtering) and of surprise library. Nevertheless, SVD could be confused with the original Singular Value Decomposition from which it's derived from, and Funk SVD could also be misleading because of the biased part of the model equation which doesn't come from Simon Funk's work. For those reasons, we chose to side with Biased Matrix Factorization which fits more naturally to it.

In [None]:
biased_mf_params = {
    'n_factors': 10,
    'bias_optimizer': river.optim.SGD(0.025),
    'latent_optimizer': river.optim.SGD(0.05),
    'weight_initializer': river.optim.initializers.Zeros(),
    'latent_initializer': river.optim.initializers.Normal(mu=0., sigma=0.1, seed=73),
    'l2_bias': 0.,
    'l2_latent': 0.
}

model = river.meta.PredClipper(
    regressor = river.reco.BiasedMF(**biased_mf_params),
    y_min=1,
    y_max=5
)

evaluate(model)

[25,000] MAE: 0.761818, RMSE: 0.961057 – 0:00:02.622680 – 1.01 MB
[50,000] MAE: 0.751667, RMSE: 0.949443 – 0:00:05.232414 – 1.28 MB
[75,000] MAE: 0.749653, RMSE: 0.948723 – 0:00:07.802655 – 1.51 MB
[100,000] MAE: 0.748559, RMSE: 0.947854 – 0:00:10.396654 – 1.69 MB


Results improved (0.7485 vs 0.7546) demonstrating that users and items latent representations bring additional information.

## Factorization Machines
Steffen Rendel came up in 2010 with Factorization Machines, an algorithm able to handle any real valued feature vector, combining the advantages of general predictors with factorization models. It became quite popular in the field of online advertising, notably after winning several Kaggle competitions. The modeling technique starts with a linear regression to capture the effects of each variable individually. 

Then are added interaction terms to learn features relations. Instead of learning a single and specific weight per interaction (as in polynomial regression), a set of latent factors is learnt per feature (as in MF). An interaction is calculated by multiplying involved features product with their latent vectors dot product. The degree of factorization — or model order — represents the maximum number of features per interaction considered.

Strong emphasis must be placed on feature engineering as it allows FM to mimic most factorization models and significantly impact its performance. High cardinality categorical variables one hot encoding is the most frequent step before feeding the model with data. For more efficiency, river FM implementation considers string values as categorical variables and automatically one hot encode them. FM models have their own module ```river.facto```.

### Mimic Biased Matrix Factorization (BiasedMF)

In [None]:
fm_params = {
    'n_factors': 10,
    'weight_optimizer': river.optim.SGD(0.025),
    'latent_optimizer': river.optim.SGD(0.05),
    'sample_normalization': False,
    'l1_weight': 0.,
    'l2_weight': 0.,
    'l1_latent': 0.,
    'l2_latent': 0.,
    'intercept': 3,
    'intercept_lr': .01,
    'weight_initializer': river.optim.initializers.Zeros(),
    'latent_initializer': river.optim.initializers.Normal(mu=0., sigma=0.1, seed=73),
}

regressor = river.compose.Select('user', 'item')
regressor |= river.facto.FMRegressor(**fm_params)

model = river.meta.PredClipper(
    regressor=regressor,
    y_min=1,
    y_max=5
)

evaluate(model)

[25,000] MAE: 0.761778, RMSE: 0.960803 – 0:00:07.251881 – 1.16 MB
[50,000] MAE: 0.751986, RMSE: 0.949941 – 0:00:13.721247 – 1.36 MB
[75,000] MAE: 0.750044, RMSE: 0.948911 – 0:00:20.654001 – 1.58 MB
[100,000] MAE: 0.748609, RMSE: 0.947994 – 0:00:27.407404 – 1.77 MB


Both MAE are very close to each other (0.7486 vs 0.7485) showing that we almost reproduced reco.BiasedMF algorithm. The cost is a naturally slower running time as FM implementation offers more flexibility.

### Feature engineering for FM models

In [None]:
def split_genres(x):
    genres = x['genres'].split(', ')
    return {f'genre_{genre}': 1 / len(genres) for genre in genres}
    

def bin_age(x):
    if x['age'] <= 18:
        return {'age_0-18': 1}
    elif x['age'] <= 32:
        return {'age_19-32': 1}
    elif x['age'] < 55:
        return {'age_33-54': 1}
    else:
        return {'age_55-100': 1}

In [None]:
fm_params = {
    'n_factors': 14,
    'weight_optimizer': river.optim.SGD(0.01),
    'latent_optimizer': river.optim.SGD(0.025),
    'intercept': 3,
    'latent_initializer': river.optim.initializers.Normal(mu=0., sigma=0.05, seed=73),
}

regressor = river.compose.Select('user', 'item')
regressor += (
    river.compose.Select('genres') |
    river.compose.FuncTransformer(split_genres)
)
regressor += (
    river.compose.Select('age') |
    river.compose.FuncTransformer(bin_age)
)
regressor |= river.facto.FMRegressor(**fm_params)

model = river.meta.PredClipper(
    regressor=regressor,
    y_min=1,
    y_max=5
)

evaluate(model)

[25,000] MAE: 0.759838, RMSE: 0.961281 – 0:00:15.857599 – 1.43 MB
[50,000] MAE: 0.751307, RMSE: 0.951391 – 0:00:33.467707 – 1.68 MB
[75,000] MAE: 0.750361, RMSE: 0.951393 – 0:00:47.431751 – 1.95 MB
[100,000] MAE: 0.749994, RMSE: 0.951435 – 0:01:01.239416 – 2.2 MB


### Higher-Order Factorization Machines (HOFM)

In [None]:
hofm_params = {
    'degree': 3,
    'n_factors': 12,
    'weight_optimizer': river.optim.SGD(0.01),
    'latent_optimizer': river.optim.SGD(0.025),
    'intercept': 3,
    'latent_initializer': river.optim.initializers.Normal(mu=0., sigma=0.05, seed=73),
}

regressor = river.compose.Select('user', 'item')
regressor += (
    river.compose.Select('genres') |
    river.compose.FuncTransformer(split_genres)
)
regressor += (
    river.compose.Select('age') |
    river.compose.FuncTransformer(bin_age)
)
regressor |= river.facto.HOFMRegressor(**hofm_params)

model = river.meta.PredClipper(
    regressor=regressor,
    y_min=1,
    y_max=5
)

evaluate(model)

[25,000] MAE: 0.761297, RMSE: 0.962054 – 0:01:09.293899 – 2.64 MB
[50,000] MAE: 0.751865, RMSE: 0.951499 – 0:02:18.506399 – 3.12 MB
[75,000] MAE: 0.750853, RMSE: 0.951526 – 0:03:26.322158 – 3.64 MB
[100,000] MAE: 0.750607, RMSE: 0.951982 – 0:04:34.358663 – 4.12 MB


High-order interactions are often hard to estimate due to too much sparsity, that's why we won't spend too much time here.

### Field-aware Factorization Machines (FFM)
Field-aware variant of FM (FFM) improved the original method by adding the notion of "fields". A "field" is a group of features that belong to a specific domain (e.g. the "users" field, the "items" field, or the "movie genres" field).

FFM restricts itself to pairwise interactions and factorizes separated latent spaces — one per combination of fields (e.g. users/items, users/movie genres, or items/movie genres) — instead of a common one shared by all fields. Therefore, each feature has one latent vector per field it can interact with — so that it can learn the specific effect with each different field.

Note that FFM usually needs to learn smaller number of latent factors than FM as each latent vector only deals with one field.

In [None]:
ffm_params = {
    'n_factors': 8,
    'weight_optimizer': river.optim.SGD(0.01),
    'latent_optimizer': river.optim.SGD(0.025),
    'intercept': 3,
    'latent_initializer': river.optim.initializers.Normal(mu=0., sigma=0.05, seed=73),
}

regressor = river.compose.Select('user', 'item')
regressor += (
    river.compose.Select('genres') |
    river.compose.FuncTransformer(split_genres)
)
regressor += (
    river.compose.Select('age') |
    river.compose.FuncTransformer(bin_age)
)
regressor |= river.facto.FFMRegressor(**ffm_params)

model = river.meta.PredClipper(
    regressor=regressor,
    y_min=1,
    y_max=5
)

evaluate(model)

[25,000] MAE: 0.757718, RMSE: 0.958158 – 0:00:21.345140 – 3.06 MB
[50,000] MAE: 0.749502, RMSE: 0.948065 – 0:00:42.577609 – 3.62 MB
[75,000] MAE: 0.749275, RMSE: 0.948918 – 0:01:03.900952 – 4.23 MB
[100,000] MAE: 0.749542, RMSE: 0.949769 – 0:01:25.155978 – 4.79 MB


### Field-weighted Factorization Machines (FwFM)
Field-weighted Factorization Machines (FwFM) address FFM memory issues caused by its large number of parameters, which is in the order of feature number times field number. As FFM, FwFM is an extension of FM restricted to pairwise interactions, but instead of factorizing separated latent spaces, it learns a specific weight for each field combination modelling the interaction strength.

In [None]:
fwfm_params = {
    'n_factors': 10,
    'weight_optimizer': river.optim.SGD(0.01),
    'latent_optimizer': river.optim.SGD(0.025),
    'intercept': 3,
    'seed': 73,
}

regressor = river.compose.Select('user', 'item')
regressor += (
    river.compose.Select('genres') |
    river.compose.FuncTransformer(split_genres)
)
regressor += (
    river.compose.Select('age') |
    river.compose.FuncTransformer(bin_age)
)
regressor |= river.facto.FwFMRegressor(**fwfm_params)

model = river.meta.PredClipper(
    regressor=regressor,
    y_min=1,
    y_max=5
)

evaluate(model)

[25,000] MAE: 0.761539, RMSE: 0.962241 – 0:00:25.952167 – 1.18 MB
[50,000] MAE: 0.754089, RMSE: 0.953181 – 0:00:52.040548 – 1.38 MB
[75,000] MAE: 0.754806, RMSE: 0.954979 – 0:01:18.107421 – 1.6 MB
[100,000] MAE: 0.755404, RMSE: 0.95604 – 0:01:44.092305 – 1.8 MB
