# Factorization Machines scoring

The goal of the **recommender system** is to predict user preference for a set of items based on the past experience. Two the most popular approaches are Content-Based and Collaborative Filtering.

The goal of this exercise is to compare SVD and FM algorithms, try different configurations of parameters and explore obtained results.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from surprise import Dataset, Reader
from surprise import SVD, NMF
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV

import functions as f
import river

This analysis will focus on book recommendations based on [Book-Crossing dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). To reduce the dimensionality of the dataset and avoid running into memory error it will focus on users with at least 3 ratings and top 10% most frequently rated books. It consists of 176,594 records.

The recommender systems will be built using [surprise package](https://surprise.readthedocs.io/en/stable/getting_started.html) (Matrix Factorization - based models).

In [2]:
df = pd.read_csv('data/ratings_top.csv')

reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df[['user_id', 'isbn', 'book_rating']], reader)

In [3]:
print('Number of ratings: %d\nNumber of books: %d\nNumber of users: %d' % (len(df), len(df['isbn'].unique()), len(df['user_id'].unique())))

Number of ratings: 176594
Number of books: 16766
Number of users: 20155


## Using the River package

In [4]:
from river import datasets, stream
r_df = pd.read_csv('data/ratings_top.csv', header=0, names=['user', 'item', 'rating'])
y = r_df.pop('rating')
X_y = stream.iter_pandas(r_df, y)
cache = stream.Cache()

In [5]:
from river import optim, reco, metrics,preprocessing
def readData():
    df = pd.read_csv('data/ratings_top.csv', header=0, names=['user', 'item', 'rating'],nrows=20000)
    
    r_df = df.sample(frac=0.8, random_state=10)
    
    test = df.drop(r_df.index)
    y = r_df.pop('rating')
    y_test = test.pop('rating')
    return stream.iter_pandas(r_df, y), stream.iter_pandas(test, y_test)
books = pd.read_csv('data/books.csv')
books.head(5)
print(books.describe(include='all'))
users = pd.read_csv('data/users.csv')
users


              isbn      book_title      book_author  year_of_publication  \
count       271379          271379           271378        271379.000000   
unique      271379          242154           102027                  NaN   
top     0195153448  Selected Poems  Agatha Christie                  NaN   
freq             1              27              632                  NaN   
mean           NaN             NaN              NaN          1959.756050   
std            NaN             NaN              NaN           258.011363   
min            NaN             NaN              NaN             0.000000   
25%            NaN             NaN              NaN          1989.000000   
50%            NaN             NaN              NaN          1995.000000   
75%            NaN             NaN              NaN          2000.000000   
max            NaN             NaN              NaN          2050.000000   

        publisher  
count      271377  
unique      16806  
top     Harlequin  
freq   

Unnamed: 0,user_id,location,age,country
0,1,"nyc, new york, usa",,Usa
1,2,"stockton, california, usa",18.0,Usa
2,3,"moscow, yukon territory, russia",,Russia
3,4,"porto, v.n.gaia, portugal",17.0,Portugal
4,5,"farnborough, hants, united kingdom",,United Kingdom
...,...,...,...,...
278853,278854,"portland, oregon, usa",,Usa
278854,278855,"tacoma, washington, united kingdom",50.0,United Kingdom
278855,278856,"brampton, ontario, canada",,Canada
278856,278857,"knoxville, tennessee, usa",,Usa


In [6]:
biased_mf_params = {
    'n_factors': 10,
    'bias_optimizer': optim.SGD(0.025),
    'latent_optimizer': optim.SGD(0.05),
    'weight_initializer': optim.initializers.Zeros(),
    'latent_initializer': optim.initializers.Normal(mu=0., sigma=0.1, seed=73),
    'l2_bias': 0.,
    'l2_latent': 0.
}

model = reco.BiasedMF(**biased_mf_params)

metric = metrics.MAE() + metrics.RMSE()
X_y, z = readData()
cnt = 0
for x, y in X_y:
    y_pred = model.predict_one(user=x['user'], item=x['item'])
    
    metric.update(y_pred=y_pred, y_true=y)
    _ = model.learn_one(**x, x=x, y=y)
    cnt+= 1 
print(metric)

MAE: 1.353959, RMSE: 1.73113


In [8]:
from river import compose, facto
fwfm_params = {
    'n_factors': 10,
    'weight_optimizer': optim.SGD(0.01),
    'latent_optimizer': optim.SGD(0.025),
    'intercept': 3,
    'seed': 73,
}
hofm_params = {
    'degree': 3,
    'n_factors': 12,
    'weight_optimizer': optim.SGD(0.01),
    'latent_optimizer': optim.SGD(0.025),
    'intercept': 3,
    'latent_initializer': optim.initializers.Normal(mu=0., sigma=0.05, seed=73),
}
ffm_params = {
    'n_factors': 8,
    'weight_optimizer': optim.SGD(0.01),
    'latent_optimizer': optim.SGD(0.025),
    'intercept': 3,
    'latent_initializer': optim.initializers.Normal(mu=0., sigma=0.05, seed=73),
}

def split_publish(x):
    if x['year_of_publication'] <= 1990:
        return {'publication_1990' : 1}
    elif x['year_of_publication'] <= 1995:
        return {'publication_1995' : 1}
    elif x['year_of_publication'] <= 2000:
        return {'publication_2000' : 1}
    else: return {'publication_new' : 1}
def bin_ages(x):
    if pd.isnull(x['age']):
        return {'age_isna': 1}
    elif x['age'] <= 18:
        return {'age_0-18': 1}
    elif x['age'] <= 32:
        return {'age_19-32': 1}
    elif x['age'] < 55:
        return {'age_33-54': 1}
    else:
        return {'age_55-100': 1}

regressor = compose.Select('user') | compose.FuncTransformer(lambda x: {'user': str(x)})
regressor += (
    compose.Select('item')
)
regressor |= facto.FwFMRegressor(**fwfm_params)

model = preprocessing.PredClipper(
    regressor=regressor,
    y_min=1,
    y_max=10
)


metric = metrics.MAE() + metrics.RMSE()
X_y, X_y_test = readData()
cnt = 0
for x, y in X_y:
    user = users.loc[users['user_id'] == x['user']]
    book = books.loc[books['isbn'] == x['item']]
    
    if user.empty or book.empty:
        continue
    # Retrieve single values using .at or .iat
    user_age = users.at[user.index[0], 'age']
    book_year = books.at[book.index[0],'year_of_publication']

    x['age'] = user_age
    x['year_of_publication'] = book_year
    x['country'] = users.at[user.index[0], 'country']
    # y_pred = model.predict_one(user=x['user'], item=x['item'], age=user_age, year_of_publication= book_year)

    y_pred = model.predict_one(x=x)
    _ = model.learn_one(x=x, y=y)
    
    metric.update(y_pred=y_pred, y_true=y) 
    if cnt % 100 == 0:
        print(cnt, metric)
    cnt += 1
print(metric)
metric = metrics.MAE() + metrics.RMSE()
for x,y in X_y_test:
    user = users.loc[users['user_id'] == x['user']]
    book = books.loc[books['isbn'] == x['item']]
    
    if user.empty or book.empty:
        continue
    # Retrieve single values using .at or .iat
    user_age = users.at[user.index[0], 'age']
    book_year = books.at[book.index[0],'year_of_publication']

    x['age'] = user_age
    x['year_of_publication'] = book_year
    x['country'] = users.at[user.index[0], 'country']
    # y_pred = model.predict_one(user=x['user'], item=x['item'], age=user_age, year_of_publication= book_year)

    y_pred = model.predict_one(x=x)
    _ = model.learn_one(x=x, y=y)
    
    metric.update(y_pred=y_pred, y_true=y) 
    if cnt % 100 == 0:
        print(cnt, metric)
    cnt += 1
print(metric)

0 MAE: 7.019074, RMSE: 7.019074
100 MAE: 2.322432, RMSE: 2.813332
200 MAE: 1.865252, RMSE: 2.329316
300 MAE: 1.743964, RMSE: 2.170838
400 MAE: 1.651162, RMSE: 2.044683
500 MAE: 1.587331, RMSE: 1.998043
600 MAE: 1.544076, RMSE: 1.951294
700 MAE: 1.510805, RMSE: 1.922031
800 MAE: 1.491343, RMSE: 1.887875
900 MAE: 1.47202, RMSE: 1.868211
1000 MAE: 1.471308, RMSE: 1.864726
1100 MAE: 1.464081, RMSE: 1.850943
1200 MAE: 1.465554, RMSE: 1.840771
1300 MAE: 1.456654, RMSE: 1.825988
1400 MAE: 1.443485, RMSE: 1.819782
1500 MAE: 1.453949, RMSE: 1.830937
1600 MAE: 1.465473, RMSE: 1.842417
1700 MAE: 1.454585, RMSE: 1.831601
1800 MAE: 1.457244, RMSE: 1.83371
1900 MAE: 1.454484, RMSE: 1.830342
2000 MAE: 1.44307, RMSE: 1.816977
2100 MAE: 1.434866, RMSE: 1.810011
2200 MAE: 1.433056, RMSE: 1.808788
2300 MAE: 1.435019, RMSE: 1.808753
2400 MAE: 1.434857, RMSE: 1.812227
2500 MAE: 1.433437, RMSE: 1.809254
2600 MAE: 1.432461, RMSE: 1.806329
2700 MAE: 1.429058, RMSE: 1.801649
2800 MAE: 1.421238, RMSE: 1.796065


## SVD and NMF models comparison

Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) are matrix factorization techniques used for dimensionality reduction. Surprise package provides implementation of those algorithms.

It's clear that for the given dataset much better results can be obtained with SVD approach - both in terms of accuracy and training / testing time.

In [79]:
model_svd = SVD()
cv_results_svd = cross_validate(model_svd, data, cv=3)
pd.DataFrame(cv_results_svd).mean()

test_rmse    1.603640
test_mae     1.240969
fit_time     1.578933
test_time    0.464496
dtype: float64

In [13]:
model_nmf = NMF()
cv_results_nmf = cross_validate(model_nmf, data, cv=3)
pd.DataFrame(cv_results_nmf).mean()

test_rmse    2.625657
test_mae     2.242587
fit_time     4.788131
test_time    0.451066
dtype: float64

## Optimisation of SVD algorithm

Grid Search Cross Validation computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. It's useful for finding the best configuration of parameters.

It is used to find the best setting of parameters:
* n_factors - the number of factors
* n_epochs - the number of iteration of the SGD procedure
* lr_all - the learning rate for all parameters
* reg_all - the regularization term for all parameters

As a result, regarding the majority of parameters, the default setting is the most optimal one. The improvement obtained with Grid Search is very small.