<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Factorization Machine Deep Dive

Factorization machine is one of the representative algorithms that are used for building content-based recommender model. The algorithm is powerful in terms of capturing the high-order interactions between input features. In addition, it provides better generalization capability and expressiveness compared to the other classic algorithms such as SVM. The most recent research extends the basic FM algorithms by using deep learning techniques, which achieve remarkable improvement in a few practical use cases.

This notebook presents a deep dive into the Factorization Machine algorithm, as well as its variants, and demonstrates some best practices of using the contemporary FM implementations like [`xlearn`](https://github.com/aksnzhy/xlearn) and [`xDeepFM`](https://github.com/microsoft/recommenders) for dealing with tasks like click-through rate prediction by using the Criteo dataset.

## 1 Factorization Machine And Its Extensions

## 2 Factorization Machine Implementation

### 2.1 xlearn

Setups for using `xlearn`.

1. `xlearn` is implemented in C++ and has Python bindings, so it can be directly installed as a Python package from PyPI. The installation of `xlearn` is enabled in the [Recommenders repo environment setup script](https://github.com/microsoft/recommenders/blob/master/scripts/generate_conda_file.py). One can follow the general setup steps to install the environment as required, in which `xlearn` is installed as well.
2. NOTE `xlearn` may require some base libraries installed as prerequisites in the system, e.g., `cmake`.

After a succesful creation of the environment, one can load the packages to run `xlearn` in a Jupyter notebook or Python script.

In [22]:
import time
import sys
sys.path.append("../../")
import os
import papermill as pm
from tempfile import TemporaryDirectory
import xlearn as xl
import tensorflow as tf
from sklearn.metrics import roc_auc_score
import numpy as np

from reco_utils.common.constants import SEED
from reco_utils.recommender.deeprec.deeprec_utils import (
    download_deeprec_resources, prepare_hparams
)
from reco_utils.recommender.deeprec.models.xDeepFM import XDeepFMModel
from reco_utils.recommender.deeprec.IO.iterator import FFMTextIterator

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
Tensorflow version: 1.12.0


To illustrate the use of `xlearn`, the following example uses a synthetized data set from Bing News for building and evaluating a FFM model built by using `xlearn`.

In [11]:
tmpdir = TemporaryDirectory()

data_path = tmpdir.name
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
train_file = os.path.join(data_path, r'synthetic_part_0')
valid_file = os.path.join(data_path, r'synthetic_part_1')
test_file = os.path.join(data_path, r'synthetic_part_2')
model_file = os.path.join(data_path, r'model.out')
output_file = os.path.join(data_path, r'output.txt')

if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')

100%|██████████| 10.3k/10.3k [00:05<00:00, 2.06kKB/s]


The following steps are from the [official documentation of `xlearn`](https://xlearn-doc.readthedocs.io/en/latest/index.html) for building a model. To begin with, we do not modify any training parameter values. 

NOTE, if `xlearn` is run through command line, the training process can be displayed in the console.

In [12]:
# Training task
ffm_model = xl.create_ffm()        # Use field-aware factorization machine (ffm)
ffm_model.setTrain(train_file)     # Set the path of training dataset
ffm_model.setValidate(valid_file)  # Set the path of validation dataset

# Parameters:
#  0. task: binary classification
#  1. learning rate: 0.2
#  2. regular lambda: 0.002
#  3. evaluation metric: accuracy
param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric':'auc'}

# Start to train
# The trained model will be stored in model.out

time_start = time.time()

ffm_model.fit(param, model_file)

time_train = time.time() - time_start

# Prediction task
ffm_model.setTest(test_file)  # Set the path of test dataset
ffm_model.setSigmoid()        # Convert output to 0-1

# Start to predict
# The output result will be stored in output.txt
time_start = time.time()

ffm_model.predict(model_file, output_file)

time_predict = time.time() - time_start

The output are the predicted labels (i.e., 1 or 0) for the testing data set. AUC score is calculated to evaluate the model performance.

In [13]:
with open(output_file) as f:
    predictions = f.readlines()

with open(test_file) as f:
    truths = f.readlines()

truths = np.array([float(truth.split(' ')[0]) for truth in truths])
predictions = np.array([float(prediction.strip('')) for prediction in predictions])

auc_score = roc_auc_score(truths, predictions)

In [6]:
auc_score

0.8680859736493857

In [7]:
print('Training takes {0:.2f}s and predicting takes {1:.2f}s.'.format(time_train, time_predict))

Training takes 0.40s and predicting takes 0.03s.


It can be seen that the model building/scoring process is fast and the model performance is good. 

### 2.2 xDeepFM

In [21]:
EPOCHS_FOR_SYNTHETIC_RUN = 15
BATCH_SIZE_SYNTHETIC = 128
RANDOM_SEED = SEED  # Set None for non-deterministic result

In [24]:
hparams = prepare_hparams(
    yaml_file,     
    FEATURE_COUNT=1000, 
    FIELD_COUNT=10, 
    cross_l2=0.0001, 
    embed_l2=0.0001, 
    learning_rate=0.001, 
    epochs=EPOCHS_FOR_SYNTHETIC_RUN,
    use_FM_part=True,
    use_CIN_part=False,
    use_DNN_part=False,
    batch_size=BATCH_SIZE_SYNTHETIC
)

print(hparams)

[('DNN_FIELD_NUM', None), ('FEATURE_COUNT', 1000), ('FIELD_COUNT', 10), ('MODEL_DIR', None), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('activation', ['relu', 'relu']), ('attention_activation', None), ('attention_dropout', 0.0), ('attention_layer_sizes', None), ('batch_size', 128), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0001), ('cross_layer_sizes', [1]), ('cross_layers', None), ('data_format', 'ffm'), ('dim', 10), ('doc_size', None), ('dropout', [0.0, 0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0001), ('enable_BN', False), ('entityEmb_file', None), ('entity_dim', None), ('entity_embedding_method', None), ('entity_size', None), ('epochs', 15), ('fast_CIN_d', 0), ('filter_sizes', None), ('init_method', 'tnormal'), ('init_value', 0.3), ('is_clip_norm', 0), ('iterator_type', None), ('kg_file', None), ('kg_training_interval', 5), ('layer_l1', 0.0), ('layer_l2', 0.0001), ('layer_sizes', [100, 100]), ('learning_rate', 0.001), ('load_model_name', 'yo

In [26]:
input_creator = FFMTextIterator
model = XDeepFMModel(hparams, input_creator, seed=RANDOM_SEED)

Add FM part.
Instructions for updating:
keep_dims is deprecated, use keepdims instead


In [27]:
model.fit(train_file, valid_file)

at epoch 1 train info: auc:0.5308, logloss:0.809 eval info: auc:0.5019, logloss:0.8444
at epoch 1 , train time: 2.8 eval time: 3.5
at epoch 2 train info: auc:0.5579, logloss:0.7493 eval info: auc:0.5043, logloss:0.8024
at epoch 2 , train time: 2.6 eval time: 3.3
at epoch 3 train info: auc:0.5863, logloss:0.7104 eval info: auc:0.5067, logloss:0.7764
at epoch 3 , train time: 2.6 eval time: 3.3
at epoch 4 train info: auc:0.6151, logloss:0.6832 eval info: auc:0.5095, logloss:0.7595
at epoch 4 , train time: 2.6 eval time: 3.4
at epoch 5 train info: auc:0.6434, logloss:0.6627 eval info: auc:0.5132, logloss:0.7481
at epoch 5 , train time: 2.6 eval time: 3.2
at epoch 6 train info: auc:0.6705, logloss:0.6461 eval info: auc:0.5178, logloss:0.7402
at epoch 6 , train time: 2.6 eval time: 3.3
at epoch 7 train info: auc:0.6959, logloss:0.6317 eval info: auc:0.5235, logloss:0.7343
at epoch 7 , train time: 2.6 eval time: 3.3
at epoch 8 train info: auc:0.7193, logloss:0.6183 eval info: auc:0.5305, logl

<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x7fad5c18a8d0>

In [28]:
res_syn = model.run_eval(test_file)
print(res_syn)

pm.record("res_syn", res_syn)

{'auc': 0.6202, 'logloss': 0.6925}


## 3 FM on criteo data set

## References