<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# xDeepFM : the eXtreme Deep Factorization Machine 
This notebook will give you a quick example of how to train an [xDeepFM model](https://arxiv.org/abs/1803.05170). 
xDeepFM \[1\] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. Thus it can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced. To summarize, xDeepFM has the following key properties:
* It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;
* It contains a traditional DNN component that learns feature interactions in an implicit fashion and in bit-wise level.
* The implementation makes this model quite configurable. We can enable different subsets of components by setting hyperparameters like `use_Linear_part`, `use_FM_part`, `use_CIN_part`, and `use_DNN_part`. For example, by enabling only the `use_Linear_part` and `use_FM_part`, we can get a classical FM model.

In this notebook, we test xDeepFM on two datasets: 1) a small synthetic dataset and 2) [Criteo dataset](http://labs.criteo.com/category/dataset)

## 0. Global Settings and Imports

In [1]:
import sys
import os
import scrapbook as sb
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from reco_utils.common.constants import SEED
from reco_utils.recommender.deeprec.deeprec_utils import (
    download_deeprec_resources, prepare_hparams
)
from reco_utils.recommender.deeprec.models.xDeepFM import XDeepFMModel
from reco_utils.recommender.deeprec.io.iterator import FFMTextIterator

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))


System version: 3.6.11 | packaged by conda-forge | (default, Aug  5 2020, 20:09:42) 
[GCC 7.5.0]
Tensorflow version: 1.15.2


#### Parameters

In [2]:
EPOCHS_FOR_SYNTHETIC_RUN = 15
EPOCHS_FOR_CRITEO_RUN = 10
BATCH_SIZE_SYNTHETIC = 128
BATCH_SIZE_CRITEO = 4096
RANDOM_SEED = SEED  # Set to None for non-deterministic result


xDeepFM uses the FFM format as data input: `<label> <field_id>:<feature_id>:<feature_value>`  
Each line represents an instance, `<label>` is a binary value with 1 meaning positive instance and 0 meaning negative instance. 
Features are divided into fields. For example, user's gender is a field, it contains three possible values, i.e. male, female and unknown. Occupation can be another field, which contains many more possible values than the gender field. Both field index and feature index are starting from 1. <br>

## 1. Synthetic data
Now let's start with a small synthetic dataset. In this dataset, there are 10 fields, 1000 fefatures, and label is generated according to the result of a set of preset pair-wise feature interactions. 

In [3]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
train_file = os.path.join(data_path, r'synthetic_part_0')
valid_file = os.path.join(data_path, r'synthetic_part_1')
test_file = os.path.join(data_path, r'synthetic_part_2')
output_file = os.path.join(data_path, r'output.txt')

if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')


100%|██████████| 10.3k/10.3k [00:01<00:00, 5.24kKB/s]


#### 1.1 Prepare hyper-parameters
prepare_hparams() will create a full set of hyper-parameters for model training, such as learning rate, feature number, and dropout ratio. We can put those parameters in a yaml file, or pass parameters as the function's parameters (which will overwrite yaml settings).

In [4]:
hparams = prepare_hparams(yaml_file, 
                          FEATURE_COUNT=1000, 
                          FIELD_COUNT=10, 
                          cross_l2=0.0001, 
                          embed_l2=0.0001, 
                          learning_rate=0.001, 
                          epochs=EPOCHS_FOR_SYNTHETIC_RUN,
                          batch_size=BATCH_SIZE_SYNTHETIC)
print(hparams)

kg_file=None,user_clicks=None,FEATURE_COUNT=1000,FIELD_COUNT=10,data_format=ffm,PAIR_NUM=None,DNN_FIELD_NUM=None,n_user=None,n_item=None,n_user_attr=None,n_item_attr=None,iterator_type=None,SUMMARIES_DIR=None,MODEL_DIR=None,wordEmb_file=None,entityEmb_file=None,contextEmb_file=None,news_feature_file=None,user_history_file=None,use_entity=True,use_context=True,doc_size=None,history_size=None,word_size=None,entity_size=None,entity_dim=None,entity_embedding_method=None,transform=None,train_ratio=None,dim=10,layer_sizes=[100, 100],cross_layer_sizes=[1],cross_layers=None,activation=['relu', 'relu'],cross_activation=identity,user_dropout=False,dropout=[0.0, 0.0],attention_layer_sizes=None,attention_activation=None,attention_dropout=0.0,model_type=xDeepFM,method=classification,load_saved_model=False,load_model_name=you model path,filter_sizes=None,num_filters=None,mu=None,fast_CIN_d=0,use_Linear_part=False,use_FM_part=False,use_CIN_part=True,use_DNN_part=False,init_method=tnormal,init_value=0

#### 1.2 Create data loader
Designate a data iterator for the model. xDeepFM uses FFMTextIterator. 

In [5]:
input_creator = FFMTextIterator

#### 1.3 Create model
When both hyper-parameters and data iterator are ready, we can create a model:

In [6]:
model = XDeepFMModel(hparams, input_creator, seed=RANDOM_SEED)

## sometimes we don't want to train a model from scratch
## then we can load a pre-trained model like this: 
#model.load_model(r'your_model_path')

Add CIN part.


Now let's see what is the model's performance at this point (without starting training):

In [7]:
print(model.run_eval(test_file))

{'auc': 0.5043, 'logloss': 0.7515}


AUC=0.5 is a state of random guess. We can see that before training, the model behaves like random guessing.

#### 1.4 Train model
Next we want to train the model on a training set, and check the performance on a validation dataset. Training the model is as simple as a function call:

In [8]:
model.fit(train_file, valid_file)

at epoch 1
train info: logloss loss:0.7556826068773302
eval info: auc:0.504, logloss:0.7042
at epoch 1 , train time: 4.3 eval time: 0.6
at epoch 2
train info: logloss loss:0.7263523231666932
eval info: auc:0.5066, logloss:0.6973
at epoch 2 , train time: 4.3 eval time: 0.8
at epoch 3
train info: logloss loss:0.7177084291104189
eval info: auc:0.5099, logloss:0.6953
at epoch 3 , train time: 3.8 eval time: 0.7
at epoch 4
train info: logloss loss:0.7118660186983875
eval info: auc:0.5147, logloss:0.6946
at epoch 4 , train time: 3.7 eval time: 0.8
at epoch 5
train info: logloss loss:0.7055103289302682
eval info: auc:0.523, logloss:0.6941
at epoch 5 , train time: 3.8 eval time: 0.7
at epoch 6
train info: logloss loss:0.6954095556154284
eval info: auc:0.5416, logloss:0.6929
at epoch 6 , train time: 3.6 eval time: 0.8
at epoch 7
train info: logloss loss:0.6723950118133702
eval info: auc:0.5916, logloss:0.6831
at epoch 7 , train time: 4.0 eval time: 0.7
at epoch 8
train info: logloss loss:0.61198

<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x7f9d74f7ff60>

#### 1.5 Evaluate model

Again, let's see what is the model's performance now (after training):

In [9]:
res_syn = model.run_eval(test_file)
print(res_syn)


{'auc': 0.9716, 'logloss': 0.2278}


In [10]:
sb.glue("res_syn", res_syn)

If we want to get the full prediction scores rather than evaluation metrics, we can do this:

In [11]:
model.predict(test_file, output_file)

<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x7f9d74f7ff60>

## 2. Criteo data 

Now we have successfully launched an experiment on a synthetic dataset. Next let's try something on a real world dataset, which is a small sample from [Criteo dataset](http://labs.criteo.com/category/dataset). Criteo dataset is a well known industry benchmarking dataset for developing CTR prediction models and it's frequently adopted as evaluation dataset by research papers. The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset.

In [12]:
print('demo with Criteo dataset')
hparams = prepare_hparams(yaml_file, 
                          FEATURE_COUNT=2300000, 
                          FIELD_COUNT=39, 
                          cross_l2=0.01, 
                          embed_l2=0.01, 
                          layer_l2=0.01,
                          learning_rate=0.002, 
                          batch_size=BATCH_SIZE_CRITEO, 
                          epochs=EPOCHS_FOR_CRITEO_RUN, 
                          cross_layer_sizes=[20, 10], 
                          init_value=0.1, 
                          layer_sizes=[20,20],
                          use_Linear_part=True, 
                          use_CIN_part=True, 
                          use_DNN_part=True)



demo with Criteo dataset


In [13]:
train_file = os.path.join(data_path, r'cretio_tiny_train')
valid_file = os.path.join(data_path, r'cretio_tiny_valid')
test_file = os.path.join(data_path, r'cretio_tiny_test')

In [14]:
model = XDeepFMModel(hparams, FFMTextIterator, seed=RANDOM_SEED)

Add linear part.
Add CIN part.
Add DNN part.


In [15]:
# check the predictive performance before the model is trained
print(model.run_eval(test_file)) 


{'auc': 0.4728, 'logloss': 0.7113}


In [16]:
model.fit(train_file, valid_file)

at epoch 1
train info: logloss loss:744.3602027893066
eval info: auc:0.6637, logloss:0.5342
at epoch 1 , train time: 21.7 eval time: 4.3
at epoch 2
train info: logloss loss:385.66927337646484
eval info: auc:0.7137, logloss:0.5109
at epoch 2 , train time: 21.4 eval time: 4.3
at epoch 3
train info: logloss loss:191.50830841064453
eval info: auc:0.7283, logloss:0.5037
at epoch 3 , train time: 21.4 eval time: 4.2
at epoch 4
train info: logloss loss:92.20774269104004
eval info: auc:0.7359, logloss:0.4991
at epoch 4 , train time: 21.6 eval time: 4.4
at epoch 5
train info: logloss loss:43.159456968307495
eval info: auc:0.74, logloss:0.4963
at epoch 5 , train time: 21.6 eval time: 4.3
at epoch 6
train info: logloss loss:19.656921446323395
eval info: auc:0.7426, logloss:0.4946
at epoch 6 , train time: 21.3 eval time: 4.2
at epoch 7
train info: logloss loss:8.77035716176033
eval info: auc:0.7441, logloss:0.4934
at epoch 7 , train time: 21.5 eval time: 4.3
at epoch 8
train info: logloss loss:3.92

<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x7f9d64b4a2e8>

In [17]:
# check the predictive performance after the model is trained
res_real = model.run_eval(test_file)
print(res_real)

{'auc': 0.7356, 'logloss': 0.5017}


In [18]:
sb.glue("res_real", res_real)

In [19]:
# Cleanup
tmpdir.cleanup()

## Reference
\[1\] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining, KDD 2018, London, UK, August 19-23, 2018.<br>