<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# xDeepFM : the eXtreme Deep Factorization Machine 
This notebook will give you a quick example of how to train an [xDeepFM model](https://arxiv.org/abs/1803.05170). 
xDeepFM \[1\] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. Thus it can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced. To summarize, xDeepFM has the following key properties:
* It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;
* It contains a traditional DNN component that learns feature interactions in an implicit fashion and in bit-wise level.
* The implementation makes this model quite configurable. We can enable different subsets of components by setting hyperparameters like `use_Linear_part`, `use_FM_part`, `use_CIN_part`, and `use_DNN_part`. For example, by enabling only the `use_Linear_part` and `use_FM_part`, we can get a classical FM model.

In this notebook, we test xDeepFM on [Criteo dataset](http://labs.criteo.com/category/dataset).

## 0. Global Settings and Imports

In [1]:
import os
import sys
import scrapbook as sb
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources, prepare_hparams
from recommenders.models.deeprec.models.xDeepFM import XDeepFMModel
from recommenders.models.deeprec.io.iterator import FFMTextIterator

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.7.13 (default, Oct 18 2022, 18:57:03) 
[GCC 11.2.0]
Tensorflow version: 2.7.4


#### Parameters

In [2]:
EPOCHS = 10
BATCH_SIZE = 4096
RANDOM_SEED = 42  # Set this to None for non-deterministic result


xDeepFM uses the FFM format as data input: `<label> <field_id>:<feature_id>:<feature_value>`  
Each line represents an instance, `<label>` is a binary value with 1 meaning positive instance and 0 meaning negative instance. 
Features are divided into fields. For example, user's gender is a field, it contains three possible values, i.e. male, female and unknown. Occupation can be another field, which contains many more possible values than the gender field. Both field index and feature index are starting from 1. <br>

In [3]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
output_file = os.path.join(data_path, r'output.txt')
train_file = os.path.join(data_path, r'cretio_tiny_train')
valid_file = os.path.join(data_path, r'cretio_tiny_valid')
test_file = os.path.join(data_path, r'cretio_tiny_test')

if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')


100%|██████████| 10.3k/10.3k [00:01<00:00, 7.96kKB/s]


## 2. Criteo data 

Now let's try the xDeepFM on a real world dataset, a small sample from [Criteo dataset](http://labs.criteo.com/category/dataset). Criteo dataset is a well known industry benchmarking dataset for developing CTR prediction models and it's frequently adopted as evaluation dataset by research papers. 

The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset.

In [4]:
print('Demo with Criteo dataset')
hparams = prepare_hparams(yaml_file, 
                          FEATURE_COUNT=2300000, 
                          FIELD_COUNT=39, 
                          cross_l2=0.01, 
                          embed_l2=0.01, 
                          layer_l2=0.01,
                          learning_rate=0.002, 
                          batch_size=BATCH_SIZE, 
                          epochs=EPOCHS, 
                          cross_layer_sizes=[20, 10], 
                          init_value=0.1, 
                          layer_sizes=[20,20],
                          use_Linear_part=True, 
                          use_CIN_part=True, 
                          use_DNN_part=True)
print(hparams)

Demo with Criteo dataset
HParams object with values {'use_entity': True, 'use_context': True, 'cross_activation': 'identity', 'user_dropout': False, 'dropout': [0.0, 0.0], 'attention_dropout': 0.0, 'load_saved_model': False, 'fast_CIN_d': 0, 'use_Linear_part': True, 'use_FM_part': False, 'use_CIN_part': True, 'use_DNN_part': True, 'init_method': 'tnormal', 'init_value': 0.1, 'embed_l2': 0.01, 'embed_l1': 0.0, 'layer_l2': 0.01, 'layer_l1': 0.0, 'cross_l2': 0.01, 'cross_l1': 0.0, 'reg_kg': 0.0, 'learning_rate': 0.002, 'lr_rs': 1, 'lr_kg': 0.5, 'kg_training_interval': 5, 'max_grad_norm': 2, 'is_clip_norm': 0, 'dtype': 32, 'optimizer': 'adam', 'epochs': 10, 'batch_size': 4096, 'enable_BN': False, 'show_step': 200000, 'save_model': False, 'save_epoch': 2, 'write_tfevents': False, 'train_num_ngs': 4, 'need_sample': True, 'embedding_dropout': 0.0, 'EARLY_STOP': 100, 'min_seq_length': 1, 'slots': 5, 'cell': 'SUM', 'FIELD_COUNT': 39, 'FEATURE_COUNT': 2300000, 'data_format': 'ffm', 'load_model_n

In [6]:
model = XDeepFMModel(hparams, FFMTextIterator, seed=RANDOM_SEED)

Add linear part.
Add CIN part.
Add DNN part.


2022-11-16 11:30:58.632305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15397 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 6.0


In [7]:
# check the predictive performance before the model is trained
print(model.run_eval(test_file)) 

2022-11-16 11:31:03.488364: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8401
2022-11-16 11:31:03.969312: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory


{'auc': 0.4728, 'logloss': 0.7113}


In [8]:
%%time
model.fit(train_file, valid_file)

at epoch 1
train info: logloss loss:744.3602104187012
eval info: auc:0.6637, logloss:0.5342
at epoch 1 , train time: 20.7 eval time: 4.0
at epoch 2
train info: logloss loss:385.66929054260254
eval info: auc:0.7137, logloss:0.5109
at epoch 2 , train time: 19.9 eval time: 3.9
at epoch 3
train info: logloss loss:191.5083179473877
eval info: auc:0.7283, logloss:0.5037
at epoch 3 , train time: 19.7 eval time: 4.1
at epoch 4
train info: logloss loss:92.20774817466736
eval info: auc:0.7359, logloss:0.4991
at epoch 4 , train time: 20.1 eval time: 3.9
at epoch 5
train info: logloss loss:43.15945792198181
eval info: auc:0.74, logloss:0.4963
at epoch 5 , train time: 20.0 eval time: 3.9
at epoch 6
train info: logloss loss:19.656923294067383
eval info: auc:0.7426, logloss:0.4946
at epoch 6 , train time: 20.3 eval time: 3.9
at epoch 7
train info: logloss loss:8.770357608795166
eval info: auc:0.7441, logloss:0.4934
at epoch 7 , train time: 19.9 eval time: 4.0
at epoch 8
train info: logloss loss:3.922

<recommenders.models.deeprec.models.xDeepFM.XDeepFMModel at 0x7f861579ebd0>

In [9]:
# check the predictive performance after the model is trained
result = model.run_eval(test_file)
print(result)

{'auc': 0.7356, 'logloss': 0.5017}


In [10]:
sb.glue("result", result)

In [11]:
# Cleanup
tmpdir.cleanup()

## Reference
\[1\] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining, KDD 2018, London, UK, August 19-23, 2018.<br>