# xDeepFM Single Node 
This notebook will give you a quick example of how to train an xDeepFM model. 
xDeepFM \[1\] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. It has the following two key properties:
* It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;
* It contains the tradicition DNN component that learns feature interactions in an implicit fashion and in bit-wise level.
Thus the xDeepFM can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced.


## Global Settings and Imports

In [2]:
import sys
sys.path.append("../../")
from reco_utils.recommender.deeprec.deeprec_utils import *
from reco_utils.recommender.deeprec.models.xDeepFM import *
from reco_utils.recommender.deeprec.IO.iterator import *


## Download data
xDeepFM use the FFM format as data input: `<label> <field_id>:<feature_id>:<feature_value>`  
Each line represents an instance, \<label\> is a binary value with 1 meaning positive instance and 0 meaning negative instance. 
Features are devided into fields. For example, user's gender is a field, it contains three possible values, i.e. male, female and unknown. Occupation can be another fields, which contains many more possible values than the gender field. Both field idx and feature idx are starting from 1. 

In [3]:
data_path = '../../tests/resources/deeprec/xdeepfm'
yaml_file = os.path.join(data_path, r'xDeepFM.yaml')
train_file = os.path.join(data_path, r'synthetic_part_0')
valid_file = os.path.join(data_path, r'synthetic_part_1')
test_file = os.path.join(data_path, r'synthetic_part_1')
output_file = os.path.join(data_path, r'output.txt')

if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')


## Create hyper-parameters
prepare_hparams() will create a full set of hyper-parameters for model training, such as learning rate, feature number, and dropout ratio. We can put thoese parameters in a yaml file, or pass parameters as the function's parameters.

In [17]:
hparams = prepare_hparams(yaml_file, FEATURE_COUNT=1000, FIELD_COUNT=10, cross_l2=0.0001, embed_l2=0.0001, learning_rate=0.001, epochs=15)
print(hparams)

[('DNN_FIELD_NUM', None), ('FEATURE_COUNT', 1000), ('FIELD_COUNT', 10), ('MODEL_DIR', None), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('activation', ['relu', 'relu']), ('attention_activation', None), ('attention_dropout', 0.0), ('attention_layer_sizes', None), ('batch_size', 128), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0001), ('cross_layer_sizes', [1]), ('cross_layers', None), ('data_format', 'ffm'), ('dim', 10), ('doc_size', None), ('dropout', [0.0, 0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0001), ('enable_BN', False), ('entityEmb_file', None), ('entity_dim', None), ('entity_embedding_method', None), ('entity_size', None), ('epochs', 15), ('fast_CIN_d', 0), ('filter_sizes', None), ('init_method', 'tnormal'), ('init_value', 0.3), ('is_clip_norm', 0), ('iterator_type', None), ('kg_file', None), ('kg_training_interval', 5), ('layer_l1', 0.0), ('layer_l2', 0.0001), ('layer_sizes', [100, 100]), ('learning_rate', 0.001), ('load_model_name', 'yo

## Create data loader
Designate a data iterator for the model. xDeepFM use FFMTextIterator. 

In [18]:
input_creator = FFMTextIterator

## Create model
When hyper-parameters and data iterator are ready, we are ready to create a model:

In [19]:
model = XDeepFMModel(hparams, input_creator)

## sometimes we don't want to train a model from scratch
## then we can load a pre-train model like this: 
#model.load_model(r'your_model_path')

Add CIN part.


Now let's see what is the model's performance at this point (wiout starting training):

In [15]:
print(model.run_eval(valid_file))

{'auc': 0.4995, 'logloss': 0.7267}


AUC=0.5 is a random guess. We can see that before training, the model behave like random guessing. Next we want to train the model on a training file, and check the performance on a validation dataset. Training the model is as simple as a function call:

In [20]:
model.fit(train_file, valid_file)

at epoch 1 train info: auc:0.5299, logloss:0.6919 eval info: auc:0.4979, logloss:0.6958
at epoch 1 , train time: 4.8 eval time: 4.7
at epoch 2 train info: auc:0.5552, logloss:0.6891 eval info: auc:0.5046, logloss:0.6943
at epoch 2 , train time: 4.7 eval time: 4.9
at epoch 3 train info: auc:0.5924, logloss:0.6837 eval info: auc:0.5195, logloss:0.6933
at epoch 3 , train time: 4.6 eval time: 4.7
at epoch 4 train info: auc:0.6689, logloss:0.6609 eval info: auc:0.5731, logloss:0.6847
at epoch 4 , train time: 4.3 eval time: 4.1
at epoch 5 train info: auc:0.8033, logloss:0.5583 eval info: auc:0.719, logloss:0.6157
at epoch 5 , train time: 4.0 eval time: 4.0
at epoch 6 train info: auc:0.8952, logloss:0.4199 eval info: auc:0.8332, logloss:0.5036
at epoch 6 , train time: 4.1 eval time: 3.9
at epoch 7 train info: auc:0.9391, logloss:0.324 eval info: auc:0.8844, logloss:0.4292
at epoch 7 , train time: 4.1 eval time: 3.9
at epoch 8 train info: auc:0.964, logloss:0.2523 eval info: auc:0.9133, loglos

<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x2451c0c9278>

Again, let's see what is the model's performance now (after training):

In [21]:
print(model.run_eval(valid_file))

{'auc': 0.9851, 'logloss': 0.1693}


If we want to get the full prediction scores rather than evaluation metrics, we can do this:

In [22]:
model.predict(test_file, output_file)

<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x2451c0c9278>

Now we have successfully launched an experiment on a synthetic dataset. Next let's try something on a real world dataset, which is a small sample from Criteo dataset \[2\].

In [5]:
print('demo with Criteo dataset')
hparams = prepare_hparams(yaml_file, FEATURE_COUNT=2300000, FIELD_COUNT=39, cross_l2=0.01, embed_l2=0.01, layer_l2=0.01,
                          learning_rate=0.002, batch_size=4096, epochs=30, cross_layer_sizes=[20, 10], init_value=0.1, layer_sizes=[20,20],
                                use_Linear_part=True, use_CIN_part=True, use_DNN_part=True)

train_file = os.path.join(data_path, r'cretio_tiny_train')
valid_file = os.path.join(data_path, r'cretio_tiny_valid')
test_file = os.path.join(data_path, r'cretio_tiny_test')

model = XDeepFMModel(hparams, FFMTextIterator)

print(model.run_eval(valid_file))
model.fit(train_file, valid_file)
print(model.run_eval(valid_file))

demo with Criteo dataset
Add linear part.
Add CIN part.
Add DNN part.
{'auc': 0.4827, 'logloss': 0.7696}
at epoch 1 train info: auc:0.6427, logloss:0.548 eval info: auc:0.6413, logloss:0.5458
at epoch 1 , train time: 52.5 eval time: 29.1
at epoch 2 train info: auc:0.7001, logloss:0.519 eval info: auc:0.7004, logloss:0.5182
at epoch 2 , train time: 50.8 eval time: 28.8
at epoch 3 train info: auc:0.7221, logloss:0.5071 eval info: auc:0.7199, logloss:0.5075
at epoch 3 , train time: 50.7 eval time: 29.0
at epoch 4 train info: auc:0.7337, logloss:0.5007 eval info: auc:0.7304, logloss:0.5016
at epoch 4 , train time: 50.7 eval time: 28.9
at epoch 5 train info: auc:0.7405, logloss:0.4967 eval info: auc:0.7367, logloss:0.498
at epoch 5 , train time: 50.3 eval time: 29.2
at epoch 6 train info: auc:0.7442, logloss:0.4944 eval info: auc:0.74, logloss:0.4958
at epoch 6 , train time: 50.5 eval time: 28.9
at epoch 7 train info: auc:0.7461, logloss:0.4932 eval info: auc:0.7418, logloss:0.4946
at epoch

## Reference
\[1\] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems.Proceedings of the 24th {ACM} {SIGKDD} International Conference on Knowledge Discovery \& Data Mining, KDD 2018, London, UK, August 19-23, 2018.
\[2\] The Criteo datasets: http://labs.criteo.com/category/dataset/. 