<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# xDeepFM : the eXtreme Deep Factorization Machine 
This notebook will give you a quick example of how to train an xDeepFM model. 
xDeepFM \[1\] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. Thus it can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced. To summarize, xDeepFM has the following key properties:
* It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;
* It contains a traditional DNN component that learns feature interactions in an implicit fashion and in bit-wise level.
* The implementation makes this model quite configurable. We can enable different subsets of components by setting hyperparameters like `use_Linear_part`, `use_FM_part`, `use_CIN_part`, and `use_DNN_part`. For example, by enabling only the `use_Linear_part` and `use_FM_part`, we can get a classical FM model.


## Global Settings and Imports

In [23]:
import sys
sys.path.append("../../")
import papermill as pm
import tensorflow as tf
import pandas as pd

from reco_utils.recommender.deeprec.deeprec_utils import *
from reco_utils.recommender.deeprec.models.xDeepFM import *
from reco_utils.recommender.deeprec.IO.iterator import *
from reco_utils.dataset.pandas_df_utils import LibffmConverter
from reco_utils.dataset import movielens
from reco_utils.common.constants import (
    DEFAULT_USER_COL as USER_COL,
    DEFAULT_ITEM_COL as ITEM_COL,
    DEFAULT_RATING_COL as RATING_COL,
    DEFAULT_PREDICTION_COL as PREDICT_COL,
)
from reco_utils.common.constants import SEED
from reco_utils.dataset.python_splitters import python_random_split
print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.5.5 |Anaconda custom (64-bit)| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0]
Tensorflow version: 1.10.1


## 1. Prepare data

xDeepFM uses the FFM format as data input: <label> <field_id>:<feature_id>:<feature_value>

    
### 1.1 Load Movie rating and genres data
First, download MovieLens data. Movies in the data set are tagged as one or more genres where there are total 19 genres including 'unknown'. We load movie genres to use them as item features.

In [24]:
MOVIELENS_DATA_SIZE='100k'
RANDOM_SEED = SEED  # Set seed for deterministic result
# The genres of each movie are returned as '|' separated string, e.g. "Animation|Children's|Comedy".
data = movielens.load_pandas_df(
        size=MOVIELENS_DATA_SIZE,
        header=[USER_COL, ITEM_COL, RATING_COL],
        genres_col='Genres_string'  # load genres as a temporal column 'Genres_string'
    )
display(data.head())
data['userID'] = data.userID.astype(str)
data['itemID'] = data.itemID.astype(str)

100%|██████████| 4.81k/4.81k [00:00<00:00, 17.4kKB/s]


Unnamed: 0,userID,itemID,rating,Genres_string
0,196,242,3.0,Comedy
1,63,242,3.0,Comedy
2,226,242,5.0,Comedy
3,154,242,3.0,Comedy
4,306,242,5.0,Comedy


### 1.2 Convert to libffm format. 
The output features are  <field_index>:<field_feature_index>:1 or <field_index>:<field_index>:<field_feature_value>, depending on the data type of the features in the original dataframe. Current libffm converter does not support multiple features in one field, so the followings cells are a workaround 

There are 3 fields in this dataset: userID, itemID and genres.
For fields of 'userID' and 'itemID', they are fields with single feature. But for the field 'genres', it has 19 features which indicates different types of genre.



In [25]:
import sklearn.preprocessing
genres_encoder = sklearn.preprocessing.MultiLabelBinarizer()
data['genres'] = genres_encoder.fit_transform(
        data['Genres_string'].apply(lambda s: s.split("|"))
    ).tolist()
print("Genres:", genres_encoder.classes_)
display(data.drop_duplicates(ITEM_COL)[[ITEM_COL, 'Genres_string', 'genres']].head())

Genres: ['Action' 'Adventure' 'Animation' "Children's" 'Comedy' 'Crime'
 'Documentary' 'Drama' 'Fantasy' 'Film-Noir' 'Horror' 'Musical' 'Mystery'
 'Romance' 'Sci-Fi' 'Thriller' 'War' 'Western' 'unknown']


Unnamed: 0,itemID,Genres_string,genres
0,242,Comedy,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
117,302,Crime|Film-Noir|Mystery|Thriller,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, ..."
414,377,Children's|Comedy,"[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
427,51,Drama|Romance|War|Western,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, ..."
508,346,Crime|Drama,"[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ..."


In [26]:
#generate new column names for field 'genre'
data[['genre1','genre2','genre3','genre4','genre5','genre6','genre7',\
      'genre8','genre9','genre10','genre11','genre12','genre13','genre14',\
      'genre15','genre16','genre17','genre18','genre19']] \
=pd.DataFrame(data.genres.values.tolist(), index= data.index)

In [27]:
data=data.drop(["genres","Genres_string"],axis=1)

In [28]:
data.head()

Unnamed: 0,userID,itemID,rating,genre1,genre2,genre3,genre4,genre5,genre6,genre7,...,genre10,genre11,genre12,genre13,genre14,genre15,genre16,genre17,genre18,genre19
0,196,242,3.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,63,242,3.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,226,242,5.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,154,242,3.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,306,242,5.0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
## feed the data into libffm converter. 
converter = LibffmConverter().fit(data, col_rating=RATING_COL)
data_transformed = converter.transform(data)

In [30]:
##change to classification label
data_transformed.loc[data_transformed['rating'] <= 3, 'rating'] = 0
data_transformed.loc[data_transformed['rating'] > 3, 'rating'] = 1


In [31]:
data_transformed.head()

Unnamed: 0,rating,userID,itemID,genre1,genre2,genre3,genre4,genre5,genre6,genre7,...,genre10,genre11,genre12,genre13,genre14,genre15,genre16,genre17,genre18,genre19
0,0.0,1:1:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
1,0.0,1:2:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
2,1.0,1:3:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
3,0.0,1:4:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
4,1.0,1:5:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0


In [32]:
## this is a workaround for current libffmconverter
## keep field number for genre to 3 and reset genre index to 1:19
num = int(data_transformed['genre1'][0][2:-2])

genre_index=1
old_field_index=genre_index+2

for i in range(num,num+19):
    
    old_index=str(old_field_index)+':'+str(i)
    new_index='3:'+str(genre_index)
    
    coln='genre'+str(genre_index)
    data_transformed[coln]=data_transformed[coln].str.replace(old_index,new_index)
    
    
    genre_index=genre_index+1
    old_field_index=old_field_index+1


In [33]:
data_transformed.tail()

Unnamed: 0,rating,userID,itemID,genre1,genre2,genre3,genre4,genre5,genre6,genre7,...,genre10,genre11,genre12,genre13,genre14,genre15,genre16,genre17,genre18,genre19
99995,1.0,1:505:1,2:2621:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:0,3:6:0,3:7:0,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0
99996,0.0,1:89:1,2:2622:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:0,3:6:0,3:7:0,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0
99997,0.0,1:89:1,2:2623:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:0,3:6:0,3:7:0,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0
99998,0.0,1:89:1,2:2624:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:0,3:6:0,3:7:0,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0
99999,0.0,1:89:1,2:2625:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:0,3:6:0,3:7:1,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0


### Parameters

## Download data
xDeepFM uses the FFM format as data input: `<label> <field_id>:<feature_id>:<feature_value>`  
Each line represents an instance, `<label>` is a binary value with 1 meaning positive instance and 0 meaning negative instance. 
Features are divided into fields. For example, user's gender is a field, it contains three possible values, i.e. male, female and unknown. Occupation can be another field, which contains many more possible values than the gender field. Both field index and feature index are starting from 1. <br>
Now let's start with movielens dataset.

In [34]:
data_path = '../../tests/resources/deeprec/movielens'
yaml_file = os.path.join(data_path, r'network_xdeepFM.yaml')
output_file = os.path.join(data_path, r'output.txt')
train_file= os.path.join(data_path,'train.csv')
test_file=os.path.join(data_path,'test.csv')

In [35]:
train, test = python_random_split(
        data_transformed,
        ratio=0.75,
        seed=RANDOM_SEED,
    )

In [36]:
train.to_csv(train_file, sep=' ',header=False, index=False)
test.to_csv(test_file, sep=' ',header=False, index=False)

## Create hyper-parameters
prepare_hparams() will create a full set of hyper-parameters for model training, such as learning rate, feature number, and dropout ratio. We can put those parameters in a yaml file, or pass parameters as the function's parameters (which will overwrite yaml settings).

In [37]:
hparams = prepare_hparams(yaml_file) ##
print(hparams)

[('DNN_FIELD_NUM', None), ('FEATURE_COUNT', 22), ('FIELD_COUNT', 3), ('MODEL_DIR', None), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('activation', ['relu', 'relu']), ('attention_activation', None), ('attention_dropout', 0.0), ('attention_layer_sizes', None), ('batch_size', 128), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0), ('cross_layer_sizes', [100, 100, 50]), ('cross_layers', None), ('data_format', 'ffm'), ('dim', 10), ('doc_size', None), ('dropout', [0.0, 0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0), ('enable_BN', False), ('entityEmb_file', None), ('entity_dim', None), ('entity_embedding_method', None), ('entity_size', None), ('epochs', 10), ('fast_CIN_d', 0), ('filter_sizes', None), ('init_method', 'tnormal'), ('init_value', 0.01), ('is_clip_norm', 0), ('iterator_type', None), ('kg_file', None), ('kg_training_interval', 5), ('layer_l1', 0.0), ('layer_l2', 0.0), ('layer_sizes', [400, 400]), ('learning_rate', 0.001), ('load_model_name', Non

## Create data loader
Designate a data iterator for the model. xDeepFM uses FFMTextIterator. 

In [38]:
input_creator = FFMTextIterator

## Create model
When both hyper-parameters and data iterator are ready, we can create a model:

In [39]:
model = XDeepFMModel(hparams, input_creator)

## sometimes we don't want to train a model from scratch
## then we can load a pre-trained model like this: 
#model.load_model(r'your_model_path')

Add CIN part.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


### Train and fit model on validation data

In [None]:
model.fit(train_file, test_file)

step 20 , total_loss: 0.7359, data_loss: 0.7359
step 40 , total_loss: 0.7060, data_loss: 0.7060
step 60 , total_loss: 0.7343, data_loss: 0.7343
step 80 , total_loss: 0.7110, data_loss: 0.7110
step 100 , total_loss: 0.6539, data_loss: 0.6539
step 120 , total_loss: 0.6569, data_loss: 0.6569
step 140 , total_loss: 0.6155, data_loss: 0.6155
step 160 , total_loss: 0.6448, data_loss: 0.6448
step 180 , total_loss: 0.6185, data_loss: 0.6185
step 200 , total_loss: 0.5997, data_loss: 0.5997
step 220 , total_loss: 0.6209, data_loss: 0.6209
step 240 , total_loss: 0.6007, data_loss: 0.6007
step 260 , total_loss: 0.5915, data_loss: 0.5915
step 280 , total_loss: 0.5860, data_loss: 0.5860
step 300 , total_loss: 0.5399, data_loss: 0.5399
step 320 , total_loss: 0.5666, data_loss: 0.5666
step 340 , total_loss: 0.5556, data_loss: 0.5556
step 360 , total_loss: 0.5446, data_loss: 0.5446
step 380 , total_loss: 0.5519, data_loss: 0.5519
step 400 , total_loss: 0.5372, data_loss: 0.5372
step 420 , total_loss: 0

step 540 , total_loss: 0.4980, data_loss: 0.4980
step 560 , total_loss: 0.4930, data_loss: 0.4930
step 580 , total_loss: 0.4901, data_loss: 0.4901
at epoch 5 train info: auc:0.5096, exp_var:0.00443500280380249, logloss:0.6844, mae:0.4914151, rmse:0.49578223, rsquare:0.0044314562769761645 eval info: auc:0.5072, exp_var:0.0018025636672973633, logloss:0.6909, mae:0.49314696, rmse:0.4974937, rsquare:0.0013879297202703533
at epoch 5 , train time: 13.6 eval time: 15.8
step 20 , total_loss: 0.4961, data_loss: 0.4961
step 40 , total_loss: 0.4960, data_loss: 0.4960
step 60 , total_loss: 0.4856, data_loss: 0.4856
step 80 , total_loss: 0.4920, data_loss: 0.4920
step 100 , total_loss: 0.5009, data_loss: 0.5009
step 120 , total_loss: 0.4965, data_loss: 0.4965
step 140 , total_loss: 0.5020, data_loss: 0.5020
step 160 , total_loss: 0.4948, data_loss: 0.4948
step 180 , total_loss: 0.5030, data_loss: 0.5030
step 200 , total_loss: 0.5013, data_loss: 0.5013
step 220 , total_loss: 0.4986, data_loss: 0.498

### Evaluate model on test data

In [None]:
res_syn = model.run_eval(test_file)
print(res_syn)
#pm.record("res_syn", res_syn)

### Evaluation of  top N recommendation

In [None]:
res_syn = model.run_eval_topN(test_file,hparams)

## Reference
\[1\] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems.Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining, KDD 2018, London, UK, August 19-23, 2018.<br>
\[2\] The Criteo datasets: http://labs.criteo.com/category/dataset/. 

In [None]:
data_path = '../../tests/resources/deeprec/movielens'
yaml_file = os.path.join(data_path, r'network_xdeepFM.yaml')
#train_file = os.path.join(data_path, r'ua.base.classification.final')
#train_file = os.path.join(data_path, r'ua.base.regression.final')
#valid_file = os.path.join(data_path, r'ua.test.regression.final')
#test_file = os.path.join(data_path, r'ua.test.regression.final')

####the following files are for classification
train_file = os.path.join(data_path, r'ua.base.classification.final')
valid_file = os.path.join(data_path, r'ua.test.classification.final')
#test_file = os.path.join(data_path, r'ua.test.classification.final')
test_file = os.path.join(data_path, r'ua.test.classification_topN.final')

#test_file = os.path.join(data_path, r'ua.test.classification_topN.final')
output_file = os.path.join(data_path, r'output.txt')

#if not os.path.exists(yaml_file):
#    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')
