<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# xDeepFM : the eXtreme Deep Factorization Machine 
This notebook will give you a quick example of how to train an xDeepFM model. 
xDeepFM \[1\] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. Thus it can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced. To summarize, xDeepFM has the following key properties:
* It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;
* It contains a traditional DNN component that learns feature interactions in an implicit fashion and in bit-wise level.
* The implementation makes this model quite configurable. We can enable different subsets of components by setting hyperparameters like `use_Linear_part`, `use_FM_part`, `use_CIN_part`, and `use_DNN_part`. For example, by enabling only the `use_Linear_part` and `use_FM_part`, we can get a classical FM model.


## Global Settings and Imports

In [5]:
import sys
sys.path.append("../../")
import papermill as pm
import tensorflow as tf
import pandas as pd

from reco_utils.recommender.deeprec.deeprec_utils import *
from reco_utils.recommender.deeprec.models.xDeepFM import *
from reco_utils.recommender.deeprec.IO.iterator import *
from reco_utils.dataset.pandas_df_utils import LibffmConverter
from reco_utils.dataset import movielens
from reco_utils.common.constants import (
    DEFAULT_USER_COL as USER_COL,
    DEFAULT_ITEM_COL as ITEM_COL,
    DEFAULT_RATING_COL as RATING_COL,
    DEFAULT_PREDICTION_COL as PREDICT_COL,
)
from reco_utils.common.constants import SEED
from reco_utils.dataset.python_splitters import python_random_split
print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.5.5 |Anaconda custom (64-bit)| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0]
Tensorflow version: 1.10.1


### 1. Prepare data

xDeepFM uses the FFM format as data input: <label> <field_id>:<feature_id>:<feature_value>

    
1.1 Load Movie rating and genres data
First, download MovieLens data. Movies in the data set are tagged as one or more genres where there are total 19 genres including 'unknown'. We load movie genres to use them as item features.

In [2]:
MOVIELENS_DATA_SIZE='100k'
RANDOM_SEED = SEED  # Set seed for deterministic result
# The genres of each movie are returned as '|' separated string, e.g. "Animation|Children's|Comedy".
data = movielens.load_pandas_df(
        size=MOVIELENS_DATA_SIZE,
        header=[USER_COL, ITEM_COL, RATING_COL],
        genres_col='Genres_string'  # load genres as a temporal column 'Genres_string'
    )
display(data.head())
data['userID'] = data.userID.astype(str)
data['itemID'] = data.itemID.astype(str)

100%|██████████| 4.81k/4.81k [00:00<00:00, 18.6kKB/s]


Unnamed: 0,userID,itemID,rating,Genres_string
0,196,242,3.0,Comedy
1,63,242,3.0,Comedy
2,226,242,5.0,Comedy
3,154,242,3.0,Comedy
4,306,242,5.0,Comedy


In [3]:
import sklearn.preprocessing
genres_encoder = sklearn.preprocessing.MultiLabelBinarizer()
data['genres'] = genres_encoder.fit_transform(
        data['Genres_string'].apply(lambda s: s.split("|"))
    ).tolist()
print("Genres:", genres_encoder.classes_)
display(data.drop_duplicates(ITEM_COL)[[ITEM_COL, 'Genres_string', 'genres']].head())

Genres: ['Action' 'Adventure' 'Animation' "Children's" 'Comedy' 'Crime'
 'Documentary' 'Drama' 'Fantasy' 'Film-Noir' 'Horror' 'Musical' 'Mystery'
 'Romance' 'Sci-Fi' 'Thriller' 'War' 'Western' 'unknown']


Unnamed: 0,itemID,Genres_string,genres
0,242,Comedy,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
117,302,Crime|Film-Noir|Mystery|Thriller,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, ..."
414,377,Children's|Comedy,"[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
427,51,Drama|Romance|War|Western,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, ..."
508,346,Crime|Drama,"[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ..."


In [6]:
data[['genre1','genre2','genre3','genre4','genre5','genre6','genre7',\
      'genre8','genre9','genre10','genre11','genre12','genre13','genre14',\
      'genre15','genre16','genre17','genre18','genre19']] \
=pd.DataFrame(data.genres.values.tolist(), index= data.index)

In [7]:
data=data.drop("genres",axis=1)
data=data.drop("Genres_string",axis=1)

1.2 Convert to libffm format. The output features are  <field_index>:<field_feature_index>:1 or <field_index>:<field_index>:<field_feature_value>, depending on the data type of the features in the original dataframe.

In [8]:
converter = LibffmConverter().fit(data, col_rating=RATING_COL)
data_transformed = converter.transform(data)

In [9]:
data_transformed['rating'] = data_transformed.rating.astype(int)
data_transformed.head()

Unnamed: 0,rating,userID,itemID,genre1,genre2,genre3,genre4,genre5,genre6,genre7,...,genre10,genre11,genre12,genre13,genre14,genre15,genre16,genre17,genre18,genre19
0,3,1:1:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
1,3,1:2:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
2,5,1:3:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
3,3,1:4:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
4,5,1:5:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0


### Parameters

## Download data
xDeepFM uses the FFM format as data input: `<label> <field_id>:<feature_id>:<feature_value>`  
Each line represents an instance, `<label>` is a binary value with 1 meaning positive instance and 0 meaning negative instance. 
Features are divided into fields. For example, user's gender is a field, it contains three possible values, i.e. male, female and unknown. Occupation can be another field, which contains many more possible values than the gender field. Both field index and feature index are starting from 1. <br>
Now let's start with movielens dataset.

In [10]:
data_path = '../../tests/resources/deeprec/movielens'
yaml_file = os.path.join(data_path, r'network_xdeepFM.yaml')
output_file = os.path.join(data_path, r'output.txt')

In [11]:
#data_transformed(data_transformed['rating'>3])=1 

data_transformed.loc[data_transformed['rating'] <= 3, 'rating'] = 0
data_transformed.loc[data_transformed['rating'] > 3, 'rating'] = 1

#data_transformed['rating'] = data_transformed.rating.astype(floa)

In [12]:
data_transformed.head()

Unnamed: 0,rating,userID,itemID,genre1,genre2,genre3,genre4,genre5,genre6,genre7,...,genre10,genre11,genre12,genre13,genre14,genre15,genre16,genre17,genre18,genre19
0,0,1:1:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
1,0,1:2:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
2,1,1:3:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
3,0,1:4:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0
4,1,1:5:1,2:944:1,3:2626:0,4:2627:0,5:2628:0,6:2629:0,7:2630:1,8:2631:0,9:2632:0,...,12:2635:0,13:2636:0,14:2637:0,15:2638:0,16:2639:0,17:2640:0,18:2641:0,19:2642:0,20:2643:0,21:2644:0


In [13]:
train, test = python_random_split(
        data_transformed,
        ratio=0.75,
        seed=RANDOM_SEED,
    )

## Create hyper-parameters
prepare_hparams() will create a full set of hyper-parameters for model training, such as learning rate, feature number, and dropout ratio. We can put those parameters in a yaml file, or pass parameters as the function's parameters (which will overwrite yaml settings).

In [14]:
hparams = prepare_hparams(yaml_file) ##
print(hparams)

[('DNN_FIELD_NUM', None), ('FEATURE_COUNT', 213), ('FIELD_COUNT', 5), ('MODEL_DIR', None), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('activation', ['relu', 'relu']), ('attention_activation', None), ('attention_dropout', 0.0), ('attention_layer_sizes', None), ('batch_size', 128), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0), ('cross_layer_sizes', [100, 100, 50]), ('cross_layers', None), ('data_format', 'ffm'), ('dim', 10), ('doc_size', None), ('dropout', [0.0, 0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0), ('enable_BN', False), ('entityEmb_file', None), ('entity_dim', None), ('entity_embedding_method', None), ('entity_size', None), ('epochs', 10), ('fast_CIN_d', 0), ('filter_sizes', None), ('init_method', 'tnormal'), ('init_value', 0.01), ('is_clip_norm', 0), ('iterator_type', None), ('kg_file', None), ('kg_training_interval', 5), ('layer_l1', 0.0), ('layer_l2', 0.0), ('layer_sizes', [400, 400]), ('learning_rate', 0.001), ('load_model_name', No

## Create data loader
Designate a data iterator for the model. xDeepFM uses FFMTextIterator. 

In [15]:
input_creator = FFMTextIterator

## Create model
When both hyper-parameters and data iterator are ready, we can create a model:

In [16]:
model = XDeepFMModel(hparams, input_creator)

## sometimes we don't want to train a model from scratch
## then we can load a pre-trained model like this: 
#model.load_model(r'your_model_path')

Add CIN part.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Now let's see what is the model's performance at this point (without starting training):

### Train and fit model on validation data

In [17]:
train.to_csv('train.csv', sep=' ',header=False, index=False)
#train.columns = [''] * len(train.columns)

In [None]:
train.head()

In [18]:
train['genre1']=train['genre1'].str.replace('2626','1')
train['genre2']=train['genre2'].str.replace('4:2627','3:2')
train['genre3']=train['genre3'].str.replace('5:2628','3:3')
train['genre4']=train['genre4'].str.replace('6:2629','3:4')
train['genre5']=train['genre5'].str.replace('7:2630','3:5')
train['genre6']=train['genre6'].str.replace('8:2631','3:6')
train['genre7']=train['genre7'].str.replace('9:2632','3:7')
train['genre8']=train['genre8'].str.replace('10:2633','3:8')
train['genre9']=train['genre9'].str.replace('11:2634','3:9')
train['genre10']=train['genre10'].str.replace('12:2635','3:10')
train['genre11']=train['genre11'].str.replace('13:2636','3:11')
train['genre12']=train['genre12'].str.replace('14:2637','3:12')
train['genre13']=train['genre13'].str.replace('15:2638','3:13')
train['genre14']=train['genre14'].str.replace('16:2639','3:14')
train['genre15']=train['genre15'].str.replace('17:2640','3:15')
train['genre16']=train['genre16'].str.replace('18:2641','3:16')
train['genre17']=train['genre17'].str.replace('19:2642','3:17')
train['genre18']=train['genre18'].str.replace('20:2643','3:18')
train['genre19']=train['genre19'].str.replace('21:2644','3:19')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [21]:
train.to_csv('train.csv', sep=' ',header=False, index=False)
#train.columns = [''] * len(train.columns)
train_file='train.csv'

In [19]:
train.head()

Unnamed: 0,rating,userID,itemID,genre1,genre2,genre3,genre4,genre5,genre6,genre7,...,genre10,genre11,genre12,genre13,genre14,genre15,genre16,genre17,genre18,genre19
98980,0,1:163:1,2:2330:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:1,3:6:0,3:7:0,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0
69824,1,1:353:1,2:1499:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:1,3:6:0,3:7:0,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0
9928,1,1:159:1,2:1004:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:0,3:6:0,3:7:0,...,3:10:0,3:11:1,3:12:0,3:13:0,3:14:0,3:15:0,3:16:1,3:17:0,3:18:0,3:19:0
75599,0,1:475:1,2:1588:1,3:1:0,3:2:0,3:3:0,3:4:0,3:5:1,3:6:0,3:7:0,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0
95621,0,1:501:1,2:2053:1,3:1:0,3:2:0,3:3:0,3:4:1,3:5:1,3:6:0,3:7:0,...,3:10:0,3:11:0,3:12:0,3:13:0,3:14:0,3:15:0,3:16:0,3:17:0,3:18:0,3:19:0


In [22]:
model.fit(train_file, train_file)

InvalidArgumentError: Input to reshape is a tensor with 6380 values, but the requested shape requires a multiple of 50
	 [[Node: XDeepFM/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](XDeepFM/embedding/embedding_lookup_sparse, XDeepFM/Reshape/shape)]]
	 [[Node: Sqrt/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_333_Sqrt", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'XDeepFM/Reshape', defined at:
  File "/anaconda/envs/py35/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda/envs/py35/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 127, in start
    self.asyncio_loop.run_forever()
  File "/anaconda/envs/py35/lib/python3.5/asyncio/base_events.py", line 421, in run_forever
    self._run_once()
  File "/anaconda/envs/py35/lib/python3.5/asyncio/base_events.py", line 1425, in _run_once
    handle._run()
  File "/anaconda/envs/py35/lib/python3.5/asyncio/events.py", line 127, in _run
    self._callback(*self._args)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tornado/ioloop.py", line 759, in _run_callback
    ret = callback()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 536, in <lambda>
    self.io_loop.add_callback(lambda : self._handle_events(self.socket, 0))
  File "/anaconda/envs/py35/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2903, in run_ast_nodes
    if self.run_code(code, result):
  File "/anaconda/envs/py35/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-16-7988be51621b>", line 1, in <module>
    model = XDeepFMModel(hparams, input_creator)
  File "../../reco_utils/recommender/deeprec/models/base_model.py", line 46, in __init__
    self.logit = self._build_graph()
  File "../../reco_utils/recommender/deeprec/models/xDeepFM.py", line 46, in _build_graph
    embed_out, res=True, direct=False, bias=False, is_masked=True
  File "../../reco_utils/recommender/deeprec/models/xDeepFM.py", line 164, in _build_CIN
    nn_input = tf.reshape(nn_input, shape=[-1, int(field_num), hparams.dim])
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6199, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
    op_def=op_def)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1717, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 6380 values, but the requested shape requires a multiple of 50
	 [[Node: XDeepFM/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](XDeepFM/embedding/embedding_lookup_sparse, XDeepFM/Reshape/shape)]]
	 [[Node: Sqrt/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_333_Sqrt", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]


### Evaluate model on test data

In [None]:
res_syn = model.run_eval(test)
print(res_syn)
pm.record("res_syn", res_syn)

### Evaluation of  top N recommendation

In [None]:
res_syn = model.run_eval_topN(test_file,hparams)

## Reference
\[1\] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems.Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining, KDD 2018, London, UK, August 19-23, 2018.<br>
\[2\] The Criteo datasets: http://labs.criteo.com/category/dataset/. 

In [None]:
data_path = '../../tests/resources/deeprec/movielens'
yaml_file = os.path.join(data_path, r'network_xdeepFM.yaml')
#train_file = os.path.join(data_path, r'ua.base.classification.final')
#train_file = os.path.join(data_path, r'ua.base.regression.final')
#valid_file = os.path.join(data_path, r'ua.test.regression.final')
#test_file = os.path.join(data_path, r'ua.test.regression.final')

####the following files are for classification
train_file = os.path.join(data_path, r'ua.base.classification.final')
valid_file = os.path.join(data_path, r'ua.test.classification.final')
#test_file = os.path.join(data_path, r'ua.test.classification.final')
test_file = os.path.join(data_path, r'ua.test.classification_topN.final')

#test_file = os.path.join(data_path, r'ua.test.classification_topN.final')
output_file = os.path.join(data_path, r'output.txt')

#if not os.path.exists(yaml_file):
#    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')


In [None]:
data['Genres_string']=data['Genres_string'].apply(lambda s: s.split("|"))
##unpivot but it leads to problem for libffm converter
#import pandas as pd
s=data.apply(lambda x: pd.Series(x['Genres_string']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'Genres_string'
data_trans=data.drop('Genres_string', axis=1).join(s)