## Introduction

[Field-Aware Factorization](https://www.csie.ntu.edu.tw/~cjlin/libffm) is a powerful representation learning.

[Github here.](https://github.com/ycjuan/libffm)

This notebook demonstrates a way to use libffm binaries into a Kaggle kernel.

Release Notes :
 - V4 : New version with Out-of-Fold
 - V6 : fixed the encoder, previous version was kind of a regularizer :) 
 

In [1]:
import numpy as np
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/libffm-binaries/ffm-train
/kaggle/input/libffm-binaries/ffm-predict
/kaggle/input/cat-in-the-dat-ii/sample_submission.csv
/kaggle/input/cat-in-the-dat-ii/test.csv
/kaggle/input/cat-in-the-dat-ii/train.csv


## Read the data

In [2]:
train = pd.read_csv('/kaggle/input/cat-in-the-dat-ii/train.csv')
test = pd.read_csv('/kaggle/input/cat-in-the-dat-ii/test.csv')
test.insert(1, 'target', 0)

## Label Encode to ease creation of libffm format

In [3]:
features = [_f for _f in train if _f not in ['id', 'target']]

def factor_encoding(train, test):
    
    assert sorted(train.columns) == sorted(test.columns)
    
    full = pd.concat([train, test], axis=0, sort=False)
    # Factorize everything
    for f in full:
        full[f], _ = pd.factorize(full[f])
        full[f] += 1  # make sure no negative
        
    return full.iloc[:train.shape[0]], full.iloc[train.shape[0]:]

train_f, test_f = factor_encoding(train[features], test[features])

## Create LibFFM files


The data format of LIBFFM has a very special format (taken from [libffm page](https://github.com/ycjuan/libffm)):
```
<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
.
.
.
```

`field` and `feature` should be non-negative integers.

It is important to understand the difference between `field` and `feature`. For example, if we have a raw data like this:

| Click | Advertiser | Publisher |
|:-----:|:----------:|:---------:|
|    0 |       Nike |       CNN |
|    1 |       ESPN |       BBC |

Here, we have 
 
 - 2 fields: Advertiser and Publisher
 - 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC

Usually you will need to build two dictionares, one for field and one for features, like this:
    
    DictField[Advertiser] -> 0
    DictField[Publisher]  -> 1
    
    DictFeature[Advertiser-Nike] -> 0
    DictFeature[Publisher-CNN]   -> 1
    DictFeature[Advertiser-ESPN] -> 2
    DictFeature[Publisher-BBC]   -> 3

Then, you can generate FFM format data:

    0 0:0:1 1:1:1
    1 0:2:1 1:3:1

Note that because these features are categorical, the values here are all ones.

The class defined below go through all features and rows and update a python dicts as new values are encountered.

In [4]:
class LibFFMEncoder(object):
    def __init__(self):
        self.encoder = 1
        self.encoding = {}

    def encode_for_libffm(self, row):
        txt = f"{row[0]}"
        for i, r in enumerate(row[1:]):
            try:
                txt += f' {i+1}:{self.encoding[(i, r)]}:1'
            except KeyError:
                self.encoding[(i, r)] = self.encoder
                self.encoder += 1
                txt += f' {i+1}:{self.encoding[(i, r)]}:1'

        return txt

# Create files for testing and OOF
from sklearn.model_selection import KFold
fold_ids = [
    [trn_, val_] for (trn_, val_) in KFold(5,True,1).split(train)
]
for fold_, (trn_, val_) in enumerate(fold_ids):
    # Fit the encoder
    encoder = LibFFMEncoder()
    libffm_format_trn = pd.concat([train['target'].iloc[trn_], train_f.iloc[trn_]], axis=1).apply(
        lambda row: encoder.encode_for_libffm(row), raw=True, axis=1
    )
    # Encode validation set
    libffm_format_val = pd.concat([train['target'].iloc[val_], train_f.iloc[val_]], axis=1).apply(
        lambda row: encoder.encode_for_libffm(row), raw=True, axis=1
    )
    
    print(train['target'].iloc[trn_].shape, train['target'].iloc[val_].shape, libffm_format_val.shape)
    
    libffm_format_trn.to_csv(f'libffm_trn_fold_{fold_+1}.txt', index=False, header=False)
    libffm_format_val.to_csv(f'libffm_val_fold_{fold_+1}.txt', index=False, header=False)
    
    
# Create files for final model
encoder = LibFFMEncoder()
libffm_format_trn = pd.concat([train['target'], train_f], axis=1).apply(
        lambda row: encoder.encode_for_libffm(row), raw=True, axis=1
)
libffm_format_tst = pd.concat([test['target'], test_f], axis=1).apply(
    lambda row: encoder.encode_for_libffm(row), raw=True, axis=1
)

libffm_format_trn.to_csv(f'libffm_trn.txt', index=False, header=False)
libffm_format_tst.to_csv(f'libffm_tst.txt', index=False, header=False)

(480000,) (120000,) (120000,)
(480000,) (120000,) (120000,)
(480000,) (120000,) (120000,)
(480000,) (120000,) (120000,)
(480000,) (120000,) (120000,)


## Make ffm-train and ffm-predict excutable

In [5]:
!cp /kaggle/input/libffm-binaries/ffm-train .
!cp /kaggle/input/libffm-binaries/ffm-predict .
!chmod u+x ffm-train
!chmod u+x ffm-predict

## Run OOF

In [6]:
from sklearn.metrics import log_loss, roc_auc_score

!./ffm-train -p libffm_val_fold_1.txt -r 0.05 -l 0.00001 -k 50 -t 7 libffm_trn_fold_1.txt libffm_fold_1_model
!./ffm-predict libffm_val_fold_1.txt libffm_fold_1_model val_preds_fold_1.txt
(
    log_loss(train['target'].iloc[fold_ids[0][1]], pd.read_csv('val_preds_fold_1.txt', header=None).values[:,0]),
    roc_auc_score(train['target'].iloc[fold_ids[0][1]], pd.read_csv('val_preds_fold_1.txt', header=None).values[:,0])
)

First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (3.5 seconds)
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.9 seconds)
iter   tr_logloss   va_logloss      tr_time
   1      0.42856      0.40905         15.2
   2      0.40279      0.40268         29.6
   3      0.39916      0.40121         44.5
   4      0.39644      0.39983         59.3
   5      0.39326      0.39833         73.6
   6      0.39011      0.39752         88.3
   7      0.38750      0.39748        103.3
logloss = 0.39748


(0.3974836592468221, 0.786498277780076)

In [7]:
!./ffm-train -p libffm_val_fold_2.txt -r 0.05 -l 0.00001 -k 50 -t 7 libffm_trn_fold_2.txt libffm_fold_2_model
!./ffm-predict libffm_val_fold_2.txt libffm_fold_2_model val_preds_fold_2.txt
(
    log_loss(train['target'].iloc[fold_ids[1][1]], pd.read_csv('val_preds_fold_2.txt', header=None).values[:,0]),
    roc_auc_score(train['target'].iloc[fold_ids[1][1]], pd.read_csv('val_preds_fold_2.txt', header=None).values[:,0])
)

First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (3.5 seconds)
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.9 seconds)
iter   tr_logloss   va_logloss      tr_time
   1      0.42893      0.40702         14.5
   2      0.40329      0.40086         29.1
   3      0.39970      0.39950         44.4
   4      0.39708      0.39783         59.2
   5      0.39391      0.39636         73.6
   6      0.39079      0.39530         88.1
   7      0.38817      0.39508        102.9
logloss = 0.39508


(0.39507581966645594, 0.7885609295518949)

In [8]:
!./ffm-train -p libffm_val_fold_3.txt -r 0.05 -l 0.00001 -k 50 -t 7 libffm_trn_fold_3.txt libffm_fold_3_model
!./ffm-predict libffm_val_fold_3.txt libffm_fold_3_model val_preds_fold_3.txt
(
    log_loss(train['target'].iloc[fold_ids[2][1]], pd.read_csv('val_preds_fold_3.txt', header=None).values[:,0]),
    roc_auc_score(train['target'].iloc[fold_ids[2][1]], pd.read_csv('val_preds_fold_3.txt', header=None).values[:,0])
)

First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (3.5 seconds)
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.8 seconds)
iter   tr_logloss   va_logloss      tr_time
   1      0.42856      0.40824         14.3
   2      0.40298      0.40184         28.6
   3      0.39945      0.40053         43.4
   4      0.39680      0.39908         57.8
   5      0.39363      0.39756         72.2
   6      0.39043      0.39670         86.6
   7      0.38779      0.39657        101.4
logloss = 0.39657


(0.3965666088899384, 0.7859854083982429)

In [9]:
!./ffm-train -p libffm_val_fold_4.txt -r 0.05 -l 0.00001 -k 50 -t 7 libffm_trn_fold_4.txt libffm_fold_4_model
!./ffm-predict libffm_val_fold_4.txt libffm_fold_4_model val_preds_fold_4.txt
(
    log_loss(train['target'].iloc[fold_ids[3][1]], pd.read_csv('val_preds_fold_4.txt', header=None).values[:,0]),
    roc_auc_score(train['target'].iloc[fold_ids[3][1]], pd.read_csv('val_preds_fold_4.txt', header=None).values[:,0])
)

First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (3.4 seconds)
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.8 seconds)
iter   tr_logloss   va_logloss      tr_time
   1      0.42814      0.41042         14.5
   2      0.40242      0.40413         29.0
   3      0.39874      0.40280         43.9
   4      0.39600      0.40136         58.3
   5      0.39274      0.39989         72.4
   6      0.38957      0.39934         86.9
   7      0.38701      0.39936        101.6
logloss = 0.39936


(0.39936217602452234, 0.7862040041925855)

In [10]:
!./ffm-train -p libffm_val_fold_5.txt -r 0.05 -l 0.00001 -k 50 -t 7 libffm_trn_fold_5.txt libffm_fold_5_model
!./ffm-predict libffm_val_fold_5.txt libffm_fold_5_model val_preds_fold_5.txt
(
    log_loss(train['target'].iloc[fold_ids[4][1]], pd.read_csv('val_preds_fold_5.txt', header=None).values[:,0]),
    roc_auc_score(train['target'].iloc[fold_ids[4][1]], pd.read_csv('val_preds_fold_5.txt', header=None).values[:,0])
)

First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (3.5 seconds)
First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.8 seconds)
iter   tr_logloss   va_logloss      tr_time
   1      0.42891      0.40734         15.1
   2      0.40322      0.40085         30.8
   3      0.39959      0.39974         45.6
   4      0.39691      0.39812         59.9
   5      0.39369      0.39662         74.0
   6      0.39054      0.39579         88.3
   7      0.38796      0.39573        102.7
logloss = 0.39573


(0.39573211651073126, 0.788560796057336)

## Compute OOF score

In [11]:
oof_preds = np.zeros(train.shape[0])
for fold_, (_, val_) in enumerate(fold_ids):
    oof_preds[val_] = pd.read_csv(f'val_preds_fold_{fold_+1}.txt', header=None).values[:, 0]
oof_score = roc_auc_score(train['target'], oof_preds)
print(oof_score)

0.7871313254064711


## Train a libffm model

In [12]:
!./ffm-train -r 0.05 -l 0.00001 -k 50 -t 7 libffm_trn.txt libffm_model

First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (4.3 seconds)
iter   tr_logloss      tr_time
   1      0.42416         17.7
   2      0.40157         36.0
   3      0.39853         53.9
   4      0.39514         71.5
   5      0.39180         89.1
   6      0.38928        107.3
   7      0.38727        125.0


## Predict for test set

In [13]:
!./ffm-predict libffm_tst.txt libffm_model tst_preds.txt

logloss = 0.23661


## Prepare submission

In [14]:
submission = test[['id']].copy()
submission['target'] = pd.read_csv('tst_preds.txt', header=None).values[:,0]
submission.to_csv('libffm_prediction.csv', index=False)