# First Thing to do
* Upload Train.csv and Test.csv.
* Activate GPU Runtime.
* Preferably, get a Testla T4 GPU.

When finished, run all cells below. The submission file will be generated at the end of the run (submission_stack.csv).

In [None]:
!pip install catboost
!git clone --recursive https://github.com/Microsoft/LightGBM
%cd /content/LightGBM
!mkdir build
!cmake -DUSE_GPU=1
!make -j$(nproc)
%cd /content/LightGBM/python-package
!sudo python setup.py install --precompile
%cd /content/
!pip install xgboost --upgrade

# Environement



*   **Platform** : Google Colab GPU Runtime
*   **GPU** : Tesla T4
*   **Gradient Boost Algorithms used** : CatBoost (version 0.24.1), LightGBM (version 3.0.0.99) and XGBoost (version 1.2.0)
*   **Estimated Runtime** : No more than 25 minutes.



# Methodology



1.   Preparing the training and test set (using the DataPreparer class).
2.   Split the training set into 2 parts : Set1 (80% of data) and Set2 (20% of data).
3.   Train XGBoost, LightGBM and CatBoost on Set1.
4.   Use Set2 to try to find the best weights to stack the 3 precedent models (using hyperopt library). This is a simple weighted average stacking.
5.   Retrain the 3 models on all the training set and stack them using the weights found in Step 4.
6.   Use the stacked model to make predictions on the test set.
7.   Create the submission file.



# Preparation of Training and Test

Given a row in the initial DataFrame, for every product that was purchased by the client, we treat it as it was not bought and let the model try to predict that it was in fact purchased.

# Features
*   **Purchased Products (P5DA, RIBP, ...)** : It is a binary vector where a one means that the client purchased the product and a zero means he did not.
*   **Sex** : Male or Female (Categorical Feature).
*   **Marital status** : (Categorical Feature)
*   **Branch code** : (Categorical Feature)
*   **Occupation code** : (Categorical Feature)
*   **Occupation category code** : (Categorical Feature)
*   **Birth year**
*   **date1** : Day number of the date when the client joined Zimnat
*   **date3** : The year when the client joined Zimnat
*   **date4** : The day name (Monday, Sunday, ...) of the date when the client joined Zimnat (Categorical Feature).
*   **date_diff** : The age of the client when he joined Zimnat. It is equal to (date4 - Birth year).
*   **num_products** : The number of products the client purchased (minus one because we removed a product that the model needs to predict).
*   **popularity_score** : Represents the popularity of the products purchased by the client (The higher the more popular). It is calculated as follow:
    1. Divide the **Purchased Products** matrix by its sum on axis=1 (for normalization purpose). The result is the matrix P.
    2. Sum the matrix P on axis=0. The result is vector C.
    3. Devide C by its sum.
    4. Dot Product P @ C and the result is a popularity score for every instance in the dataset.
*   **p_(Product_Code)** : Represent the popularity of the product (Product_Code) regarding **date3** feature. It is calculated as follow:
    1. Regroup instances of Dataset by **date4** feature.
    2. For each group, Sum the **Purchased Products** matrix and devide the resulting vector by its sum.
    3. Join each vector with its corresponding group.
*   **popularity_score_year** : This feature was supposed to be the equivalent of **popularity_score** but regarding the year (**date4** feature). Unfortunately, I did a bad manipulation while creating it which leave its value 0 for every instance of the Dataset. I will leave it as it is.

**Note** : There were few join_date missing in the dataset. They were replaced by the mean date.




# Dataset Preparer 

In [None]:
import copy
import random

import numpy as np
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator, RegressorMixin
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder, StandardScaler

def add_dummies(df, column_dummies, threshold=None):
    """Add dummies columns to a DataFrame. If threshold is not None, categories in a 
    column that represent less than threshold% of the data are regrouped in a
    single category named '_rare__'"""
    column_dummies = [x for x in column_dummies if x in df.columns]
    for col in column_dummies:
        if threshold:
            t = df[col].value_counts(normalize=True) <= threshold
            t = t[t].index.to_list()
            df[col].replace(t, value='_rare__', inplace=True)
        c = pd.get_dummies(df[col], prefix=col, dtype=np.int8)
        df = pd.concat([df, c], axis=1)
    df.drop(columns=column_dummies, inplace=True)
    return df


class DatasetPreparer:
    def __init__(self, dummies=False, columns_drop=None, popularity_score=False,
                 popularity_product_year=False, popularity_score_year=False):
        """This class prepares data for training, testing and submission.
        dummies (bool): Whether to convert categorical columns to dummies or not.
        columns_drop (list) : Columns to drop at the end of data preparation.
        popularity_score (bool): Whether to include the popularity_score column.
        popularity_product_year (bool): Whether to include the popularity_product_year column.
        popularity_score_year (bool): Whether to include the popularity_score_year column.
        """

        self.called = False
        self.le = None
        self.true_values = None
        self.dummies = dummies
        self.columns_drop = columns_drop
        self.popularity_score = popularity_score
        self.popularity_product_year = popularity_product_year
        self.popularity_score_year = popularity_score_year

    def get_train_test(self, train, test):
        """Preparation of training and test set"""
        self.called = True
        X_train = []
        X_train_columns = train.columns
        c = 0
        for v in train.values:
            info = v[:8]
            binary = v[8:]
            index = [k for k, i in enumerate(binary) if i == 1]
            for i in index:
                c += 1
                for k in range(len(binary)):
                    if k == i:
                        binary_transformed = list(copy.copy(binary))
                        binary_transformed[i] = 0
                        X_train.append(list(info) + binary_transformed + [X_train_columns[8 + k]] + [c])

        X_train = pd.DataFrame(X_train)
        X_train.columns = ['ID', 'join_date', 'sex', 'marital_status', 'birth_year', 'branch_code',
                           'occupation_code', 'occupation_category_code', 'P5DA', 'RIBP', '8NN1',
                           '7POT', '66FJ', 'GYSR', 'SOP4', 'RVSZ', 'PYUQ', 'LJR9', 'N2MW', 'AHXO',
                           'BSTQ', 'FM3X', 'K6QO', 'QBOL', 'JWFN', 'JZ9D', 'J9JW', 'GHYX', 'ECY3', 'product_pred',
                           'ID2']

        X_test = []
        true_values = []
        c = 0
        for v in test.values:
            c += 1
            info = v[:8]
            binary = v[8:]
            index = [k for k, i in enumerate(binary) if i == 1]
            X_test.append(list(info) + list(binary) + [c])
            for k in test.columns[8:][index]:
                true_values.append(v[0] + ' X ' + k)

        X_test = pd.DataFrame(X_test)
        X_test.columns = ['ID', 'join_date', 'sex', 'marital_status', 'birth_year', 'branch_code',
                          'occupation_code', 'occupation_category_code', 'P5DA', 'RIBP', '8NN1',
                          '7POT', '66FJ', 'GYSR', 'SOP4', 'RVSZ', 'PYUQ', 'LJR9', 'N2MW', 'AHXO',
                          'BSTQ', 'FM3X', 'K6QO', 'QBOL', 'JWFN', 'JZ9D', 'J9JW', 'GHYX', 'ECY3', 'ID2']

        features_train = []
        features_test = []
        columns = []

        append_features = ['P5DA', 'RIBP', '8NN1', '7POT', '66FJ', 'GYSR', 'SOP4', 'RVSZ', 'PYUQ', 'LJR9',
                           'N2MW', 'AHXO', 'BSTQ', 'FM3X', 'K6QO', 'QBOL', 'JWFN', 'JZ9D', 'J9JW', 'GHYX',
                           'ECY3', 'ID', 'ID2', 'join_date', 'sex', 'marital_status', 'branch_code', 'occupation_code',
                           'occupation_category_code',
                           'birth_year']
        for v in append_features:
            features_train.append(X_train[v].values.reshape(-1, 1))
            features_test.append(X_test[v].values.reshape(-1, 1))
            columns.append(np.array([v]))

        y_train = X_train[['product_pred']]

        features_train = np.concatenate(features_train, axis=1)
        features_test = np.concatenate(features_test, axis=1)
        columns = np.concatenate(np.array(columns))

        X_train = pd.DataFrame(features_train)
        X_train.columns = columns
        X_test = pd.DataFrame(features_test)
        X_test.columns = columns

        X_train['join_date'] = pd.to_datetime(X_train['join_date'])
        X_test['join_date'] = pd.to_datetime(X_test['join_date'])

        X_train['join_date'].fillna(X_train['join_date'].mean(), inplace=True)
        X_test['join_date'].fillna(X_test['join_date'].mean(), inplace=True)

        X_train['date1'] = X_train['join_date'].dt.day
        X_train['date2'] = X_train['join_date'].dt.month
        X_train['date3'] = X_train['join_date'].dt.year

        X_test['date1'] = X_test['join_date'].dt.day
        X_test['date2'] = X_test['join_date'].dt.month
        X_test['date3'] = X_test['join_date'].dt.year

        X_train['date_diff'] = X_train['date3'] - X_train['birth_year']
        X_test['date_diff'] = X_test['date3'] - X_test['birth_year']

        X_train['date4'] = X_train['join_date'].dt.day_name()
        X_test['date4'] = X_test['join_date'].dt.day_name()

        X_train.drop('join_date', axis=1, inplace=True)
        X_test.drop('join_date', axis=1, inplace=True)

        le = LabelEncoder()
        data = X_train.append(X_test)
        for v in ['sex', 'marital_status', 'branch_code', 'occupation_code', 'occupation_category_code', 'date4']:
            data.loc[:, v] = le.fit_transform(data.loc[:, v])

        if self.popularity_product_year:
            def f(x):
                x = x.iloc[:, :21].sum()
                x /= x.sum()
                return x

            t = data.groupby(['date3']).apply(f)
            t = t.add_prefix('p_')
            data = data.join(t, on='date3')
        if self.popularity_score_year:
            data['popularity_score_year'] = 0
            for d in data['date3'].unique():
                c = data[data['date3'] == d]
                w = c.iloc[:, :21].values.copy()
                w = w / w.sum(axis=1, keepdims=True)
                w2 = w.sum(axis=0)
                w2 = w2 / w2.sum()
                c['popularity_score_year'] = w @ w2
        if self.dummies:
            data = add_dummies(data, column_dummies=['sex', 'marital_status', 'branch_code', 'occupation_code',
                                                     'occupation_category_code', 'date4'], threshold=0.001)
        data['num_products'] = data.iloc[:, :21].sum(axis=1)

        if self.popularity_score:
            t = data.iloc[:, :21].values.copy()
            t = t / t.sum(axis=1, keepdims=True)
            t2 = t.sum(axis=0)
            t2 = t2 / t2.sum()
            data['popularity_score'] = t @ t2

        X_train = data[:X_train.shape[0]]
        X_test = data[-X_test.shape[0]:]

        le.fit(y_train.iloc[:, 0])
        y_train = pd.DataFrame(le.transform(y_train.iloc[:, 0]))
        y_train.columns = ['target']

        self.le = le
        self.true_values = true_values

        if self.columns_drop:
            X_train.drop(columns=self.columns_drop, inplace=True)
            X_test.drop(columns=self.columns_drop, inplace=True)

        return X_train, X_test, y_train

    def submission(self, model=None, X_test=None, prediction_proba=None):
        """Creating submission DataFrame"""
        if not self.called:
            raise RuntimeError('Run get_train_test first!')
        if prediction_proba is None:
            proba = model.predict_proba(X_test.drop(columns=['ID', 'ID2'], axis=1))
        else:
            proba = prediction_proba
        y_test = pd.DataFrame(proba)
        y_test.columns = self.le.inverse_transform(y_test.columns)

        answer_mass = []
        for i in range(X_test.shape[0]):
            id = X_test['ID'].iloc[i]
            for c in y_test.columns:
                answer_mass.append([id + ' X ' + c, y_test[c].iloc[i]])

        df_answer = pd.DataFrame(answer_mass)
        df_answer.columns = ['ID X PCODE', 'Label']
        for i in range(df_answer.shape[0]):
            if df_answer['ID X PCODE'].iloc[i] in self.true_values:
                df_answer['Label'].iat[i] = 1.0

        df_answer.reset_index(drop=True, inplace=True)
        return df_answer

# Train models on Set1
See Methodology for more information about what Set1 is.

The hyperparameters we see here were found using a hyperparameter search library (hyperopt).

In [None]:
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from lightgbm.sklearn import LGBMClassifier
from sklearn.pipeline import Pipeline
import xgboost as xgb

params2 = {
    'n_estimators': 6600,
    'depth': 5,
    'one_hot_max_size': 45,
    'l2_leaf_reg': 8,
    'task_type': 'GPU',
    'loss_function': 'MultiClass',
    'max_bin': 254
}

params = {
 'n_estimators': 100,
 'bagging_fraction': 0.4957961459694631,
 'feature_fraction': 0.6037064857092298,
 'lambda_l2': 0.21378633320033757,
 'learning_rate': 0.08561797859363587,
 'min_data_in_leaf': 100,
 'num_leaves': 18}
params['boosting_type'] = 'gbdt'
params['objective'] = 'multiclass'
params['metric'] = 'multi_logloss'
params['num_class'] = 21
params['device_type'] = 'gpu'
params['verbose'] = -1


params3 = { 'n_estimators': 149,
 'colsample_bytree': 0.8506284701578654,
 'eta': 0.12646057899961385,
 'gamma': 0.03203531113764864,
 'lambda': 2.863535707219281,
 'max_depth': 6,
 'min_child_weight': 1,
 'subsample': 0.9645884644210851,
 'objective': 'multi:softprob',
 'eval_metric': 'mlogloss',
 'num_class': 21,
 'tree_method': 'gpu_hist'}

models = [
    xgb.XGBClassifier(**params3),
    CatBoostClassifier(**params2),
    LGBMClassifier(**params),
]

train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')

dp = DatasetPreparer(dummies=False, columns_drop='date2', 
                     popularity_score_year=True, popularity_product_year=True, popularity_score=True)
X_train, X_test, y_train = dp.get_train_test(train, test)
dp_dummies = DatasetPreparer(dummies=True, columns_drop='date2', 
                             popularity_score_year=True, popularity_product_year=True, popularity_score=True)
X_train_d, X_test_d, y_train_d = dp_dummies.get_train_test(train, test)


# Class 8 is too rare so:
# Oversample class 8 for cross validation to work
X_train, y_train = X_train.reset_index(drop=True), y_train.reset_index(drop=True)
t = X_train[y_train.target == 8]
X_train = X_train.append(X_train.loc[t.index.repeat(2)].copy())
y_train = y_train.target.append(pd.Series([8, ] * 8))
# End of oversampling
# Oversample class 8 for cross validation to work
X_train_d, y_train_d = X_train_d.reset_index(drop=True), y_train_d.reset_index(drop=True)
t = X_train_d[y_train_d.target == 8]
X_train_d = X_train_d.append(X_train_d.loc[t.index.repeat(2)].copy())
y_train_d = y_train_d.target.append(pd.Series([8, ] * 8))
# End of oversampling


X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify=y_train.values.ravel(), random_state=12345,
                                          test_size=0.1)
X_train_d, X_val_d, y_train_d, y_val_d = train_test_split(X_train_d, y_train_d, stratify=y_train_d.values.ravel(), random_state=12345,
                                          test_size=0.1)

X_train, X_val, X_test = X_train.infer_objects(), X_val.infer_objects(), X_test.infer_objects()
X_train_d, X_val_d, X_test_d = X_train_d.infer_objects(), X_val_d.infer_objects(), X_test_d.infer_objects()
for model in models:
    if isinstance(model, CatBoostClassifier):
        model.fit(X_train.drop(columns=['ID', 'ID2']), y_train.values.ravel(),
              cat_features=['sex', 'marital_status', 'branch_code', 'occupation_code', 'occupation_category_code', 'date4'])
    elif isinstance(model, LGBMClassifier):
        model.fit(X_train.drop(columns=['ID', 'ID2']), y_train.values.ravel(),
              categorical_feature=['sex', 'marital_status', 'branch_code', 'occupation_code', 'occupation_category_code', 'date4'])
    elif isinstance(model, (xgb.XGBClassifier, Pipeline)):
        model.fit(X_train_d.drop(columns=['ID', 'ID2']), y_train_d)

# Find weights for Weighted Average Stacking

In [None]:

import os

import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import log_loss
from hyperopt import hp, Trials, fmin, anneal

models_path = 'models_to_stack'
features = []
features_test = []
for i, model in enumerate(models):
    if isinstance(model, (xgb.XGBClassifier, Pipeline)):
        t1 = pd.DataFrame(model.predict_proba(X_val_d.drop(columns=['ID', 'ID2', ]))).add_prefix(str(i) + '_')
        t2 = pd.DataFrame(model.predict_proba(X_test_d.drop(columns=['ID', 'ID2']))).add_prefix(str(i) + '_')
    elif isinstance(model, (LGBMClassifier, CatBoostClassifier)):
        t1 = pd.DataFrame(model.predict_proba(X_val.drop(columns=['ID', 'ID2']))).add_prefix(str(i) + '_')
        t2 = pd.DataFrame(model.predict_proba(X_test.drop(columns=['ID', 'ID2']))).add_prefix(str(i) + '_')
    features.append(t1)
    features_test.append(t2)

print('Models losses :')
for i, f in enumerate(features):
    print('Model', i, ': ', log_loss(y_val, f))

features = pd.concat(features, axis=1)
features_test = pd.concat(features_test + [X_test[['ID', 'ID2']]], axis=1)

def gb_cv(params, random_state=11837198, nfold=4, X=features, y=y_val):
    X = X.values.reshape(-1, len(models), 21).transpose((0,2,1))
    weight = np.array([params[k] for k in params])
    
    X = (X * weight).sum(-1) / weight.sum()
    return log_loss(y, X)

# possible values of parameters
space = {str(k) : hp.uniform(str(k), 0, 1) for k in range(len(models))}

# trials will contain logging information
trials = Trials()

best = fmin(fn=gb_cv,  # function to optimize
            space=space,
            algo=anneal.suggest,
            max_evals=500,
            trials=trials,
            rstate=np.random.RandomState(32465237)
            )

print(best)

# Train on Full Dataset

In [None]:
# Prepare models for blending/stacking (5 CatBoostClassifier from best models hyperopt : not have to be the best ones)

import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split

params2 = {
    'n_estimators': 6600,
    'depth': 5,
    'one_hot_max_size': 45,
    'l2_leaf_reg': 8,
    'task_type': 'GPU',
    'loss_function': 'MultiClass',
    'max_bin': 254
}

params = {'bagging_fraction': 0.4957961459694631,
 'feature_fraction': 0.6037064857092298,
 'lambda_l2': 0.21378633320033757,
 'learning_rate': 0.08561797859363587,
 'min_data_in_leaf': 100,
 'num_leaves': 18}
params['boosting_type'] = 'gbdt'
params['objective'] = 'multiclass'
params['metric'] = 'multi_logloss'
params['num_class'] = 21
params['device_type'] = 'gpu'
params['verbose'] = -1

params3 = { 'n_estimators': 149,
 'colsample_bytree': 0.8506284701578654,
 'eta': 0.12646057899961385,
 'gamma': 0.03203531113764864,
 'lambda': 2.863535707219281,
 'max_depth': 6,
 'min_child_weight': 1,
 'subsample': 0.9645884644210851,
 'objective': 'multi:softprob',
 'eval_metric': 'mlogloss',
 'num_class': 21,
 'tree_method': 'gpu_hist'}

models = [
    xgb.XGBClassifier(**params3),
    CatBoostClassifier(**params2),
    LGBMClassifier(**params),
]

train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')

dp = DatasetPreparer(dummies=False,columns_drop='date2', popularity_score_year=True, popularity_product_year=True, popularity_score=True)
X_train, X_test, y_train = dp.get_train_test(train, test)
dp_dummies = DatasetPreparer(dummies=True, columns_drop='date2', popularity_score_year=True, popularity_product_year=True, popularity_score=True)
X_train_d, X_test_d, y_train_d = dp_dummies.get_train_test(train, test)

X_train, X_test = X_train.infer_objects(), X_test.infer_objects()
X_train_d, X_test_d = X_train_d.infer_objects(), X_test_d.infer_objects()

for model in models:
    if isinstance(model, CatBoostClassifier):
        model.fit(X_train.drop(columns=['ID', 'ID2']), y_train.values.ravel(),
              cat_features=['sex', 'marital_status', 'branch_code', 'occupation_code', 'occupation_category_code', 'date4'])
    elif isinstance(model, LGBMClassifier):
        model.fit(X_train.drop(columns=['ID', 'ID2']), y_train.values.ravel(),
              categorical_feature=['sex', 'marital_status', 'branch_code', 'occupation_code', 'occupation_category_code', 'date4'])
    elif isinstance(model, xgb.XGBClassifier):
        model.fit(X_train_d.drop(columns=['ID', 'ID2']), y_train_d.values.ravel())


features_test = []
for i, model in enumerate(models):
    if isinstance(model, xgb.XGBClassifier):
        t2 = pd.DataFrame(model.predict_proba(X_test_d.drop(columns=['ID', 'ID2']))).add_prefix(str(i) + '_')
    elif isinstance(model, (LGBMClassifier, CatBoostClassifier)):
        t2 = pd.DataFrame(model.predict_proba(X_test.drop(columns=['ID', 'ID2']))).add_prefix(str(i) + '_')
    features_test.append(t2)
features_test = pd.concat(features_test, axis=1)



# Weighted Average
Using the weights found previously to stack the 3 models.

In [None]:
proba = features_test.values.reshape(-1, len(models), 21).transpose((0,2,1))
weights = best
weights = [weights[str(i)] for i in range(len(models))]
proba = (proba * weights).sum(-1) / sum(weights)

# Submission

In [None]:
submission = dp.submission(X_test=X_test, prediction_proba=proba)
submission.to_csv('submission_stack.csv', index=False)

In [None]:
!nvidia-smi

Sat Aug 29 09:36:01 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P0    33W /  70W |    415MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces