## Install dependencies

In [2]:
! conda install -c conda-forge xgboost --yes

Collecting package metadata: done
Solving environment: done


  current version: 4.6.14
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    libxgboost-1.0.2           |       he1b5a44_1         2.8 MB  conda-forge
    py-xgboost-1.0.2           |   py36h9f0ad1d_1         2.2 MB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    xgboost-1.0.2              |   py36h831f99a_1          11 KB  conda-forge
    ----------------------------

In [1]:
import xgboost

In [4]:
! pip install 'scikit-learn==0.23.1'

Collecting scikit-learn==0.23.1
[?25l  Downloading https://files.pythonhosted.org/packages/d9/3a/eb8d7bbe28f4787d140bb9df685b7d5bf6115c0e2a969def4027144e98b6/scikit_learn-0.23.1-cp36-cp36m-manylinux1_x86_64.whl (6.8MB)
[K    100% |████████████████████████████████| 6.9MB 5.3MB/s eta 0:00:01    16% |█████▏                          | 1.1MB 11.8MB/s eta 0:00:01    60% |███████████████████▎            | 4.1MB 24.7MB/s eta 0:00:01    78% |█████████████████████████       | 5.3MB 24.6MB/s eta 0:00:01
[?25hCollecting numpy>=1.13.3 (from scikit-learn==0.23.1)
[?25l  Downloading https://files.pythonhosted.org/packages/00/16/476826a84d545424084499763248abbbdc73d065168efed9aa71cdf2a7dc/numpy-1.19.0-cp36-cp36m-manylinux1_x86_64.whl (13.5MB)
[K    100% |████████████████████████████████| 13.5MB 2.5MB/s eta 0:00:01   23% |███████▋                        | 3.2MB 22.1MB/s eta 0:00:01    32% |██████████▍                     | 4.4MB 26.1MB/s eta 0:00:01    74% |████████████████████████        | 10.1MB

In [5]:
! pip install scipy --upgrade

Collecting scipy
[?25l  Downloading https://files.pythonhosted.org/packages/ab/f9/6eeed6d5cd8dd435bbf105d10d778c2d76de1a5838fdbc315a59fb7fad64/scipy-1.5.1-cp36-cp36m-manylinux1_x86_64.whl (25.9MB)
[K    100% |████████████████████████████████| 25.9MB 1.5MB/s eta 0:00:01  3% |█▎                              | 1.0MB 11.2MB/s eta 0:00:03    10% |███▎                            | 2.7MB 22.8MB/s eta 0:00:02    27% |████████▉                       | 7.1MB 22.9MB/s eta 0:00:01    31% |██████████▏                     | 8.2MB 25.4MB/s eta 0:00:01    45% |██████████████▊                 | 11.9MB 25.7MB/s eta 0:00:01    55% |█████████████████▊              | 14.3MB 24.8MB/s eta 0:00:01    69% |██████████████████████▏         | 17.9MB 23.3MB/s eta 0:00:01    99% |████████████████████████████████| 25.8MB 24.8MB/s eta 0:00:01
Installing collected packages: scipy
  Found existing installation: scipy 0.19.1
    Uninstalling scipy-0.19.1:
      Successfully uninstalled scipy-0.19.1
Successfully instal

In [2]:
import sklearn

In [3]:
import pickle
import pandas as pd
import numpy as np
import pylab as plt

%matplotlib inline

## Load artifacts and datasets

In [4]:
path_artifacts = "preproc_artifacts.pkl"

with open(path_artifacts, 'rb') as fin:
    artifacts = pickle.load(fin)

In [10]:
path_train = "Udacity_MAILOUT_052018_TRAIN_preproc.csv"
path_original_train = "../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv"
path_test = "Udacity_MAILOUT_052018_TEST_preproc.csv"

In [6]:
train_df = pd.read_csv(path_train, dtype=artifacts['type_converter'])
y_df = pd.read_csv(path_original_train, sep=';', usecols=['RESPONSE'])
test_df = pd.read_csv(path_test, dtype=artifacts['type_converter'])

In [14]:
x_train = train_df
y_train = y_df

Remember to check for "LNR" column to avoid overfitting

In [28]:
"LNR" in x_train.columns

False

In [21]:
x_train = x_train.astype(float)

In [15]:
del train_df
del y_df

As the test set has no labels we should split train set into train and validation so we can check for any potential overfittin.

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.25, random_state=42)

## Prepare model

Let's see how well balanced the problem is

In [62]:
y_train['RESPONSE'].value_counts(normalize=True)

0    0.988372
1    0.011628
Name: RESPONSE, dtype: float64

Not very good. We will need to tune the loss function to overcome this. 

Here it is a good tip: https://machinelearningmastery.com/xgboost-for-imbalanced-classification/

In [80]:
num_samples = y_train['RESPONSE'].value_counts()
scale_pos_weight = num_samples[0] / num_samples[1]

In [93]:
clf = xgboost.XGBClassifier(scale_pos_weight=scale_pos_weight)

And to avoid overfitting: https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/

In [94]:
eval_set = [(x_val.values, y_val.values.ravel())]

In [96]:
clf = clf.fit(x_train.values, y_train.values.ravel(), \
              early_stopping_rounds=10, eval_metric="auc", eval_set=eval_set, verbose=True)

[0]	validation_0-auc:0.58232
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.55712
[2]	validation_0-auc:0.57525
[3]	validation_0-auc:0.56756
[4]	validation_0-auc:0.58341
[5]	validation_0-auc:0.57625
[6]	validation_0-auc:0.57088
[7]	validation_0-auc:0.55987
[8]	validation_0-auc:0.57297
[9]	validation_0-auc:0.56719
[10]	validation_0-auc:0.57270
[11]	validation_0-auc:0.56332
[12]	validation_0-auc:0.56641
[13]	validation_0-auc:0.57181
[14]	validation_0-auc:0.56978
Stopping. Best iteration:
[4]	validation_0-auc:0.58341



## Evaluate model

From the logs we noticed that we have a ROC-AUC of 0.583 in validation set. It's not the best score ever, in fact, it is slightly better than predicting by chance, but this is a heavily imbalanced dataset and a much more advanced feature engineering will be required to improve this score.

## Score test partition and prepare Kaggle submission

In [107]:
y_test_proba_pred = clf.predict_proba(test_df.drop(columns=['LNR']).values)

In [113]:
submission = pd.DataFrame()
submission['LNR'] = test_df['LNR'].astype(int)
submission['RESPONSE'] = y_test_proba_pred[:,0]
submission.to_csv("submission.csv", index=False)