This is a very slightly modified fork of Microsoft's [Fast Retraining](https://github.com/Azure/fast_retraining/).

# Experiment 1: HIGGS boson 

This experiment uses the data from the [HIGGS dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS) to predict the appearance of the Higgs boson. The dataset consists of 11 million of observations.

Dataset of atomic particles measurements. The total size of the data is 11 millions of observations. 
It can be used in a classification problem to distinguish between a signal process which produces Higgs 
bosons and a background process which does not.
The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic 
properties measured by the particle detectors in the accelerator. The last seven features are functions of 
the first 21 features; these are high-level features derived by physicists to help discriminate between the 
two classes. The first column is the class label (1 for signal, 0 for background), followed by the 28 
features (21 low-level features then 7 high-level features): lepton pT, lepton eta, lepton phi, 
missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, 
jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, 
jet 4 b-tag, m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb.

Link to the source: https://archive.ics.uci.edu/ml/datasets/HIGGS

Testing is done on my home computer

    CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
    RAM: 32GiB (4 x 8GiB DIMM Synchronous 2133 MHz (0.5 ns)
    OS: Ubuntu 16.04.4 LTS
    Storage: Samsung SSD 850

In [1]:
import json
import sys
import os
import warnings
import pkg_resources

import pandas as pd
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from timer import Timer

print("System version: {}".format(sys.version))
print("XGBoost version: {}".format(pkg_resources.get_distribution('xgboost').version))
print("LightGBM version: {}".format(pkg_resources.get_distribution('lightgbm').version))
print("CatBoost version: {}".format(pkg_resources.get_distribution('catboost').version))

warnings.filterwarnings("ignore", category=DeprecationWarning)

System version: 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
XGBoost version: 0.71
LightGBM version: 2.1.1
CatBoost version: 0.8.1.1


In [2]:
import notebook_memory_management
notebook_memory_management.start_watching_memory()

In [2] used 0.2461 MiB RAM in 0.10s, total RAM usage 120.56 MiB


In [3]:
HIGGS_PATH = "data/HIGGS.csv.gz"

def load_higgs():
    """ Loads HIGGS data
    
    Dataset of atomic particles measurements. The total size of the data is 11 millions of observations. 
    It can be used in a classification problem to distinguish between a signal process which produces Higgs 
    bosons and a background process which does not.
    The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic 
    properties measured by the particle detectors in the accelerator. The last seven features are functions of 
    the first 21 features; these are high-level features derived by physicists to help discriminate between the 
    two classes. The first column is the class label (1 for signal, 0 for background), followed by the 28 
    features (21 low-level features then 7 high-level features): lepton pT, lepton eta, lepton phi, 
    missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, 
    jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, 
    jet 4 b-tag, m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb.
    Link to the source: https://archive.ics.uci.edu/ml/datasets/HIGGS
    
    Returns
    -------
    pandas DataFrame
    """
    cols = ['boson','lepton_pT','lepton_eta','lepton_phi','missing_energy_magnitude','missing_energy_phi','jet_1_pt','jet_1_eta','jet_1_phi','jet_1_b-tag','jet_2_pt','jet_2_eta','jet_2_phi','jet_2_b-tag','jet_3_pt','jet_3_eta','jet_3_phi','jet_3_b-tag','jet_4_pt','jet_4_eta','jet_4_phi','jet_4_b-tag','m_jj','m_jjj','m_lv','m_jlv','m_bb','m_wbb','m_wwbb']
    return pd.read_csv(HIGGS_PATH, names=cols)

In [3] used 0.0000 MiB RAM in 0.10s, total RAM usage 120.56 MiB


In [4]:
%%time
df = load_higgs()
print(df.shape)

(11000000, 29)
CPU times: user 1min 39s, sys: 2.43 s, total: 1min 41s
Wall time: 1min 41s
In [4] used 4807.1523 MiB RAM in 101.85s, total RAM usage 4927.71 MiB


In [5]:
df.head()

Unnamed: 0,boson,lepton_pT,lepton_eta,lepton_phi,missing_energy_magnitude,missing_energy_phi,jet_1_pt,jet_1_eta,jet_1_phi,jet_1_b-tag,...,jet_4_eta,jet_4_phi,jet_4_b-tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb
0,1.0,0.869293,-0.635082,0.22569,0.32747,-0.689993,0.754202,-0.248573,-1.092064,0.0,...,-0.010455,-0.045767,3.101961,1.35376,0.979563,0.978076,0.920005,0.721657,0.988751,0.876678
1,1.0,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,2.173076,...,-1.13893,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343
2,1.0,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.0,...,1.128848,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118
3,0.0,1.344385,-0.876626,0.935913,1.99205,0.882454,1.786066,-1.646778,-0.942383,0.0,...,-0.678379,-1.360356,0.0,0.946652,1.028704,0.998656,0.728281,0.8692,1.026736,0.957904
4,1.0,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.0,...,-0.373566,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487


In [5] used 0.0000 MiB RAM in 0.14s, total RAM usage 4927.71 MiB


In [6]:
import pandas as pd
cols = ['boson','lepton_pT','lepton_eta','lepton_phi','missing_energy_magnitude','missing_energy_phi','jet_1_pt','jet_1_eta','jet_1_phi','jet_1_b-tag','jet_2_pt','jet_2_eta','jet_2_phi','jet_2_b-tag','jet_3_pt','jet_3_eta','jet_3_phi','jet_3_b-tag','jet_4_pt','jet_4_eta','jet_4_phi','jet_4_b-tag','m_jj','m_jjj','m_lv','m_jlv','m_bb','m_wbb','m_wwbb']
df = pd.read_csv("data/HIGGS-sample.csv", names=cols)

In [11]:
df.boson.mean()

0.5274975002272521

In [6]:
num_rounds = 200
number_processors = os.cpu_count()
print("number_processors =", number_processors)

number_processors = 8
In [6] used 0.0000 MiB RAM in 0.10s, total RAM usage 4927.71 MiB


In [7]:
metrics_dict = {
    'Accuracy': accuracy_score,
    'Precision': precision_score,
    'Recall': recall_score,
    'AUC': roc_auc_score,
    'F1': f1_score,
}

def classification_metrics(metrics, y_true, y_pred):
    return {metric_name:metric(y_true, y_pred) for metric_name, metric in metrics.items()}

In [7] used 0.0000 MiB RAM in 0.11s, total RAM usage 4927.71 MiB


In [8]:
def generate_feables(df):
    X = df[df.columns.difference(['boson'])]
    y = df['boson']
    return X,y

In [8] used 0.0000 MiB RAM in 0.10s, total RAM usage 4927.71 MiB


In [9]:
X, y = generate_feables(df)

In [9] used 2350.2422 MiB RAM in 0.84s, total RAM usage 7277.95 MiB


In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=77, test_size=500000)

In [10] used 2243.7266 MiB RAM in 6.78s, total RAM usage 9521.68 MiB


In [11]:
results_dict = dict()

def extend_result(name, train_t, test_t, y_pred):
    results_dict[name] = {
        'train_time': t_train.interval,
        'train_mem': t_train.memory,
        'test_time': t_test.interval,
        'train_mem': t_test.memory,
        'performance': classification_metrics(metrics_dict, y_test, y_pred)
    }

In [11] used 0.0000 MiB RAM in 0.11s, total RAM usage 9521.68 MiB


### XGBoost

In [12]:
xgb_hist_clf_pipeline = XGBClassifier(max_depth=0,
                                      learning_rate=0.1,
                                      scale_pos_weight=2,
                                      n_estimators=num_rounds,
                                      gamma=0.1,
                                      min_child_weight=1,
                                      reg_lambda=1,
                                      subsample=1,
                                      max_leaves=2**5,
                                      grow_policy='lossguide',
                                      tree_method='hist',
                                      nthread=number_processors)

In [12] used 0.0000 MiB RAM in 0.11s, total RAM usage 9521.68 MiB


In [13]:
with Timer() as t_train:
    xgb_hist_clf_pipeline.fit(X_train, y_train)

with Timer() as t_test:
    y_pred = xgb_hist_clf_pipeline.predict(X_test)

In [13] used 4682.0898 MiB RAM in 199.80s, total RAM usage 14203.77 MiB


In [14]:
extend_result('xgb_hist', t_train, t_test, y_pred)

In [14] used 0.0000 MiB RAM in 0.55s, total RAM usage 14203.77 MiB


### LightGBM

In [15]:
lgbm_clf_pipeline = LGBMClassifier(num_leaves=2**5, 
                                   learning_rate=0.1, 
                                   scale_pos_weight=2,
                                   n_estimators=num_rounds,
                                   min_split_gain=0.1,
                                   min_child_weight=1,
                                   reg_lambda=1,
                                   subsample=1,
                                   nthread=number_processors)

In [15] used 0.0000 MiB RAM in 0.10s, total RAM usage 14203.77 MiB


In [16]:
with Timer() as t_train:
    lgbm_clf_pipeline.fit(X_train, y_train)
    
with Timer() as t_test:
    y_pred = lgbm_clf_pipeline.predict(X_test)

In [16] used 2243.6211 MiB RAM in 123.14s, total RAM usage 16447.39 MiB


In [17]:
extend_result('lgbm', t_train, t_test, y_pred)

In [17] used 0.0000 MiB RAM in 0.63s, total RAM usage 16447.39 MiB


### CatBoost

In [18]:
catboost_clf1 = CatBoostClassifier(iterations=50,  depth=10, learning_rate=0.75, use_best_model=True)
catboost_clf2 = CatBoostClassifier(iterations=100, depth=10, learning_rate=0.75, use_best_model=True)
catboost_clf3 = CatBoostClassifier(iterations=200, depth=10, learning_rate=0.75, use_best_model=True)

In [18] used 0.0000 MiB RAM in 0.10s, total RAM usage 16447.39 MiB


In [19]:
with Timer() as t_train: 
    catboost_clf1.fit(X_train, y_train, cat_features=[], eval_set=(X_test, y_test), verbose=True, plot=False)
with Timer() as t_test:  
    y_pred = catboost_clf1.predict(X_test)
extend_result('catboost1', t_train, t_test, y_pred)

0:	learn: 0.5984543	test: 0.5978012	best: 0.5978012 (0)	total: 3.21s	remaining: 2m 37s
1:	learn: 0.5769844	test: 0.5759589	best: 0.5759589 (1)	total: 6.31s	remaining: 2m 31s
2:	learn: 0.5657624	test: 0.5648693	best: 0.5648693 (2)	total: 9.42s	remaining: 2m 27s
3:	learn: 0.5581784	test: 0.5570759	best: 0.5570759 (3)	total: 12.6s	remaining: 2m 24s
4:	learn: 0.5526425	test: 0.5516414	best: 0.5516414 (4)	total: 15.7s	remaining: 2m 21s
5:	learn: 0.5483744	test: 0.5474873	best: 0.5474873 (5)	total: 18.7s	remaining: 2m 17s
6:	learn: 0.5454438	test: 0.5444752	best: 0.5444752 (6)	total: 21.9s	remaining: 2m 14s
7:	learn: 0.5433435	test: 0.5424801	best: 0.5424801 (7)	total: 25s	remaining: 2m 11s
8:	learn: 0.5414438	test: 0.5406867	best: 0.5406867 (8)	total: 28s	remaining: 2m 7s
9:	learn: 0.5390634	test: 0.5382806	best: 0.5382806 (9)	total: 31.2s	remaining: 2m 4s
10:	learn: 0.5371265	test: 0.5363774	best: 0.5363774 (10)	total: 34.2s	remaining: 2m 1s
11:	learn: 0.5350353	test: 0.5344325	best: 0.534

In [20]:
with Timer() as t_train: 
    catboost_clf2.fit(X_train, y_train, cat_features=[], eval_set=(X_test, y_test), verbose=True, plot=False)
with Timer() as t_test:  
    y_pred = catboost_clf2.predict(X_test)
extend_result('catboost2', t_train, t_test, y_pred)

0:	learn: 0.5985638	test: 0.5978893	best: 0.5978893 (0)	total: 3.16s	remaining: 5m 13s
1:	learn: 0.5788223	test: 0.5778533	best: 0.5778533 (1)	total: 6.28s	remaining: 5m 7s
2:	learn: 0.5652920	test: 0.5643083	best: 0.5643083 (2)	total: 9.39s	remaining: 5m 3s
3:	learn: 0.5578412	test: 0.5571759	best: 0.5571759 (3)	total: 12.5s	remaining: 5m
4:	learn: 0.5525221	test: 0.5519408	best: 0.5519408 (4)	total: 15.6s	remaining: 4m 56s
5:	learn: 0.5493542	test: 0.5487470	best: 0.5487470 (5)	total: 18.8s	remaining: 4m 54s
6:	learn: 0.5452024	test: 0.5446291	best: 0.5446291 (6)	total: 22s	remaining: 4m 52s
7:	learn: 0.5432375	test: 0.5427703	best: 0.5427703 (7)	total: 25.1s	remaining: 4m 48s
8:	learn: 0.5400723	test: 0.5396200	best: 0.5396200 (8)	total: 28.2s	remaining: 4m 44s
9:	learn: 0.5384773	test: 0.5379934	best: 0.5379934 (9)	total: 31.2s	remaining: 4m 40s
10:	learn: 0.5360142	test: 0.5356756	best: 0.5356756 (10)	total: 34.4s	remaining: 4m 38s
11:	learn: 0.5345165	test: 0.5341581	best: 0.5341

93:	learn: 0.4973080	test: 0.5013014	best: 0.5013014 (93)	total: 4m 53s	remaining: 18.7s
94:	learn: 0.4970630	test: 0.5010601	best: 0.5010601 (94)	total: 4m 56s	remaining: 15.6s
95:	learn: 0.4968842	test: 0.5009511	best: 0.5009511 (95)	total: 4m 59s	remaining: 12.5s
96:	learn: 0.4967703	test: 0.5008993	best: 0.5008993 (96)	total: 5m 2s	remaining: 9.36s
97:	learn: 0.4965697	test: 0.5007545	best: 0.5007545 (97)	total: 5m 5s	remaining: 6.24s
98:	learn: 0.4963017	test: 0.5005246	best: 0.5005246 (98)	total: 5m 9s	remaining: 3.12s
99:	learn: 0.4961731	test: 0.5004488	best: 0.5004488 (99)	total: 5m 12s	remaining: 0us

bestTest = 0.50044878
bestIteration = 99

Shrink model to first 100 iterations.
In [20] used 10.2266 MiB RAM in 380.94s, total RAM usage 20757.16 MiB


In [21]:
with Timer() as t_train: 
    catboost_clf3.fit(X_train, y_train, cat_features=[], eval_set=(X_test, y_test), verbose=True, plot=False)
with Timer() as t_test:  
    y_pred = catboost_clf3.predict(X_test)
extend_result('catboost3', t_train, t_test, y_pred)

0:	learn: 0.5986898	test: 0.5980496	best: 0.5980496 (0)	total: 3.14s	remaining: 10m 25s
1:	learn: 0.5771127	test: 0.5761156	best: 0.5761156 (1)	total: 6.26s	remaining: 10m 19s
2:	learn: 0.5657691	test: 0.5649751	best: 0.5649751 (2)	total: 9.38s	remaining: 10m 16s
3:	learn: 0.5583664	test: 0.5574668	best: 0.5574668 (3)	total: 12.6s	remaining: 10m 15s
4:	learn: 0.5526046	test: 0.5518329	best: 0.5518329 (4)	total: 15.7s	remaining: 10m 11s
5:	learn: 0.5484880	test: 0.5478387	best: 0.5478387 (5)	total: 18.7s	remaining: 10m 5s
6:	learn: 0.5456264	test: 0.5448630	best: 0.5448630 (6)	total: 21.9s	remaining: 10m 5s
7:	learn: 0.5433129	test: 0.5424412	best: 0.5424412 (7)	total: 25s	remaining: 10m
8:	learn: 0.5406706	test: 0.5399141	best: 0.5399141 (8)	total: 28.2s	remaining: 9m 58s
9:	learn: 0.5381005	test: 0.5374030	best: 0.5374030 (9)	total: 31.3s	remaining: 9m 54s
10:	learn: 0.5360710	test: 0.5354316	best: 0.5354316 (10)	total: 34.4s	remaining: 9m 50s
11:	learn: 0.5345192	test: 0.5338550	best

92:	learn: 0.4975785	test: 0.5011863	best: 0.5011863 (92)	total: 4m 49s	remaining: 5m 33s
93:	learn: 0.4973843	test: 0.5010351	best: 0.5010351 (93)	total: 4m 53s	remaining: 5m 30s
94:	learn: 0.4971849	test: 0.5008694	best: 0.5008694 (94)	total: 4m 56s	remaining: 5m 27s
95:	learn: 0.4969902	test: 0.5007166	best: 0.5007166 (95)	total: 4m 59s	remaining: 5m 24s
96:	learn: 0.4968632	test: 0.5006230	best: 0.5006230 (96)	total: 5m 2s	remaining: 5m 21s
97:	learn: 0.4967079	test: 0.5005497	best: 0.5005497 (97)	total: 5m 5s	remaining: 5m 18s
98:	learn: 0.4966036	test: 0.5005104	best: 0.5005104 (98)	total: 5m 8s	remaining: 5m 14s
99:	learn: 0.4963641	test: 0.5003361	best: 0.5003361 (99)	total: 5m 11s	remaining: 5m 11s
100:	learn: 0.4962340	test: 0.5002660	best: 0.5002660 (100)	total: 5m 14s	remaining: 5m 8s
101:	learn: 0.4960315	test: 0.5001330	best: 0.5001330 (101)	total: 5m 18s	remaining: 5m 5s
102:	learn: 0.4959010	test: 0.5000654	best: 0.5000654 (102)	total: 5m 21s	remaining: 5m 2s
103:	learn

182:	learn: 0.4857267	test: 0.4940868	best: 0.4940860 (181)	total: 9m 31s	remaining: 53.1s
183:	learn: 0.4853943	test: 0.4938091	best: 0.4938091 (183)	total: 9m 34s	remaining: 49.9s
184:	learn: 0.4852998	test: 0.4937834	best: 0.4937834 (184)	total: 9m 37s	remaining: 46.8s
185:	learn: 0.4852528	test: 0.4937783	best: 0.4937783 (185)	total: 9m 40s	remaining: 43.7s
186:	learn: 0.4850810	test: 0.4936348	best: 0.4936348 (186)	total: 9m 43s	remaining: 40.6s
187:	learn: 0.4849878	test: 0.4936230	best: 0.4936230 (187)	total: 9m 46s	remaining: 37.5s
188:	learn: 0.4849135	test: 0.4936021	best: 0.4936021 (188)	total: 9m 49s	remaining: 34.3s
189:	learn: 0.4847696	test: 0.4935386	best: 0.4935386 (189)	total: 9m 53s	remaining: 31.2s
190:	learn: 0.4846980	test: 0.4935306	best: 0.4935306 (190)	total: 9m 56s	remaining: 28.1s
191:	learn: 0.4846100	test: 0.4935099	best: 0.4935099 (191)	total: 9m 59s	remaining: 25s
192:	learn: 0.4843260	test: 0.4932621	best: 0.4932621 (192)	total: 10m 2s	remaining: 21.9s
1

## Results

In [22]:
# Results
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "catboost1": {
        "performance": {
            "AUC": 0.7419166857815135,
            "Accuracy": 0.743498,
            "F1": 0.7604625231831247,
            "Precision": 0.752742640995966,
            "Recall": 0.7683423913043478
        },
        "test_time": 1.0937234009616077,
        "train_mem": 12.3203125,
        "train_time": 220.42882245406508
    },
    "catboost2": {
        "performance": {
            "AUC": 0.748363189747947,
            "Accuracy": 0.749914,
            "F1": 0.7664271344487428,
            "Precision": 0.7587327886859303,
            "Recall": 0.7742791364734299
        },
        "test_time": 1.1106334659270942,
        "train_mem": 4.22265625,
        "train_time": 379.0510009857826
    },
    "catboost3": {
        "performance": {
            "AUC": 0.7535612424280542,
            "Accuracy": 0.755168,
            "F1": 0.7716000089556918,
            "Precision": 0.7629846648856877,
            "Recall": 0.7804121376811595
        },
 