<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Spark-MLlib-Tuning" data-toc-modified-id="Spark-MLlib-Tuning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><a href="https://spark.apache.org/docs/latest/ml-tuning.html" target="_blank">Spark MLlib Tuning</a></a></span></li><li><span><a href="#Hyperopt" data-toc-modified-id="Hyperopt-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><a href="https://github.com/hyperopt/hyperopt" target="_blank">Hyperopt</a></a></span><ul class="toc-item"><li><span><a href="#XGBoost-Tuning" data-toc-modified-id="XGBoost-Tuning-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><a href="https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/" target="_blank">XGBoost Tuning</a></a></span><ul class="toc-item"><li><span><a href="#Objective-function" data-toc-modified-id="Objective-function-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Objective function</a></span></li><li><span><a href="#Tune-number-of-trees" data-toc-modified-id="Tune-number-of-trees-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Tune number of trees</a></span></li><li><span><a href="#Tune-tree-specific-parameters" data-toc-modified-id="Tune-tree-specific-parameters-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Tune tree-specific parameters</a></span><ul class="toc-item"><li><span><a href="#Tune-max_depth,-min_child_weight" data-toc-modified-id="Tune-max_depth,-min_child_weight-2.1.3.1"><span class="toc-item-num">2.1.3.1&nbsp;&nbsp;</span>Tune max_depth, min_child_weight</a></span></li><li><span><a href="#Tune-gamma" data-toc-modified-id="Tune-gamma-2.1.3.2"><span class="toc-item-num">2.1.3.2&nbsp;&nbsp;</span>Tune gamma</a></span></li><li><span><a href="#Tune-subsample,-colsample_bytree" data-toc-modified-id="Tune-subsample,-colsample_bytree-2.1.3.3"><span class="toc-item-num">2.1.3.3&nbsp;&nbsp;</span>Tune subsample, colsample_bytree</a></span></li></ul></li><li><span><a href="#Tune-regularization-parameters" data-toc-modified-id="Tune-regularization-parameters-2.1.4"><span class="toc-item-num">2.1.4&nbsp;&nbsp;</span>Tune regularization parameters</a></span></li><li><span><a href="#Lower-the-learning-rate-and-decide-the-optimal-parameters" data-toc-modified-id="Lower-the-learning-rate-and-decide-the-optimal-parameters-2.1.5"><span class="toc-item-num">2.1.5&nbsp;&nbsp;</span>Lower the learning rate and decide the optimal parameters</a></span></li></ul></li><li><span><a href="#LogisticRegression-Tuning" data-toc-modified-id="LogisticRegression-Tuning-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LogisticRegression Tuning</a></span></li><li><span><a href="#Optional-MongoTrials" data-toc-modified-id="Optional-MongoTrials-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Optional <a href="https://hyperopt.github.io/hyperopt/scaleout/mongodb/" target="_blank">MongoTrials</a></a></span><ul class="toc-item"><li><span><a href="#XGBoost-Tuning" data-toc-modified-id="XGBoost-Tuning-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>XGBoost Tuning</a></span></li></ul></li></ul></li><li><span><a href="#Results" data-toc-modified-id="Results-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Results</a></span></li></ul></div>

Продолжаем работать над задачей CTR-prediction с использованием датасета от Criteo.

Описание задачи и данных можно посмотреть в notebook'e предыдущей практики (`sgd_logreg_nn/notebooks/ctr_prediction_mllib.ipynb`).

In [1]:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import os
import sys
import glob
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

import pyspark
import pyspark.sql.functions as F
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import Row

COMMON_PATH = '/workspace/common'

sys.path.append(os.path.join(COMMON_PATH, 'utils'))

os.environ['PYSPARK_SUBMIT_ARGS'] = """
--jars {common}/xgboost4j-spark-0.72.jar,{common}/xgboost4j-0.72.jar
--py-files {common}/sparkxgb.zip pyspark-shell
""".format(common=COMMON_PATH).replace('\n', ' ')

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("spark_sql_examples") \
    .config("spark.executor.memory", "23g") \
    .config("spark.driver.memory", "23g") \
    .getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

from metrics import rocauc, logloss, ne, get_ate
from processing import split_by_col

from sparkxgb.xgboost import *

In [2]:
DATA_PATH = '/workspace/gradient_boosting/notebooks'

TRAIN_PATH = os.path.join(DATA_PATH, 'train.csv')

In [3]:
df = sqlContext.read.format("com.databricks.spark.csv") \
    .option("delimiter", ",") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('file:///' + TRAIN_PATH)

**Remark** Необязательно использовать половину датасета и всего две категориальные переменные. Можно использовать больше данных, если вам позволяет ваша конфигурация

In [5]:
df = df.sample(False, 0.5)

In [6]:
num_columns = ['_c{}'.format(i) for i in range(1, 14)]
cat_columns = ['_c{}'.format(i) for i in range(14, 40)][:2]
len(num_columns), len(cat_columns)

(13, 2)

In [7]:
df = df.fillna(0, subset=num_columns)

In [8]:
from pyspark.ml import PipelineModel


pipeline_model = PipelineModel.load(os.path.join(DATA_PATH, 'pipeline'))

In [9]:
df = pipeline_model \
    .transform(df) \
    .select(F.col('_c0').alias('label'), 'features', 'id') \
    .cache()

df.count()

1831000

In [9]:
train_df, val_df, test_df = split_by_col(df, 'id', [0.8, 0.1, 0.1])

# [Spark MLlib Tuning](https://spark.apache.org/docs/latest/ml-tuning.html)

У имеющегося в Spark'e метода HPO есть два существенных недостатка, которые делают его мало пригодным в контексте нашей задачи:

1. `ParamGridBuilder` - поиск по сетке
2. `TrainValidationSplit` - делит данные случайнм образом

# [Hyperopt](https://github.com/hyperopt/hyperopt)

Установим `hyperopt`

In [4]:
!pip3.5 install hyperopt

Collecting hyperopt
  Downloading hyperopt-0.2.3-py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 1.7 MB/s eta 0:00:01
[?25hCollecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 60.4 MB/s eta 0:00:01
Collecting tqdm
  Downloading tqdm-4.43.0-py2.py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 5.0 MB/s  eta 0:00:01
Collecting cloudpickle
  Downloading cloudpickle-1.3.0-py2.py3-none-any.whl (26 kB)
Collecting networkx==2.2
  Downloading networkx-2.2.zip (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 66.5 MB/s eta 0:00:01
Installing collected packages: future, tqdm, cloudpickle, networkx, hyperopt
    Running setup.py install for future ... [?25ldone
[?25h    Running setup.py install for networkx ... [?25ldone
[?25hSuccessfully installed cloudpickle-1.3.0 future-0.18.2 hyperopt-0.2.3 networkx-2.2 tqdm-4.43.0


## [XGBoost Tuning](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)

> [Notes on Parameter Tuning](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html)

### Objective function

In [11]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import scipy.stats as st


def objective(space):
    estimator = XGBoostEstimator(**space)
    print('SPACE:', estimator._input_kwargs_processed())
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    log_loss = logloss(model, val_df, probabilities_col='probabilities')
    roc_auc = rocauc(model, val_df, probabilities_col='probabilities')
    
    print('LOG-LOSS: {}, ROC-AUC: {}'.format(log_loss, roc_auc))

    return {'loss': log_loss, 'rocauc': roc_auc, 'status': STATUS_OK }

In [19]:
static_params = {
    'featuresCol': "features", 
    'labelCol': "label", 
    'predictionCol': "prediction",
    'eval_metric': 'logloss',
    'objective': 'binary:logistic',
    'nthread': 1,
    'silent': 0,
    'nworkers': 1
}

Fix baseline parameters and train baseline model

In [24]:
CONTROL_NAME = 'xgb baseline'

start_params = {
    'colsample_bytree': 0.9,
    'eta': 0.15,
    'gamma': 0.9,
    'max_depth': 6,
    'min_child_weight': 50.0,
    'subsample': 0.9,
    'num_round': 20
}

baseline_model = XGBoostEstimator(**{**static_params, **start_params}).fit(train_df)

In [14]:
baseline_rocauc = rocauc(baseline_model, val_df, probabilities_col='probabilities')
baseline_rocauc

0.7267792837479905

In [15]:
all_metrics = {}

In [16]:
baseline_test_metrics = {
    'logloss': logloss(baseline_model, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(baseline_model, test_df, probabilities_col='probabilities')
}

all_metrics[CONTROL_NAME] = baseline_test_metrics

### Tune number of trees

> Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate.

In [17]:
%%time

num_round_choice = [10, 20, 40, 100]
eta_choice = [0.5, 0.10, 0.15, 0.20, 0.30]

space = {
    # Optimize
    'num_round': hp.choice('num_round', num_round_choice),
    'eta': hp.choice('eta', eta_choice),
    
    # Fixed    
    'max_depth': baseline_params['max_depth'],
    'min_child_weight': baseline_params['min_child_weight'],
    'subsample': baseline_params['subsample'],
    'gamma': baseline_params['gamma'],
    'colsample_bytree': baseline_params['colsample_bytree'],
    
    **static_params
}


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials)

SPACE:                                                
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'eta': 0.15, 'eval_metric': 'logloss', 'gamma': 0.9, 'objective': 'binary:logistic', 'num_round': 40, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.5059750051229666, ROC-AUC: 0.7316424878716304
SPACE:                                                                          
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'eta': 0.2, 'eval_metric': 'logloss', 'gamma': 0.9, 'objective': 'binary:logistic', 'num_round': 40, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.5049240827093872, ROC-AUC: 0.7328122734181619                       
SPACE:                                                                          
{'labelCol': 'label', 'nwor

LOG-LOSS: 0.5049240827093872, ROC-AUC: 0.7328122734181647                        
SPACE:                                                                           
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'eta': 0.15, 'eval_metric': 'logloss', 'gamma': 0.9, 'objective': 'binary:logistic', 'num_round': 40, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.5059750051229666, ROC-AUC: 0.7316424878716293                        
SPACE:                                                                           
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 0.9, 'objective': 'binary:logistic', 'num_round': 20, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.5046411635761301, ROC-AUC: 0.73299651976396

In [18]:
best

{'eta': 0, 'num_round': 3}

Обратите внимание на то, что в случае с `hp.choice` в переменной `best` хранится не конкретное значение гиперпараметра, а его индекс из списка, например, `num_round_choice`

In [19]:
eta = eta_choice[best['eta']]  # change me!
num_round = num_round_choice[best['num_round']]  # change me!

In [10]:
eta, num_round = [0.5, 0.10, 0.15, 0.20, 0.30][0], [10, 20, 40, 100][3]
eta, num_round

(0.5, 100)

### Tune tree-specific parameters

> Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.

#### Tune max_depth, min_child_weight

In [None]:
%%time

max_depth_choice = [5, 6, 8, 10]
min_child_weight_choice = [0.0, 1.0, 5.0, 10.0, 25.0, 40.0, 50.0]

space_1 = {
    # Optimize
    'max_depth': hp.choice('max_depth', max_depth_choice),
    'min_child_weight': hp.choice('min_child_weight', min_child_weight_choice),
    
    # Fixed    
    'num_round': num_round,
    'eta': eta,
    'subsample': start_params['subsample'],
    'gamma': start_params['gamma'],
    'colsample_bytree': start_params['colsample_bytree'],
    
    **static_params
}


trials_1 = Trials()
best_1 = fmin(fn=objective,
            space=space_1,
            algo=tpe.suggest,
            max_evals=15,
            trials=trials_1)
print(best_1)
max_depth = max_depth_choice[best_1['max_depth']]
min_child_weight = min_child_weight_choice[best_1['min_child_weight']]

In [11]:
max_depth = [5, 6, 8, 10][1]
min_child_weight = [0.0, 1.0, 5.0, 10.0, 25.0, 40.0, 50.0][6]

max_depth, min_child_weight

(6, 50.0)

#### Tune gamma

In [23]:
%%time

gamma_choice = [0.1, 0.2, 0.4, 0.9, 2.0]

space_2 = {
    # Optimize
    'gamma': hp.choice('gamma', gamma_choice),
    
    # Fixed    
    'num_round': num_round,
    'eta': eta,
    'subsample': start_params['subsample'],
    'colsample_bytree': start_params['colsample_bytree'],
    'max_depth': max_depth,
    'min_child_weight': min_child_weight,
    
    **static_params
}


trials_2 = Trials()
best_2 = fmin(fn=objective,
            space=space_2,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_2)
print(best_2)


SPACE:                                                
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 0.9, 'objective': 'binary:logistic', 'num_round': 100, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.4999719312076353, ROC-AUC: 0.739497136284216
SPACE:                                                                           
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 2.0, 'objective': 'binary:logistic', 'num_round': 100, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.500091270990817, ROC-AUC: 0.7392182630119916                         
SPACE:                                                                           
{'labelCol': 'label', 'n

LOG-LOSS: 0.4999344937271374, ROC-AUC: 0.7394154688180656                         
SPACE:                                                                            
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'num_round': 100, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.4997659706582391, ROC-AUC: 0.7396264087238456                         
SPACE:                                                                            
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'num_round': 100, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.4997659706582391, ROC-AUC: 0.739626408

KeyError: 'gamma'

In [12]:
gamma = [0.1, 0.2, 0.4, 0.9, 2.0][2]#gamma_choice[best_2['gamma']]
gamma

0.4

#### Tune subsample, colsample_bytree

In [28]:
%%time

subsample_choice = [0.05, 0.15, 0.4, 1.0] 
colsample_bytree_choice = [0.1, 0.3, 0.5, 1.0]

space_3 = {
    # Optimize
    'subsample': hp.choice('subsample', subsample_choice),
    'colsample_bytree': hp.choice('colsample_bytree', colsample_bytree_choice),
    
    # Fixed    
    'num_round': num_round,
    'eta': eta,
    'gamma': gamma,
    'max_depth': max_depth,
    'min_child_weight': min_child_weight,
    
    **static_params
}


trials_3 = Trials()
best_3 = fmin(fn=objective,
            space=space_3,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_3)
print(best_3)

SPACE:                                                
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 1.0, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'num_round': 100, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 1.0, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.5000928239315975, ROC-AUC: 0.7392247189052934
SPACE:                                                                           
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 1.0, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'num_round': 100, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.5, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.500531698468729, ROC-AUC: 0.7384748074102702                         
SPACE:                                                                           
{'labelCol': 'label', '

LOG-LOSS: 0.5055654615978574, ROC-AUC: 0.7319119998521211                        
SPACE:                                                                           
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 0.15, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'num_round': 100, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 1.0, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.5067085627484498, ROC-AUC: 0.7300070171857189                        
SPACE:                                                                           
{'labelCol': 'label', 'nworkers': 1, 'nthread': 1, 'subsample': 1.0, 'eta': 0.5, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'num_round': 100, 'max_depth': 6, 'silent': 0, 'predictionCol': 'prediction', 'colsample_bytree': 0.1, 'min_child_weight': 50.0, 'featuresCol': 'features'}
LOG-LOSS: 0.5024113621014259, ROC-AUC: 0.736307925301

In [29]:
subsample = subsample_choice[best_3['subsample']]
colsample_bytree = colsample_bytree_choice[best_3['colsample_bytree']]

In [13]:
subsample = [0.05, 0.15, 0.4, 1.0][3]
colsample_bytree = [0.1, 0.3, 0.5, 1.0][3]

subsample, colsample_bytree

(1.0, 1.0)

### Tune regularization parameters

> Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.

In [33]:
%%time

alpha_choice = [0.0, 0.1, 0.3, 0.7, 1.5, 3.0]
lambda_choice = [0.0, 0.1, 0.5, 1.0, 1.5, 3.0]

space_4 = {
    # Optimize
    'alpha': hp.choice('alpha', alpha_choice),
    'reg_lambda': hp.choice('reg_lambda', lambda_choice),

    # Fixed    
    'num_round': num_round,
    'eta': eta,
    'gamma': gamma,
    'max_depth': max_depth,
    'min_child_weight': min_child_weight,
    'subsample': subsample,
    'colsample_bytree': colsample_bytree,
    
    **static_params
}


trials_4 = Trials()
best_4 = fmin(fn=objective,
            space=space_4,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_4)
print(best_4)

SPACE:                                                
{'labelCol': 'label', 'colsample_bytree': 1.0, 'nthread': 1, 'min_child_weight': 50.0, 'eta': 0.5, 'nworkers': 1, 'alpha': 3.0, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'featuresCol': 'features', 'silent': 0, 'subsample': 1.0, 'predictionCol': 'prediction', 'max_depth': 6, 'num_round': 100, 'lambda': 0.0}
LOG-LOSS: 0.5006333418478331, ROC-AUC: 0.738428914222012
SPACE:                                                                           
{'labelCol': 'label', 'colsample_bytree': 1.0, 'nthread': 1, 'min_child_weight': 50.0, 'eta': 0.5, 'nworkers': 1, 'alpha': 0.7, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'featuresCol': 'features', 'silent': 0, 'subsample': 1.0, 'predictionCol': 'prediction', 'max_depth': 6, 'num_round': 100, 'lambda': 0.0}
LOG-LOSS: 0.5004349764545957, ROC-AUC: 0.7389036774607389                        
SPACE:                                          

LOG-LOSS: 0.5004349764545957, ROC-AUC: 0.7389036774607407                         
SPACE:                                                                            
{'labelCol': 'label', 'colsample_bytree': 1.0, 'nthread': 1, 'min_child_weight': 50.0, 'eta': 0.5, 'nworkers': 1, 'alpha': 0.3, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'featuresCol': 'features', 'silent': 0, 'subsample': 1.0, 'predictionCol': 'prediction', 'max_depth': 6, 'num_round': 100, 'lambda': 1.5}
LOG-LOSS: 0.5002497242057075, ROC-AUC: 0.7391441974556907                         
SPACE:                                                                            
{'labelCol': 'label', 'colsample_bytree': 1.0, 'nthread': 1, 'min_child_weight': 50.0, 'eta': 0.5, 'nworkers': 1, 'alpha': 0.1, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'featuresCol': 'features', 'silent': 0, 'subsample': 1.0, 'predictionCol': 'prediction', 'max_depth': 6, 'num_round': 100, 'lambda

In [34]:
alpha = alpha_choice[best_4['alpha']]
reg_lambda = lambda_choice[best_4['reg_lambda']]

In [14]:
alpha = [0.0, 0.1, 0.3, 0.7, 1.5, 3.0][2]
reg_lambda = [0.0, 0.1, 0.5, 1.0, 1.5, 3.0][5]

alpha, reg_lambda

(0.3, 3.0)

### Lower the learning rate and decide the optimal parameters

In [35]:
%%time

max_depth_choice_final = [6, 7]
min_child_weight_choice_final = [50.0, 60.0, 70.0]
gamma_choice_final = [0.4, 0.5, 0.6]
subsample_choice_final = [0.9, 1.0] 
colsample_bytree_choice_final = [0.9, 1.0]
alpha_choice_final = [0.2, 0.3, 0.4]
lambda_choice_final = [2.0, 3.0, 4.0]
num_round_choice_final = [100, 150]


space_final = {
    # Optimize
    'max_depth': hp.choice('max_depth', max_depth_choice_final),
    'min_child_weight': hp.choice('min_child_weight', min_child_weight_choice_final),
    'gamma': hp.choice('gamma', gamma_choice_final),
    'subsample': hp.choice('subsample', subsample_choice_final),
    'colsample_bytree': hp.choice('colsample_bytree', colsample_bytree_choice_final),
    'alpha': hp.choice('alpha', alpha_choice_final),
    'reg_lambda': hp.choice('reg_lambda', lambda_choice_final),
    'num_round': hp.choice('num_round', num_round_choice_final),
    
    # Fixed
    'eta': 0.05,
    
    **static_params
}


trials_final = Trials()
best_final = fmin(fn=objective,
            space=space_final,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_final)
print(best_final)

final_params = {
    'colsample_bytree': colsample_bytree_choice_final[best_final['colsample_bytree']],
    'eta': eta,
    'gamma': gamma_choice_final[best_final['gamma']],
    'max_depth': max_depth_choice_final[best_final['max_depth']],
    'min_child_weight': min_child_weight_choice_final[best_final['min_child_weight']],
    'subsample': subsample_choice_final[best_final['subsample']],
    'num_round': num_round_choice_final[best_final['num_round']],
    'alpha': alpha_choice_final[best_final['alpha']],
    'reg_lambda': lambda_choice_final[best_final['reg_lambda']],
}

SPACE:                                                
{'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'min_child_weight': 70.0, 'eta': 0.05, 'nworkers': 1, 'alpha': 0.3, 'eval_metric': 'logloss', 'gamma': 0.5, 'objective': 'binary:logistic', 'featuresCol': 'features', 'silent': 0, 'subsample': 0.9, 'predictionCol': 'prediction', 'max_depth': 6, 'num_round': 150, 'lambda': 4.0}
LOG-LOSS: 0.505043543869428, ROC-AUC: 0.7329799266302088
SPACE:                                                                            
{'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'min_child_weight': 70.0, 'eta': 0.05, 'nworkers': 1, 'alpha': 0.2, 'eval_metric': 'logloss', 'gamma': 0.6, 'objective': 'binary:logistic', 'featuresCol': 'features', 'silent': 0, 'subsample': 0.9, 'predictionCol': 'prediction', 'max_depth': 6, 'num_round': 150, 'lambda': 2.0}
LOG-LOSS: 0.5050005749054789, ROC-AUC: 0.7329889466585511                         
SPACE:                                      

LOG-LOSS: 0.505657776077862, ROC-AUC: 0.7321075859855927                          
SPACE:                                                                            
{'labelCol': 'label', 'colsample_bytree': 1.0, 'nthread': 1, 'min_child_weight': 50.0, 'eta': 0.05, 'nworkers': 1, 'alpha': 0.4, 'eval_metric': 'logloss', 'gamma': 0.4, 'objective': 'binary:logistic', 'featuresCol': 'features', 'silent': 0, 'subsample': 0.9, 'predictionCol': 'prediction', 'max_depth': 6, 'num_round': 150, 'lambda': 2.0}
LOG-LOSS: 0.504896653993036, ROC-AUC: 0.7331208859742133                          
SPACE:                                                                              
{'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'min_child_weight': 50.0, 'eta': 0.05, 'nworkers': 1, 'alpha': 0.4, 'eval_metric': 'logloss', 'gamma': 0.6, 'objective': 'binary:logistic', 'featuresCol': 'features', 'silent': 0, 'subsample': 1.0, 'predictionCol': 'prediction', 'max_depth': 6, 'num_round': 150, 'la

In [15]:
max_depth_choice_final = [6, 7]
min_child_weight_choice_final = [50.0, 60.0, 70.0]
gamma_choice_final = [0.4, 0.5, 0.6]
subsample_choice_final = [0.9, 1.0] 
colsample_bytree_choice_final = [0.9, 1.0]
alpha_choice_final = [0.2, 0.3, 0.4]
lambda_choice_final = [2.0, 3.0, 4.0]
num_round_choice_final = [100, 150]

final_params = {
    'colsample_bytree': colsample_bytree_choice_final[1],
    'eta': eta,
    'gamma': gamma_choice_final[0],
    'max_depth': max_depth_choice_final[1],
    'min_child_weight': min_child_weight_choice_final[1],
    'subsample': subsample_choice_final[0],
    'num_round': num_round_choice_final[1],
    'alpha': alpha_choice_final[1],
    'reg_lambda': lambda_choice_final[1],
}

final_params

{'alpha': 0.3,
 'colsample_bytree': 1.0,
 'eta': 0.5,
 'gamma': 0.4,
 'max_depth': 7,
 'min_child_weight': 60.0,
 'num_round': 150,
 'reg_lambda': 3.0,
 'subsample': 0.9}

---
## LogisticRegression Tuning

Подберем гиперпараметры для логрега из предыдущих практик

In [21]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import scipy.stats as st
from pyspark.ml.classification import LogisticRegression


def objective_lr(space):
    estimator = LogisticRegression(**space)
    print('SPACE:', estimator._input_kwargs)
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    log_loss = logloss(model, val_df, probabilities_col='probability')
    roc_auc = rocauc(model, val_df, probabilities_col='probability')
    
    print('LOG-LOSS: {}, ROC-AUC: {}'.format(log_loss, roc_auc))

    return {'loss': log_loss, 'rocauc': roc_auc, 'status': STATUS_OK }

In [22]:
CONTROL_NAME_LR = 'lr baseline'

static_params_lr = {
    'featuresCol': "features", 
    'labelCol': "label", 
    'maxIter': 15,
}
prev_params_lr = {'elasticNetParam': 1, 'fitIntercept': True, 'regParam': 0}

merged = {**static_params_lr, **prev_params_lr}
baseline_model_lr = LogisticRegression(**merged).fit(train_df)

In [27]:
baseline_test_metrics_lr = {
    'logloss': logloss(baseline_model_lr, test_df, probabilities_col='probability'),
    'rocauc': rocauc(baseline_model_lr, test_df, probabilities_col='probability')
}

all_metrics[CONTROL_NAME_LR] = baseline_test_metrics_lr
all_metrics

{'lr baseline': {'logloss': 0.5311226907739438, 'rocauc': 0.7007963446486266},
 'xgb baseline': {'logloss': 0.5136422657623795, 'rocauc': 0.725208836729057}}

In [28]:
%%time

regParam_choice = [0., 0.01, 0.5, 0.1]
elasticNetParam_choice = [0., 0.1, 0.2, 0.5, 1]
fitIntercept_choice = [True, False]

space_lr = {
    'regParam': hp.choice('regParam', regParam_choice),
    'elasticNetParam': hp.choice('elasticNetParam', elasticNetParam_choice),
    'fitIntercept': hp.choice('fitIntercept', fitIntercept_choice),
    
    **static_params_lr
}


trials_lr = Trials()
best_lr = fmin(fn=objective_lr,
            space=space_lr,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_lr)

print(best_lr)

SPACE:                                                
{'elasticNetParam': 1, 'featuresCol': 'features', 'regParam': 0.0, 'labelCol': 'label', 'maxIter': 15, 'fitIntercept': True}
LOG-LOSS: 0.5279326290586741, ROC-AUC: 0.7031880786148457
SPACE:                                                                          
{'elasticNetParam': 0.5, 'featuresCol': 'features', 'regParam': 0.0, 'labelCol': 'label', 'maxIter': 15, 'fitIntercept': False}
LOG-LOSS: 0.5278683217330382, ROC-AUC: 0.7031251222098683                       
SPACE:                                                                          
{'elasticNetParam': 0.1, 'featuresCol': 'features', 'regParam': 0.01, 'labelCol': 'label', 'maxIter': 15, 'fitIntercept': True}
LOG-LOSS: 0.5283276953838569, ROC-AUC: 0.7037407395505592                       
SPACE:                                                                          
{'elasticNetParam': 0.5, 'featuresCol': 'features', 'regParam': 0.01, 'labelCol': 'label', 'maxIter':

NameError: name 'best_final' is not defined

In [29]:
final_params_lr = {
    'elasticNetParam': elasticNetParam_choice[best_lr['elasticNetParam']], 
    'fitIntercept': fitIntercept_choice[best_lr['fitIntercept']], 
    'regParam': regParam_choice[best_lr['regParam']]
}

final_params_lr

{'elasticNetParam': 0.5, 'fitIntercept': False, 'regParam': 0.0}

In [16]:
final_params_lr = {'elasticNetParam': 0.5, 'fitIntercept': False, 'regParam': 0.0}

---
## Optional [MongoTrials](https://hyperopt.github.io/hyperopt/scaleout/mongodb/)

> For parallel search, hyperopt includes a MongoTrials implementation that supports asynchronous updates.

**TLDR** Преимущества использования `MongoTrials`:
* `MongoTrials` позволяет параллельно запускать несколько вычислений целевой функции
* Динамический уровень параллелизма - можно добавлять/удалять воркеров, которые вычисляют целевую функцию
* Все результаты сохраняются в БД - история запусков никуда не потеряется

*За выполнение данного задания можно получить дополнительно +0.4 к итоговому баллу*

### XGBoost Tuning

In [40]:
######################################
######### YOUR CODE HERE #############
######################################

# Results

Подведем итоги.

Обучите модели с найденными (оптимальными) гиперпараметрами и сделайте справнение на отложенной выборке

In [17]:
df = sqlContext.read.format("com.databricks.spark.csv") \
    .option("delimiter", ",") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('file:///' + TRAIN_PATH)

df = df.fillna(0, subset=num_columns)

df = pipeline_model \
    .transform(df) \
    .select(F.col('_c0').alias('label'), 'features', 'id') \
    .cache()

train_df, val_df, test_df = split_by_col(df, 'id', [0.8, 0.1, 0.1])

In [25]:
all_metrics = {}
prev_xgb_params = {**static_params, **start_params}
baseline_model = XGBoostEstimator(**prev_xgb_params).fit(train_df)

all_metrics["xgb baseline"] = {
    'logloss': logloss(baseline_model, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(baseline_model, test_df, probabilities_col='probabilities')
}

In [28]:
final_xgb_params = {**static_params, **final_params}
xgb_new_model = XGBoostEstimator(**final_xgb_params).fit(train_df)

all_metrics['xgb final'] = {
    'logloss': logloss(xgb_new_model, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(xgb_new_model, test_df, probabilities_col='probabilities')
}

In [29]:
lr_model = LogisticRegression(**{**static_params_lr, **prev_params_lr}).fit(train_df)

all_metrics['lr baseline'] = {
    'logloss': logloss(lr_model, test_df, probabilities_col='probability'),
    'rocauc': rocauc(lr_model, test_df, probabilities_col='probability')
}

In [30]:
lr_final_model = LogisticRegression(**{**static_params_lr, **final_params_lr}).fit(train_df)

all_metrics['lr final'] = {
    'logloss': logloss(lr_final_model, test_df, probabilities_col='probability'),
    'rocauc': rocauc(lr_final_model, test_df, probabilities_col='probability')
}

Итоговая таблица

In [35]:
get_ate(all_metrics, "xgb baseline")

Unnamed: 0,lr baseline ate %,lr final ate %,metric,xgb final ate %
0,-3.18866,-3.19287,rocauc,2.163623
1,3.148918,3.148291,logloss,-2.345035


In [32]:
all_metrics

{'lr baseline': {'logloss': 0.5299829461896662, 'rocauc': 0.7017485246832023},
 'lr final': {'logloss': 0.5299797247351329, 'rocauc': 0.701718010310289},
 'xgb baseline': {'logloss': 0.5138036910102037, 'rocauc': 0.7248619097892304},
 'xgb final': {'logloss': 0.501754812085975, 'rocauc': 0.7405451857937936}}

In [36]:
TEST_PATH = os.path.join(DATA_PATH, 'test.csv')

submit_df = sqlContext.read.format("com.databricks.spark.csv") \
    .option("delimiter", ",") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('file:///' + TEST_PATH)

In [39]:
from pyspark.ml import PipelineModel

num_columns = ['_c{}'.format(i) for i in range(1, 14)]
cat_columns = ['_c{}'.format(i) for i in range(14, 40)][:2]

submit_df = submit_df.fillna(0, subset=num_columns)
pipeline_model = PipelineModel.load(os.path.join(DATA_PATH, 'pipeline'))

submit_df = pipeline_model \
    .transform(submit_df) \
    .select('features', 'id') \
    .cache()

In [45]:
result = xgb_new_model.transform(submit_df)

In [47]:
result = result\
    .withColumnRenamed('probabilities', 'proba')\
    .select('id', 'proba').rdd\
    .map(lambda arr: [int(arr[0]), float(arr[1][1])]) \
    .collect()

In [48]:
import csv

RESULT_PATH = os.path.join(DATA_PATH, 'xgb_model_opt_submission.csv')

with open(RESULT_PATH, 'wt') as fout:
    writer = csv.writer(fout)
    writer.writerow(['id', 'proba'])
    for line in result:
        writer.writerow(line)