## XGBoost

In [1]:
from sklearn import metrics, datasets

In [2]:
datasets.make_classification

<function sklearn.datasets._samples_generator.make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)>

In [3]:
help(datasets.make_classification)

Help on function make_classification in module sklearn.datasets._samples_generator:

make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
    Generate a random n-class classification problem.
    
    This initially creates clusters of points normally distributed (std=1)
    about vertices of an ``n_informative``-dimensional hypercube with sides of
    length ``2*class_sep`` and assigns an equal number of clusters to each
    class. It introduces interdependence between these features and adds
    various types of further noise to the data.
    
    Without shuffling, ``X`` horizontally stacks features in the following
    order: the primary ``n_informative`` features, followed by ``n_redundant``
    linear combinations of the informative features, followed by ``n_repeated``
    duplicates, dr

In [4]:
x, y = datasets.make_classification(10000, n_features=10, n_informative=5, n_redundant=0, n_clusters_per_class=3, class_sep=0.5, random_state=150)

In [5]:
x.shape

(10000, 10)

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
train, test, y_train, y_test = train_test_split(x, y, random_state=10, test_size=0.25)


In [8]:
train.shape


(7500, 10)

In [9]:
test.shape

(2500, 10)

In [10]:

import xgboost as xgb

## General Parameters
These define the overall functionality of XGBoost.

1. booster [default=gbtree]
    - Select the type of model to run at each iteration. It has 2 options:
        - gbtree: tree-based models
        - gblinear: linear models
2. silent [default=0]:
    - Silent mode is activated is set to 1, i.e. no running messages will be printed.
    - It’s generally good to keep it 0 as the messages might help in understanding the model.
 

## Booster Parameters
Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.

1. eta [default=0.3]
    - Makes the model more robust by shrinking the weights on each step
    - Typical final values to be used: 0.01-0.2
2. min_child_weight [default=1]
    - Defines the minimum sum of weights of all observations required in a child.
3. max_depth [default=6]
    - Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
    - Should be tuned using CV.
    - Typical values: 3-10
4. max_leaf_nodes
    - The maximum number of terminal nodes or leaves in a tree.
    - Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
5. gamma [default=0]
    - A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
    - Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
6. max_delta_step [default=0]
    - In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
    - Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
    - This is generally not used but you can explore further if you wish.
7. subsample [default=1]
    - Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
    - Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
    - Typical values: 0.5-1
8. colsample_bytree [default=1]
    - Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
    - Typical values: 0.5-1
9. colsample_bylevel [default=1]
    - Denotes the subsample ratio of columns for each split, in each level.
    - I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
10. lambda [default=1]
    - L2 regularization term on weights (analogous to Ridge regression)
    - This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
11. alpha [default=0]
    - L1 regularization term on weight (analogous to Lasso regression)
    - Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
12. scale_pos_weight [default=1]
    - A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
 

## Learning Task Parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.

1. objective [default=reg:linear]
    - This defines the loss function to be minimized. Mostly used values are:
        - binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
        - multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
            you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
        - multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
2. eval_metric [ default according to objective ]
    - The metric to be used for validation data.
    - The default values are rmse for regression and error for classification.
    - Typical values are:
        - rmse – root mean square error
        - mae – mean absolute error
        - logloss – negative log-likelihood
        - error – Binary classification error rate (0.5 threshold)
        - merror – Multiclass classification error rate
        - mlogloss – Multiclass logloss
        - auc: Area under the curve
3. seed [default=0]
    - The random number seed.
    - Can be used for generating reproducible results and also for parameter tuning.

In [11]:
dtrain = xgb.DMatrix(train, y_train)
dtest = xgb.DMatrix(test, y_test)
params={
          'learning_rate': 0.1,
          'colsample_bytree' : 0.3,
          'max_depth': 5,
          'objective': 'binary:logistic',
          'n_estimators':100,
          'alpha' : 10,
          'silent': True,
          'tree_method':'gpu_hist',
          'eval_metric':'auc'}
## Train the model
trained_model = xgb.train(
                        params,
                        dtrain,
                        num_boost_round=100, evals=[(dtrain, 'train'), (dtest, 'test')])

## Predict the model
prediction = trained_model.predict(dtest)

[0]	train-auc:0.55593	test-auc:0.48753
[1]	train-auc:0.64252	test-auc:0.58759
[2]	train-auc:0.64416	test-auc:0.59567
[3]	train-auc:0.66299	test-auc:0.61016
[4]	train-auc:0.74745	test-auc:0.70718
[5]	train-auc:0.77451	test-auc:0.73081
[6]	train-auc:0.76914	test-auc:0.73244
[7]	train-auc:0.77888	test-auc:0.73552
[8]	train-auc:0.78009	test-auc:0.73401
[9]	train-auc:0.78143	test-auc:0.73298
[10]	train-auc:0.78055	test-auc:0.73599
[11]	train-auc:0.79227	test-auc:0.74858
[12]	train-auc:0.80013	test-auc:0.75588
[13]	train-auc:0.80082	test-auc:0.75512
[14]	train-auc:0.80302	test-auc:0.75511
[15]	train-auc:0.80156	test-auc:0.75591
[16]	train-auc:0.80294	test-auc:0.75958
[17]	train-auc:0.80141	test-auc:0.75764
[18]	train-auc:0.80222	test-auc:0.75714
[19]	train-auc:0.80279	test-auc:0.75781
[20]	train-auc:0.80575	test-auc:0.76079
[21]	train-auc:0.80748	test-auc:0.76089
[22]	train-auc:0.80698	test-auc:0.75984
[23]	train-auc:0.80769	test-auc:0.76021
[24]	train-auc:0.80622	test-auc:0.75843
[25]	train

In [24]:

from xgboost import XGBClassifier
model = XGBClassifier(**params)
model.fit(train, y_train)

XGBClassifier(alpha=10, base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3, eval_metric='auc',
              gamma=0, gpu_id=0, importance_type='gain',
              interaction_constraints=None, learning_rate=0.1, max_delta_step=0,
              max_depth=5, min_child_weight=1, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=0,
              num_parallel_tree=1, random_state=0, reg_alpha=10, reg_lambda=1,
              scale_pos_weight=1, silent=True, subsample=1,
              tree_method='gpu_hist', validate_parameters=1, verbosity=None)

In [18]:
type(model.predict(test))

numpy.ndarray

In [23]:
import matplotlib.pyplot as plt


0

In [38]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
skf = StratifiedKFold(n_splits=4, shuffle = True, random_state = 1001)

params={
          'learning_rate': [0.2, 0.3, 0.4],
          'colsample_bytree' : [0.3],
          'max_depth': [5, 6, 7],
          'objective': ['binary:logistic'],
          'n_estimators':[100, 200],
          'alpha' : [10, 20],
          'silent': [True],
          'tree_method':['gpu_hist'],
          'eval_metric':['auc']}
#model = XGBClassifier(**params)
#tparams={
#          'learning_rate': [0.1, 0.2, 0.3],
#          'max_depth': [5, 6, 7]}

grid_search = GridSearchCV(model, params, scoring='roc_auc', n_jobs=1, cv=skf, verbose=3)

grid_search.fit(train, y_train)

Fitting 4 folds for each of 36 candidates, totalling 144 fits
[CV] alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=100, objective=binary:logistic, silent=True, tree_method=gpu_hist 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=100, objective=binary:logistic, silent=True, tree_method=gpu_hist, score=0.801, total=   0.5s
[CV] alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=100, objective=binary:logistic, silent=True, tree_method=gpu_hist 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


[CV]  alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=100, objective=binary:logistic, silent=True, tree_method=gpu_hist, score=0.802, total=   0.4s
[CV] alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=100, objective=binary:logistic, silent=True, tree_method=gpu_hist 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s


[CV]  alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=100, objective=binary:logistic, silent=True, tree_method=gpu_hist, score=0.799, total=   0.4s
[CV] alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=100, objective=binary:logistic, silent=True, tree_method=gpu_hist 
[CV]  alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=100, objective=binary:logistic, silent=True, tree_method=gpu_hist, score=0.788, total=   0.4s
[CV] alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=200, objective=binary:logistic, silent=True, tree_method=gpu_hist 
[CV]  alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, n_estimators=200, objective=binary:logistic, silent=True, tree_method=gpu_hist, score=0.817, total=   0.6s
[CV] alpha=10, colsample_bytree=0.3, eval_metric=auc, learning_rate=0.2, max_depth=5, 

[Parallel(n_jobs=1)]: Done 144 out of 144 | elapsed:  1.2min finished


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=1001, shuffle=True),
             estimator=XGBClassifier(alpha=10, base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=0.3, eval_metric='auc',
                                     gamma=None, gpu_id=None,
                                     importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=0.1, max_delta_step=None,
                                     max_depth=5, min_...
                                     scale_pos_weight=None, silent=True,
                                     subsample=None, tree_method='gpu_hist',
                                     validate_parameters=False,
                                     verbosity=None),
             n_jobs=1,
             param_grid={'alp

In [39]:
grid_search.best_estimator_

XGBClassifier(alpha=10, base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3, eval_metric='auc',
              gamma=0, gpu_id=0, importance_type='gain',
              interaction_constraints=None, learning_rate=0.2, max_delta_step=0,
              max_depth=7, min_child_weight=1, missing=nan,
              monotone_constraints=None, n_estimators=200, n_jobs=0,
              num_parallel_tree=1, random_state=0, reg_alpha=10, reg_lambda=1,
              scale_pos_weight=1, silent=True, subsample=1,
              tree_method='gpu_hist', validate_parameters=False,
              verbosity=None)

In [35]:
grid_search.best_estimator_

XGBClassifier(alpha=10, base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3, eval_metric='auc',
              gamma=0, gpu_id=0, importance_type='gain',
              interaction_constraints=None, learning_rate=0.3, max_delta_step=0,
              max_depth=7, min_child_weight=1, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=0,
              num_parallel_tree=1, random_state=0, reg_alpha=10, reg_lambda=1,
              scale_pos_weight=1, silent=True, subsample=1,
              tree_method='gpu_hist', validate_parameters=False,
              verbosity=None)

In [41]:
grid_search.best_params_

{'alpha': 10,
 'colsample_bytree': 0.3,
 'eval_metric': 'auc',
 'learning_rate': 0.2,
 'max_depth': 7,
 'n_estimators': 200,
 'objective': 'binary:logistic',
 'silent': True,
 'tree_method': 'gpu_hist'}

In [42]:
grid_search.best_score_

0.815843731070787