# A Guide on XGBoost hyperparameters tuning

* XGBoost algorithm provides large range of hyperparameters. We should know how to tune these hyperparameters to improve and take full advantage of the XGBoost model.

* Model training typically starts with parameters being initialized to some values (random values or set to zeros). As training/learning progresses the initial values are updated using an optimization algorithm (e.g. gradient descent). The learning algorithm is continuously updating the parameter values as learning progress but hyperparameter values set by the model designer remain unchanged.

* In Machine learning you choose and set hyperparameter values that your learning algorithm will use before the training of the model even begins



# XGBoost Parameters¶

Before running XGBoost, we must set three types of parameters: general parameters, booster parameters and task parameters.

1. **General parameters** relate to which booster we are using to do boosting, commonly tree or linear model

2. **Booster parameter**s depend on which booster you have chosen

3. **Learning task parameters** decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

# Relevent General Parameters¶

1. **booster** [default= gbtree ] 
  * Which booster to use. Can be gbtree, gblinear or dart; 
  * gbtree and dart use tree based models while gblinear uses linear functions.

2. **verbosity** [default=1]
  * Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). 
  * Sometimes XGBoost tries to change configurations based on heuristics, which is displayed as warning message. 
  * If there’s unexpected behaviour, please try to increase value of verbosity.

3. **nthread** [default to maximum number of threads available if not set]
  * Number of parallel threads used to run XGBoost. When choosing it, please keep thread contention and hyperthreading in mind.




# Parameters for Tree Booster
1. **eta [default=0.3, ]**
* alias: learning_rate
* Step size shrinkage used in update to prevents overfitting.
* After each boosting step, we can directly get the weights of new features
* It makes the model more robust by shrinking the weights on each step.
* range: [0,1]

2. **gamma [default=0]**
* Minimum loss reduction required to make a further partition on a leaf node of the tree. 
* The larger gamma is, the more conservative the algorithm will be.
* range: [0,∞]

3. **max_depth [default=6]**

* Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 
* 0 is only accepted in lossguided growing policy when tree_method is set as hist or gpu_hist and it indicates no limit on depth. 
* Beware that XGBoost aggressively consumes memory when training a deep tree.
* range: [0,∞] )

4. **min_child_weight [default=1]**

* its Minimum sum of instance weight (hessian) needed in a child. I
* if the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. 
* In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. 
* The larger min_child_weight is, the more conservative the algorithm will be.
* range: [0,∞]

5. **max_delta_step [default=0]**

* Maximum delta step we allow each leaf output to be. 
* If the value is set to 0, it means there is no constraint. 
* If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.
* range: [0,∞]

6. **subsample [default=1]**

* It denotes the fraction of observations to be randomly samples for each tree.
* Subsample ratio of the training instances.
* Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. - This will prevent overfitting.
* Subsampling will occur once in every boosting iteration.
* Lower values make the algorithm more conservative and prevents overfitting but too small alues might lead to under-fitting.
* typical values: 0.5-1
* range: (0,1]



7. **sampling_method [default= uniform]**

* The method to use to sample the training instances.
* **uniform:** each training instance has an equal probability of being selected. Typically set subsample >= 0.5 for good results.
* **gradient_based:** the selection probability for each training instance is proportional to the regularized absolute value of gradients 

8. **colsample_bytree, colsample_bylevel, colsample_bynode [default=1]**

**This is a family of parameters for subsampling of columns.**

**All colsample_by** parameters have a range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.

**lsample_bytree**s the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.

**colsample_bylevel** is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.

**colsample_bynode** is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.

**colsample_by*** parameters work cumulatively. For instance, the combination **{'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5}** with 64 features will leave 8 features to choose from at each split.

9. **lambda [default=1]**
* alias: reg_lambda
* L2 regularization term on weights. 
* Increasing this value will make model more conservative.

10. **alpha [default=0]**
* alias: reg_alpha
* L1 regularization term on weights.
* Increasing this value will make model more conservative.

11. **grow_policy [default= depthwise]**
* Controls a way new nodes are added to the tree.
* Currently supported only if tree_method is set to hist or gpu_hist.
* **Choices:** depthwise, lossguide
* **depthwise:** split at nodes closest to the root.
* **lossguide:** split at nodes with highest loss change.

12. **max_leaves [default=0]**
* Maximum number of nodes to be added. 
* Only relevant when grow_policy=lossguide is set.

[for more see this](https://xgboost.readthedocs.io/en/latest/parameter.html)

# Learning Task Parameters 

1. **objective [default=reg:squarederror]**

It defines the loss function to be minimized. Most commonly used values are given below -

* reg:squarederror : regression with squared loss.

* reg:squaredlogerror: regression with squared log loss 1/2[log(pred+1)−log(label+1)]2. - All input labels are required to be greater than -1.

* reg:logistic : logistic regression

* binary:logistic : logistic regression for binary classification, output probability

* binary:logitraw: logistic regression for binary classification, output score before logistic transformation

* binary:hinge : hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.

* multi:softmax : set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

* multi:softprob : same as softmax, but output a vector of ndata nclass, which can be further reshaped to ndata nclass matrix. The result contains predicted probability of each data point belonging to each class.

2. **eval_metric [default according to objective]**
* The metric to be used for validation data.
* The default values are rmse for regression, error for classification and mean average precision for ranking.
* We can add multiple evaluation metrics.
* Python users must pass the metrices as list of parameters pairs instead of map.
* The most common values are given below -

 * rmse : root mean square error
 * mae : mean absolute error
 * logloss : negative log-likelihood
 * error : Binary classification error rate (0.5 threshold). 
 * merror : Multiclass classification error rate.
 * mlogloss : Multiclass logloss
 * auc: Area under the curve
 * aucpr : Area under the PR curve

In [None]:
import numpy as np
import pandas as pd 
import xgboost as xgb
from sklearn.metrics import accuracy_score
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
import warnings
warnings.filterwarnings('ignore')

In [None]:
data = '../input/tabular-playground-series-oct-2021/train.csv'
df = pd.read_csv(data)
test=pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')

In [None]:
test

In [None]:
X = df.drop(['loss','id'], axis=1)
test=test.drop('id',axis=1)
y = df['loss']


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 0)

# The available hyperopt optimization algorithms are -

* hp.choice(label, options) — Returns one of the options, which should be a list or tuple.

* hp.randint(label, upper) — Returns a random integer between the range [0, upper).

* hp.uniform(label, low, high) — Returns a value uniformly between low and high.

* hp.quniform(label, low, high, q) — Returns a value round(uniform(low, high) / q) * q,

* hp.normal(label, mean, std) — Returns a real value that’s normally-distributed with mean and standard deviation sigma.

In [None]:
space={'max_depth': hp.quniform("max_depth", 3, 18, 1),
        'gamma': hp.uniform ('gamma', 1,9),
        'reg_alpha' : hp.quniform('reg_alpha', 50,150,1),
        'reg_lambda' : hp.quniform('reg_lambda', 40,100,1),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators': hp.quniform('n_estimators',5000,10000,1000),
        'tree_method':'gpu_hist',
        'subsample': hp.uniform('subsample', 0,1),
        'learning_rate': hp.uniform('learning_rate',0.000001,1),
        'seed': 0
    }



In [None]:
from sklearn.metrics import mean_squared_error
def objective(space):
    clf=xgb.XGBRegressor(
                    n_estimators =int(space['n_estimators']), 
                    max_depth = int(space['max_depth']), 
                    gamma = space['gamma'],
                    reg_alpha = int(space['reg_alpha']),
                    reg_lambda =int(space['reg_lambda']),
                    min_child_weight=int(space['min_child_weight']),
                    colsample_bytree=int(space['colsample_bytree']))
    
    evaluation = [( X_train, y_train), ( X_test, y_test)]
    
    clf.fit(X_train, y_train,
            eval_set=evaluation, eval_metric="rmse",
            early_stopping_rounds=10,verbose=False)
    

    pred = clf.predict(X_test)
    accuracy = mean_squared_error(y_test,pred)
    print ("SCORE:", accuracy)
    return {'loss': -accuracy, 'status': STATUS_OK }

In [None]:

import warnings
warnings.filterwarnings('ignore')
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 10,
                        trials = trials)

In [None]:
best_hyperparams

public score 7.87

{'n_estimators':5000,
          'learning_rate': 0.02,
          'subsample': 0.5,
          'colsample_bytree': 0.7,
          'max_depth': 6,
          'booster': 'gbtree',
          'tree_method': 'gpu_hist',
          'reg_lambda': 60,
          'reg_alpha': 60,
           'n_jobs': 4}
   
           

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, KFold
params = {'colsample_bytree': 0.824275003154649,
 'gamma': 7.565239251710113,
 'learning_rate': 0.035137857511338515,
 'max_depth': 11,
 'min_child_weight': 10,
 'n_estimators': 6000,
 'reg_alpha': 54,
 'reg_lambda': 54,
 'subsample': 0.7862419047343154}
params['tree_method'] = 'gpu_hist'


splits = 12
stf = StratifiedKFold(n_splits=splits, shuffle=True)
oof= np.zeros((X.shape[0],))
prediction = 0
model_fi = 0
total_mean_rmse = 0

for num, (train_id, valid_id) in enumerate(stf.split(X, y)):
    X_train, X_valid = X.loc[train_id], X.loc[valid_id]
    y_train, y_valid = y.loc[train_id], y.loc[valid_id]
    
    model = xgb.XGBRegressor(**params)
    model.fit(X_train, y_train,
              eval_set=[(X_train, y_train), (X_valid, y_valid)],
              eval_metric="rmse",verbose=0)
    
    prediction += model.predict(test) / splits
    oof[valid_id] = model.predict(X_valid)
    oof[oof < 0] = 0

    fold_rmse = np.sqrt(mean_squared_error(y_valid, oof[valid_id]))
    print(f"Fold {num} RMSE: {fold_rmse}")

sub=pd.read_csv('../input/tabular-playground-series-oct-2021/sample_submission.csv')
sub['loss']=prediction

In [None]:
sub.to_csv('submission.csv',index=False)