# XGBoost Library

XGBoost is an implementation of decision tree ensembles. This tree ensemble model consists of a set of classification and regression trees (CART). A CART is a bit different from decision trees, in which the leaf only contains decision values. XGboost provides these advantages:

Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.

parallelizable Algorithms : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.

Performance : It has shown better performance on a variety of machine learning benchmark datasets.

Tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.


### XGBoost in action

We can use the XGBoost by importing xgboost as xgb.

```python
import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('../../../data/agaricus.train')
dtest = xgb.DMatrix('../../../data/agaricus.test')
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)                

```

### Exercise

Import XGBoost library as xgb and use above dataset to train xgb. Print preds.

In [21]:
import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('../../../data/agaricus.train')
dtest = xgb.DMatrix('../../../data/agaricus.test')
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
preds = bst.predict(dtest)
preds

[11:29:10] 6513x127 matrix with 143286 entries loaded from ../../../data/agaricus.train
[11:29:10] 1611x127 matrix with 35442 entries loaded from ../../../data/agaricus.test


array([0.28583017, 0.9239239 , 0.28583017, ..., 0.9239239 , 0.05169873,
       0.9239239 ], dtype=float32)

### Solution

Hit run on the above

Let us try regression using XGB with Boston  Housing data.



### Exercise:

In this exercise, you are required to create a xgBoost for predicting Boston Housing MEDV, fit the dataset. Use kfold validation to split data before fitting model.


In [22]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.datasets import load_iris, load_digits, load_boston
from sklearn.metrics import mean_squared_error,confusion_matrix, classification_report

rng = np.random.RandomState(31337)
print("Boston Housing: regression")
boston = load_boston()
y = boston['target']
X = boston['data']
xgbr_params = {'objective':'reg:squarederror'}
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBRegressor(**xgbr_params).fit(X[train_index], y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    print(mean_squared_error(actuals, predictions))

print("Parameter optimization")
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor(**xgbr_params)
clf = GridSearchCV(xgb_model,
                   {'max_depth': [2,4,6],
                    'n_estimators': [50,100,200]}, verbose=1)
clf.fit(X,y)
print(clf.best_score_)
print(clf.best_params_)


Boston Housing: regression
22.247195111946144
9.914663303584955
Parameter optimization
Fitting 3 folds for each of 9 candidates, totalling 27 fits
0.5984879606490934
{'max_depth': 4, 'n_estimators': 100}


[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    1.4s finished


### Solution code

Hit run on the above code.

## Using GridSearch with XGBoost

Let us do GridSearchCV for tuning parameters for Regression.

### Exercise

Run GridSearchCV on xgBoost to fine tune parameters. Use n_estimators from 100 to 1000 and max_depth from 2 to 6.


In [24]:
cv_params = {
             'n_estimators' : np.arange(100, 1001, 100),
             'max_depth' : np.arange(2, 6)
             }

xgbr_params = {'objective':'reg:squarederror','n_jobs':-1,'random_state':4444,'min_child_weight':1,
              'eta':0.3,'subsample':0.8,'gamma':0.5,'colsample_bytree':0.8}




### Solution

```python
reg = GridSearchCV(xgb.XGBRegressor(**xgbr_params)
                              ,param_grid=cv_params,scoring='r2',cv=5,n_jobs=-1,return_train_score=True, verbose=3)

reg.fit(X,y)
print(reg.best_score_)
print(reg.best_params_)
```

## More Tuning in XGBoost

Most of parameters in XGBoost are about bias variance tradeoff. The best model should trade the model complexity against accuracy of prediction.

Two scenarios are provided as per XGBoost documentation.

### Handling Overfitting

When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.

There are in general two ways that you can control overfitting in XGBoost:

The first way is to directly control model complexity.
This includes max_depth, min_child_weight and gamma.
The second way is to add randomness to make training robust to noise.
This includes subsample and colsample_bytree.
You can also reduce stepsize eta. Remember to increase num_round when you do so.

### Working with Imbalanced Dataset

An imbalanced dataset can affect the training of XGBoost model, and there are two ways to improve it.

<li>If you care only about the overall performance metric (AUC) of your prediction
    
Balance the positive and negative weights via scale_pos_weight and use AUC for evaluation

<li>If you care about predicting the right probability

In such a case, you cannot re-balance the dataset,Set parameter max_delta_step to a finite number (say 1) to help convergence

### Exercise

Use XGBoost classifier on Carseats dataset. Predict binary class of High is 0 or 1 using XGB after tuning hyperparameters.

In [23]:


df3 = pd.read_csv('../../../data/Carseats.csv').drop('Unnamed: 0', axis=1)
df3.head()
df3['High'] = df3.Sales.map(lambda x: 1 if x>8 else 0)
df3.ShelveLoc = pd.factorize(df3.ShelveLoc)[0]

df3.Urban = df3.Urban.map({'No':0, 'Yes':1})
df3.US = df3.US.map({'No':0, 'Yes':1})
df3.info()
X = df3.drop(['Sales', 'High'], axis=1)
y = df3.High

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

xgbr_params = {'max_depth':2,  'silent':1, 'objective':'binary:logistic','n_jobs':-1,'random_state':4444,
               'min_child_weight':1,
              'eta':0.3,'subsample':0.8,'gamma':0.5,'colsample_bytree':0.8}
clf = xgb.XGBClassifier(**xgbr_params)
clf.fit(X,y)
print('Test accuracy is', clf.score(X_test,y_test))



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 12 columns):
Sales          400 non-null float64
CompPrice      400 non-null int64
Income         400 non-null int64
Advertising    400 non-null int64
Population     400 non-null int64
Price          400 non-null int64
ShelveLoc      400 non-null int64
Age            400 non-null int64
Education      400 non-null int64
Urban          400 non-null int64
US             400 non-null int64
High           400 non-null int64
dtypes: float64(1), int64(11)
memory usage: 37.6 KB
Test accuracy is 0.945


  if diff:


### Solution

Hit run on the above