# Extreme Gradient Boosting

### Importing Libraries

In [1]:
#Importing required libraries
import pandas as pd 
import numpy as np

### Loading the dataset

In [2]:
#reading the data
data=pd.read_csv('data_cleaned.csv')

In [3]:
#first five rows of the data
data.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


### Separating independent and dependent variables

In [4]:
#independent variables
x = data.drop(['Survived'], axis=1)

#dependent variable
y = data['Survived']

### Creating the train and test dataset

In [5]:
#import the train-test split
from sklearn.model_selection import train_test_split

In [6]:
#divide into train and test sets
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 101, stratify=y)

## Install XGBoost

Use the following command in terminal or command prompt

_**$ pip install xgboost**_

## Building an XGBM Model

In [7]:
#Importing XGBM Classifier 
from xgboost import XGBClassifier

In [8]:
#creating an extreme Gradient boosting instance
clf = XGBClassifier(random_state=96)

In [9]:
#training the model
clf.fit(train_x,train_y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=96, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [10]:
#calculating score on training data
clf.score(train_x, train_y)

0.87874251497005984

In [11]:
#calculating score on test data
clf.score(test_x, test_y)

0.82511210762331844

# Hyperparamter Tuning

Same as GBDT

1. **n_estimators:** Total number of trees
2. **learning_rate:**This determines the impact of each tree on the final outcome
3. **random_state:** The random number seed so that same random numbers are generated every time
4. **max_depth:** Maximum depth to which tree can grow (stopping criteria)
5. **subsample:** The fraction of observations to be selected for each tree. Selection is done by random sampling
6. **objective:** Defines Loss function (*binary:logistic* is for classification using probability, *reg:logistic* is for classification, *reg:linear* is for regression)
7. **colsample_bylevel:** Random feature selection at levels
8. **colsample_bytree:** Random feature selection at tree

In [12]:
#set parameters
clf = XGBClassifier(random_state=96, colsample_bytree=0.7, max_depth=6)

In [13]:
#training the model
clf.fit(train_x,train_y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.7, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=6, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=96, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [14]:
#calculating score on test data
clf.score(test_x, test_y)

0.80717488789237668

Regularization

1. **gamma:** Minimum reduction in loss at every split
2. **reg_alpha:** Makes leaf weights 0
3. **reg_lambda:** Decrease leaf weights more smoothly

In [15]:
clf = XGBClassifier(gamma=0.1, random_state=96)

In [16]:
#training the model
clf.fit(train_x,train_y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0.1, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=96, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [17]:
#calculating score on test data
clf.score(test_x, test_y)

0.82511210762331844