Hello Fellow Kagglers!

This is **Abel Ofinni** from **Nigeria**. A **Data Science Nigeria AI+ Community Member** This Kernel is provided to teach XGBoost in its simplistic style to Beginners. So, I will try to make it Beginner-friendly as much as I can. It's a excerpt from recent training which i also intend to save here for future reference. **Kindly upvote this Kernel if found helpful.** Thanks.

**Prologue**

XGBoost is an implementation of gradient boosting that is being used to win Machine Learning competitions. It is powerful but it can be hard to get started. In this guide you will discover a 7-Part crash course on **XGBoost with Python**. This mini course is designed for Python machine learning Beginners that are already comfortable with scikit-learn and the SciPy ecosystem. 

Now, let’s get started.


**Course Overview** (what to expect)

This course stems from a 7-day crash course on XGBoost from one of my Mentors. I would mention him once he approves his mention. The course is divided into 7 parts. Each topic was designed to take the average developer about 30 minutes. You might ﬁnish some much sooner and others you may choose to go deeper and spend more time for more research into them. You can complete each part as quickly or as slowly as you like. A comfortable schedule may be to complete one lesson per day over a one week period. Highly recommended.  

The 7 Topics you will cover are as follows:

 **Introduction to Gradient Boosting.**

 **Introduction to XGBoost.**

 **Develop Your First XGBoost Model.**

 **Monitor Performance and Early Stopping.**

 **Feature Importance with XGBoost.**

 **How to Conﬁgure Gradient Boosting.**

 **XGBoost Hyperparameter Tuning.**

We will be using the UCI Machine Learning Pima-Indians-Diabetes Dataset in this short tutorial.

Grab your **Coffee** and let's explore Extreme Gradient Boosting (XGBoost) together.


Topic 01: **INTRODUCTION TO GRADIENT BOOSTING**

Gradient boosting is one of the most powerful techniques for building predictive models. The idea of boosting came out of the idea of whether a weak learner can be modiﬁed to become better. The ﬁrst realization of boosting that saw great success in application was **Adaptive Boosting** or **AdaBoost** for short. The weak learners in **AdaBoost** are **Decision Trees** with a single split, called **Decision Stumps** for their shortness. AdaBoost and related algorithms were recast in a statistical framework and became known as **Gradient Boosting Machines** (GBM). The statistical framework cast boosting as a *Numerical Optimization* problem where the objective is to minimize the loss of the model by adding *Weak Learners* using a gradient descent-like procedure, hence the name. The Gradient Boosting *algorithm* involves three elements:

1. A loss function to be optimized, such as cross entropy for classiﬁcation or mean squared error for regression problems.

2. A weak learner to make predictions, such as a greedily constructed decision tree.

3. An additive model, used to add weak learners to minimize the loss function.

New weak learners are added to the model in an eﬀort to correct the residual errors of all previous trees. The result is a powerful predictive modeling algorithm, perhaps more powerful than random forest. 

Hang on there! Now, let's take a look at the XGBoost implementation of gradient boosting.


(http://)Topic 02: **INTRODUCTION TO XGBOOST**

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost stands for **eXtreme Gradient Boosting**. It was developed by ***Tianqi Chen*** and is laser-focused on *computational speed* and *model performance*, as such there are few frills. In addition to supporting all key variations of the technique, the real interest is the speed provided by the careful engineering of the implementation, including:

* Parallelization of tree construction using all of your CPU cores during training.

* Distributed Computing for training very large models using a cluster of machines.

* Out-of-Core Computing for very large datasets that don’t ﬁt into memory.

* Cache Optimization of data structures and algorithms to make best use of hardware.

Traditionally, gradient boosting implementations are slow because of the sequential nature in which each tree must be constructed and added to the model. The on performance in the development of XGBoost has resulted in one of the best predictive modeling algorithms that can now harness the full capability of your hardware platform, or very large computers you might rent in the cloud. As such, XGBoost has been a cornerstone in competitive machine learning, being the technique used to win and recommended by winners.
1 http://goo.gl/AHkmWx 2 http://goo.gl/sGyGtu


**We shall be developing our first XGB model right away**

You will need XGBoost installed. Visit the XGBoost Documentation for installation guide here https://xgboost.readthedocs.io/en/latest/ . 

If you already have it installed, let's move on.

Topic 03: **DEVELOP YOUR FIRST XGBOOST MODEL**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
from numpy import loadtxt

#Import XGBoost Model
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
#Success
print ('Run Successful')
# Any results you write to the current directory are saved as output.

In [None]:
#Import the Pina Indians Diabetes Dataset
dataset = loadtxt("../input/Diabetes.csv" , delimiter = ",")
print ("Run Successfully")

We should consider an overview of our dataset. We will need the profile of each feature here to understand our Dataset

You might have quite a few things to say using the above overview of our dataset. But to keep things short. We have 0 missing data. So, Let's proceed.

In [None]:
#Split the Dataset into X and Y
X = dataset[:, 0:8]
Y = dataset [:,8]
print ('Ran Successfully')

In [None]:
#Split the Dataset into into Train and Test 
seed = 7
test_size = 0.33
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=test_size, random_state = seed)
print ('Ran Successfully')

**WE will fit XGBoost default model on our Train dataset**

In [None]:
#Let's fit our model on the training data
xgb = XGBClassifier()
xgb.fit(X_train, Y_train)
print('Ran Successfully')

We have trained our default XGB Model. Let's try and fit it on our test dataset to see how well the trained default XGB Model performed

In [None]:
#Predict usng our model now
predictions1 = xgb.predict(X_test)

In [None]:
#Evaluate Predictions
accuracy = accuracy_score(Y_test, predictions1)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

With an **Accuracy Score of 77.95%**. We need to improve our Model to perform better. We can move conveniently to our next topic now.

Take a sip, your coffee is now cold or still hot. </Winks/>

Let's move to the next topic

TOPIC 04: **MONITOR PERFORMANCE AND EARLY STOPPING**

The XGBoost model can evaluate and report on the performance on a test set for the model during training. It supports this capability by specifying both a test dataset and an evaluation metric on the call to model.fit() when training the model and specifying verbose output (verbose=True). For example, we can report on the binary classiﬁcation error rate (error) on a standalone test set (eval set) while training an XGBoost model.

We can use this evaluation to stop training once no further improvements have been made to the model. We can do this by setting the early stopping rounds parameter when calling *model.fit()* to the number of iterations that no improvement is seen on the validation dataset before training is stopped. The full example using the Pima Indians Onset of Diabetes dataset is provided below.


In [None]:
# split data into train and test sets 
seed = 7
test_size = 0.33 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = XGBClassifier()
eval_set = [(X_test, y_test)]

#Set eval_metrics as logloss, early_stopping_round as 5 
model.fit(X_train, y_train, early_stopping_rounds=5, eval_metric="logloss", eval_set=eval_set, verbose=True) 
# make predictions for test data 
y_predictions = model.predict(X_test)  
# evaluate predictions
accuracy = accuracy_score(y_test, y_predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
# fit model on training data 
model = XGBClassifier() 
eval_set = [(X_test, y_test)] 
#Set eval_metrics as logloss, early_stopping_round as 5 
model.fit(X_train, y_train, early_stopping_rounds=5, eval_metric="error", eval_set=eval_set, verbose=True) 
# make predictions for test data
y_predictions = model.predict(X_test) 
# evaluate predictions 
accuracy = accuracy_score(y_test, y_predictions) 
print("Accuracy: %.2f%%" % (accuracy * 100.0))

A subtle introduction of my Mentor.
**Enjoy reading this** https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/

I have some articles you can read up to better undersatnd XGBoost Model. I will update the links in the next version of this Kernel. I'm still learning too. Do not hesistate to share any Article that could help beginners in comment section. I'll appreciate it. Thanks in advance.

Let's move on to the next topic.

TOPIC 05: **FEATURE IMPORTANCE WITH XGBOOST**

A beneﬁt of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model. A trained XGBoost model automatically calculates feature importance on your predictive modeling problem. These importance scores are available in the feature importances member variable of the trained model.

The XGBoost library provides a built-in function to plot features ordered by their importance. The function is called plot importance() and can be used. The importance scores can help us decide what input variables to keep or discard. They can also be used as the basis for automatic feature selection techniques. We will now plot the feature importance scores using the Pima Indians Onset of Diabetes dataset.


In [None]:
# plot feature importance using built-in function 
# fit model on training data
model = XGBClassifier() 
model.fit(X_train, y_train) 
# plot feature importance 
plot_importance(model) 
pyplot.show()

Our model feature importance is now plotted. It's pretty simple. just know the simple syntax.

TOPIC 06: **HOW TO CONFIGURE GRADIENT BOOSTING**

**Gradient boosting** is one of the most powerful techniques for applied machine learning and as such is quickly becoming one of the most popular. **But how do you conﬁgure gradient boosting on your problem?**

A number of conﬁguration heuristics were published in the original gradient boosting papers. They can be summarized as:

 Learning rate or shrinkage (learning rate in XGBoost) should be set to 0.1 or lower, and smaller values will require the addition of more trees.

 The depth of trees (tree depth in XGBoost) should be conﬁgured in the range of 2-to-8, where not much beneﬁt is seen with deeper trees.

 Row sampling (subsample in XGBoost) should be conﬁgured in the range of 30% to 80% of the training dataset, and compared to a value of 100% for no sampling.

These are a good starting points when conﬁguring your model. A good general conﬁguration strategy is as follows:

1. Run the default conﬁguration and review plots of the learning curves on the training and validation datasets.

2. If the system is overlearning, decrease the learning rate and/or increase the number of trees.

3. If the system is underlearning, speed the learning up to be more aggressive by increasing the learning rate and/or decreasing the number of trees.

Owen Zhang, the former #1 ranked Competitor on Kaggle and now CTO at Data Robot proposes an interesting strategy to conﬁgure XGBoost5. He suggests to set the number of trees to a target value such as 100 or 1000, then tune the learning rate to ﬁnd the best model. This is an eﬃcient strategy for quickly ﬁnding a good model. In the next and ﬁnal lesson, we will look at an example of tuning the XGBoost hyperparameters.

Check Owen proposition here - http://www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1


TOPIC 07: XGBOOST HYPERPARAMETER TUNING

The scikit-learn framework provides the capability to search combinations of parameters. This capability is provided in the GridSearchCV class and can be used to discover the best way to conﬁgure the model for top performance on your problem. For example, we can deﬁne a grid of the number of trees (n estimators) and tree sizes (max depth) to evaluate by deﬁning a grid. And then evaluate each combination of parameters using 10-fold cross-validation.

We can then review the results to determine the best combination and the general trends in varying the combinations of parameters. This is the best practice when applying XGBoost to your own problems. 

The parameters to consider tuning are:

* The number and size of trees (n estimators and max depth).

* The learning rate and number of trees (learning rate and n estimators).

* The row and column subsampling rates (subsample, colsample bytree and colsample bylevel).


Now we should tune our model's learning rate

In [None]:
# Tune learning_rate 
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import KFold, StratifiedKFold

Grid Search for the best model

In [None]:
#Split Dataset
X = dataset[:,0:8] 
Y = dataset[:,8] 
# grid search 
model = XGBClassifier() 
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] 
param_grid = dict(learning_rate=learning_rate) 
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) 
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) 
grid_result = grid_search.fit(X, Y) 

Let's provide a summary report of our result here now. as we conclude this tutorial.

In [None]:
# summarize results 
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) 
means = grid_result.cv_results_['mean_test_score'] 
stds = grid_result.cv_results_['std_test_score'] 
params = grid_result.cv_results_['params'] 
for mean, stdev, param in zip(means, stds, params): 
    print("%f (%f) with: %r" % (mean, stdev, param))


**Just before You Go...**

You made it. Well done! Take a moment and look back at how far you have come:

 You learned about the gradient boosting algorithm and the XGBoost library.

 You developed your ﬁrst XGBoost model.

 You learned how to use advanced features like early stopping and feature importance.

 You learned how to conﬁgure gradient boosted models and how to design controlled experiments to tune XGBoost hyperparameters.

Don’t make light of this, you have come a long way in a short amount of time. This is just the beginning of your journey with XGBoost in Python. Keep practicing and developing your skills. I have not stopped learning and sharing what I've learnt. I'm continually inspired by [https://www.kaggle.com/bayoadekanmbi] and www.datasciencenigeria.org. Thanks to https://www.kaggle.com/afolaborn for his immense contribution too. The learning continues as i proceed in the Data Science journey.


**If You enjoy this Kernel, kindly upvote**it. It's my first kernel before. I look forward to your awesome comments and upvotes. I love you all.

In [None]:
print ('Thank you all for stopping by to learn and as you comment')