# XGBoost

Extreme Gradient Boosting, aka XGBoost, was developed at the University of Washington, USA. It optimizes the gradient boosting.

**Boosting**:

Boosting uses ensemble methods. It uses number of weak classifier to build a strong classifier. It creates a model utilizing weak models in a series. First, it builts a model using the training dataset. Next, it builds the second model is created which tries to fix the errors identified in the first created model. This procedure continues until either the complete training data set is predicted correctly or the maximum number of models are added.  

![boosting.png](attachment:boosting.png)

**Gradient Boosting**

Gradient Boosting is a famous algorithm for boosting techniques. Each predictor in gradient boosting corrects the error of its predecessor. Unlike Adaboost, each predictor is trained using the residual errors of the predecessor as labels rather than adjusting the weights of the training instances.

The Gradient Boosted Trees approach uses CART as its base learner (Classification and Regression Trees).

* XGBoost is technically applied using Gradient Boosted Decision Tree.
* Using XGBoost, decision trees are created sequentially.
* In XGBoost, weights play a crucial role. 
* The decision tree, which forecasts outcomes, receives weights for each independent variable. 
* The variables that the decision tree incorrectly predicted are given more weight before being placed into the second decision tree. Then, a more potent and accurate model is created by combining these distinct classifiers and predictors.
* It can be used in regression, classification, ranking, and user-defined prediction.

* Extreme Gradient Boosting, or XG Boost, uses an ensemble strategy based on the gradient boosting algorithm.
* Speed, usability, and performance on big datasets are all optimised in XGBoost. It can be used right away with no additional configuration because it doesn't need parameter optimization or tuning.
* XGBoost is used for two reasons: 
   * Execution speed and 
   * Model performance

In [None]:
# importing libraries  
import numpy as np
import pandas as pd  
import seaborn as snp
import matplotlib.pyplot as plt
from xgboost import XGBClassifier, plot_tree

from sklearn.datasets import load_boston
import xgboost as xgb
from sklearn.metrics import mean_squared_error
plt.style.use('ggplot')

In [None]:
# Loading the data
from sklearn.datasets import load_boston
boston = load_boston()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [None]:
# To check the dataset shape
print(boston.data.shape)

(506, 13)


In [None]:
# To see the column names
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [None]:
# To see the description of the dataset
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [None]:
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [None]:
X, y = data.iloc[:,:-1],data.iloc[:,-1]


In [None]:
# we will convert the dataset into an optimized data structure called Dmatrix that XGBoost supports
data_dmatrix = xgb.DMatrix(data=X,label=y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)  #For splitting the test data


In [None]:
#We have to instantiate an XGBoost regressor object by calling the XGBRegressor() class from the XGBoost library with the hyper-parameters passed as arguments.
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10)

In [None]:
#Fit the regressor to the training set and make predictions on the test set using the familiar .fit() and .predict() methods
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)




In [None]:
# we have to #Compute the rmse by invoking the mean_sqaured_error function from sklearn's
#metrics module.
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))


RMSE: 6.455243


In [None]:
#Invoking XGBoost's cv() method and store the results in a cv_results DataFrame
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 5, 'alpha': 10}

In [None]:
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)



In [None]:
# We have to Extract and print the final boosting round metric.
print((cv_results["test-rmse-mean"]).tail(1))

49    3.995269
Name: test-rmse-mean, dtype: float64
