## XGBOOST Regression

##This is a basic implementation of xgbregression. The notebook will guide you through how to import the data, process it, setup a xgb regression model, fit the model and finally different methods to evaluate it.

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

##To check if you have imported the model correctly and it's working fine just run the below command to check the version.

In [None]:
print(xgb.__version__)

0.90


##We are using the winequality data from kaggle. Check this site to download the dataset: [dataset](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009)

In [None]:
data = pd.read_csv("winequality-red.csv")

##We are just checking few columns of the datset to get the idea of how it is.

In [None]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


##The dataset has shape of (1599,12) which means that we have 11 features (last one  beign the target ) and 1599 entities

In [None]:
data.shape

(1599, 12)

##In the below step we are just seprating The target value from the features. Assining y the last column(i.e. the score column)

In [None]:
X,y = data.iloc[:,:-1],data.iloc[:,-1]

In [None]:
X.shape, y.shape

((1599, 11), (1599,))

##Before we build the model we need to seprate the model into training and testing datsets using train-test_split from sklearn. Here the test size is 20% of the original data and we have assingned **True** to shuffle, in order to get better sepration of data(you can leave it to default value of false)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, shuffle=True)

In [None]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((1279, 11), (320, 11), (1279,), (320,))

#Processing the data

In [None]:
X_train.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
dtype: int64

In [None]:
y_train.isna().sum()

0

##As the sum of all the null values is 0 in trainig set it means that we don't have any null values.

In [None]:
X_train.dtypes

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
dtype: object

##Check for the datatypes of the features. If any of them is not integer of float we would have to convert it into integer or float accordingly. Here we dont need to anything like it.

In [None]:
X["fixed acidity"].unique().shape,X["volatile acidity"].unique().shape

((96,), (143,))

# Build a model

In [None]:
model = xgb.XGBRegressor(max_depth=3,learning_rate=0.3,verbosity=0,objective='reg:linear',booster='gbtree')

##XGBRegressor is a method in xgb class which is used to build a regressor model. The parameters here are:
1. max_depth: It is the max depth of the tree build by the xgbressor(It's usually 3 and doesn't have to go beyond 6, even for complex datasets)
2. verbosity: The default value is 1 which is used to give warnings, here we have set it 0 in order to silent the warninigs.
3. objective: Specifies the learning task, here we need it to linear regression that's why its"reg:linear" you can used different parameters for different purpouses.
4. booster: relate to which booster we are using to do boosting, commonly tree or linear model

In [None]:
print(model)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.3, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=0)


##The above output is the summary of the model 

## The data is fit on the model

In [None]:
model.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.3, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=0)

##To make predictions on the testing data we need to use **model.predict**, in order to get whole numbers here the predictions are rounded.(They do not have any effect on the accuracy of the model)

In [None]:
y = model.predict(X_test)
predictions = [round(value) for value in y]
print(predictions)

[5, 6, 5, 5, 6, 4, 6, 7, 7, 5, 6, 6, 6, 5, 6, 5, 6, 6, 5, 6, 5, 6, 5, 5, 5, 6, 5, 6, 6, 6, 5, 5, 5, 6, 5, 6, 6, 5, 5, 6, 5, 6, 5, 6, 5, 5, 5, 6, 5, 6, 5, 6, 6, 5, 5, 6, 5, 5, 5, 6, 7, 5, 5, 6, 4, 5, 5, 6, 6, 7, 6, 6, 6, 6, 6, 7, 6, 6, 6, 5, 7, 6, 6, 6, 6, 7, 6, 6, 7, 5, 6, 5, 6, 5, 5, 6, 6, 5, 6, 5, 6, 6, 6, 5, 5, 7, 6, 5, 6, 5, 7, 7, 7, 5, 5, 5, 7, 5, 5, 6, 5, 6, 5, 6, 5, 7, 6, 6, 6, 7, 5, 6, 5, 6, 6, 7, 7, 6, 5, 5, 5, 6, 6, 6, 5, 6, 6, 6, 5, 5, 5, 5, 5, 7, 5, 6, 6, 7, 7, 6, 6, 5, 6, 6, 5, 5, 6, 6, 6, 5, 6, 5, 6, 5, 5, 5, 5, 6, 6, 5, 6, 6, 7, 6, 6, 4, 6, 5, 6, 7, 6, 5, 6, 6, 5, 5, 5, 5, 5, 5, 6, 6, 5, 7, 6, 5, 5, 5, 5, 5, 5, 6, 7, 5, 5, 5, 7, 5, 6, 7, 6, 5, 6, 5, 5, 6, 6, 5, 6, 6, 6, 6, 7, 5, 7, 5, 5, 6, 5, 5, 6, 6, 5, 6, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 6, 5, 5, 5, 5, 7, 7, 6, 5, 6, 6, 5, 7, 6, 5, 6, 5, 4, 6, 5, 7, 5, 6, 6, 5, 5, 6, 6, 7, 5, 5, 6, 5, 7, 6, 5, 6, 6, 5, 5, 5, 7, 6, 5, 6, 5, 5, 6, 5, 7, 6, 5, 7, 5, 5, 6, 7, 6, 7, 5, 5, 6, 6, 5, 5, 5]


##Different Evaluation metrics

##This is an inbuilt accuracy method in sklearn to measure the accuracy of the model

In [None]:
accuracy = accuracy_score(y_test,predictions)
print(accuracy)

0.603125


## This is root mean squared error for all the entities of the data

In [None]:
rmse = np.sqrt((y - y_test)**2)
rmse

1012    0.421551
658     0.583838
1483    1.201342
1552    0.843307
1311    0.168949
          ...   
1044    0.383334
497     0.881669
89      0.142319
1062    0.204480
946     0.119686
Name: quality, Length: 320, dtype: float64

#The training score, i.e. how well the model performed in training data

In [None]:
score = model.score(X_train, y_train)   
print("Training score: ", score) 

Training score:  0.7983299484360415


##The cross validation score

In [None]:
# - cross validataion 
scores = cross_val_score(model, X_train, y_train, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())

Mean cross-validation score: 0.36


##K Fold Score

In [None]:
kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(model, X_train, y_train, cv=kfold )
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())

K-fold CV average score: 0.37
