In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

In [3]:
dataset_classification = pd.read_csv("Social_Network_Ads.csv")

In [4]:
dataset_classification.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [5]:
dataset_classification.drop(['User ID'], inplace = True, axis=1)

In [6]:
dataset_classification.shape

(400, 4)

In [7]:
from sklearn.preprocessing import LabelEncoder

In [8]:
le = LabelEncoder()

In [9]:
dataset_classification['Gender'] = le.fit_transform(dataset_classification['Gender'])

In [10]:
dataset_classification.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,1,19,19000,0
1,1,35,20000,0
2,0,26,43000,0
3,0,27,57000,0
4,1,19,76000,0


In [11]:
from sklearn.model_selection import train_test_split

In [12]:
y = dataset_classification['Purchased']

In [13]:
X = dataset_classification.iloc[:,:-1]

In [14]:
X.shape

(400, 3)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=10)

In [16]:
X_test.shape

(100, 3)

## XGBoost ( eXtreme Gradient Boosting)

#### Why is XGBoost good ?

    1. Parallel Computing: It is enabled with parallel processing (using OpenMP); i.e., when you run xgboost, by default, it would use all the cores of your laptop/machine.
    
    2. Regularization: I believe this is the biggest advantage of xgboost. GBM has no provision for regularization. Regularization is a technique used to avoid overfitting in linear and tree-based models.
    
    3. Enabled Cross Validation: In R, we usually use external packages such as caret and mlr to obtain CV results. But, xgboost is enabled with internal CV function.
    
    4. Missing Values: XGBoost is designed to handle missing values internally. The missing values are treated in such a manner that if there exists any trend in missing values, it is captured by the model.
    
    5. Flexibility: In addition to regression, classification, and ranking problems, it supports user-defined objective functions also. An objective function is used to measure the performance of the model given a certain set of parameters. Furthermore, it supports user defined evaluation metrics as well.
    
    6. Availability: Currently, available for programming languages such as R, Python, Java, Julia, and Scala.
    
    7. Save and Reload: XGBoost gives us a feature to save our data matrix and model and reload it later. Suppose, we have a large data set, we can simply save the model and use it in future instead of wasting time redoing the computation.
    
    8. Tree Pruning: Unlike GBM, where tree pruning stops once a negative loss is encountered, XGBoost grows the tree upto max_depth and then prune backward until the improvement in loss function is below a threshold.

#### Type of problems XGBoost can solve :

##### 1. Classification : 
    a. It uses booster = gbtree parameter; i.e., a tree is grown one after other and attempts to reduce misclassification rate in subsequent iterations. 
    b. In this, the next tree is built by giving a higher weight to misclassified points by the previous tree.
    
##### 2. Regression :
    a. we have two methods: booster = gbtree and booster = gblinear. 
    b. In gblinear, it builds generalized linear model and optimizes it using regularization (L1,L2) and gradient descent. 
    c. In this, the subsequent models are built on residuals (actual - predicted) generated by previous iterations.
    

### Understanding XGBoost Tuning Parameters:

XGBoost parameters can be divided into three categories (as suggested by its authors):

####    1. General Parameters: Controls the booster type in the model which eventually drives overall functioning are as follows:
        a. booster[default=gbtree]
                Sets the booster type (gbtree, gblinear or dart) to use. 
                    For classification problems, you can use gbtree, dart. 
                    For regression, you can use any.
        b. n_jobs[default=1]
                Activates parallel computation. 
                Generally, people don't change it as using maximum cores leads to the fastest computation. 
        c. silent[default=None]
            If you set it to 1, your console will get flooded with running messages.
            Better not to change it.

#### 2. Booster Parameters
As mentioned above, parameters for tree and linear boosters are different. Let's understand each one of them:

######    A. Parameters for Tree Booster
        a. n_estimators[default=100]
            It controls the maximum number of iterations. 
            For classification, it is similar to the number of trees to grow.
            Should be tuned using CV
        b. learning_rate[default=0.1][range: (0,1)]
            It controls the learning rate, i.e., the rate at which our model learns patterns in data. 
            After every round, it shrinks the feature weights to reach the best optimum.
            Lower eta leads to slower computation. It must be supported by increase in nrounds.
            Typically, it lies between 0.01 - 0.3
        c. gamma[default=0][range: (0,Inf)]
            It controls regularization (or prevents overfitting). 
            The optimal value of gamma depends on the data set and other parameter values.
            Higher the value, higher the regularization. 
            Regularization means penalizing large coefficients which don't improve the model's performance. 
            Default = 0 means no regularization.
            Tune trick: Start with 0 and check CV error rate. 
                If you see train error >>> test error, bring gamma into action. 
                Higher the gamma, lower the difference in train and test CV. 
                If you have no clue what value to use, use gamma=5 and see the performance. 
                Remember that gamma brings improvement when you want to use shallow (low max_depth) trees.
        d. max_depth[default=3][range: (0,Inf)] 
            It controls the depth of the tree.
            Larger the depth, more complex the model; higher chances of overfitting. 
            There is no standard value for max_depth. Larger data sets require deep trees to learn the rules from data.
            Should be tuned using CV
        e. min_child_weight[default=1][range:(0,Inf)]
            In regression, it refers to the minimum number of instances required in a child node. 
            In classification, if the leaf node has a minimum sum of instance weight (calculated by second order partial derivative) lower than min_child_weight, the tree splitting stops.
            In simple words, it blocks the potential feature interactions to prevent overfitting. Should be tuned using CV.
        f. subsample[default=1][range: (0,1)]
            It controls the number of samples (observations) supplied to a tree.
            Typically, its values lie between (0.5-0.8)
        g. colsample_bytree[default=1][range: (0,1)]
            It control the number of features (variables) supplied to a tree
            Typically, its values lie between (0.5,0.9)
        h. reg_lambda[default=1]
            It controls L2 regularization (equivalent to Ridge regression) on weights. 
            It is used to avoid overfitting.
        i. reg_alpha[default=0]
            It controls L1 regularization (equivalent to Lasso regression) on weights. 
            In addition to shrinkage, enabling alpha also results in feature selection. 
            Hence, it's more useful on high dimensional data sets.


##### 2. Parameters for Linear Booster
Using linear booster has relatively lesser parameters to tune, hence it computes much faster than gbtree booster.

    a. n_estimators[default=100]
        It controls the maximum number of iterations (steps) required for gradient descent to converge.
        Should be tuned using CV
    b. reg_lambda[default=1]
        It enables Ridge Regression. Same as above
    c. reg_alpha[default=0]
        It enables Lasso Regression. Same as above

#### 3. Learning Task Parameters: 
Sets and evaluates the learning process of the booster from the given data

These parameters specify methods for the loss function and model evaluation. In addition to the parameters listed below, you are free to use a customized objective / evaluation function.

    a. objective[default=binary:logistic]
            reg:linear - for linear regression
            binary:logistic - logistic regression for binary classification. It returns class probabilities
            multi:softmax - multiclassification using softmax objective. 
                            It returns predicted class labels. 
                            It requires setting num_class parameter denoting number of unique prediction classes.
            multi:softprob - multiclassification using softmax objective. It returns predicted class probabilities.
    b. eval_metric [no default, depends on objective selected] - not in python
            These metrics are used to evaluate a model's accuracy on validation data. 
                For regression, default metric is RMSE. 
                For classification, default metric is error.
                    Available error functions are as follows:
                        mae - Mean Absolute Error (used in regression)
                        Logloss - Negative loglikelihood (used in classification)
                        AUC - Area under curve (used in classification)
                        RMSE - Root mean square error (used in regression)
                        error - Binary classification error rate [#wrong cases/#all cases]
                        mlogloss - multiclass logloss (used in classification)

#### XG Boost Classifier

In [17]:
xgb = XGBClassifier(random_state=10)

In [18]:
xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=10,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [19]:
y_pred_xgb = xgb.predict(X_test)

In [20]:
roc_auc_score(y_test, y_pred_xgb)

0.9242636746143058

In [21]:
accuracy_score(y_test, y_pred_xgb)

0.92

#### XGB Random Forest Classifier

In [32]:
xgbrf = XGBRFClassifier(random_state=10)

In [33]:
xgbrf.fit(X_train, y_train)

XGBRFClassifier(base_score=0.5, colsample_bylevel=1, colsample_bynode=0.8,
                colsample_bytree=1, gamma=0, learning_rate=1, max_delta_step=0,
                max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
                n_jobs=1, nthread=None, objective='binary:logistic',
                random_state=10, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                seed=None, silent=None, subsample=0.8, verbosity=1)

In [34]:
y_pred_xgbrf = xgbrf.predict(X_test)

In [35]:
roc_auc_score(y_test, y_pred_xgbrf)

0.9242636746143058

In [36]:
accuracy_score(y_test, y_pred_xgbrf)

0.92