### Name of the classifiers: XGBoost 

Accoring to machinelearning mastery, XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

#### 1. Import packages

In [96]:
from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from itertools import product
import warnings
warnings.filterwarnings('ignore')

#### 2. Load Data

In [97]:
X_train, y_train = load_svmlight_file("a9a.txt")
X_test, y_test = load_svmlight_file("a9a.t")


NOTE:
    For each of learning algorithms, you will need to set various hyperparameters (e.g. For XGBoost: the
tree method, max depth, number of weak classifiers, objective, etc)


#### 3. Fit model on training data. Fit based on the default values of the hyperparameters for simplicity

In [98]:
model = XGBClassifier()
model.fit(X_train, y_train)
print("Default values of the hyperparameters:\n", model)


Default values of the hyperparameters:
 XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)


#### 4. Make predictions for the test data

In [5]:
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]


#### 5. Evaluate predictions

In [6]:
accuracy = accuracy_score(y_test, predictions)
print("\nAccuracy: %.2f%%" % (accuracy * 100))



Accuracy: 84.83%


NOTES: 

The list of hyperparameters and brief description of each hyperparameter you tuned in training, their default values, and the final hyperparameter settings you use to get the best result.
    Parameters to be tuned for XGBoost:
    1. n estimators : 
    2. max depth
    3. lambda
    4. learning rate
    5. missing
    6. objective



#### 6. Tuning hyperparameters.
Tune the hyperparameters by seting the list of values. This will allow different combination of parameters. Meaning that the more values, more time it will take to compute. 


In [99]:
n_estimators = [100, 200, 300]
max_depth = [2, 3, 4, 5]
reg_lambda = [0.0, 1.0]
learning_rate = [0.05, 0.1, 0.2, 0.5]
missing = [None, 0]
objective = ('binary:logistic', 'binary:hinge')

In [17]:
hyperparameters = []
for n_estimate, depth, lam, rate, miss, obj in product(n_estimators, max_depth, reg_lambda, learning_rate, missing, objective):
    hyperparameters.append(
        [n_estimate, depth, lam, rate, miss, obj])


NOTES:

Number of hyperparameters: We have about 384 parameters combination with the above tuning

In [92]:
count = 0
for i in hyperparameters:
    count+=1

print (count)
    
    

384


#### 7. Time it took:
It took me about 45 minutes to run 384 hyperparameters on mac os

In [102]:
best_accuracy = 0
count = 0
long = [5,10,50,100,300,350]

for parameter in hyperparameters: 
    parameters = {'n_estimators': parameter[0], 'max_depth': parameter[1], 'reg_lambda': parameter[2], 'learning_rate': parameter[3], 'missing': parameter[4], 'objective': parameter[5]}

    model = XGBClassifier(n_estimators= parameter[0], max_depth= parameter[1], reg_lambda= parameter[2], learning_rate= parameter[3], missing= parameter[4], objective= parameter[5])
                          

    kfold = KFold()
    cross_val_scores = cross_val_score(model, X_train, y_train, cv=kfold) #takes long
    accuracy = cross_val_scores.mean() * 100
    
    count += 1


    if (best_accuracy < accuracy):
        best_accuracy = accuracy
        best_model = parameters
        
    for j in long:
        if count == j: 
             print(count)
        
        
print("Best accuracy that we can get: \n", accuracy)
print("\nThe best model parameters: \n", best_model)

5
10
50
100
300
350
Best accuracy that we can get: 
 nan

The best model parameters: 
 {'max_depth': 2, 'learning_rate': 0.5, 'missing': None, 'n_estimators': 200, 'reg_lambda': 1.0, 'objective': 'binary:logistic'}


#### 8. Analysis with new hyperparameter:

After several trial, I've noticed that the learning rate should be in between 0.2 to 0.7. With learning rate 0.7 and other fixed values, it resulted in highest accuracy. Lower or higher values would lower the accuracy. Also accuracy reduced when 150 < n_estimators < 260 . Within the given interval the accuracy would fluctuate but would not decrease dramatically. I've noticed that values around 190 -200 and 240-250 result in higher accuracy. Also, deeper max_depth would significantly reduce the accuracy. 



In [103]:
##TRIAL
new_model = XGBClassifier(n_estimators=249, max_depth=2, reg_lambda=1, learning_rate= 0.7, missing=None, objective='binary:logistic')

new_model.fit(X_train, y_train)
print("New values of the hyperparameters: \n", new_model)


New values of the hyperparameters: 
 XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.7, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=None, monotone_constraints='()',
              n_estimators=249, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)


In [104]:
kfold = KFold()
cross_val_scores = cross_val_score(new_model, X_train, y_train, cv=kfold)
accuracy = cross_val_scores.mean() * 100

In [105]:
print ("New Model Stats")
print("\nAccuracy: \n ", accuracy)

print("Cross Validation Training Error Rate:\n ",
      1-cross_val_scores.mean())

print("Test Error Rate: \n",
      1-new_model.score(X_test, y_test))

New Model Stats

Accuracy: 
  85.09260820638065
Cross Validation Training Error Rate:
  0.1490739179361935
Test Error Rate: 
 0.15011362938394446


**Accuracy when hyperparameter (Default) :** 84.17%

**Accuracy when hyperparameter (New) :** 85.09%

| Hyperparameter | Default Value | New Value |
| :- | -: | :-: |
 *n_estimators* | 200 | 249
 *max_depth* | 2 | 2
 *reg_lambda* | 1 | 1
 *learning_rate* | 0.5 | 0.7   
 *missing* | None | .None
 *objective* | binary:logistic | binary:logistic


**Cross Validation Training Error Rate:** 0.1490739179361935

**Test Error Rate:** 0.15011362938394446
 
 
