# Applying the XGBoost Algorithm

So far you've learned to apply a variety of different models. In this notebook you should prepare the data, import the XGBClassifier, train it and make predictions on your own. (Feel free to tune the hyperparameters in the end using a grid or random search!). 

We'll use the pima-native-americans-diabetic dataset for this task. You can find it in the data folder. You will see that the dataset is lacking column names. Therefore we added them as a list in one of the cells below. Have a look at the documentation. It is possible to import the data and directly add the columns names (they are in the correct order). 

If you need help or inspiration you can have a look at this [blogpost](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/), describing how to use the XGBoost Algorithm. 

## Import and Setup

In [1]:
# Import moduls (as many as you need)
from xgboost import XGBClassifier
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

RSEED=42

In [2]:
# Import diabetes data
df = pd.read_csv('data/pima-native-americans-diabetes.csv', header=None)
df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [3]:
# Import data 
column_names = ['pregnancies', 'glucose', 'blood_pressure', 'skin_thickness', 'insulin', 'bmi', 'diabetes_pedigree_function', 'age', 'outcome']

In [4]:
df.columns = column_names
df.head(2)

Unnamed: 0,pregnancies,glucose,blood_pressure,skin_thickness,insulin,bmi,diabetes_pedigree_function,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [5]:
df.columns

Index(['pregnancies', 'glucose', 'blood_pressure', 'skin_thickness', 'insulin',
       'bmi', 'diabetes_pedigree_function', 'age', 'outcome'],
      dtype='object')

In [6]:
X = df.drop('outcome', axis = 1)
y = df.outcome

In [7]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=RSEED)

In [8]:
y_test.shape

(192,)

In [9]:
# Fit model to training data
model = XGBClassifier()
model.fit(X_train, y_train)

In [10]:
# Make predictions on test set 
y_pred = model.predict(X_test)

In [11]:
predictions = [round(value) for value in y_pred]

In [12]:
# Evaluate your model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 74.48%


In [13]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

In [14]:
model = XGBClassifier()

In [15]:
param_dist = {
    "n_estimators": randint(100, 600),
    "max_depth": randint(3, 10),
    "learning_rate": uniform(0.01, 0.3),
    "subsample": uniform(0.6, 0.4),
    "colsample_bytree": uniform(0.6, 0.4),
    "gamma": uniform(0, 5),
    "min_child_weight": randint(1, 10)
}

In [16]:
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=50,              # increase if you have time
    scoring="roc_auc",      # good default for binary classification
    cv=5,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

In [17]:
random_search.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [18]:
best_model = random_search.best_estimator_
best_params = random_search.best_params_

print(best_params)

{'colsample_bytree': 0.9376852562905246, 'gamma': 4.650084174054159, 'learning_rate': 0.031124839254863167, 'max_depth': 4, 'min_child_weight': 6, 'n_estimators': 250, 'subsample': 0.6560336060946096}


In [19]:
y_pred = best_model.predict(X_test)

In [20]:
predictions = [round(value) for value in y_pred]

In [21]:
# Evaluate your model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 76.04%
