## XGBoost Tutorial

[Tutorial](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/)
  
[Dataset](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv) and [details](http://mlearn.ics.uci.edu/databases/pima-indians-diabetes/pima-indians-diabetes.names)

 For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
   
**A few helpful videos**
* What is [boosting?](https://www.youtube.com/watch?v=GM3CDQfQ4sw)
* [Boosting](https://www.youtube.com/watch?v=0Xc9LIb_HTw) and decission trees. 

In [1]:
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [4]:
dataset = loadtxt('pimaData.csv', delimiter=",")

---
I really like how working with a dataframe you can refer to the label directly.  Using loadtxt() we are referring to each attribute by it's numerical position and it can be confusing. 

Let's split our data up. 

In [5]:
x = dataset[:,0:8]
y = dataset[:,8]

In [6]:
seed = 7
test_size = 0.33
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=seed)

---
Let's train our model!

In [7]:
model = XGBClassifier()
model.fit(x_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

---
Now we can make some predictions! To store our predictions, we will use a list comprehension.

>By default, the predictions made by XGBoost are probabilities. Because this is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class. We can easily convert them to binary class values by rounding them to 0 or 1.

In [8]:
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]

---
So how did the model perform?

In [10]:
accuracy = accuracy_score(y_test, predictions)

In [11]:
accuracy

0.77952755905511806

In [13]:
print(f'Accuracy: {round(accuracy*100,2)}%')

Accuracy: 77.95%


---
### Out of curiosity, let's try that with a logistic regression model...

[Scikit-Learn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) docs



In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
clf = LogisticRegression().fit(x_train, y_train)

In [16]:
y_predLR = clf.predict(x_test)

In [18]:
predictions = [round(value) for value in y_predLR]

In [19]:
accuracy = accuracy_score(y_test, predictions)

---
When I saw the score method I was curious how it fared against the accuracy method in the previous tutorial...the score method is much faster.

In [17]:
clf.score(x_test, y_test)

0.75590551181102361

In [20]:
print(f'Accuracy: {round(accuracy*100,2)}%')

Accuracy: 75.59%
