<a href="https://colab.research.google.com/github/stephenfrein/tree_models/blob/master/Boosted_Trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosted Trees

Now we'll use boosted decision trees to analyze a famous data set, the Pima Diabetes data.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
from xgboost import XGBClassifier
# get Pima data set and examine it
url = 'https://query.data.world/s/4zv4vc2itsj3aaf2ge67urxkss553o'
pima_diabetes = pd.read_csv(url)
pima_diabetes.head(n=10)

In [None]:
pima_diabetes.describe()

In [None]:
# predictor variables - all but column called Outcome
X = pima_diabetes.drop("Outcome",1)
# target variable
y = pima_diabetes["Outcome"]
# creat test and training sets - 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [None]:
# decision tree classifier with max depth to avoid overfitting
from sklearn.tree import DecisionTreeClassifier
simple_tree = DecisionTreeClassifier(max_depth=3, criterion='gini')
# train decision tree classifer
simple_tree = simple_tree.fit(X_train,y_train)
y_pred=simple_tree.predict(X_test)
# how many were classified correctly
print("Simple Tree Accuracy:", metrics.accuracy_score(y_test, y_pred))

In [None]:
## random forest approach
from sklearn.ensemble import RandomForestClassifier
# create new classifier - 500 trees with max depth of 3 per tree
rf=RandomForestClassifier(n_estimators=500, max_depth=3)
# train the model
rf.fit(X_train,y_train)
# predict the response for test dataset
y_pred=rf.predict(X_test)
# how many were classified correctly
print("Random Forest Accuracy:", metrics.accuracy_score(y_test, y_pred))

In [None]:
# now try boosted trees
boosted_tree_model = XGBClassifier()
boosted_tree_model.fit(X_train, y_train)
# predictions for test data
y_pred = boosted_tree_model.predict(X_test)
# evaluate predictions
print("Boosted Trees Accuracy:", metrics.accuracy_score(y_test, y_pred))

In [None]:
# XGBoost defaults to 100 trees - would adding more help?
# add evaluation set with early stopping if model isn't improving - verbose output so we can see improvement
eval_set = [(X_test, y_test)]
boosted_tree_model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# predictions for test data
y_pred = boosted_tree_model.predict(X_test)
# evaluate predictions
print("Boosted Trees Accuracy:", metrics.accuracy_score(y_test, y_pred))

In [None]:
# try different numbers of "rounds" (trees added)
boosted_tree_model = XGBClassifier(n_estimators=50)
boosted_tree_model.fit(X_train, y_train)
# predictions for test data
y_pred = boosted_tree_model.predict(X_test)
# evaluate predictions
print("Boosted Trees Accuracy:", metrics.accuracy_score(y_test, y_pred))

In [None]:
# what features are most important?
from xgboost import plot_importance
from matplotlib import pyplot
plot_importance(boosted_tree_model)
pyplot.show()

# Exercise #4

Try to improve your performance on the IMDB data set using boosted decision trees.

In [None]:
# enjoy