<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-gradient-boosting-with-xgboost/01-decision-tree-in-depth/01_decision_tree_in_depth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Decision tree in depth

XGBoost is an ensemble method, meaning that it is composed of different machine
learning models that combine to work together. The individual models that make up the
ensemble in XGBoost are called base learners.

Decision trees, the most commonly used XGBoost base learners, are unique in the
machine learning landscape. Instead of multiplying column values by numeric weights,
as in linear regression and logistic regression,
decision trees split the data by asking questions about the columns.

A decision tree can create thousands of branches until it uniquely maps each sample to the
correct target in the training set. This means that the training set can have 100% accuracy.
Such a model, however, will not generalize well to new data.

Decision trees are prone to overfitting the data. In other words, decision trees can map too
closely to the training data, a problem explored later in this chapter in terms of variance
and bias. 

Hyperparameter fine-tuning is one solution to prevent overfitting. Another
solution is to aggregate the predictions of many trees, a strategy that Random Forests and
XGBoost employ.


##Setup

In [1]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [None]:
!wget https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/raw/master/Chapter02/census_cleaned.csv

##Exploring decision trees

Decision Trees work by splitting the data into branches. The branches are followed down
to leaves where predictions are made.

In [10]:
df_census = pd.read_csv("census_cleaned.csv")
df_census.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# declare your predictor and target columns, X and y
X = df_census.iloc[:, :-1]
y = df_census.iloc[:, -1]

In [8]:
# split the data into training and tests set
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=2)

In [9]:
# decision tree classifier
clf = DecisionTreeClassifier(random_state=2)
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)
accuracy_score(y_pred, y_test)

0.8131679154894976