#### Decision Tree for Classification

tree models have high flexibility and can capture complex non-linear relationships but they're prone to memorizing the noise present in a dataset, one solution is to aggregate the predictions of trees that are trained differently which is called ensemble method

CART (classification and regression trees) are a set of supervised learning models used for problems involving classification and regression
this chapter talks about the CART algorithm

a **classification tree** learns a sequence of if-else questions about individual features when given a labeled dataset, its objective is to infel class labels, trees are able to capture non-linear relationships between features and labels (whereas linear models can't), trees don't require features to be on the same scale so you don't have to do something like standardization, an example of a problem is predicting whether a tumor is malignant or benign using only 2 features

when a classification tree is trained on a dataset like this the tree will learn a squence of if-else questions which each question involving one feature and one split-point

the maximum number of branches from the top from an extreme-end is called the maximum depth, a tree with 2 questions is 2 (3 levels)

a classification model divides the feature-space into regions, **decision regions** are a region in the feature space where all instances in one region are assigned to one and only one class label, decision regions are separated by surfaces called **decision boundaries**, decision boundaries are like the dividing line and then the regions are the sections that line makes, in a linear model like logistic regression the regions are a single straight line decision boundary but with decision trees the decision regions are divided into rectangular regions because only one feature is involved at each split made by the tree

In [None]:
# classification-tree in scikit-learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# split the dataset into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

# instantiate the decision tree with a maximum depth of 2
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

# fit dt to the training set
dt.fit(X_train, y_train)
# predict the test set labels
y_pred = dt.predict(X_test)

# evaluate the test-set accuracy
accuracy_score(y_test, y_pred)

#### Classification Tree Learning

a **decision tree** is a data structure consisting of a hierarchy individual units of nodes
a **node** is a question or prediction, there are 3 kinds of nodes:
- the **root** is where the decision tree starts growing, it has no parent node, its question gives rise to 2 children nodes
- an **internal node** has 1 parent and its question gives rise to 2 children nodes
- a **leaf** has 1 parent node and 0 children (no questions), it's where a prediction is made

a decision tree creates the purest leaf possible, the nodes of a classification tree are grown recursively (the internal node or leaf depends on the state of its predecessors, fto create a pure leaf, at each node a tree asks a question involving one feature (f) and a split-point (sp), but how does it know which feature and which split-point to pick?

the tree answers this question by maximizing information gain, the tree knows that every node contains information and it aims at maximizing the **information gain** (IG) obtained after each split

another question is what criterion is used to measure the impurity of a node? two of the options are:
- gini index
- entropy
most of the time the gini index and entropy lead to the same results but the gini index is slightly faster to compute and is the default criterion used in the DecisionTreeClassifier model of sckit-learn


classification-tree learning, how the tree learns:
- when an unconstrained tree is trained, the nodes are grown recursively (a node exists based on the state of its predecessors)
- at each non-leaf node, the data is split based on feature f and split-point sp to maximize IG
- if the information gained obtained by splittg a node is null then the node is declared a leaf, IG(node)=0 make it a leaf
- if the tree is constrained, like a max depth of 2, then all nodes having a depth of 2 will be declared leafs even if the info obtained by such nodes is not null

In [None]:
# you could now update the previous code with a criterion using the gini-index
dt = DecisionTreeClassifier(criterion='gini', random_state=1)
# the rest of the code is the same too

In [None]:
# exercise example, train a classification tree using entropy as an information criterion
# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Instantiate dt_entropy, set 'entropy' as the information criterion
dt_entropy = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1)

# Fit dt_entropy to the training set
dt_entropy.fit(X_train, y_train)

In [None]:
# exercise example, compare entropy to gini index
# Import accuracy_score from sklearn.metrics
from sklearn.metrics import accuracy_score

# Use dt_entropy to predict test set labels
y_pred = dt_entropy.predict(X_test)

# Evaluate accuracy_entropy
accuracy_entropy = accuracy_score(y_test, y_pred)

# Print accuracy_entropy
print(f'Accuracy achieved by using entropy: {accuracy_entropy:.3f}')

# Print accuracy_gini
print(f'Accuracy achieved by using the gini index: {accuracy_gini:.3f}')

#### Decision Tree for Regression

time to learnn how to train a decision tree for a regression problem!!

remember: the target variable in regression is continuous, the output of your model is a real value

you could make a scatterplot of mpg versus the displacement of a car, you'd see that the mpg consumption decreases nonlinearly (kinda banana-shaped) with displacement, linear models like linear regression wouldn't be able to capture a non-linear trend like this 

when a regression tree is trained on a dataset, the impurity of a node is measured using the mean squared error of the targets in that node, this means the regression tree tries to find the splits that produce leafs so that in each leaf the target values are, on average, the closest possible to the mean value of the labels in that particular leaf 
the RMSE of a model measures, on average, how much the model's predictions differ from the actual labels

as a new instance traverses the tree and reaches a certain leaf, its target variable y is computed as the average of the target variables contained in that leaf

In [None]:
# regression tree in scikit-learn
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

# split the dataset into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=3)

# instantiate the decision tree regressor
# min_sample_leaf if for a stopping condition in which each leaf has to contain at least 10% of the training data
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.1, random_state=3)

# fit dt to the training set
dt.fit(X_train, y_train)
# predict test-set labels
y_pred = dt.predict(X_test)

# obtain the root mean squared error of the model on the test set 
# evaluate the mean squared error
mse_dt = MSE(y_test, y_pred)
# you could also compute the 10-folds CV RMSE by squaring the root of the average MSE
# RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# raise the obtained value to the power 1/2
rmse_dt = mse_dt**(1/2)
print(rmse_dt)