### UBC Extended Learning
#### Instructor: Socorro Dominguez
#### Week 02 - Module 02

- [] Broadly describe how decision trees make predictions.
- [] Use `DecisionTreeClassifier()` and `DecisionTreeRegressor()` to build decision trees using scikit-learn.
- [] Explain the `.fit()` and `.predict()` paradigm and use `.score()` method of ML models.
- [] Explain the concept of decision boundaries.
- [] Explain the difference between parameters and hyperparameters.
- [] Explain how decision boundaries change with `max_depth`.
- [] Explain the concept of generalization.

### How do Decision Trees Work?

- `Decision trees` are used for both classification and regression tasks.
- They make predictions by recursively partitioning the input data into subsets and assigning a target value or class label to each subset. 
- A `Decision Tree` consists of **nodes** and **branches**.
    - It starts with a **root node** that represents the entire `data set`.
    - It splits (recursively) into **internal nodes**, which represent subsets of the data.
    - The **leaves** represent the final predictions.

- `Feature Selection`: At each **internal node**, the tree algorithm selects a feature from the input data to split the data set into two **child nodes**. The `feature` is chosen based on criteria such as **Gini impurity** or **entropy** which measures the quality of the split in terms of **class homogeneity**.

- `Recursive Splitting`: The data is divided into subsets based on the selected **feature** and **split point**. The process continues recursively for each child node, creating a hierarchical structure of nodes and branches until a **stopping condition** is met.

- `Stopping Conditions`: They may include reaching a **predefined depth** of the tree, having a **minimum number of data points** in a node, or **achieving a purity threshold** for classification tasks.

- `Leaf Node Predictions`: Each **leaf node** is associated with a specific **class label** or a predicted **target value**.
    - The predictions are made based on the **majority class** or the **mean target value**.

- `Prediction Process`: To make a prediction for a new, unseen data point, the decision tree goes through the whole tree structure, starting from the root and following the splits based on the values of the input features - landing into a leaf node. The **prediction** associated with that **leaf node** is returned as the final prediction for the input data point.

One of the most popular methods is called **Gini Impurity**, but there are other methods such as **Entropy**

$GINI = 1 -\frac{(Number for Yes)^2}{(Total for the Leaf)} -\frac{(Number for No)^2}{(Total for the Leaf)}$

Generally speaking, the output of a Leaf is whatever category that has the most counts.

![](https://i.stack.imgur.com/FgdfC.jpg)

`max_depth`
- We can put limits on how trees grow, for example, by requiring 3 or more people per Leaf. If we did that with our Training Data, we would end up with a different tree, that might be more impure.
- But we would also have a better sense of the accuracy of our prediction.
- max_depth allows us to change up to when we want a tree to grow.

Is `max_depth` a parameter or a hyperparameter?

- In `sklearn` you will see max_depth as a parameter. But, in the sklearn documentation, `parameter` stands for the programming standard of an "argument" in a function. 
- `max_depth` is a hyperparameter in ML terms but a parameter in coding terms. We will refer to it based on its ML term.

- Hyperparameters are widgets we can manipulate and play with to improve our tree/algorithm. Parameters are trainable elements in the equation (intercepts and bias).

#### Generalization

![](img/overfitting.png)

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# Load the Iris dataset
iris = load_iris()

# Create a DataFrame for the Iris dataset
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
# Split the DataFrame into features (X) and target labels (y)
X = iris_df.drop(columns=['target'])
y = iris_df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

```python
df_train, df_test = train_test_split(df, test_size = .2)
```

In [4]:
# Create a Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth = 1, random_state=42)

# Train (fit) the classifier on the training data
clf.fit(X_train, y_train)

In [5]:
new_obs = X_test.iloc[[0]]
new_obs

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
73,6.1,2.8,4.7,1.2


In [6]:
clf.predict(new_obs)

array([1])

In [7]:
y_test.iloc[[0]]

73    1
Name: target, dtype: int64

In [8]:
# Make predictions on all the test data
y_pred = clf.predict(X_test)

# Calculate the accuracy of the classifier
accuracy_test = clf.score(X_test, y_test)
accuracy_train = clf.score(X_train, y_train)

# Print the accuracy
print(f"Test Accuracy: {accuracy_test}")
print(f"Train Accuracy: {accuracy_train}")

Test Accuracy: 0.7111111111111111
Train Accuracy: 0.6476190476190476


In [9]:
strategies = ['a', 'b', 'c']

for strategy in strategies:
    clf = DummyClassifier(strategy = strategy)
    clf.fit(X, y)

NameError: name 'DummyClassifier' is not defined

In [None]:
import numpy as np


np.random.seed(3)
np.random.rand()

In [None]:
np.random.rand()

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=342)
X_train

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
96,5.7,2.9,4.2,1.3
32,5.2,4.1,1.5,0.1
31,5.4,3.4,1.5,0.4
113,5.7,2.5,5.0,2.0
49,5.0,3.3,1.4,0.2
...,...,...,...,...
139,6.9,3.1,5.4,2.1
68,6.2,2.2,4.5,1.5
4,5.0,3.6,1.4,0.2
84,5.4,3.0,4.5,1.5
