# Module 1: Introduction to Scikit-Learn

## Part 15: Decision Trees

In this section, we will explore Decision Trees, powerful supervised learning algorithms used for both classification and regression tasks. Decision Trees create a tree-like model of decisions for improved accuracy.

### 15.1 Understanding Decision Trees

A decision tree consists of nodes, including a root node, internal nodes, and leaf nodes. The root node represents the entire dataset, and internal nodes represent subsets of data based on feature values. Leaf nodes contain the final prediction or class label.

Decision trees work by recursively partitioning the dataset into subsets based on the values of input features, ultimately leading to a tree-like structure where each leaf node represents a class label (in classification) or a predicted numerical value (in regression). The choice of which feature to split on and the value(s) at which to split is based on a splitting criterion. For classification, common criteria include Gini impurity and entropy, while for regression, mean squared error (MSE) is often used. The tree can grow until a specified depth, or until the minimum number of samples required to split a node is reached.

To make a prediction, an input sample traverses the decision tree from the root to a leaf node. In classification, the majority class in a leaf node is assigned as the prediction, while in regression, the predicted value is the mean (or another measure) of the target values in the leaf node.

Decision trees provide a measure of feature importance, which indicates how much each feature contributes to the model's predictions. This can be useful for feature selection and understanding the most influential factors in a model.

### 15.2 Training and Evaluation

To train a Decision Tree model, we need a labeled dataset with the target variable and the corresponding feature values. The model learns the decision rules based on the training data to make predictions.

Decision Trees have several hyperparameters that control the model's behavior, such as the maximum depth of the tree, the number of trees in the forest, and the criterion used for splitting nodes. Tuning these hyperparameters can significantly impact the model's performance.

Once trained, we can evaluate the model's performance using evaluation metrics suitable for classification or regression tasks, such as accuracy, precision, recall, F1-score, or mean squared error.

#### DecisionTreeClassifier Example

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Visualize the decision tree
plt.figure(figsize=(8, 4), dpi=300)
plot_tree(clf, filled=True, feature_names=cancer.feature_names.tolist(), class_names=cancer.target_names.tolist())
plt.title('Decision Tree')
plt.show()

This example showcases how to build, train, and evaluate a decision tree classifier using scikit-learn.

First we load the Iris dataset and split it into training and testing sets and we create a DecisionTreeClassifier and fit it to the training data. We make predictions on the test data and calculate accuracy. Finally a visual representation of the decision tree is displayed.

#### DecisionTreeRegressor Example

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Create synthetic data
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth=5)

# Fit the regressor to the training data
regressor.fit(X_train, y_train)

# Predict on the test data
y_pred = regressor.predict(X_test)

# Order y_pred based on X_test
sorted_indices = X_test.ravel().argsort()
X_test_sorted = X_test[sorted_indices]
y_pred_sorted = y_pred[sorted_indices]

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared (R²): {r2:.2f}")

# Plot the results
plt.figure()
plt.scatter(X, y, c="darkorange", label="data")
plt.plot(X_test_sorted, y_pred_sorted, color="cornflowerblue", linewidth=2, label="prediction")
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

# Visualize the decision tree
plt.figure(figsize=(8, 4), dpi=300)
plot_tree(clf, filled=True, feature_names=cancer.feature_names.tolist(), class_names=cancer.target_names.tolist())
plt.title('Decision Tree')
plt.show()

In this example, we generate synthetic data with some noise and use a DecisionTreeRegressor to learn the underlying pattern. The max_depth parameter controls the depth of the tree, which can be adjusted depending on your specific regression problem. The visualization shows the original data points and the regression line predicted by the decision tree.

### 15.3 Summary

A Decision Tree is a versatile machine learning algorithm used for both classification and regression tasks. It partitions data into subsets based on features, aiming to maximize information gain or reduce impurity at each node. Decision Trees are interpretable, making them valuable for understanding decision processes. However, they can be prone to overfitting complex data. Techniques like pruning and limiting tree depth help mitigate this. Decision Trees are a foundational component in ensemble methods like Random Forests and Gradient Boosting, enhancing predictive power. They are widely used in various domains due to their simplicity, effectiveness, and interpretability.