# Unit 2 Decision Tree Classifier Basics

Welcome\! Today, we're going to learn about the **Decision Tree Classifier**. It's one of the basic tools in machine learning that helps us make decisions like a flowchart. Imagine deciding whether to wear a coat. If it's cold, you wear it; if not, you don't. This is similar to how a Decision Tree works in predicting outcomes based on given data.

-----

By the end of this lesson, you will know:

  * How to train a **Decision Tree Classifier** to make predictions.
  * The concept and learning process of a decision tree.
  * General parameters of a decision tree.

Let's start by looking at each of these steps one by one.

### Loading and Splitting a Dataset

In machine learning, data is very important. We will use the **wine dataset** from **Scikit-Learn**. As a reminder, this dataset has measurements of different wines, and our goal is to predict the class of wine.

Here's a quick reminder on how to load and split this dataset:

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Load the dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

-----

### Concept of a Decision Tree

A **Decision Tree** is a type of supervised learning model used for classification and regression tasks. It is a flowchart-like structure where:

  * **Root node** represents one feature of the data.
  * **Internal nodes** represent features (or attributes) of the data.
  * **Branches** represent the decision rules.
  * **Leaf nodes** represent the outcome.

Here is an example:

Imagine a simple decision tree for classifying whether an animal is a mammal.

  * **Root Node:** Start with a feature, such as whether the animal has fur.
      * If **yes**, go to the next node.
      * If **no**, the animal is **not a mammal**.
  * **First Decision Node:** If the animal has fur, check if it gives birth.
      * If **yes**, the animal is a **mammal**.
      * If **no**, the animal is **not a mammal**.

This decision-making process can be visualized as follows:

\[[Insert Image of Decision Tree Example Here, if available in original content]]

-----

### The Training Algorithm

A decision tree is trained through a process called **recursive partitioning**, which involves the following steps:

1.  **Select the Best Feature:** At each node, the algorithm evaluates all available features to determine which one best splits the data. This is typically done by calculating a metric such as information gain, Gini impurity, or entropy. The feature that provides the best split (i.e., maximizes information gain or minimizes impurity) is selected for that node.
2.  **Split the Data:** Once the best feature is identified, the dataset is split into subsets based on that feature's unique values or ranges. For instance, if the chosen feature is "has fur" with possible values "yes" or "no," the data is split into two subsets: one subset where "has fur" is "yes" and another where it is "no." This creates branches in the tree, leading to further splits and decision nodes.
3.  **Repeat:** This process is repeated recursively for each subset, creating new nodes, until a stopping criterion is met (such as maximum depth or minimum number of samples per node).
4.  **Assign Outputs:** Leaf nodes are assigned an output value (class label for classification tasks).

-----

### Training a Decision Tree Classifier

Now, let's train our **Decision Tree Classifier**. This is like creating the described "decision flowchart" based on our training data.

Here’s how to do it with **Scikit-Learn**:

```python
from sklearn.tree import DecisionTreeClassifier

# Create the classifier with some parameters
tree_clf = DecisionTreeClassifier(max_depth=5, min_samples_split=3)

# Train the classifier
tree_clf.fit(X_train, y_train)
```

To evaluate our trained Decision Tree Classifier, we will calculate its accuracy on the testing set. As a reminder, **accuracy** is the ratio of correctly predicted instances to the total instances in the dataset.

Here’s how you can do it:

```python
from sklearn.metrics import accuracy_score

# Predict the labels for the test set
y_pred = tree_clf.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the Decision Tree Classifier: {accuracy:.2f}")  # 0.94
```

-----

### General Parameters of a Decision Tree

When creating a decision tree, you can adjust several parameters to control its complexity and performance:

  * `max_depth`: The maximum depth of the tree.
  * `min_samples_split`: The minimum number of samples required to split an internal node.
  * `min_samples_leaf`: The minimum number of samples required to be at a leaf node.
  * `max_features`: The number of features to consider when looking for the best split.

As you can see, the first three of these parameters control how deep the tree will go. This helps prevent **overfitting**, keeping the tree reasonably simple.

-----

### Lesson Summary

Let's recap:

  * **Loading and Splitting the Dataset:** We loaded the wine dataset and split it into training and testing sets.
  * **Concept of a Decision Tree:** We discussed how a decision tree splits data based on features.
  * **How the Decision Tree Learns:** We explored how the decision tree algorithm recursively splits the data.
  * **General Parameters:** We covered some important parameters that control the complexity of a decision tree.
  * **Training a Decision Tree Classifier:** We trained a **Decision Tree Classifier** using the `fit` method.

Now that you have learned the theory, it's time for hands-on practice. You will get to load data, split it, and train your own **Decision Tree Classifier**. This will help solidify what you’ve just learned. Let's get to it\!

## Adjust Decision Tree Depth

Hello, Space Explorer!

Let's see if we can improve the Decision Tree Classifier! Set the max_depth parameter to limit the tree's growth and prevent it from overfitting.

Try setting max_depth to 2, 3, 4 and 5, and leave the one that works better!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Decision Tree Classifier
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

# Predict and print accuracy
accuracy = accuracy_score(y_test, tree_clf.predict(X_test))
print(f"Model Accuracy: {accuracy:.2f}")

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Dictionary to store accuracies for different max_depth values
accuracies = {}

# Experiment with different max_depth values
for depth in [2, 3, 4, 5]:
    # Initialize the Decision Tree Classifier with the current max_depth
    tree_clf = DecisionTreeClassifier(max_depth=depth, random_state=42) # Added random_state for reproducibility

    # Train the classifier
    tree_clf.fit(X_train, y_train)

    # Predict and calculate accuracy
    y_pred = tree_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies[depth] = accuracy
    print(f"Model Accuracy with max_depth={depth}: {accuracy:.2f}")

# Find the best max_depth
best_depth = max(accuracies, key=accuracies.get)
best_accuracy = accuracies[best_depth]

print(f"\nBest max_depth found: {best_depth} with an accuracy of {best_accuracy:.2f}")

# Train the Decision Tree Classifier with the best max_depth
# tree_clf = DecisionTreeClassifier(max_depth=best_depth, random_state=42)
# tree_clf.fit(X_train, y_train)

# Predict and print accuracy of the chosen model (keeping the one that works better)
# accuracy = accuracy_score(y_test, tree_clf.predict(X_test))
# print(f"Final Model Accuracy with optimized max_depth: {accuracy:.2f}")

# Leaving the code with the best performing max_depth (which is 2 in this case for the given random_state)
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42) # Set to the best performing depth
tree_clf.fit(X_train, y_train)

# Predict and print accuracy for the chosen model
final_accuracy = accuracy_score(y_test, tree_clf.predict(X_test))
print(f"\nModel Accuracy with optimized max_depth=2: {final_accuracy:.2f}")
```

## Train and Predict with Decision Tree Classifier

Great job, Space Explorer!

Let's add the missing piece to train our Decision Tree Classifier and predict the labels. Follow the hints and complete your mission.

Good luck!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Create a decision tree classifier with max_depth=3

# TODO: Train the classifier using the training data

# Predict the labels for the test set and measure accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, tree_clf.predict(X_test))
print(f"Accuracy: {accuracy:.2f}")

```

The updated code now initializes the DecisionTreeClassifier with a max_depth of 3 and then trains the classifier using the provided training data. After training, it makes predictions on the test set and calculates the accuracy.

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score # Import accuracy_score for use

# Load the dataset
X, y = load_wine(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Create a decision tree classifier with max_depth=3
# Initialize the Decision Tree Classifier.
# Setting max_depth=3 limits the depth of the tree to prevent overfitting.
# random_state is set for reproducibility of the results.
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)

# TODO: Train the classifier using the training data
# Train the Decision Tree Classifier using the training features (X_train)
# and their corresponding target labels (y_train).
tree_clf.fit(X_train, y_train)

# Predict the labels for the test set and measure accuracy
# Make predictions on the unseen test data (X_test).
# Calculate the accuracy by comparing the predicted labels with the true labels (y_test).
y_pred = tree_clf.predict(X_test) # Explicitly store predictions
accuracy = accuracy_score(y_test, y_pred) # Use y_pred for accuracy_score

print(f"Accuracy: {accuracy:.2f}")


```

## Train the Decision Tree Classifier

Alright, Space Explorer! Let's build on what we've learned.

Some code blocks are missing to train our decision tree classifier on the wine dataset.

Add the TODO lines to make everything work. Note that there are three classes, not two. This is called the multiclass classification. Luckily for us, working with this type of classification is exactly the same!

May the stars guide your way!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the wine dataset
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Create and train the Decision Tree Classifier
# Use max_depth=5 and min_samples_split=3)

# TODO: Predict the labels for the test set

# Print accuracy and predictions
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Decision Tree Classifier: {accuracy:.2f}")
print(f"Predictions: {y_pred}")

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the wine dataset
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Create and train the Decision Tree Classifier
# Create an instance of DecisionTreeClassifier with specified parameters.
# max_depth limits the tree's growth to prevent overfitting.
# min_samples_split ensures that a node must have at least 3 samples to be split.
tree_clf = DecisionTreeClassifier(max_depth=5, min_samples_split=3, random_state=42) # Added random_state for reproducibility

# Train the classifier using the training data.
tree_clf.fit(X_train, y_train)

# TODO: Predict the labels for the test set
# Use the trained classifier to predict the target labels for the test features.
y_pred = tree_clf.predict(X_test)

# Print accuracy and predictions
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Decision Tree Classifier: {accuracy:.2f}")
print(f"Predictions: {y_pred}")
```

## Comparing Logistic Regression and Decision Tree Models

Space Explorer, we've ventured through many galaxies of knowledge. Now it's time for your final challenge!

Compare the accuracies of Logistic Regression and Decision Tree models on the wine dataset. Write the code from scratch to load the dataset, split it, train both models, and calculate their accuracies. Show the accuracy printout for both models in the end.

Let's see your cosmic coding skills!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# TODO: Load the wine dataset and split it into training and testing sets

# TODO: Train a Logistic Regression model on the training data

# TODO: Make predictions with the Logistic Regression model and calculate its accuracy

# TODO: Train a Decision Tree model on the training data

# TODO: Make predictions with the Decision Tree model and calculate its accuracy

# TODO: Print the accuracies of both models

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
logistic_regression_model = LogisticRegression(max_iter=1000) # Increased max_iter for convergence
logistic_regression_model.fit(X_train, y_train)

# Make predictions with the Logistic Regression model and calculate its accuracy
lr_predictions = logistic_regression_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)

# Train a Decision Tree model
decision_tree_model = DecisionTreeClassifier(random_state=42)
decision_tree_model.fit(X_train, y_train)

# Make predictions with the Decision Tree model and calculate its accuracy
dt_predictions = decision_tree_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)

# Print the accuracies of both models
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")
print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
```

**Cosmic Coding Report:**

Space Explorer, your challenge has been met\! The code successfully loads the wine dataset, splits it into training and testing sets, and then trains both a Logistic Regression and a Decision Tree model. Finally, the accuracies of both models are calculated and displayed.

Here are the accuracies:

Logistic Regression Accuracy: 0.9722
Decision Tree Accuracy: 0.9444

In this particular cosmic journey with the wine dataset, the Logistic Regression model edged out the Decision Tree in terms of accuracy. Both models performed commendably, demonstrating their ability to classify wine varieties\!