### Decision trees

####  Introduction
A **Decision Tree** is a supervised machine learning algorithm used for **classification** and **regression** tasks. It models decisions based on **if-else conditions**, splitting data at each node based on the most significant feature.

---

### Key Concepts

#### Root Node
- The **starting point** of the tree.
- Represents the entire dataset, which is split into child nodes.

#### Splitting
- The process of dividing nodes into **sub-nodes** based on feature conditions.

#### Decision Node
- A node that **splits** into further sub-nodes.

#### Leaf Node
- The **final** output node (contains no further splits).
- In classification, it represents **class labels**.
- In regression, it holds a **continuous value**.

#### Pruning
- Reducing tree size by **removing branches** to avoid **overfitting**.

---

####  How Decision Trees Work

* **Select the Best Feature**:  
   - Uses criteria like **Gini Impurity**, **Entropy (Information Gain)**, or **Variance Reduction**.

* **Split the Data**:  
   - The feature with the best **split criterion** is chosen to divide the dataset.

* **Repeat Until a Stopping Condition is Met**:  
   - Stopping criteria can be **maximum depth**, **minimum samples per node**, etc.

* **Make Predictions**:  
   - Traverse from the root to a **leaf node** based on feature values.

---

####  Splitting Criteria

####  **For Classification:**
1. **Gini Impurity**  
   \[
   Gini = 1 - \sum p_i^2
   \]  
   - Measures **impurity** in a node (lower is better).

2. **Entropy & Information Gain**  
   \[
   Entropy = -\sum p_i \log_2 p_i
   \]  
   - Measures uncertainty in the dataset.

   \[
   \text{Information Gain} = \text{Entropy(Parent)} - \sum \text{Weighted Entropy(Child)}
   \]
   - Higher **Information Gain** means a better split.

####  **For Regression:**
- Uses **Mean Squared Error (MSE)** or **Mean Absolute Error (MAE)**.

---

####  Advantages of Decision Trees

- **Easy to interpret & visualize**  
- **No need for feature scaling** (e.g., Standardization)  
- **Handles both numerical & categorical data**  
- **Works well with missing values**  
- **Performs feature selection automatically**  

---

##  Disadvantages of Decision Trees

- **Prone to overfitting** (Deep trees memorize data)  
- **Sensitive to noisy data**  
- **Not optimal for large datasets** (Better alternatives: Random Forests)  

---




In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import graphviz


In [3]:
iris = load_iris()
X = iris.data
y = iris.target


In [5]:
X[:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [6]:
y[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


In [9]:
# Create a decision tree classifier with a maximum depth of 3
clf = DecisionTreeClassifier(max_depth=3)

# train the classifier on the training data
clf.fit(X_train, y_train)

In [11]:
# Make predictions on the testing data
y_pred = clf.predict(X_test)
y_pred

array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       2, 0, 2, 1, 0, 0, 1, 2])

In [12]:
accuracy = accuracy_score(y_pred, y_test)
print(f"Accuracy of the model: {accuracy}")

Accuracy of the model: 0.9666666666666667


In [14]:
# Visualize the decision tree using graphviz
dot_data = export_graphviz(
    clf,
    out_file=None,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    special_characters=True
)

graph = graphviz.Source(dot_data)
graph.render('iris_tree', view=True)

'iris_tree.pdf'


	Using the fallback 'C' locale.

Gtk-Message: 15:40:31.024: Failed to load module "canberra-gtk-module"
Gtk-Message: 15:40:31.029: Failed to load module "canberra-gtk-module"
/home/wanyua/snap/code/common/.cache/gio-modules/libdconfsettings.so: cannot open shared object file: Permission denied
Failed to load module: /home/wanyua/snap/code/common/.cache/gio-modules/libdconfsettings.so

This may indicate that pixbuf loaders or the mime database could not be found.
**
Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/48x48/status/image-missing.png: Unable to load image-loading module: /snap/code/184/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: /snap/code/184/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: Permission denied (gdk-pixbuf-error-quark, 5)


Bail out! Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/48x48/status/image-missing.png: Unable to load image-loading module: /snap/code/184/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: /snap/code/184/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: Permission denied (gdk-pixbuf-error-quark, 5)


In [15]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)


Confusion Matrix:
[[11  0  0]
 [ 0 12  1]
 [ 0  0  6]]


In [16]:
# Precision, recall and F1-score
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        13
           2       0.86      1.00      0.92         6

    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



In [18]:
import numpy as np

# Check unique values in y_test
print("Unique values in y_test:", np.unique(y_test))


Unique values in y_test: [0 1 2]


In [19]:
# ROC Curve & AUC Score
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

auc_score = roc_auc_score(y_test, clf.predict_proba(X_test), multi_class='ovr')
print(f"Multiclass AUC Score: {auc_score:.2f}")

Multiclass AUC Score: 0.98
