In [None]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

In [None]:
# Load dataset
df = pd.read_csv("breast_cancer.csv")

In [None]:
# Display the first few rows of the combined DataFrame
df.head()

In [4]:
# Step 3: Preprocess the dataset
# Replace missing values (marked as '?') with NaN and drop rows with missing values
df.replace('?', pd.NA, inplace=True)
df.dropna(inplace=True)

In [None]:
# Check for missing values
print(df.isnull().sum())

In [6]:
# Map the 'Class' column: 2 (benign) -> 0, 4 (malignant) -> 1
df['Class'] = df['Class'].map({2: 0, 4: 1})

In [None]:
df.head(10)

In [8]:
# Step 4: Define features (X) and target variable (y)
X = df.drop(columns=['Class'])  # All features except the target
y = df['Class']  # Target variable (0: benign, 1: malignant)

In [9]:
# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [10]:
# Step 6: Initialize the decision tree classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=42)

In [None]:
# Step 7: Train the model on the training set
clf.fit(X_train, y_train)

In [None]:
# Step 8: Evaluate the model on the testing set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Decision Tree Classifier: {accuracy * 100:.2f}%")

In [None]:
# Step 9: Visualize the Decision Tree with Higher Resolution
plt.figure(figsize=(15, 10), dpi=1200)  # Larger figure size and higher resolution
plot_tree(
    clf,  # The trained classifier
    feature_names=X.columns,  # Feature names from the dataset
    class_names=["Benign", "Malignant"],  # Class names from the dataset
    filled=True,  # Use colors to represent classes
    fontsize=10  # Font size for better readability
)
plt.title("Decision Tree for Breast Cancer Classification", fontsize=16)
plt.show()

In [None]:
# Step 10: Display the tree structure in text format
tree_rules = export_text(clf, feature_names=list(X.columns))
print("Decision Tree Rules:")
print(tree_rules)

# Decision Tree Explanation

This document explains how the decision tree works, which variables were used, and the rules applied at each decision point.

---

## Overview

The decision tree is a classification model that predicts a class label (`class: 0` or `class: 1`) based on several variables. At each step, the tree makes decisions by applying thresholds to variables and branching accordingly.

---

## Variables and Decision Rules

### 1. **Root Node: Uniformity_of_cell_size**
- The root node splits the data based on **Uniformity_of_cell_size**:
  - If `Uniformity_of_cell_size <= 3.50`, follow the left branch.
  - If `Uniformity_of_cell_size > 3.50`, follow the right branch.

### 2. **Left Branch (Uniformity_of_cell_size <= 3.50)**
- The next decision depends on **Bare_nuclei**:
  - **If Bare_nuclei <= 5.50**:
    - Check **Normall_nucleoli**:
      - If `Normall_nucleoli <= 3.50`: Predict **class: 0**.
      - If `Normall_nucleoli > 3.50`: Predict **class: 1**.
  - **If Bare_nuclei > 5.50**:
    - Check **Clump_thickness**:
      - If `Clump_thickness <= 2.00`: Predict **class: 0**.
      - If `Clump_thickness > 2.00`: Predict **class: 1**.

### 3. **Right Branch (Uniformity_of_cell_size > 3.50)**
- The next decision depends on whether **Uniformity_of_cell_size** is between 3.50 and 4.50:
  - **If Uniformity_of_cell_size <= 4.50**:
    - Check **Bare_nuclei**:
      - If `Bare_nuclei <= 7.50`: Predict **class: 0**.
      - If `Bare_nuclei > 7.50`: Predict **class: 1**.
  - **If Uniformity_of_cell_size > 4.50**:
    - Check **Marginal_adhesion**:
      - If `Marginal_adhesion <= 1.50`: Predict **class: 1**.
      - If `Marginal_adhesion > 1.50`: Predict **class: 1**.

---

## Summary of Decision Process

1. **Variable Selection**:
   - The tree begins by splitting on **Uniformity_of_cell_size**, indicating it has the highest predictive power.
   - Subsequent splits involve **Bare_nuclei**, **Normall_nucleoli**, **Clump_thickness**, and **Marginal_adhesion**.

2. **Class Assignment**:
   - Each leaf node represents a terminal decision where a class is assigned based on the rules applied.

3. **Predictions**:
   - **Class 0**: Typically corresponds to one category (e.g., "benign").
   - **Class 1**: Typically corresponds to the other category (e.g., "malignant").

---

## How the Tree Works
The decision tree works by:
- Splitting the dataset at each node using threshold rules.
- Branching left or right based on whether the condition is met.
- Assigning a class at the leaf nodes, based on the subset of data reaching that point.

Each decision refines the predictions by narrowing down possibilities, ensuring the most accurate classification.

---


##### The End