# Decision Tree Assignment

---

## Question 1: What is a Decision Tree, and how does it work in the context of classification?

**Answer:**  
A **Decision Tree** is a supervised machine learning algorithm used for **classification and regression** tasks. It represents decisions and their possible consequences as a **tree-like structure** of nodes and leaves.

**How it works (for classification):**  
1. The tree starts at a **root node** that contains all data.  
2. It **splits** data based on a feature and threshold that best separates classes (using metrics like Gini or Entropy).  
3. The process continues recursively, creating **child nodes**.  
4. The tree stops growing when a **stopping criterion** is met (e.g., max depth, no further information gain).  
5. Each **leaf node** assigns a class label.

**Advantages:**  
- Easy to understand and visualize  
- Handles both categorical and numerical data  
- Captures non-linear relationships  

**Disadvantages:**  
- Prone to overfitting if not pruned  
- Can be unstable with small changes in data  
- May prefer features with more levels

## Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

**Answer:**  
Both **Gini Impurity** and **Entropy** measure how mixed the classes are in a node. The goal is to make nodes **as pure as possible**.

- **Gini Impurity:**  
  Formula:  
  \( Gini = 1 - \sum p_k^2 \)  
  where \( p_k \) is the proportion of class *k* in the node.  
  - 0 means completely pure (only one class).  
  - Higher Gini means more impurity.

- **Entropy:**  
  Formula:  
  \( Entropy = -\sum p_k \log_2(p_k) \)  
  - 0 means pure.  
  - Higher entropy → more disorder.

**Impact on splits:**  
- Both metrics prefer splits that create purer child nodes.  
- Entropy tends to be more sensitive to rare classes; Gini is slightly faster to compute.  
- In practice, they produce similar trees.

## Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

**Answer:**  

| Type | Description | Example | Advantage |
|------|--------------|----------|------------|
| **Pre-Pruning** | Stop tree growth early using parameters like `max_depth`, `min_samples_split`, etc. | Setting `max_depth=5` | Prevents overfitting and reduces computation time |
| **Post-Pruning** | Grow a full tree first, then remove weak branches based on validation data or cost complexity (`ccp_alpha`). | Using `DecisionTreeClassifier(ccp_alpha=0.01)` | Usually better generalization since it’s based on model performance |

## Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Answer:**  
**Information Gain (IG)** measures the **reduction in entropy** after a dataset is split based on a feature.  
It tells us how much a split improves class purity.

Formula:  
\( IG = Entropy(Parent) - \sum (\frac{n_i}{n}) Entropy(Child_i) \)

**Importance:**  
- High IG means the split produces more homogeneous subsets.  
- Decision Trees select the feature with **maximum Information Gain** at each step.  
- It helps in building a model that efficiently separates the classes.

## Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Answer:**  

**Applications:**  
- Healthcare: Disease diagnosis based on symptoms  
- Finance: Credit risk and loan approval  
- Marketing: Customer segmentation and churn prediction  
- Manufacturing: Fault detection  
- Education: Predicting student performance  

**Advantages:**  
- Easy to interpret and visualize  
- No need for feature scaling  
- Handles numerical and categorical data  

**Limitations:**  
- Easily overfits (needs pruning or ensemble methods)  
- Sensitive to small data changes  
- Greedy algorithm may miss global optimum

## Question 6: Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the model’s accuracy and feature importances

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train Decision Tree using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predictions and accuracy
y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))

# Feature importance
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f'{name}: {importance:.4f}')

Accuracy: 0.9333333333333333
sepal length (cm): 0.0062
sepal width (cm): 0.0292
petal length (cm): 0.5586
petal width (cm): 0.4060


## Question 7: Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Shallow tree (max_depth=3)
tree_shallow = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_shallow.fit(X_train, y_train)
y_pred_shallow = tree_shallow.predict(X_test)
acc_shallow = accuracy_score(y_test, y_pred_shallow)

# Fully grown tree
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

print('Accuracy (max_depth=3):', acc_shallow)
print('Accuracy (fully grown):', acc_full)

Accuracy (max_depth=3): 0.9666666666666667
Accuracy (fully grown): 0.9333333333333333


## Question 8: Train a Decision Tree Regressor on the Boston Housing dataset and print the MSE and feature importances

In [7]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

# Load dataset (Boston deprecated)
data = fetch_california_housing()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

# Results
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
for name, importance in zip(data.feature_names, reg.feature_importances_):
    print(f'{name}: {importance:.4f}')

Mean Squared Error: 0.495235205629094
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


## Question 9: Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV

In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor # Import Regressor
from sklearn.metrics import mean_squared_error # For evaluation, though GridSearchCV uses 'scoring'

param_grid = {
    'max_depth': [None, 2, 3, 4, 5],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# Use DecisionTreeRegressor and appropriate scoring for regression
grid = GridSearchCV(DecisionTreeRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1) # Use neg_mean_squared_error for regression
grid.fit(X_train, y_train)

print('Best parameters:', grid.best_params_)
print('Best cross-validation MSE:', -grid.best_score_) # Convert back to positive MSE

best_model = grid.best_estimator_
print('Test MSE:', mean_squared_error(y_test, best_model.predict(X_test)))

Best parameters: {'max_depth': None, 'min_samples_split': 10}
Best cross-validation MSE: 0.4624738299655961
Test MSE: 0.44398484516177295


## Question 10: Explain the step-by-step process to build a Decision Tree model for disease prediction in healthcare

**Answer:**  
**Step 1: Handle missing values**  
- Impute missing numerical values with mean/median.  
- Impute categorical values with mode or create a new category ("Unknown").  

**Step 2: Encode categorical features**  
- Use one-hot encoding for nominal variables.  
- Use label encoding or ordinal mapping for ordered features.  

**Step 3: Train a Decision Tree model**  
- Start with a simple tree using `criterion='gini'` or `'entropy'`.  
- Use `class_weight='balanced'` if data is imbalanced.  

**Step 4: Tune hyperparameters**  
- Use GridSearchCV to optimize `max_depth`, `min_samples_split`, `min_samples_leaf`, and `ccp_alpha`.  

**Step 5: Evaluate performance**  
- Use metrics like accuracy, precision, recall, F1-score, and ROC-AUC.  
- Check confusion matrix to analyze false positives/negatives.  

**Business value:**  
- Early disease prediction helps prioritize high-risk patients.  
- Supports doctors with data-driven decisions.  
- Reduces healthcare costs through preventive care.