#Decision Tree Assignment

##ASSIGNMENT QUESTIONS ANSWERS

Question 1: What is a Decision Tree, and how does it work in the context of
classification?

Answer 1 : A **Decision Tree** is a supervised learning algorithm used for both classification and regression. In the context of **classification**, it works by recursively splitting the dataset into subsets based on feature values, forming a tree-like structure. Each **internal node** represents a decision based on a feature (e.g., "Is age > 30?"), each **branch** represents the outcome of that decision, and each **leaf node** represents a class label.

The algorithm selects features to split using measures like **Gini Index** or **Entropy (Information Gain)**, aiming to create the most homogeneous groups possible.

For example, in a dataset predicting whether a warrior from the Mahabharata joins the Pandavas or Kauravas, features might include **loyalty**, **family ties**, and **dharma principles**. A rule could be: *If loyalty = Bhishma’s vow → Kauravas; if guided by dharma like Arjuna → Pandavas.*

Decision Trees are easy to interpret but prone to **overfitting**, often addressed using pruning or ensemble methods like Random Forests.


---

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?


Answer 2: Gini Impurity and Entropy are two common measures used in Decision Trees to determine the quality of a split at each node.

**Gini Impurity** measures the probability of incorrectly classifying a randomly chosen element if it were labeled according to the distribution of classes in the node. It is calculated as:

$$
Gini = 1 - \sum p_i^2
$$

where $p_i$ is the probability of class $i$. A Gini of 0 means the node is pure (all samples belong to one class).

**Entropy**, derived from information theory, measures the disorder or uncertainty in a dataset. It is given by:

$$
Entropy = - \sum p_i \log_2(p_i)
$$

A lower entropy value indicates a purer node, while higher entropy shows more randomness.

**Impact on Splits:**
Decision Trees use these measures to select the best split. The algorithm tries to reduce impurity after each split. Gini tends to favor larger partitions with dominant classes, while Entropy is more sensitive to class distribution.

**Examples:**

1. If a node has samples $[50 Yes, 0 No]$, both Gini and Entropy = 0 (pure).
2. For $[25 Yes, 25 No]$: Gini = 0.5, Entropy = 1 (high impurity).
3. For $[40 Yes, 10 No]$: Gini = 0.32, Entropy ≈ 0.72 (moderate impurity).

Thus, both guide the tree toward splits that increase purity and improve classification.


---

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.


Answer 3 : **Pre-Pruning** and **Post-Pruning** are techniques to prevent overfitting in Decision Trees by controlling their complexity.

**Pre-Pruning (Early Stopping):**
In pre-pruning, the tree growth is restricted before it becomes too deep. Constraints such as maximum depth, minimum samples per leaf, or minimum information gain are applied during training. This prevents the model from splitting nodes that do not significantly improve prediction.

* *Example:* If we set a maximum depth of 3 while predicting whether warriors in the Mahabharata would win based on their weapons and allies, the tree will stop after 3 levels, avoiding overly complex splits.
* *Advantage:* It saves computation time and reduces overfitting by keeping the tree simpler.

**Post-Pruning (Pruning After Full Growth):**
In post-pruning, the tree is first grown to its maximum possible depth, and then non-essential branches are removed based on validation performance. The pruning step eliminates nodes that do not contribute significantly to accuracy.

* *Example:* A fully grown tree predicting battle outcomes might create very specific rules like *“If warrior has bow + ally is Krishna + battlefield is Kurukshetra”*. Post-pruning would remove such overly specific branches if they don’t generalize well.
* *Advantage:* It usually results in better accuracy since the tree explores all splits first and then trims unhelpful complexity.

👉 In short, pre-pruning controls growth early, while post-pruning refines a fully grown tree.


---

Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Answer 4 :Information Gain is a metric used in decision tree learning to determine the best way to split the data at each node. It quantifies how much a particular feature reduces uncertainty (or entropy) about the target variable. Essentially, it measures how much more predictable the data becomes after splitting on a specific feature.
How it works:

1. Entropy:

Entropy measures the impurity or randomness of a dataset. A node with high entropy has a mix of different classes, while a node with low entropy has mostly one class.

2. Information Gain Calculation:

Information Gain is calculated by subtracting the weighted average entropy of the child nodes (created by the split) from the entropy of the parent node.

3. Choosing the Best Split:

The feature with the highest Information Gain is chosen as the best split because it leads to the most significant reduction in uncertainty and creates more homogenous child nodes.

Example:
Imagine a decision tree trying to predict whether someone will play tennis based on weather conditions. One feature is "Outlook," which can be sunny, overcast, or rainy. Another feature is "Humidity," which can be high or normal.
If splitting on "Outlook" results in three child nodes: one with mostly "play tennis" outcomes, another with mostly "no play tennis," and a third with a mix, the Information Gain will be calculated. If splitting on "Humidity" results in two child nodes, one with all "play tennis" and the other with mostly "no play tennis", the Information Gain of splitting on Humidity will likely be higher because it created more pure child nodes.


*Why it's important:*

By selecting features with the highest Information Gain, the decision tree aims to create branches that lead to increasingly pure (less uncertain) subsets of data, ultimately leading to more accurate classifications. In simpler terms, it helps the model make the most informed decisions at each step of the tree construction

---

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Answer 5 : Decision trees are versatile tools with applications across diverse fields, offering both advantages and limitations. Common uses include loan approval in banking, medical diagnosis, customer churn prediction, and fraud detection. Their strengths lie in interpretability and ease of understanding, while limitations include susceptibility to overfitting and potential instability with small changes in data.


###*Real-world applications:*

1. Banking:
Decision trees help assess loan applications based on factors like credit score and income, aiding in quick and reliable approval decisions.

2. Healthcare:
They assist in disease diagnosis, such as predicting diabetes based on clinical data like glucose levels.

3. Marketing:
Businesses use them to predict customer churn (likelihood of leaving) based on behavior patterns and purchase history.

4. Fraud Detection:
They help identify fraudulent activities, like credit card fraud, by analyzing transaction patterns.

###*Advantages:*

1. Interpretability: Decision trees are easy to understand and visualize, making the decision-making process transparent.

2. Handles diverse data: They can handle both numerical and categorical data.
Feature selection: They help identify the most relevant attributes for prediction.

3. Relatively low data preparation: They require less data preparation compared to some other machine learning algorithms.

###*Limitations:*

1. Overfitting:
Large decision trees can be prone to overfitting the training data, leading to poor generalization on unseen data.

2. Instability:
Slight changes in the training data can lead to a significantly different tree structure.

4. Not suitable for very complex relationships:
Decision trees might struggle with highly complex, non-linear relationships in data.



---

Question 6: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances


In [1]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


---

Question 7: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [2]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a Decision Tree with max_depth=3
dt_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_limited.fit(X_train, y_train)
y_pred_limited = dt_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Train a fully grown Decision Tree (no max_depth restriction)
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)
y_pred_full = dt_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print results
print(f"Accuracy with max_depth=3: {accuracy_limited:.4f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.4f}")


Accuracy with max_depth=3: 1.0000
Accuracy with fully-grown tree: 1.0000


---

Question 8: Write a Python program to:

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances


In [3]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predictions
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Mean Squared Error (MSE):", mse)

# Feature importances
feature_importances = pd.DataFrame({
    'Feature': housing.feature_names,
    'Importance': regressor.feature_importances_
}).sort_values(by="Importance", ascending=False)

print("\nFeature Importances:")
print(feature_importances)


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
      Feature  Importance
0      MedInc    0.528509
5    AveOccup    0.130838
6    Latitude    0.093717
7   Longitude    0.082902
2    AveRooms    0.052975
1    HouseAge    0.051884
4  Population    0.030516
3   AveBedrms    0.028660


---

Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy


In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Tune Decision Tree max_depth and min_samples_split using GridSearchCV
# Define the parameter grid to search
param_grid = {
    'max_depth': [None, 3, 5, 7, 10],  # None means no limit
    'min_samples_split': [2, 5, 10, 20]
}

# Create a Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Create GridSearchCV object
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# 3. Print the best parameters and the resulting model accuracy
print("Best parameters found by GridSearchCV:")
print(grid_search.best_params_)

# Get the best estimator (model)
best_dt_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_dt_model.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the best model on the test set: {accuracy:.4f}")

Best parameters found by GridSearchCV:
{'max_depth': None, 'min_samples_split': 10}
Accuracy of the best model on the test set: 1.0000


---

Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.

ANSWER 10. Here’s a structured **step-by-step explanation** tailored for your question:

---

### **Step 1: Handle Missing Values**

* **Identify missing data**: Use `df.isnull().sum()` to see which features have missing values.
* **Numerical features**: Replace missing values with the **mean** or **median** (depending on skewness). Example: `SimpleImputer(strategy="median")`.
* **Categorical features**: Replace missing values with the **most frequent value** or introduce a new category `"Unknown"`.
* Business note: Proper handling ensures no patient records are discarded unnecessarily, keeping the dataset large and representative.

---

### **Step 2: Encode Categorical Features**

* **Label Encoding**: For ordinal categories (e.g., "Mild", "Moderate", "Severe").
* **One-Hot Encoding**: For nominal categories (e.g., "Blood Type: A, B, AB, O"). Example: `OneHotEncoder(handle_unknown='ignore')`.
* Business note: Correct encoding lets the model understand non-numeric patient details like gender, region, or symptoms.

---

### **Step 3: Train a Decision Tree Model**

* Split data into **training (80%)** and **test (20%)** sets using `train_test_split`.
* Initialize a `DecisionTreeClassifier(random_state=42)` and train it on the processed data.
* Business note: Decision Trees are interpretable, allowing doctors and stakeholders to understand the reasoning behind predictions.

---

### **Step 4: Tune Hyperparameters**

* Use **GridSearchCV** or **RandomizedSearchCV** to tune:

  * `max_depth` (to prevent overfitting)
  * `min_samples_split` (controls minimum patients per split)
  * `min_samples_leaf` (ensures stability)
  * `criterion` ("gini" or "entropy")
* Example parameter grid:

  ```python
  param_grid = {
      'max_depth': [3, 5, 7, None],
      'min_samples_split': [2, 5, 10],
      'min_samples_leaf': [1, 2, 4],
      'criterion': ['gini', 'entropy']
  }
  ```
* Business note: Tuning ensures the model is both **accurate and generalizable** for new patients.

---

### **Step 5: Evaluate Performance**

* Use metrics beyond accuracy:

  * **Precision & Recall** (important if false negatives are dangerous).
  * **ROC-AUC Score** (to measure discrimination power).
  * **Confusion Matrix** (to analyze true vs. false predictions).
* Apply **cross-validation** for robust evaluation.

---

### **Business Value in Real-World Setting**

* **Early disease detection** → Helps doctors flag high-risk patients sooner.
* **Decision support** → Model explanations (feature importance) guide doctors on key risk factors.
* **Resource optimization** → Hospitals can prioritize testing and treatments for patients most likely at risk.
* **Cost savings & patient safety** → Reduces unnecessary tests for low-risk patients, ensuring faster, targeted care.

---

