Decision Tree  Assignment


Q1: What is a Decision Tree, and how does it work in the context of
classification?


-> In machine learning, a Decision Tree is a non-parametric supervised learning algorithm used for both classification and regression tasks. It resembles an upside-down tree or a flowchart, where each internal branch represents a choice based on data features, leading to a final predicted outcome


Q2:  Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?


-> Gini Impurity

Gini Impurity measures the probability that a randomly chosen element from the dataset would be incorrectly labeled if it were randomly classified according to the class distribution in that node.
Formula: \(G=1-\sum p_{i}^{2}\), where \(p_{i}\) is the probability of an element belonging to class \(i\).Range: Typically ranges from 0 to 0.5 for binary classification.0: Perfect purity (all elements belong to one class).0.5: Maximum impurity (elements are evenly distributed among classes).Split Impact: It favors larger partitions and is computationally efficient because it does not involve logarithmic calculations. It is the default metric for the CART (Classification and Regression Trees) algorithm.

 Entropy

 Entropy originates from information theory and measures the degree of uncertainty or randomness in a dataset.
Formula: \(H=-\sum p_{i}\log _{2}(p_{i})\).Range: Ranges from 0 to 1.0: No uncertainty (pure node).1: Maximum uncertainty (perfectly mixed classes).Split Impact: It is used to calculate Information Gain, which is the reduction in entropy after a split. Entropy is more sensitive to small changes in class probabilities than Gini and often produces more balanced splits, though it is slower to compute due to the \(\log \) function. It is primarily used in the ID3 and C4.5 algorithms.


Q3:  What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each?


-> Pre-Pruning (Early Stopping)

Pre-pruning halts the growth of a decision tree during the construction phase. It stops further splitting of a node if it does not meet a predefined threshold or criteria.
Common Criteria: Setting a maximum depth (max_depth), a minimum number of samples required to split a node (min_samples_split), or a minimum impurity decrease.
Practical Advantage: Computational Efficiency. Since the algorithm stops building the tree early, it saves significant time and memory, making it ideal for extremely large datasets where building a full tree would be resource-intensive.

Post-Pruning (Backward Pruning)

Post-pruning allows the decision tree to grow to its full depth (often perfectly classifying the training set) and then trims away non-significant branches.
Common Method: Cost Complexity Pruning, which uses a parameter (often called alpha) to balance tree size against accuracy on a validation set.
Practical Advantage: Higher Pruning Quality. It avoids the "horizon effect," where a split that seems non-significant on its own might lead to very valuable splits further down. Post-pruning ensures the model captures these deep, complex patterns before simplifying the structure.


Q4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?


-> In decision tree learning, Information Gain (IG) is a metric used to determine which feature (attribute) most effectively splits data into homogeneous groups. It quantifies the reduction in "disorder" or uncertainty after a dataset is partitioned based on a specific feature.

The Formula:\(\text{Information\ Gain}=\text{Entropy(parent)}-\text{Weighted\ Average\ Entropy(children)}\)

This calculation represents the difference between the uncertainty of the dataset before the split and the total uncertainty remaining in the resulting branches.
 Why It Is Important for Choosing Splits Information Gain serves as the "selection criterion" for building the tree. Its primary roles include:

 Identifying the Best Split: The algorithm calculates the IG for every possible feature and threshold. The feature with the highest Information Gain is chosen as the splitting point for that node because it reduces uncertainty the most.Maximizing Purity: By selecting features that maximize gain, the algorithm ensures that each split creates "purer" child nodes where the data points belong more consistently to a single class.Driving Tree Structure: It dictates the order of decision nodes, typically placing the most "informative" features (those that clarify the target label fastest) near the root of the tree.



Q5:  What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?


-> Common Real-World Applications

Finance:
Banks use them for credit scoring to determine loan eligibility based on income and credit history, as well as for fraud detection by identifying anomalous transaction patterns.
Healthcare:
 Medical professionals apply decision trees for disease diagnosis (e.g., predicting diabetes or heart disease) and creating personalized treatment plans based on patient symptoms and medical history.
Marketing & Retail
: Companies use them for customer segmentation to target specific demographics and churn prediction to identify customers likely to stop using a service. Retailers also use them for inventory management to predict sales trends.
Manufacturing: Used for quality control to identify production conditions that lead to defects and for equipment failure prediction.

Main Advantages
Interpretability:
They provide a "white box" model that is easy for non-experts to understand and visualize as a flowchart.

Minimal Data Preparation:
 They do not require feature scaling or normalization and can handle both numerical and categorical data directly.
Handling Missing Values:
 Many decision tree algorithms can effectively handle datasets with missing values or outliers.
Versatility: They are applicable to both classification (discrete outcomes) and regression (continuous values) tasks.


Q6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances?


In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data  # Features: sepal/petal length and width
y = iris.target  # Labels: Setosa, Versicolor, Virginica

# Split data: 70% for training and 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train a Decision Tree Classifier using the Gini criterion
# Note: 'gini' is the default criterion for DecisionTreeClassifier in scikit-learn
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# 3. Predict and print model accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# 4. Print feature importances
print("\nFeature Importances:")
feature_importance_df = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': clf.feature_importances_
}).sort_values(by='Importance', ascending=False)

print(feature_importance_df)


Model Accuracy: 1.0000

Feature Importances:
             Feature  Importance
2  petal length (cm)    0.893264
3   petal width (cm)    0.087626
1   sepal width (cm)    0.019110
0  sepal length (cm)    0.000000


Q7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train a Decision Tree with max_depth=3 (Pre-pruning)
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

# 3. Train a fully-grown Decision Tree (No depth limit)
clf_full = DecisionTreeClassifier(max_depth=None, random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# 4. Compare Accuracy and Depth
print(f"--- Depth-Limited Tree (max_depth=3) ---")
print(f"Test Accuracy: {acc_limited:.4f}")
print(f"Actual Tree Depth: {clf_limited.get_depth()}")

print(f"\n--- Fully-Grown Tree (Unlimited) ---")
print(f"Test Accuracy: {acc_full:.4f}")
print(f"Actual Tree Depth: {clf_full.get_depth()}")


--- Depth-Limited Tree (max_depth=3) ---
Test Accuracy: 1.0000
Actual Tree Depth: 3

--- Fully-Grown Tree (Unlimited) ---
Test Accuracy: 1.0000
Actual Tree Depth: 6


Q8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importance

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train a Decision Tree Regressor
# We use a max_depth to prevent extreme overfitting, a common issue with trees
regressor = DecisionTreeRegressor(max_depth=5, random_state=42)
regressor.fit(X_train, y_train)

# 3. Predict and Print Mean Squared Error (MSE)
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

# 4. Print Feature Importances
print("\nFeature Importances:")
feature_importance_df = pd.DataFrame({
    'Feature': housing.feature_names,
    'Importance': regressor.feature_importances_
}).sort_values(by='Importance', ascending=False)

print(feature_importance_df)


HTTPError: HTTP Error 403: Forbidden

Q9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracY?

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Define the Parameter Grid
# max_depth: controls the maximum levels of the tree
# min_samples_split: minimum number of samples needed to split a node
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 5, 10, 20]
}

# 3. Initialize GridSearchCV
# cv=5 means 5-fold cross-validation is used during the search
dtree = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(estimator=dtree, param_grid=param_grid, cv=5, scoring='accuracy')

# 4. Tune the Model
grid_search.fit(X_train, y_train)

# 5. Extract results
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# 6. Predict and print resulting accuracy
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Best Parameters: {best_params}")
print(f"Final Model Test Accuracy: {accuracy:.4f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Final Model Test Accuracy: 1.0000


Q10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

ANS 10: 1. Handle Missing Values

Medical datasets often contain missing values due to skipped tests or recording errors.
Identify the Mechanism: Determine if data is Missing Completely at Random (MCAR) or Missing at Random (MAR).
Imputation Strategies:

Statistical Imputation:
 Replace missing numeric values with the median (robust to outliers) and categorical values with the mode (most frequent class).

Advanced Imputation:
Use a secondary model like KNN Imputation or an SVM regressor to predict missing values based on other patient features.

Algorithm Support:
 Modern Decision Tree implementations (like XGBoost or specific CART variants) can handle missing values natively by creating "surrogate splits" or assigning them to the most common branch.

2. Encode Categorical Features
Machine learning models require numerical input.
One-Hot Encoding: Use for nominal features without order (e.g., "Blood Type" or "Gender").

Ordinal Encoding: Use for features with a natural ranking (e.g., "Pain Severity: Low, Medium, High").
Target Encoding: Useful for high-cardinality features (e.g., "Zip Code") to prevent sparse data, though it requires care to avoid leakage.

3. Train the Decision Tree Model
Data Split: Divide your data into training (e.g., 80%) and testing (20%) sets.
Criterion Selection: Use Gini Impurity for computational efficiency or Entropy if you need to measure specific information gain.

Initialization:
 Train the DecisionTreeClassifier on the processed training set.

4. Tune Hyperparameters
Tuning prevents the tree from "memorizing" specific patients (overfitting) and ensures it learns general medical patterns.

Grid Search (GridSearchCV): Exhaustively test combinations of parameters like max_depth (tree height), min_samples_split (minimum patients needed to split a node), and min_samples_leaf.

Pruning: Use Cost Complexity Pruning to trim branches that do not significantly improve the model’s predictive power on validation data.

5. Evaluate Performance
Accuracy alone is insufficient in healthcare; the cost of a false negative (missing a disease) is much higher than a false positive.

Primary Metrics: Focus on Recall (Sensitivity) to ensure most diseased patients are caught and Precision to avoid unnecessary treatments.

F1-Score: The harmonic mean of precision and recall, providing a balanced view of model quality.

ROC-AUC: Measures the model's ability to distinguish between diseased and healthy patients regardless of the decision threshold.

Real-World Business Value
Clinical Decision Support:
 Acts as a "second opinion" for doctors, flagging high-risk patients for early intervention.

Operational Efficiency:
Automates the initial screening process, allowing healthcare providers to prioritize critical cases and reduce wait times.

Cost Reduction: Identifies patients who do not need expensive, invasive diagnostic tests, saving resources for both hospitals and insurance providers.
Personalized Care: Enables data-driven treatment plans tailored to a patient's specific symptoms and history.