Assignment Code: DA-AG-012

Decision Tree | Assignment

Question 1: What is a Decision Tree, and how does it work in the context of classification?

Answer: A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, its goal is to predict the class label of a given data point.
It works by splitting the dataset into subsets based on the value of input features. This process is repeated recursively, forming a tree-like structure. The main components of a decision tree are:
Root Node: The topmost node representing the entire dataset.
Internal Nodes: Nodes that represent a decision (or test) on a specific feature, splitting the data into branches.
Leaf Nodes: The terminal nodes that represent the final class label or outcome.
The algorithm selects the feature that best separates the data into classes (e.g., using Information Gain or Gini Impurity) at each node. It starts from the root, asks a question about a feature, and follows the branch corresponding to the answer. This process continues until it reaches a leaf node, which provides the final classification.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Answer: Gini Impurity and Entropy are metrics used to measure the "impurity" or "disorder" of a dataset. A node is "pure" (Gini=0, Entropy=0) if all its data points belong to a single class. The goal of a Decision Tree is to create splits that maximize the purity of the resulting child nodes.
Gini Impurity: Measures the probability of incorrectly classifying a randomly chosen element from the node if it were randomly labeled according to the class distribution in the node. It is calculated as: G i n i
1 − ∑ i
1 C ( p i ) 2 Gini=1−∑ i=1 C  (p i  ) 2
where p i p i  is the probability of an object being classified to class i i.
Entropy: Measures the average amount of "information" or "surprise" inherent in the node's possible outcomes. A higher entropy means more disorder. It is calculated as: E n t r o p y
− ∑ i
1 C p i ∗ log ⁡ 2 ( p i ) Entropy=−∑ i=1 C  p i  ∗log 2  (p i  )
Impact on Splits: The Decision Tree algorithm evaluates all possible splits on all features. For each potential split, it calculates the weighted average impurity (Gini or Entropy) of the resulting child nodes. The split that results in the largest reduction in impurity (i.e., the greatest increase in purity) is chosen. This reduction is formally called Information Gain when using Entropy. Both metrics lead to similar trees, but Gini is slightly faster to compute, while Entropy might produce more balanced trees.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Answer: Pruning is a technique used to prevent overfitting in Decision Trees by removing parts of the tree that provide little predictive power.
Pre-Pruning (Early Stopping): This involves stopping the tree-building process before it perfectly classifies the training data. This is done by setting constraints (hyperparameters) like max_depth, min_samples_split, and min_samples_leaf.
Practical Advantage: It is computationally more efficient because it avoids building an overly complex tree in the first place.
Post-Pruning: This involves allowing the tree to fully grow (even overfitting the training data) and then removing non-critical branches or nodes afterward. A common method is Cost Complexity Pruning (CCP).
Practical Advantage: It often results in a more accurate and robust tree because it doesn't rely on hard stop criteria during the building phase and can make more nuanced decisions about which branches to cut.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer: Information Gain (IG) is the measure of the effectiveness of a feature in classifying the data. It quantifies the reduction in entropy (or Gini impurity) after a dataset is split on a feature.
It is calculated as: I G ( S , A )
E n t r o p y ( S ) − ∑ v ∈ V a l u e s ( A ) ∣ S v ∣ ∣ S ∣ E n t r o p y ( S v ) IG(S,A)=Entropy(S)−∑ v∈Values(A)
∣S∣ ∣S v  ∣  Entropy(S v  ) Where:
 S is the original dataset.
 A is the feature to split on.
 Values(A) are the possible values of feature A .
S v  is the subset of S  where feature  A has value v.
Importance: Information Gain is critically important because it provides a quantifiable criterion for selecting the best split at each node. The algorithm computes the Information Gain for every possible split on every feature and then chooses the split with the highest Information Gain. This greedy approach ensures that the most informative features (those that create the purest child nodes) are used higher up in the tree, leading to a more efficient and effective model.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

 Answer: Common Real-World Applications:
Finance: Credit scoring and loan approval.
Healthcare: Medical diagnosis (e.g., identifying diseases from symptoms).
Marketing: Customer segmentation and predicting churn.
Manufacturing: Quality control and fault detection.
Main Advantages:
Interpretability: The model is white-box and easy to understand and visualize, even for non-experts.
Little Data Preprocessing: Requires little data scaling or normalization.
Handles Mixed Data: Can work with both numerical and categorical data.
Main Limitations:
Overfitting: They can easily overfit the training data if not pruned properly, capturing noise as patterns.
Instability: Small changes in the data can lead to the generation of a completely different tree.
Bias: They can be biased towards features with more levels and may create biased trees if some classes are dominant.

Question 6: Write a Python program to load the Iris Dataset, train a Decision Tree Classifier using the Gini criterion, and print the model's accuracy and feature importances.



Answer:



In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data  # Feature matrix
y = iris.target  # Target vector

# Split the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Decision Tree Classifier with Gini criterion
dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, dt_classifier.feature_importances_):
    print(f"  {feature_name}: {importance:.4f}")

Model Accuracy: 1.0000

Feature Importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0167
  petal length (cm): 0.9061
  petal width (cm): 0.0772


Question 7: Write a Python program to load the Iris Dataset, train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

Answer:

In [2]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Train a Decision Tree with max_depth=3
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
y_pred_pruned = pruned_tree.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# 2. Train a fully-grown Decision Tree (no restrictions)
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print the comparison
print("Accuracy Comparison:")
print(f"Pruned Tree (max_depth=3): {accuracy_pruned:.4f}")
print(f"Fully-Grown Tree: {accuracy_full:.4f}")

Accuracy Comparison:
Pruned Tree (max_depth=3): 1.0000
Fully-Grown Tree: 1.0000


Question 8: Write a Python program to load the Boston Housing Dataset, train a Decision Tree Regressor, and print the Mean Squared Error (MSE) and feature importances.

Answer:

In [3]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the Boston Housing Dataset from an alternative source
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(url)

# Separate features and target variable
X = data.drop('medv', axis=1)
y = data['medv']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)

# Make predictions
y_pred = dt_regressor.predict(X_test)

# Calculate and print the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(X.columns, dt_regressor.feature_importances_):
    print(f"  {feature_name}: {importance:.4f}")

Mean Squared Error (MSE): 10.4161

Feature Importances:
  crim: 0.0513
  zn: 0.0034
  indus: 0.0058
  chas: 0.0000
  nox: 0.0271
  rm: 0.6003
  age: 0.0136
  dis: 0.0707
  rad: 0.0019
  tax: 0.0125
  ptratio: 0.0110
  b: 0.0090
  lstat: 0.1933


Question 9: Write a Python program to load the Iris Dataset, tune the Decision Tree's max_depth and min_samples_split using GridSearchCV, and print the best parameters and the resulting model accuracy.

Answer:

In [4]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the classifier
dt = DecisionTreeClassifier(random_state=42)

# Define the hyperparameter grid to search
param_grid = {
    'max_depth': [2, 3, 4, 5, None],  # None means no limit
    'min_samples_split': [2, 3, 4, 5]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print("Best Hyperparameters:", grid_search.best_params_)

# Get the best model
best_dt_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred_best = best_dt_model.predict(X_test)
best_accuracy = accuracy_score(y_test, y_pred_best)

print(f"Best Model Accuracy on Test Set: {best_accuracy:.4f}")

Best Hyperparameters: {'max_depth': 4, 'min_samples_split': 2}
Best Model Accuracy on Test Set: 1.0000


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process.

Answer:

Step-by-Step Process:

Handle Missing Values:

Numerical Features: Impute missing values with the mean, median, or mode of the column. For a more sophisticated approach, use model-based imputation (e.g., K-Nearest Neighbors).

Categorical Features: Impute with the mode (most frequent category) or create a new category like "Unknown."

Encode Categorical Features:

Ordinal Features: Use Label Encoding if the categories have a natural order (e.g., "low," "medium," "high").

Nominal Features: Use One-Hot Encoding for categories without a natural order (e.g., "city A," "city B," "city C").

Train a Decision Tree Model:

Split the preprocessed data into training and testing sets (e.g., 80-20 split).

Initialize a DecisionTreeClassifier.

Train the model on the training data using the fit method.

Tune its Hyperparameters:

Use techniques like GridSearchCV or RandomizedSearchCV to find the optimal combination of hyperparameters.

Key parameters to tune include max_depth, min_samples_split, min_samples_leaf, and criterion (gini or entropy). This step is crucial to prevent overfitting and improve generalization.

Evaluate its Performance:

Use the held-out test set to make predictions with the tuned model.

Evaluate using metrics appropriate for classification:

Accuracy: Overall correctness.

Precision & Recall: Especially important if the disease is rare (class imbalance).

F1-Score: Harmonic mean of precision and recall.

ROC-AUC Score: Measures the model's ability to distinguish between classes.

Analyze the confusion matrix for detailed insight.

Business Value:
This predictive model could provide immense business value by enabling early and accurate disease detection. This allows healthcare providers to:

Intervene Sooner: Initiate treatment plans earlier, potentially leading to better patient outcomes and lower treatment costs.

Optimize Resources: Efficiently allocate medical resources (like specialist time and diagnostic tests) to high-risk patients.

Develop Proactive Care: Shift from a reactive to a proactive healthcare model, focusing on prevention and early intervention, thereby improving overall population health and reducing long-term healthcare expenditures.