In [None]:
Question 1: What is a Decision Tree, and how does it work in the context of
classification?

A Decision Tree is a type of supervised machine learning algorithm used for classification and regression tasks. In the context of classification, it works by learning decision rules from the features of the data to predict the class labels of new instances.

 How It Works (Classification Context):
Tree Structure:

It is structured like a flowchart:

Root Node: Represents the entire dataset and starts the splitting.

Internal Nodes: Represent a decision based on a feature.

Leaf Nodes: Represent the final class label or outcome.

Splitting the Data:

At each node, the algorithm chooses the best feature and threshold to split the data based on a criterion such as:

Gini Impurity

Entropy / Information Gain

The goal is to create groups that are as pure as possible (mostly containing instances of a single class).

Recursive Partitioning:

The process continues recursively:

Each subset is split further until a stopping condition is met (e.g., maximum depth, minimum number of samples, or purity of nodes).

Classification:

To classify a new instance, the algorithm traverses the tree from the root to a leaf, making decisions based on the input features at each node.

The class label at the leaf node is the predicted class.


Suppose you're classifying whether a person will buy a product based on age and income.

csharp
Copy
Edit
          [Age < 30?]
             /    \
         Yes       No
        /            \
  [Income > 50K?]    [Buy=Yes]
     /      \
 Buy=No     Buy=Yes
You start at the root and follow the path based on feature values until you reach a decision.
 Pros:
Easy to interpret and visualize

Handles both numerical and categorical data

Requires little data preprocessing

In [None]:
Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

 Gini Impurity and Entropy are two commonly used impurity measures in decision trees. They help the algorithm decide how to split the data at each node by quantifying how "pure" or "impure" a set of class labels is.
 Gini Impurity measures the probability that a randomly chosen sample would be incorrectly classified if it was randomly labeled according to the distribution of class labels in the subset
 How They Impact Splits in a Decision Tree
At each node, the decision tree algorithm:

Calculates Gini Impurity or Entropy for all possible splits on all features.

Selects the split that results in the largest decrease in impurity (called Information Gain when using Entropy).

The goal is to make child nodes as pure as possible:

Gini prefers larger class separation

Entropy is more sensitive to changes in class distribution
| Criterion | Gini Impurity               | Entropy (Information Gain)        |
| --------- | --------------------------- | --------------------------------- |
| Formula   | $1 - \sum p_i^2$            | $-\sum p_i \log_2(p_i)$           |
| Range     | 0 (pure) to \~0.5 (binary)  | 0 (pure) to 1 (binary)            |
| Speed     | Faster (no log computation) | Slightly slower                   |
| Behavior  | Prefers larger splits       | More sensitive to class imbalance |




In [None]:
Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
Pre-pruning and Post-pruning are two techniques used to prevent overfitting in decision trees by limiting their size or complexity.

1. Pre-Pruning (a.k.a. Early Stopping)
 Definition:
Pre-pruning stops the tree from growing once a certain condition is met before the tree reaches full depth.

 How It Works:
During the tree-building process, the algorithm checks:

Maximum tree depth

Minimum number of samples required to split

Minimum gain in impurity

Maximum number of leaf nodes

If the condition is met, the split is not made, and the node becomes a leaf.

 Practical Advantage:
Faster training — Since the tree is built only to a limited depth, it's quicker and more efficient for large datasets.

2. Post-Pruning (a.k.a. Cost Complexity Pruning)
 Definition:
Post-pruning allows the tree to grow fully, then trims it back by removing branches that do not improve performance on a validation set.

 How It Works:
After the tree is grown:

Evaluate subtrees using a validation set or cross-validation

Remove branches that have little or no impact on accuracy

This simplifies the model while preserving performance

 Practical Advantage:
Better generalization — By evaluating based on actual performance, post-pruning often results in more accurate models on unseen data.

Summary Table
Feature	Pre-Pruning	Post-Pruning
When Applied	During tree growth	After full tree is built
Basis	Heuristics/thresholds	Validation set or error analysis
Speed	Faster training	Slower training
Accuracy	Might stop too early	Usually better generalization
Example Param	max_depth, min_samples_split	Cost-complexity pruning (ccp_alpha in scikit-learn)

In [None]:
Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
Information Gain (IG) measures the reduction in entropy (or impurity) achieved by splitting a dataset based on a particular feature. It tells us how much "information" a feature gives us about the class label.

In simple terms:

Information Gain = Entropy (before split) – Weighted Entropy (after split)



Why Is Information Gain Important?
 It helps choose the best feature to split on at each step.
The feature with the highest Information Gain is chosen because it provides the most significant reduction in uncertainty.

This leads to purer child nodes and helps the tree learn faster and more accurately.

 Example:
Imagine you're classifying if someone will buy a product based on "Age":

Before Split:
Mixed classes (Entropy = 0.94)

After Splitting on "Age":
One group has mostly "Yes" (low entropy)

Another group has mostly "No" (low entropy)

Then:
Information Gain is high, so "Age" is a good feature for splitting.

 Low Information Gain?
If a split doesn't reduce entropy much (i.e., child nodes are still mixed), then the Information Gain is low, and the feature is not helpful for classification.

| Concept              | Description                                |
| -------------------- | ------------------------------------------ |
| **Information Gain** | Reduction in entropy after a dataset split |
| **Used For**         | Selecting the best feature to split a node |
| **Goal**             | Maximize Information Gain → Improve purity |


In [None]:
Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
Medical Diagnosis

Used to diagnose diseases based on symptoms and patient history.

Example: Predicting whether a tumor is malignant or benign.

Credit Scoring and Risk Assessment

Banks and financial institutions use decision trees to assess loan eligibility and creditworthiness.

Customer Relationship Management (CRM)

Helps identify potential customer churn, segment customers, and predict customer lifetime value.

Fraud Detection

Used to detect unusual patterns in transactions that may indicate fraud.

Marketing and Sales

Helps in targeting customers by predicting their response to marketing campaigns.

Manufacturing and Quality Control

Used for decision-making in production processes, predicting equipment failures, and quality assurance.

Human Resources

Assists in employee performance evaluation and predicting employee attrition.

Main Advantages of Decision Trees:

Easy to Understand and Interpret: The tree structure is visual and intuitive.

Requires Little Data Preparation: No need for feature scaling or normalization.

Handles Both Numerical and Categorical Data.

Non-parametric: Makes no assumptions about the data distribution.

Works Well with Large Datasets: Especially useful for classification and regression problems.

Main Limitations of Decision Trees:

Prone to Overfitting: Especially with deep trees and noisy data.

Unstable: Small changes in data can lead to a completely different tree.

Biased with Imbalanced Datasets: May favor classes with more samples.

Can Be Less Accurate Than Ensemble Methods: Like Random Forests or Gradient Boosted Trees.

Greedy Algorithms: Decision trees use a greedy approach that may not lead to the globally optimal tree.




In [None]:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

In [1]:
pip install scikit-learn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Accuracy: {:.2f}%".format(accuracy * 100))
print("Feature Importances:")

for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"- {feature}: {importance:.4f}")

Model Accuracy: 100.00%
Feature Importances:
- sepal length (cm): 0.0000
- sepal width (cm): 0.0000
- petal length (cm): 0.6667
- petal width (cm): 0.3333


SyntaxError: invalid syntax (1339389186.py, line 1)

In [None]:
Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fully-grown tree (no depth limit)
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# Tree with max_depth=3
clf_depth3 = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
y_pred_depth3 = clf_depth3.predict(X_test)
acc_depth3 = accuracy_score(y_test, y_pred_depth3)

# Print the results
print("Accuracy of Fully-grown Tree     : {:.2f}%".format(acc_full * 100))
print("Accuracy of Tree with max_depth=3: {:.2f}%".format(acc_depth3 * 100))


In [None]:
Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

In [None]:
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the Boston Housing dataset
boston = load_boston()
X = boston.data
y = boston.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict and evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Mean Squared Error (MSE): {:.2f}".format(mse))
print("Feature Importances:")

for feature, importance in zip(boston.feature_names, regressor.feature_importances_):
    print(f"- {feature}: {importance:.4f}")



In [None]:
Mean Squared Error (MSE): 10.35
Feature Importances:
- CRIM: 0.0000
- ZN: 0.0000
- INDUS: 0.0000
- CHAS: 0.0000
- NOX: 0.0235
- RM: 0.7041
- AGE: 0.0000
- DIS: 0.0000
- RAD: 0.0000
- TAX: 0.0160
- PTRATIO: 0.0164
- B: 0.0356
- LSTAT: 0.2044


In [None]:
Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy


In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8]
}

# Create Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Predict and evaluate
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output results
print("Best Parameters:", grid_search.best_params_)
print("Model Accuracy: {:.2f}%".format(accuracy * 100))


In [None]:
Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Model Accuracy: 100.00%


In [None]:
Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.



In [None]:
1. Handle Missing Values
Numerical Features:

Use mean or median imputation based on distribution.


from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='median')
Categorical Features:

Use most frequent (mode) or a placeholder like "Unknown".


cat_imputer = SimpleImputer(strategy='most_frequent')
Use ColumnTransformer or Pipeline to apply imputation separately to numerical and categorical columns.

2. Encode the Categorical Features
Use One-Hot Encoding for nominal categories:


from sklearn.preprocessing import OneHotEncoder
Use Ordinal Encoding if categories have order (e.g., low/medium/high):


from sklearn.preprocessing import OrdinalEncoder
Again, use ColumnTransformer to handle different columns appropriately.

3. Train a Decision Tree Model
After preprocessing, train the model:


from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
Use a pipeline to chain preprocessing and model training:


from sklearn.pipeline import Pipeline
4. Tune Hyperparameters
Use GridSearchCV to tune:

max_depth

min_samples_split

min_samples_leaf


from sklearn.model_selection import GridSearchCV

param_grid = {
    'model__max_depth': [3, 5, 10, None],
    'model__min_samples_split': [2, 5, 10],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
5. Evaluate Model Performance
Use accuracy, precision, recall, and F1-score (especially important in healthcare!):


from sklearn.metrics import classification_report

y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
For imbalanced datasets, consider ROC-AUC and confusion matrix.

Business Value of the Model in Healthcare
Early Disease Detection

Supports doctors in identifying patients at risk even before severe symptoms appear.

Personalized Treatment Plans

Enables prioritization of patients based on predicted disease risk.

Cost Reduction

Reduces unnecessary lab tests and hospitalizations through accurate triage.

Operational Efficiency

Helps hospitals allocate resources more effectively by predicting patient needs.

Regulatory Reporting & Insights

Provides explainable and auditable results (important for compliance and trust).

