Question 1: What is a Decision Tree, and how does it work in the context of
classification?
>>Answer:
A Decision Tree is a model that makes decisions step by step, just like asking yes/no questions.

Each node checks a feature (like “Is petal length > 2.5?”).

Each branch represents the answer (Yes or No).

Each leaf gives the final class (like Iris-setosa or Iris-versicolor).

It divides data into smaller groups based on the most useful features until it can clearly decide the class.

Question 2: What is Gini Impurity and Entropy? How do they affect splits?
>>Answer:
Both Gini and Entropy tell how mixed the data is in a node.

- Gini Impurity → Measures how often a random sample would be wrongly labeled.

Formula: G=1−∑pi2​

- Entropy → Measures the randomness in data.

Formula: H=−∑pi​log2​pi​

Effect:
The tree looks for the feature that gives the lowest impurity (or highest purity) after splitting.

- Gini is faster.

- Entropy is based on information theory.
Both usually give similar results.

Question 3: Difference between Pre-Pruning and Post-Pruning

>>Answer:

- Pre-Pruning: Stop the tree early (like setting max_depth=3).

Advantage: Saves time and avoids overfitting early.

- Post-Pruning: First grow a full tree, then remove unnecessary branches.

Advantage: Gives better accuracy because it removes only weak parts after checking performance.

Question 4: What is Information Gain, and why is it important?

>>Answer:
Information Gain tells how much “impurity” is reduced after a split.

Information Gain= Impurity before​−Impurity after​

The higher the gain, the better the split.
It helps the Decision Tree choose the best feature and condition for each node.

Question 5: Real-world applications, advantages, and limitations

>>Answer:
- Applications:

Predicting diseases (healthcare)

Loan approval (banking)

Detecting fraud (finance)

Customer churn prediction (marketing)

- Advantages:

Easy to understand and visualize

Works with both numbers and categories

Needs little data preparation

- Limitations:

Can overfit if not pruned

Small changes in data can change the tree

Not great with continuous smooth decisions alone (better in ensembles)

Question 6: Write a Python program to:

-  Load the Iris Dataset

- Train a Decision Tree Classifier using the Gini criterion

-  Print the model’s accuracy and feature importances


In [1]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree using Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Accuracy:", accuracy)
print("Feature Importances:")
for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


Question 7: Write a Python program to:

- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree with max_depth = 3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

# Fully-grown Decision Tree
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# Print comparison
print("Accuracy (max_depth=3):", acc_limited)
print("Accuracy (fully-grown tree):", acc_full)


Accuracy (max_depth=3): 1.0
Accuracy (fully-grown tree): 1.0


Question 8: Write a Python program to:
- Load the California Housing dataset from sklearn
- Train a Decision Tree Regressor
- Print the Mean Squared Error (MSE) and feature importances


In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load data
data = fetch_california_housing()
X = data.data
y = data.target

# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predict and find error
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Feature Importances:")
for name, val in zip(data.feature_names, model.feature_importances_):
    print(f"{name}: {val:.4f}")


Mean Squared Error: 0.495235205629094
Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


Question 9: Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set parameters to test
param_grid = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 3, 4, 5]
}

# Use GridSearchCV to find best parameters
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)

# Best parameters and accuracy
print("Best Parameters:", grid.best_params_)
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

>>Answer:
If we have patient data and want to predict disease:

- Handle missing values:

Replace with average (for numbers) or most frequent value (for text).

- Encode categorical data:

Convert text values (like “Male”, “Female”) into numbers using Label Encoding or One-Hot Encoding.

- Train the Decision Tree:

Split data → 80% training, 20% testing.

Use DecisionTreeClassifier from sklearn.

- Tune hyperparameters:

Try different max_depth, min_samples_split using GridSearchCV to get the best model.

- Evaluate performance:

Check accuracy, precision, recall, or confusion matrix on the test set.

Business Value:
This model helps doctors and hospitals predict diseases early, make faster decisions, and reduce diagnosis time — improving patient care and saving costs.