Question 1: What is a Decision Tree, and how does it work in the context of classification? 

Ans: A Decision Tree is a non-parametric supervised learning algorithm used for both classification and regression tasks. It has a hierarchical, tree-like structure consisting of a root node, branches, internal nodes, and leaf nodes. 

How it works in classification:

1. Root Node: The process starts at the top with the entire dataset.
2. Splitting: The algorithm selects the best feature to split the data into subsets that are more "pure" (containing instances of the same class).
3. Internal Nodes: These represent decision points based on specific feature values (e.g., "Is age > 30?").
4. Leaf Nodes: These represent the final classification category. Once a data point reaches a leaf node, it is assigned the majority label of that node.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree? 

Ans: Both measures quantify the "disorder" or "impurity" of a node. 

* Gini Impurity: Measures the probability of a randomly chosen element being incorrectly classified if it was randomly labeled according to the distribution of labels in the node. It ranges from 0 (pure) to 0.5 (equally distributed).


* Entropy: Derived from information theory, it measures the level of uncertainty or randomness in the data. It ranges from 0 (pure) to 1 (equally distributed).

Impact on Splits:
The algorithm calculates the Gini or Entropy for every possible split. It chooses the split that results in the highest reduction in impurity (Information Gain).  Gini is often preferred as it is computationally faster because it doesn't involve logarithmic calculations.

Question 3: Difference between Pre-Pruning and Post-Pruning. Give one practical advantage of each. 

Ans: Pruning is used to prevent overfitting by simplifying the tree. 

| Feature | Pre-Pruning (Early Stopping) | Post-Pruning (Cost Complexity Pruning) |
| --- | --- | --- |
| Method | Stops the tree from growing once it hits certain criteria (e.g., `max_depth`). | Allows the tree to grow fully, then removes branches that provide little predictive power. |
| Advantage | Saves computational time and memory by not building a massive tree. | More accurate model because it considers the whole tree before deciding what is unimportant. |

Question 4: What is Information Gain and why is it important for choosing the best split? 

Ans: Information Gain is the difference between the entropy of the parent node and the weighted average entropy of the child nodes.   
Importance: It serves as the selection criterion. The feature that provides the "highest Information Gain" is considered the most informative feature for that specific split. By maximizing Information Gain at each step, the tree becomes as shallow and efficient as possible while maintaining high accuracy.

Question 5: Real-world applications, advantages, and limitations. 

Ans: Applications: Credit scoring (loan default prediction), medical diagnosis, and customer churn prediction in marketing.   
Advantages:
* Easy to understand and visualize (White-box model).
* Requires little data preprocessing (no need for scaling/normalization).
* Handles both numerical and categorical data.   
Limitations:
* High tendency to overfit (captures noise in data).
* Unstable: Small changes in data can lead to a completely different tree structure.

Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

In [3]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.fit(X_train, y_train).predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("Feature Importances:", clf.feature_importances_)

Accuracy: 1.0000
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)

# Max Depth = 3
clf_3 = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_train, y_train)
acc_3 = accuracy_score(y_test, clf_3.predict(X_test))

clf_full = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)
acc_full = accuracy_score(y_test, clf_full.predict(X_test))

print(f"Accuracy (max_depth=3): {acc_3:.4f}")
print(f"Accuracy (Fully Grown): {acc_full:.4f}")

Accuracy (max_depth=3): 1.0000
Accuracy (Fully Grown): 1.0000


Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Loading from a URL (standard alternative to the deprecated library)
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
print(f"Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred):.4f}")
print("Feature Importances:", regressor.feature_importances_)

  raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)


Mean Squared Error (MSE): 10.4161
Feature Importances: [5.12956739e-02 3.35270585e-03 5.81619171e-03 2.27940651e-06
 2.71483790e-02 6.00326256e-01 1.36170630e-02 7.06881622e-02
 1.94062297e-03 1.24638653e-02 1.10116089e-02 9.00872742e-03
 1.93328464e-01]


Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

iris = load_iris()
param_grid = {'max_depth': [2, 3, 4, 5], 'min_samples_split': [2, 5, 10]}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(iris.data, iris.target)

print("Best Parameters:", grid_search.best_params_)
print(f"Best Accuracy: {grid_search.best_score_:.4f}")

Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Best Accuracy: 0.9733


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Ans: 
Process:

1. Handle Missing Values: Use Imputation. For numerical data (like age), use the mean/median. For categorical data (like blood type), use the mode or a "Missing" flag. 
2. Encode Categorical Features: Use One-Hot Encoding for nominal data (e.g., Gender) or Label Encoding for ordinal data (e.g., Disease Stage: Mild, Moderate, Severe). 
3. Train Model: Split the data (80/20) and train a `DecisionTreeClassifier`. 
4. Tune Hyperparameters: Use `GridSearchCV` to find the optimal `max_depth` and `min_samples_leaf` to prevent overfitting to patient noise. 
5. Evaluate Performance: Use a Confusion Matrix, focusing on Recall (to minimize false negatives, as missing a disease diagnosis is critical). 

Business Value: This model provides Decision Support. It allows doctors to prioritize high-risk patients for early intervention, leading to better patient outcomes and reduced long-term hospital costs.