Question 1: What is a Decision Tree, and how does it work in the context of classification?

=> A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. It models decisions based on input features by splitting data at each node according to feature values. In classification, each leaf node represents a class label, and branches denote feature conditions; the tree predicts class labels by traversing nodes based on feature values until it reaches a leaf.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

=> Gini Impurity measures the likelihood of incorrectly classifying a randomly chosen element. Lower Gini Impurity indicates purer nodes.

Entropy quantifies uncertainty or disorder; lower entropy means higher purity. Decision Trees use these measures to determine splits, choosing splits that result in the lowest impurity or entropy and thus the most homogeneous nodes.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

=> Pre-Pruning: Stops tree growth early by setting constraints (e.g., max depth, min samples per leaf). Advantage: Prevents overfitting by controlling complexity during training.

Post-Pruning: Grows a full tree, then removes unnecessary nodes. Advantage: Can yield more accurate models by optimizing after seeing all data.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

=> Information Gain measures the reduction in impurity after a dataset is split on a feature. The split with the highest gain is chosen, as it yields the purest child subsets. It's crucial for identifying the most informative features and creating effective splits.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

=> Applications: Medical diagnosis, customer segmentation, credit scoring, fraud detection.

Advantages: Intuitive, easy to interpret, handle both numeric and categorical data well.

Limitations: Prone to overfitting, sensitive to small changes in data, less effective with highly correlated features.



Question 6: Python Program (Iris, Gini criterion, Accuracy & Feature Importances)

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)
pred = clf.predict(X)
print('Accuracy:', accuracy_score(y, pred))
print('Feature Importances:', clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.02666667 0.         0.05072262 0.92261071]


Question 7: Python Program (Iris, max_depth=3 vs fully-grown tree accuracy)

In [2]:
from sklearn.tree import DecisionTreeClassifier
clf_full = DecisionTreeClassifier()
clf_full.fit(X, y)
accuracy_full = accuracy_score(y, clf_full.predict(X))

clf_depth3 = DecisionTreeClassifier(max_depth=3)
clf_depth3.fit(X, y)
accuracy_depth3 = accuracy_score(y, clf_depth3.predict(X))

print('Fully-grown tree accuracy:', accuracy_full)
print('max_depth=3 tree accuracy:', accuracy_depth3)


Fully-grown tree accuracy: 1.0
max_depth=3 tree accuracy: 0.9733333333333334


Question 8: Python Program (Boston Housing, Decision Tree Regressor, MSE & Feature Importances)

In [4]:
#load_boston dataset has been removed from scikit-learn since v1.2 due to ethical concerns.
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X, y = housing.data, housing.target
reg = DecisionTreeRegressor()
reg.fit(X, y)
pred = reg.predict(X)
print('Mean Squared Error:', mean_squared_error(y, pred))
print('Feature Importances:', reg.feature_importances_)


Mean Squared Error: 9.35625724797014e-32
Feature Importances: [0.52562746 0.0509582  0.05302283 0.02771999 0.0317255  0.13125989
 0.09416251 0.08552361]


Question 9: Python Program (Iris, GridSearchCV for max_depth, min_samples_split)

In [6]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

iris = load_iris()
X, y = iris.data, iris.target  # y should be 0/1/2

params = {'max_depth': [2,3,4,5], 'min_samples_split': [2,3,4]}
grid = GridSearchCV(DecisionTreeClassifier(), params, cv=3)
grid.fit(X, y)
print('Best Parameters:', grid.best_params_)
print('Best Accuracy:', grid.best_score_)


Best Parameters: {'max_depth': 3, 'min_samples_split': 3}
Best Accuracy: 0.9733333333333333


Question 10: Healthcare Prediction - Data Science Process?

=> Handle missing values: Impute using mean/median for numeric, mode or a placeholder for categorical.

Encode categorical features: Use label encoding or one-hot encoding as appropriate.

Train a Decision Tree model: Split data into train/test sets, fit classifier.

Tune hyperparameters: Use GridSearchCV to optimize parameters like max_depth, min_samples_split.

Evaluate performance: Assess using metrics like accuracy, precision, recall, F1-score.

Business value: Enables data-driven, consistent disease prediction, improves efficiency, reduces manual analysis, and can help prioritize patient care and resources.