# **Theoritical Questions**

Q1. What is a Decision Tree, and how does it work?
- A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. It works by splitting the data into subsets based on the value of input features, creating a tree-like model of decisions. Each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome (or class label).

The process of building a Decision Tree involves:

Selecting the Best Feature: At each node, the algorithm selects the feature that best separates the data into classes based on a certain criterion (e.g., Gini impurity, entropy).
Splitting the Data: The data is split into subsets based on the selected feature's value.
Recursion: The process is repeated recursively for each subset until a stopping criterion is met (e.g., maximum depth, minimum samples per leaf, or no further improvement).


Q2. What are impurity measures in Decision Trees?
- Impurity measures are metrics used to evaluate how well a feature separates the classes in a dataset. They help in determining the best feature to split the data at each node of the Decision Tree. Common impurity measures include:

Gini Impurity: Measures the probability of misclassifying a randomly chosen element if it was randomly labeled according to the distribution of labels in the subset.
Entropy: Measures the amount of disorder or uncertainty in the dataset. It quantifies the unpredictability of the information content.
Classification Error: The proportion of misclassified instances in a subset.

Q3. What is the mathematical formula for Gini Impurity?
- The Gini Impurity for a dataset is calculated using the formula:



where:

( D ) is the dataset,
( C ) is the number of classes,
( p_i ) is the proportion of instances belonging to class ( i ) in the dataset.

Q4. What is the mathematical formula for Entropy?
- The Entropy of a dataset is calculated using the formula:



where:

( D ) is the dataset,
( C ) is the number of classes,
( p_i ) is the proportion of instances belonging to class ( i ) in the dataset.

Q5. What is Information Gain, and how is it used in Decision Trees?

- Information Gain is a measure of the effectiveness of an attribute in classifying the training data. It quantifies the reduction in entropy or impurity after a dataset is split on an attribute. The formula for Information Gain is:




where:

( IG(D, A) ) is the Information Gain of attribute ( A ) with respect to dataset ( D ),
( Values(A) ) are the possible values of attribute ( A ),
( D_v ) is the subset of ( D ) for which attribute ( A ) has value ( v ),
( |D| ) is the total number of instances in dataset ( D ),
( |D_v| ) is the number of instances in subset ( D_v ).
In Decision Trees, Information Gain is used to select the attribute that provides the highest reduction in uncertainty (or impurity) when splitting the data, thus guiding the construction of the tree.

Q6. Difference between Gini Impurity and Entropy

- Definition:

Gini Impurity: Measures the probability of misclassifying a randomly chosen element from the dataset. It ranges from 0 (pure) to 0.5 (maximum impurity for binary classification).
Entropy: Measures the amount of uncertainty or disorder in the dataset. It ranges from 0 (pure) to log₂(C) for multi-class problems, where C is the number of classes.
Computational Complexity:

Gini is computationally simpler and faster as it does not involve logarithmic calculations.
Entropy requires logarithmic calculations, making it more computationally intensive.
Bias:

Gini tends to favor splits that result in a more balanced distribution of classes.
Entropy is biased towards selecting splits that result in a higher reduction of uncertainty.
Usage:

Gini is commonly used in CART (Classification and Regression Trees).
Entropy is used in algorithms like ID3 and C4.5.


Q7. Mathematical Explanation behind Decision Trees

- Structure:

Decision Trees consist of nodes (decision nodes and leaf nodes) and edges (branches).
Splitting Criteria:

The tree is built by recursively splitting the dataset based on feature values to minimize impurity (Gini or Entropy).
Information Gain:

Information Gain is calculated as the difference in entropy before and after a split: [ IG(S, A) = H(S) - \left( \frac{|S_1|}{|S|} H(S_1) + \frac{|S_2|}{|S|} H(S_2) \right) ]
Where (H(S)) is the entropy of the dataset before the split, and (H(S_1)) and (H(S_2)) are the entropies of the subsets after the split.

Stopping Criteria:

The splitting process stops when a node reaches a maximum depth, contains fewer than a minimum number of data points, or results in a pure node.


Q8. Pre-Pruning in Decision Trees

- Definition: Pre-Pruning, also known as early stopping, involves halting the growth of the decision tree during the training phase to prevent overfitting.

Methods:

Set a maximum depth for the tree.
Specify a minimum number of samples required to split a node.
Define a minimum number of samples per leaf.
Limit the number of features considered for splitting.
Purpose: The goal is to simplify the model and enhance generalization by avoiding overly complex trees.

Q9. Post-Pruning in Decision Trees

- Definition: Post-Pruning involves allowing the decision tree to grow fully and then removing branches that do not contribute significantly to predictive accuracy.

Methods:

Cost-Complexity Pruning: Assigns a cost to subtrees based on their complexity and selects the subtree with the smallest cost.
Reduced Error Pruning: Removes branches that do not improve validation accuracy.
Minimum Impurity Decrease: Prunes nodes if the reduction in impurity is below a certain threshold.
Purpose: This technique aims to reduce overfitting and improve the model's performance on unseen data.

Q10. Difference between Pre-Pruning and Post-Pruning

- Timing:

Pre-Pruning occurs during the tree construction process, stopping growth based on predefined criteria.
Post-Pruning takes place after the tree has been fully constructed, where unnecessary branches are removed.
Approach:

Pre-Pruning aims to prevent the tree from becoming too complex from the outset.
Post-Pruning focuses on refining the tree after it has been built to enhance its predictive power.
Impact on Model:

Pre-Pruning can lead to a simpler model that may underfit if too restrictive.
Post-Pruning can help in retaining a more complex model while reducing overfitting

Q11. What is a Decision Tree Regressor?
- A Decision Tree Regressor is a machine learning model that predicts continuous outcomes by learning decision rules from the features of the dataset. It operates by recursively splitting the data into subsets based on feature values, ultimately forming a tree-like structure where each leaf node represents a predicted value.

Q12. What are the advantages and disadvantages of Decision Trees?

- Advantages:

Easy to Understand: Decision trees are intuitive and can be visualized easily, making them accessible for non-technical stakeholders.
Handles Non-Linear Relationships: They can model complex relationships between features and target variables effectively.
Minimal Data Preparation: Decision trees require little preprocessing, such as scaling or normalization.
Robust to Outliers: They are generally insensitive to outliers, as extreme values do not significantly affect the splits.
Automatic Feature Selection: Decision trees can identify and prioritize important features during the splitting process.


Disadvantages:

Overfitting: They are prone to overfitting, especially with deep trees, which can lead to poor generalization on unseen data.

Instability: Small changes in the data can result in different tree structures, making them less stable.

Bias Towards Features with More Levels: Decision trees may favor features with many categories, potentially leading to misleading results.

Limited Expressiveness: They may struggle to capture certain complex relationships compared to more advanced models like neural networks.

Q13. How does a Decision Tree handle missing values?
- Decision Trees can handle missing values in several ways:

Surrogate Splits: They can create alternative splits based on other features when a value is missing.

Imputation: Missing values can be filled in with the most common value or the mean/median of the feature.

Assigning Probabilities: Decision trees can assign missing values based on the probabilities derived from other samples, ensuring that the model remains robust despite incomplete data.

Q14. How does a Decision Tree handle categorical features?
- Decision Trees handle categorical features by splitting the data based on the categories of the feature. Each category can lead to a different branch in the tree, allowing the model to make decisions based on the presence or absence of specific categories. This capability makes Decision Trees particularly effective for datasets with mixed data types.

Q15. What are some real-world applications of Decision Trees?
- Decision Trees are widely used in various fields, including:

Customer Segmentation: Identifying distinct customer groups based on purchasing behavior.

Credit Scoring: Assessing the creditworthiness of individuals by analyzing their financial history.

Medical Diagnosis: Classifying patients based on symptoms and medical history to predict diseases.

Predicting Housing Prices: Estimating property values based on features like location, size, and amenities.

Risk Assessment: Evaluating potential risks in finance, insurance, and project management

# **PRACTICAL QUESTIONS**

In [None]:
# Q16. Write a Python program to train a Decision Tree Classifier on the Iris dataset and print the model accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train classifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Predict & evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.4f}")


In [None]:
# Q17. Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train classifier with Gini impurity
clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)

# Print feature importances
print("Feature importances (Gini):")
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")


In [None]:
# Q18. Write a Python program to train a Decision Tree Classifier using Entropy as the splitting criterion and print the model accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train classifier with entropy
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, y_train)

# Predict & evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy (Entropy): {accuracy:.4f}")


In [None]:
# Q19. Write a Python program to train a Decision Tree Regressor on a housing dataset and evaluate using Mean Squared Error (MSE)

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load Boston housing data
boston = load_boston()
X, y = boston.data, boston.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

# Predict & evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")


In [None]:
# Q20. Write a Python program to train a Decision Tree Classifier and visualize the tree using graphviz

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train classifier
clf = DecisionTreeClassifier()
clf.fit(X, y)

# Export tree in DOT format
dot_data = export_graphviz(
    clf,
    out_file=None,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    special_characters=True
)

# Visualize using graphviz
graph = graphviz.Source(dot_data)
graph.render("iris_decision_tree")  # saves as PDF file
graph.view()  # opens the PDF file in default viewer


In [None]:
# Q21. Write a Python program to train a Decision Tree Classifier with a maximum depth of 3 and compare its accuracy with a fully grown tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Tree with max_depth=3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

# Fully grown tree (default max_depth=None)
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max_depth=3: {acc_limited:.4f}")
print(f"Accuracy with fully grown tree: {acc_full:.4f}")


In [None]:
# Q22. Write a Python program to train a Decision Tree Classifier using min_samples_split=5 and compare its accuracy with a default tree

# Using the same iris dataset and train/test split from above

# Tree with min_samples_split=5
tree_min_split = DecisionTreeClassifier(min_samples_split=5, random_state=42)
tree_min_split.fit(X_train, y_train)
y_pred_min_split = tree_min_split.predict(X_test)
acc_min_split = accuracy_score(y_test, y_pred_min_split)

# Default tree
tree_default = DecisionTreeClassifier(random_state=42)
tree_default.fit(X_train, y_train)
y_pred_default = tree_default.predict(X_test)
acc_default = accuracy_score(y_test, y_pred_default)

print(f"Accuracy with min_samples_split=5: {acc_min_split:.4f}")
print(f"Accuracy with default min_samples_split: {acc_default:.4f}")


In [None]:
# Q23. Write a Python program to apply feature scaling before training a Decision Tree Classifier and compare its accuracy with unscaled data

from sklearn.preprocessing import StandardScaler

# Unscaled data tree
tree_unscaled = DecisionTreeClassifier(random_state=42)
tree_unscaled.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, tree_unscaled.predict(X_test))

# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Tree on scaled data
tree_scaled = DecisionTreeClassifier(random_state=42)
tree_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, tree_scaled.predict(X_test_scaled))

print(f"Accuracy on unscaled data: {acc_unscaled:.4f}")
print(f"Accuracy on scaled data: {acc_scaled:.4f}")


In [None]:
# Q24. Write a Python program to train a Decision Tree Classifier using One-vs-Rest (OvR) strategy for multiclass classification

from sklearn.multiclass import OneVsRestClassifier

# OvR wrapped decision tree
ovr_tree = OneVsRestClassifier(DecisionTreeClassifier(random_state=42))
ovr_tree.fit(X_train, y_train)
y_pred_ovr = ovr_tree.predict(X_test)
acc_ovr = accuracy_score(y_test, y_pred_ovr)

print(f"Accuracy using One-vs-Rest Decision Tree: {acc_ovr:.4f}")


In [None]:
# Q25. Write a Python program to train a Decision Tree Classifier and display the feature importance scores

# Train tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

# Feature importance
importances = tree.feature_importances_
for name, score in zip(data.feature_names, importances):
    print(f"{name}: {score:.4f}")


In [None]:
# Q26. Write a Python program to train a Decision Tree Regressor with max_depth=5 and compare its performance with an unrestricted tree

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset (Boston housing)
data = load_boston()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Decision Tree Regressor with max_depth=5
tree_limited = DecisionTreeRegressor(max_depth=5, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)

# Unrestricted Decision Tree Regressor
tree_unrestricted = DecisionTreeRegressor(random_state=42)
tree_unrestricted.fit(X_train, y_train)
y_pred_unrestricted = tree_unrestricted.predict(X_test)

# Evaluate with MSE
mse_limited = mean_squared_error(y_test, y_pred_limited)
mse_unrestricted = mean_squared_error(y_test, y_pred_unrestricted)

print(f"MSE with max_depth=5: {mse_limited:.3f}")
print(f"MSE unrestricted tree: {mse_unrestricted:.3f}")


In [None]:
# Q27. Write a Python program to train a Decision Tree Classifier, apply Cost Complexity Pruning (CCP), and visualize its effect on accuracy

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train initial tree to get ccp_alphas
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

train_scores = []
test_scores = []

for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    train_scores.append(clf.score(X_train, y_train))
    test_scores.append(clf.score(X_test, y_test))

plt.figure(figsize=(8,5))
plt.plot(ccp_alphas, train_scores, marker='o', label='Train Accuracy')
plt.plot(ccp_alphas, test_scores, marker='o', label='Test Accuracy')
plt.xlabel('ccp_alpha')
plt.ylabel('Accuracy')
plt.title('Effect of CCP Alpha on Accuracy')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q28. Write a Python program to train a Decision Tree Classifier and evaluate its performance using Precision, Recall, and F1-Score

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))


In [None]:
# Q29. Write a Python program to train a Decision Tree Classifier and visualize the confusion matrix using seaborn

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()


In [None]:
# Q30. Write a Python program to train a Decision Tree Classifier and use GridSearchCV to find the optimal values for max_depth and min_samples_split

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 5, 10, 20]
}

clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)

# Evaluate on test set
best_clf = grid_search.best_estimator_
print("Test set accuracy:", best_clf.score(X_test, y_test))
