DECISION TREE ASSIGNMENT

**Decision Tree Theorotical Answers**

1. **What is a Decision Tree, and how does it work?**  
A Decision Tree is a supervised learning algorithm used for classification and regression. It consists of decision nodes, branches, and leaf nodes. The tree recursively splits data based on certain criteria (e.g., Gini Impurity or Entropy) to separate the data effectively.

2. **What are impurity measures in Decision Trees?**  
Impurity measures help evaluate the quality of a split in a Decision Tree. Common impurity measures include:
   - Gini Impurity
   - Entropy (Information Gain)

3. **What is the mathematical formula for Gini Impurity?**  
Gini = 1 - Σ (p_i)^2  
where p_i is the probability of class i in the node and C is the total number of classes.

4. **What is the mathematical formula for Entropy?**  
Entropy = - Σ p_i log_2 (p_i)  
where p_i is the probability of class i in the node.

5. **What is Information Gain, and how is it used in Decision Trees?**  
Information Gain (IG) measures the reduction in entropy after splitting a dataset. It helps determine the best feature for splitting.
IG = Entropy(parent) - Σ (N_j/N) Entropy(child_j)  
where N is the total number of samples and N_j is the number of samples in child node j.

6. **What is the difference between Gini Impurity and Entropy?**  
- **Gini Impurity** is faster to compute and measures misclassification.
- **Entropy** involves logarithm calculations and measures disorder in data.
- **Gini Impurity** is used in CART, while **Entropy** is used in ID3 and C4.5.

7. **What is the mathematical explanation behind Decision Trees?**  
A Decision Tree splits data using **Information Gain** (Entropy) or minimizes **Gini Impurity**. The process follows:
1. Select the best feature using impurity measures.
2. Recursively split data.
3. Prune the tree to prevent overfitting.

8. **What is Pre-Pruning in Decision Trees?**  
Pre-Pruning stops the tree from growing too complex by setting conditions like:
- Maximum depth
- Minimum samples per leaf
- Minimum Information Gain threshold

9. **What is Post-Pruning in Decision Trees?**  
Post-Pruning removes unnecessary branches from a fully grown tree using:
- Cost Complexity Pruning (CCP)
- Reduced Error Pruning

10. **What is the difference between Pre-Pruning and Post-Pruning?**  
- **Pre-Pruning** stops tree growth early to prevent overfitting but may cause underfitting.
- **Post-Pruning** grows a full tree first and then trims unnecessary branches to reduce overfitting.

11. **What is a Decision Tree Regressor?**  
A Decision Tree Regressor is used for predicting continuous values. It splits data based on a feature threshold and predicts using the mean value of each leaf node.

12. **What are the advantages and disadvantages of Decision Trees?**  
**Advantages:**
- Easy to interpret
- Handles both numerical and categorical data
- Minimal data preprocessing required
- Can model non-linear relationships

**Disadvantages:**
- Prone to overfitting
- Sensitive to small data changes
- Computationally expensive for large datasets

13. **How does a Decision Tree handle missing values?**  
- Ignores missing values during training
- Uses imputation (mean, median, mode)
- Uses surrogate splits
- Assigns missing values as a separate category

14. **How does a Decision Tree handle categorical features?**  
- Label Encoding (assigning numbers to categories)
- One-Hot Encoding (binary columns for each category)
- Direct splitting (used in some implementations)

15. **What are some real-world applications of Decision Trees?**  
- Medical Diagnosis
- Customer Churn Prediction
- Credit Risk Assessment
- Fraud Detection
- Spam Detection



PRACTICAL QUESTIONS

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris, fetch_california_housing

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor,
export_text, plot_tree

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import accuracy_score, mean_squared_error, precision_score, recall_score, f1_score, confusion_matrix

from sklearn.preprocessing import StandardScaler
import graphviz


# Question 16: Train a Decision Tree Classifier on the Iris dataset and print model accuracy

iris = load_iris()

X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

# Question 17: Train a Decision Tree Classifier using Gini Impurity and print feature importances
clf_gini = DecisionTreeClassifier(criterion='gini')

clf_gini.fit(X_train, y_train)

print("Feature Importances:", clf_gini.feature_importances_)

# Question 18: Train a Decision Tree Classifier using Entropy and print accuracy
clf_entropy = DecisionTreeClassifier(criterion='entropy')

clf_entropy.fit(X_train, y_train)

y_pred_entropy = clf_entropy.predict(X_test)

print("Accuracy with Entropy:", accuracy_score(y_test, y_pred_entropy))

# Question 19: Train a Decision Tree Regressor on a housing dataset and evaluate using MSE
housing = fetch_california_housing()

X_housing, y_housing = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(X_housing,
y_housing, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor()

regressor.fit(X_train_h, y_train_h)

y_pred_h = regressor.predict(X_test_h)

print("Mean Squared Error:", mean_squared_error(y_test_h, y_pred_h))

# Question 20: Visualize the Decision Tree
plt.figure(figsize=(12,8))

plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)

plt.show()

# Question 21: Compare max depth of 3 vs fully grown tree
clf_depth3 = DecisionTreeClassifier(max_depth=3)

clf_depth3.fit(X_train, y_train)

y_pred_depth3 = clf_depth3.predict(X_test)

print("Accuracy with max_depth=3:", accuracy_score(y_test, y_pred_depth3))

print("Accuracy with full tree:", accuracy_score(y_test, y_pred))

# Question 22: Train with min_samples_split=5
clf_split5 = DecisionTreeClassifier(min_samples_split=5)

clf_split5.fit(X_train, y_train)

y_pred_split5 = clf_split5.predict(X_test)

print("Accuracy with min_samples_split=5:", accuracy_score(y_test, y_pred_split5))

# Question 23: Apply feature scaling before training
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

clf_scaled = DecisionTreeClassifier()

clf_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = clf_scaled.predict(X_test_scaled)

print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))

# Question 24: Train a Decision Tree Classifier using One-vs-Rest (OVR) strategy
clf_ovr = DecisionTreeClassifier()

clf_ovr.fit(X_train, y_train)

print("Accuracy with OVR:", accuracy_score(y_test, clf_ovr.predict(X_test)))

# Question 25: Display feature importance scores
print("Feature Importance Scores:", clf.feature_importances_)

# Question 26: Compare Decision Tree Regressor with max_depth=5
regressor_depth5 = DecisionTreeRegressor(max_depth=5)

regressor_depth5.fit(X_train_h, y_train_h)

y_pred_depth5 = regressor_depth5.predict(X_test_h)

print("MSE with max_depth=5:", mean_squared_error(y_test_h, y_pred_depth5))

# Question 27: Apply Cost Complexity Pruning (CCP) and visualize effect
path = clf.cost_complexity_pruning_path(X_train, y_train)

ccp_alphas = path.ccp_alphas

pruned_clfs = [DecisionTreeClassifier(ccp_alpha=alpha).fit(X_train, y_train) for alpha in ccp_alphas]

pruned_accuracies = [accuracy_score(y_test, clf.predict(X_test)) for clf in pruned_clfs]

plt.plot(ccp_alphas, pruned_accuracies, marker='o', linestyle='--')

plt.xlabel('CCP Alpha')

plt.ylabel('Accuracy')

plt.title('Effect of CCP on Accuracy')

plt.show()

# Question 28: Evaluate using Precision, Recall, and F1-Score
precision = precision_score(y_test, y_pred, average='weighted')

recall = recall_score(y_test, y_pred, average='weighted')

f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")

# Question 29: Visualize confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)

sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)

plt.xlabel('Predicted')

plt.ylabel('Actual')

plt.title('Confusion Matrix')

plt.show()

# Question 30: Use GridSearchCV to find optimal max_depth and min_samples_split
param_grid = {'max_depth': [3, 5, 10, None], 'min_samples_split': [2, 5, 10]}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
