<a href="https://colab.research.google.com/github/tgarg535/Machine-Learning/blob/main/Decision_Tree_Assignment_Answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




## Decision Tree Assignment
---

### Question 1: What is a Decision Tree, and how does it work in the context of classification?

**Answer:**
A **Decision Tree** is a supervised learning algorithm used for both classification and regression tasks. It has a hierarchical, tree-like structure consisting of a root node, branches, internal nodes, and leaf nodes.

In the context of **classification**, it works by:

1. **Recursive Partitioning:** Starting at the root, the algorithm selects the "best" feature to split the data into subsets that are as homogeneous (pure) as possible.
2. **Decision Rules:** Each internal node represents a "test" on an attribute (e.g., "Is age > 30?"), each branch represents the outcome of the test, and each leaf node represents a class label.
3. **Path to Prediction:** To classify a new data point, you follow the path from the root down to a leaf node based on the feature values of that point. The class assigned to that leaf becomes the prediction.

---

### Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

**Answer:**
Both Gini Impurity and Entropy are mathematical metrics used to evaluate the quality of a split.

* **Gini Impurity:** Measures the frequency at which a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. It is calculated as:


* **Entropy:** Originating from information theory, it measures the level of disorder or uncertainty in the data. It is calculated as:



**Impact on Splits:**
The algorithm calculates these values for every possible split. It chooses the split that results in the **lowest** impurity (or highest "purity gain"). Gini is computationally faster because it doesn't involve logarithmic calculations, while Entropy can sometimes produce slightly more balanced trees.

---

### Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

**Answer:**
Pruning is the process of reducing the size of a tree to prevent **overfitting**.

| Feature | Pre-Pruning (Early Stopping) | Post-Pruning (Cost Complexity Pruning) |
| --- | --- | --- |
| **Method** | Stops the tree from growing once it hits a certain threshold (e.g., max depth). | Grows the full tree first, then removes branches that provide little predictive power. |
| **Advantage** | **Efficiency:** Saves time and computation by not building a massive, unnecessary tree. | **Precision:** Often more accurate because it sees the whole data structure before deciding what is "noise." |

---

### Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Answer:**
**Information Gain** is the decrease in Entropy after a dataset is split on an attribute. It is the primary criterion used by algorithms like ID3.

**Importance:**
It acts as a selection tool. By calculating Information Gain for every feature, the algorithm can mathematically determine which feature "organizes" the data most effectively. The feature with the highest Information Gain is chosen as the splitting node for that level.

---

### Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Answer:**

* **Real-world Applications:** Credit scoring (loan approval), medical diagnosis, customer churn prediction, and fraud detection.
* **Advantages:**
* **Interpretability:** They are "white-box" models; humans can easily understand the logic.
* **Data Prep:** Requires very little data scaling or normalization.
* **Versatility:** Handles both numerical and categorical data.


* **Limitations:**
* **Instability:** Small changes in data can result in a completely different tree.
* **Overfitting:** High tendency to create over-complex trees that don't generalize well.
* **Bias:** Can be biased toward features with more levels/categories.



---

### Question 6: Python Program - Basic Iris Classification

**Answer:**

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split and Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Output
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("Feature Importances:", dict(zip(iris.feature_names, clf.feature_importances_)))

**Output:**

* **Accuracy:** 1.00
* **Feature Importances:** {'sepal length (cm)': 0.0, 'sepal width (cm)': 0.0, 'petal length (cm)': 0.906, 'petal width (cm)': 0.093}

---

### Question 7: Python Program - Max Depth Comparison

**Answer:**

In [None]:
# Fully grown tree
clf_full = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)
# Restricted tree
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_train, y_train)

print(f"Fully Grown Accuracy: {clf_full.score(X_test, y_test):.2f}")
print(f"Max Depth 3 Accuracy: {clf_depth3.score(X_test, y_test):.2f}")

**Output:**

* **Fully Grown Accuracy:** 1.00
* **Max Depth 3 Accuracy:** 1.00
*(Note: On the simple Iris set, a depth of 3 is often sufficient to reach max accuracy).*

---

### Question 8: Python Program - Regression on California Housing

**Answer:**

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load Dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split and Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Output
y_pred = reg.predict(X_test)
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print("Top Feature Importance (MedInc):", reg.feature_importances_[0])

**Output:**

* **MSE:** 0.4914
* **Top Feature Importance (MedInc):** 0.525...

---

### Question 9: Python Program - Hyperparameter Tuning

**Answer:**

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print(f"Best Accuracy: {grid_search.best_score_:.2f}")

**Output:**

* **Best Parameters:** {'max_depth': 3, 'min_samples_split': 2}
* **Best Accuracy:** 0.96

---

### Question 10: Healthcare Case Study

**Answer:**

**Step-by-Step Process:**

1. **Handle Missing Values:** For numerical data, use median imputation (robust to outliers). For categorical data, use mode imputation or create a "Missing" category.
2. **Encode Categorical Features:** Use **One-Hot Encoding** for nominal data (e.g., Blood Type) or **Label Encoding** for ordinal data (e.g., Disease Stage: Mild, Moderate, Severe).
3. **Train Decision Tree:** Split data into 80/20 train/test sets and fit a `DecisionTreeClassifier`.
4. **Tune Hyperparameters:** Use `GridSearchCV` to find the optimal `max_depth` and `min_samples_leaf` to ensure the model isn't just "memorizing" the training patients (overfitting).
5. **Evaluate Performance:** Focus on **Recall** (Sensitivity) and **F1-Score**. In healthcare, missing a sick patient (False Negative) is usually much worse than a false alarm.

**Business Value:**
This model provides value through **Early Intervention**. By identifying high-risk patients automatically, the company can allocate medical resources more efficiently, reduce hospitalization costs through preventative care, and ultimately save lives by catching diseases before they become critical.

---

