

## **Decision Tree | Assignment**
---

## **Question 1:**  
What is a Decision Tree, and how does it work in the context of classification?

### **Answer:**
A **Decision Tree** is a supervised machine learning algorithm used for both **classification** and **regression** tasks.  
It works by splitting the dataset into branches based on feature values, forming a tree-like structure where each internal node represents a **decision rule**, each branch represents an **outcome**, and each leaf node represents a **class label** (in classification) or a **value** (in regression).

In classification:
- The algorithm selects the feature that provides the **best split** of the data (using impurity measures such as Gini or Entropy).
- It continues splitting recursively until a stopping criterion (like maximum depth or minimum samples) is met.
- The final prediction for a sample is made based on the **majority class** in the leaf node it falls into.

Decision Trees mimic human decision-making, making them easy to interpret and visualize.

---

## **Question 2:**  
Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

### **Answer:**
**Gini Impurity** and **Entropy** are measures used to evaluate the "purity" of a dataset at a node.

- **Gini Impurity:**
  \[
  Gini = 1 - \sum (p_i)^2
  \]
  where \( p_i \) is the probability of each class.
  - It measures how often a randomly chosen element would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the node.
  - A **Gini value of 0** means the node is pure (contains only one class).

- **Entropy:**
  \[
  Entropy = -\sum p_i \log_2(p_i)
  \]
  - It measures the amount of disorder or randomness in the dataset.
  - A higher entropy value means the data is more impure.

**Impact on Splits:**  
Both metrics aim to find splits that **reduce impurity** the most. The feature and threshold producing the **highest decrease in impurity** (called *Information Gain*) are selected for splitting.

---

## **Question 3:**  
What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

### **Answer:**
- **Pre-Pruning:**  
  Stops the tree growth early before it becomes too complex.  
  Example: Limiting the `max_depth` or `min_samples_split`.  
  - **Advantage:** Saves computation time and prevents overfitting early.

- **Post-Pruning:**  
  Grows the full tree first, then removes branches that do not improve accuracy on validation data.  
  Example: Cost complexity pruning (`ccp_alpha` in sklearn).  
  - **Advantage:** Produces a more optimized tree after evaluating model performance.

---

## **Question 4:**  
What is Information Gain in Decision Trees, and why is it important for choosing the best split?

### **Answer:**
**Information Gain (IG)** measures the reduction in impurity (entropy or Gini) after a dataset is split on a particular feature.

\[
IG = Entropy(parent) - \sum \frac{n_{child}}{n_{total}} \times Entropy(child)
\]

- A higher **Information Gain** means the feature provides a better split.
- It helps the decision tree algorithm **choose the most informative feature** at each step, ensuring efficient and accurate classification.

Hence, Information Gain guides the tree-building process by identifying splits that **maximize purity and minimize uncertainty**.

---

## **Question 5:**  
What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

### **Answer:**

**Applications:**
- Medical diagnosis (disease prediction)
- Financial risk analysis and loan approval
- Customer segmentation and churn prediction
- Fraud detection
- Manufacturing defect classification

**Advantages:**
- Easy to understand and visualize
- Works with both numerical and categorical data
- Requires little data preprocessing
- Handles non-linear relationships

**Limitations:**
- Prone to overfitting (especially deep trees)
- Small data changes can lead to a completely different tree
- May be biased toward features with more levels

---


## **Question 6:**   

  Write a Python program to: \
● Load the Iris Dataset \
● Train a Decision Tree Classifier using the Gini criterion \
● Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)
print("Feature Importances:", clf.feature_importances_)

Model Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


## **Question 7:**  
 Write a Python program to: \
● Load the Iris Dataset \
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [2]:
from sklearn.metrics import accuracy_score

# Full tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_pred = full_tree.predict(X_test)
full_acc = accuracy_score(y_test, full_pred)

# Limited depth tree
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_pred = limited_tree.predict(X_test)
limited_acc = accuracy_score(y_test, limited_pred)

print("Accuracy (Full Tree):", full_acc)
print("Accuracy (Max Depth = 3):", limited_acc)

Accuracy (Full Tree): 1.0
Accuracy (Max Depth = 3): 1.0



## **Question 8:**

 Write a Python program to: \
● Load the Boston Housing Dataset \
● Train a Decision Tree Regressor \
● Print the Mean Squared Error (MSE) and feature importances

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Feature Importances:", regressor.feature_importances_)

Mean Squared Error: 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


## **Question 9:**
Write a Python program to: \
● Load the Iris Dataset \
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV \
● Print the best parameters and the resulting model accuracy

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 5, 10]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Model Accuracy:", accuracy_score(y_test, y_pred))


---

## **Question 10:**  
Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

### **Answer:**

**Step-by-step process:**

1. **Handle Missing Values**
   - Use `SimpleImputer` to fill missing numeric values (mean/median).
   - Fill missing categorical values with the most frequent category.

2. **Encode Categorical Features**
   - Use `LabelEncoder` for ordinal features.
   - Use `OneHotEncoder` for nominal features to avoid introducing order bias.

3. **Train a Decision Tree Model**
   - Split the data into training and testing sets.
   - Train `DecisionTreeClassifier(criterion='gini')`.

4. **Tune Hyperparameters**
   - Use `GridSearchCV` to tune `max_depth`, `min_samples_split`, and `criterion`.
   - Select the model with the best cross-validation score.

5. **Evaluate Model Performance**
   - Use accuracy, precision, recall, F1-score, and confusion matrix.
   - Evaluate on unseen test data to check generalization.

**Business Value:**
- Enables **early disease detection**, improving patient outcomes.
- Helps **prioritize patients** for further tests.
- Assists doctors in **decision support systems**.
- Reduces healthcare costs through **predictive analytics** and efficient resource allocation.