# *Question 1: What is a Decision Tree, and how does it work in the context of classification*
-  Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, it is used to predict a class label by learning simple decision rules inferred from the features of the data.


 How it Works:
Structure:

The tree consists of:

- Root Node: Represents the entire dataset and starts the splitting process.

Internal Nodes: Represent tests or decisions based on feature values.

Leaf Nodes: Represent class labels or final decisions.

- Splitting:

The tree splits the dataset based on feature values using a criterion such as:

Gini Index

Entropy/Information Gain

Chi-Square

- Decision Path:

A new data point follows a path from the root node to a leaf node based on the decisions at each node.

The leaf node gives the predicted class label.

# *** Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?***

📌-  What are Impurity Measures?
In a Decision Tree, impurity measures are used to decide how to split the data at each node. They help determine how "pure" or "mixed" a node is — i.e., whether the samples in a node mostly belong to one class or multiple classes.

1️⃣ Gini Impurity
📘 Definition:
Gini Impurity measures the probability that a randomly chosen sample would be incorrectly classified if it were labeled according to the class distribution in that node.

🔢 Formula:
𝐺
𝑖
𝑛
𝑖
=
1
−
∑
𝑖
=
1
𝑛
𝑝
𝑖
2
Gini=1−
i=1
∑
n
​
 p
i
2
​

Where:

𝑝
𝑖
p
i
​
  = probability of class
𝑖
i at a node

𝑛
n = total number of classes

🔍 Interpretation:
Gini = 0 → Node is pure (only one class)

Gini increases as classes are more mixed

2️⃣ Entropy (Information Gain)
📘 Definition:
Entropy measures the amount of uncertainty or disorder in the data. It comes from information theory.

🔢 Formula:
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
=
−
∑
𝑖
=
1
𝑛
𝑝
𝑖
⋅
log
⁡
2
(
𝑝
𝑖
)
Entropy=−
i=1
∑
n
​
 p
i
​
 ⋅log
2
​
 (p
i
​
 )
Where:

𝑝
𝑖
p
i
​
  = probability of class
𝑖
i

🔍 Interpretation:
Entropy = 0 → Pure node

Entropy = 1 (max) → Completely mixed classes (in binary classification)

🔁 Information Gain:
When splitting a node, the Information Gain is calculated as:

Information Gain
=
Entropy (parent)
−
Weighted Entropy (children)
Information Gain=Entropy (parent)−Weighted Entropy (children)
📊 Impact on Decision Tree Splits
Measure	Used For	Objective
Gini Impurity	CART (Classification Tree)	Minimizes impurity after split
Entropy	ID3, C4.5 Algorithms	Maximizes information gain

# Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.


| Feature            | Pre-Pruning                      | Post-Pruning                         |
| ------------------ | -------------------------------- | ------------------------------------ |
| Timing             | During tree building             | After tree is fully grown            |
| Goal               | Avoid unnecessary splits         | Remove overfitted branches           |
| Control Parameters | `max_depth`, `min_samples_split` | Validation performance, pruning cost |
| Speed              | Faster                           | Slightly slower                      |
| Accuracy           | Might underfit                   | Better generalization                |


# *Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split? *
- Information Gain is a metric used to measure the effectiveness of an attribute (feature) in classifying the training data. It quantifies the reduction in entropy (uncertainty or impurity) achieved after a dataset is split based on an attribute.

🔢 Formula:
Information Gain
=
Entropy (Parent)
−
Weighted Entropy (Children)
Information Gain=Entropy (Parent)−Weighted Entropy (Children)
Where:

Entropy measures the disorder/impurity in the dataset.

Weighted Entropy = sum of the entropy of each child node, weighted by the proportion of instances.

📊 Why is Information Gain Important?
At each node, the decision tree needs to choose which feature to split on.

Information Gain helps identify the feature that gives the highest reduction in impurity.

A higher Information Gain means the feature does a better job at classifying the data.

## Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?


✅ Advantages of Decision Trees:
Feature	Benefit
Easy to Understand	Visual and logical flow, even non-experts can interpret it
Handles Both Data Types	Works with both categorical and numerical data
Requires Little Prep	No need for feature scaling or normalization
Fast Prediction	Once built, predictions are quick and require minimal computation
Feature Importance	Helps in identifying the most influential variables

⚠️ Limitations of Decision Trees:
Limitation	Impact
Overfitting	Deep trees may fit training data too closely and perform poorly on new data
Instability	Small data changes can lead to a completely different tree structure
Bias toward dominant classes	Imbalanced datasets can affect performance
Greedy nature	Makes local optimal decisions that may not be globally optimal
Less accurate alone	Often outperformed by ensemble methods like Random Forest or XGBoost

1.Iris Dataset

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target


In [3]:
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [4]:
X

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

In [6]:
y

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

Boston Housing Dataset

In [2]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X = housing.data
y = housing.target


In [7]:
X

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

In [8]:
y

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

In [9]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions on test data
y_pred = clf.predict(X_test)

# Calculate and print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Display feature importances
feature_importances = pd.Series(clf.feature_importances_, index=feature_names)
print("\nFeature Importances:")
print(feature_importances.sort_values(ascending=False))


Model Accuracy: 1.00

Feature Importances:
petal length (cm)    0.906143
petal width (cm)     0.077186
sepal width (cm)     0.016670
sepal length (cm)    0.000000
dtype: float64


# Question 7: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.


In [10]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model 1: Decision Tree with max_depth = 3 (shallow tree)
clf_shallow = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_shallow.fit(X_train, y_train)
y_pred_shallow = clf_shallow.predict(X_test)
acc_shallow = accuracy_score(y_test, y_pred_shallow)

# Model 2: Fully grown Decision Tree (no depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# Print comparison
print(f"Shallow Tree Accuracy (max_depth=3): {acc_shallow:.2f}")
print(f"Fully Grown Tree Accuracy           : {acc_full:.2f}")


Shallow Tree Accuracy (max_depth=3): 1.00
Fully Grown Tree Accuracy           : 1.00


# Question 8: Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

In [11]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target
feature_names = data.feature_names

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on test data
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Display feature importances
importances = pd.Series(regressor.feature_importances_, index=feature_names)
print("\nFeature Importances:")
print(importances.sort_values(ascending=False))


Mean Squared Error (MSE): 0.50

Feature Importances:
MedInc        0.528509
AveOccup      0.130838
Latitude      0.093717
Longitude     0.082902
AveRooms      0.052975
HouseAge      0.051884
Population    0.030516
AveBedrms     0.028660
dtype: float64


# Question 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
# ● Print the best parameters and the resulting model accuracy

In [12]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Define the parameter grid to tune
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Best Parameters: {accuracy:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.00


# Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
# Explain the step-by-step process you would follow to:

# ● Handle the missing values

# ● Encode the categorical features

# ● Train a Decision Tree model

# ● Tune its hyperparameters

# ● Evaluate its performance

# And describe what business value this model could provide in the real-world setting




- STEP-BY-STEP PROCESS

1️⃣ **Handle Missing Values**

🔍 Explore missing data:

df.isnull().sum()

✅ Handling strategies:

Numerical features: Impute using mean or median:


from sklearn.impute import SimpleImputer

imputer_num = SimpleImputer(strategy='median')


df[numerical_cols] = imputer_num.fit_transform(df[numerical_cols])

**Categorical features**: Impute using most frequent:

imputer_cat = SimpleImputer(strategy='most_frequent')

df[categorical_cols] = imputer_cat.fit_transform(df[categorical_cols])

2️⃣ Encode Categorical Features

✅ Use One-Hot Encoding for nominal features (e.g., gender, region):


from sklearn.preprocessing import OneHotEncoder

df = pd.get_dummies(df, columns=categorical_cols)

✅ Use Label Encoding if feature is ordinal (e.g., mild < moderate < severe):


from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['severity'] = le.fit_transform(df['severity'])

3️⃣ Train the Decision Tree Model

✅ Split the data:



from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

✅ Train the model:


from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)

model.fit(X_train, y_train)

4️⃣ Tune Hyperparameters


✅ Use GridSearchCV to find best parameters:


from sklearn.model_selection import GridSearchCV


param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}


grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')

grid.fit(X_train, y_train)

best_model = grid.best_estimator_
5️⃣** Evaluate Model Performance**
✅ Use common metrics:

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


y_pred = best_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))
📈  Visualize the decision tree:


from sklearn.tree import plot_tree
plot_tree(best_model, filled=True, feature_names=X.columns, class_names=['No Disease', 'Disease'

In [15]:
 ############################END##########################################