Question 1:  What is a Decision Tree, and how does it work in the context of classification? 

Ans- A decision tree is a supervised learning algorithm, used for both classification and regression, that models decisions and their possible consequences as a tree-like structure. In classification, it works by recursively partitioning data into subsets based on feature values, ultimately assigning each subset (or leaf) to a specific class. 

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree? 

Ans- Gini Impurity and Entropy are both measures of impurity used in decision tree algorithms to determine the best way to split data at each node. They quantify how mixed or pure a node is with respect to the target variable. Lower impurity values indicate a more homogenous node (better split), while higher values indicate a more mixed node (worse split)

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each. 

Ans- Pre-pruning and post-pruning are techniques used to prevent overfitting in decision trees by simplifying the tree structure. Pre-pruning stops the tree from growing during construction based on certain criteria, while post-pruning removes branches from a fully grown tree. A practical advantage of pre-pruning is faster training due to smaller tree size, while post-pruning offers a more robust approach by evaluating the tree's performance after it's fully built.<br>
Pre-Tuning Example- A pre-pruning technique could involve stopping the tree from splitting a node if the split doesn't improve the accuracy of the model by a certain threshold or if the resulting leaf node would have fewer than a specified number of samples.<br>
Post-Tuning Example- Post-pruning techniques might involve replacing subtrees with leaf nodes if the subtree's error rate is higher than a certain threshold or if the subtree contains too few samples. 

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split? 

Ans- Information gain in decision trees measures how much a particular feature reduces uncertainty (or entropy) when used to split the data. It's crucial for selecting the best split because the feature with the highest information gain will create the most informative split, leading to a more accurate and efficient decision tree. 

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations? 

Ans- Decision trees are widely used in various real-world applications due to their simplicity and interpretability. They are valuable in areas like loan approval, medical diagnosis, customer churn prediction, and fraud detection. While decision trees offer advantages like ease of understanding and handling both numerical and categorical data, they also have limitations, including potential overfitting and instability. 

Question 6:   Write a Python program to:
<br>●	Load the Iris Dataset 
<br>●	Train a Decision Tree Classifier using the Gini criterion 
<br>●	Print the model’s accuracy and feature importances 


In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

data = load_iris()

df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target
X = df.drop('Target', axis=1)
y = df['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

clf = DecisionTreeClassifier(criterion='gini')

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(f'Accuracy Score: {round(accuracy_score(y_test, y_pred), 4)}')

print("Feature Importace....")
for i in range(len(clf.feature_importances_)):
    print(f"{X.columns[i]}: {round(clf.feature_importances_[i] * 100, 2)}%")


Accuracy Score: 0.9556
Feature Importace....
sepal length (cm): 2.15%
sepal width (cm): 2.15%
petal length (cm): 57.2%
petal width (cm): 38.51%


Question 7:  Write a Python program to:
<br>●	Load the Iris Dataset 
<br>●	Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree. 


In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load iris dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target

# Split features and target
X = df.drop('Target', axis=1)
y = df['Target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)

# Model 1: Full tree
clf_one = DecisionTreeClassifier()
clf_one.fit(X_train, y_train)
y_pred_one = clf_one.predict(X_test)
clf_one_acc = accuracy_score(y_test, y_pred_one)

# Model 2: Tree with max_depth=3
clf_two = DecisionTreeClassifier(max_depth=3)
clf_two.fit(X_train, y_train)
y_pred_two = clf_two.predict(X_test)
clf_two_acc = accuracy_score(y_test, y_pred_two)

# Results
print(f"Accuracy Score for Full grown tree: {round(clf_one_acc, 4)}")
print(f"Accuracy Score for 3 branch tree: {round(clf_two_acc, 4)}")


Accuracy Score for Full grown tree: 0.9556
Accuracy Score for 3 branch tree: 0.9556


Question 8: Write a Python program to: 
<br>●	Load the Boston Housing Dataset 
<br>●	Train a Decision Tree Regressor 
<br>●	Print the Mean Squared Error (MSE) and feature importances 


In [29]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load Boston Housing Dataset
boston = fetch_california_housing()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target

# Split features and target
X = df.drop('MEDV', axis=1)
y = df['MEDV']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predictions
y_pred = regressor.predict(X_test)

# Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Results
print(f"Mean Squared Error: {mse:.4f}")
print("\nFeature Importances:")
for feature, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error: 0.5280

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


Question 9: Write a Python program to: 
<br>●	Load the Iris Dataset 
<br>●	Tune the Decision Tree’s max_depth and min_samples_split using 
GridSearchCV 
<br>●	Print the best parameters and the resulting model accuracy 


In [30]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Target'] = iris.target

# Split features and target
X = df.drop('Target', axis=1)
y = df['Target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Parameter grid
param_grid = {
    'max_depth': [None, 2, 3, 4, 5, 6],
    'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5
)

grid_search.fit(X_train, y_train)

# Best model
best_dt = grid_search.best_estimator_

# Predictions
y_pred = best_dt.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Results
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Model Accuracy: {accuracy:.4f}")


Best Parameters: {'max_depth': None, 'min_samples_split': 6}
Best Model Accuracy: 1.0000


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. 
 Explain the step-by-step process you would follow to: 
<br>●	Handle the missing values 
<br>●	Encode the categorical features 
<br>●	Train a Decision Tree model 
<br>●	Tune its hyperparameters 
<br>●	Evaluate its performance 
And describe what business value this model could provide in the real-world setting. 


Ans- 
1. **Handle Missing Values**  
   - For numeric features → fill with mean or median.  
   - For categorical features → fill with the most frequent category.

2. **Encode Categorical Features**  
   - Convert categories to numbers using label encoding or one-hot encoding.

3. **Train Decision Tree Model**  
   - Split the dataset into training and testing sets.  
   - Train the Decision Tree using the training set.

4. **Tune Hyperparameters**  
   - Use GridSearchCV or RandomizedSearchCV to find the best  
     `max_depth`, `min_samples_split`, and `min_samples_leaf`.

5. **Evaluate Performance**  
   - Test the model on the test set using accuracy, precision, recall,  
     F1-score, and ROC-AUC for classification problems.

---

### Business Value

- **Early Detection:** Helps doctors identify high-risk patients sooner.  
- **Efficiency:** Reduces time spent on manual diagnosis.  
- **Cost Savings:** Early treatment is cheaper and more effective.  
- **Improved Outcomes:** Leads to better patient recovery rates.