### DECISION TREE

Question 1: What is a Decision Tree, and how does it work in the context of
classification?

ANSWER=> A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It works by splitting the dataset into smaller subsets based on feature values, forming a tree-like structure of decisions.



Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Answer=> Gini Impurity measures how often a randomly chosen element would be incorrectly classified. It ranges from 0 (pure node) to 0.5 (for binary classification). A lower Gini value indicates a better split.Entropy measures the level of randomness or disorder in a dataset. Higher entropy means more uncertainty. Information Gain is calculated using entropy.
Both metrics guide the tree to create splits that result in child nodes with higher purity. The split that minimizes impurity (Gini) or maximizes Information Gain (Entropy) is selected.


Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each?

Answer=>Pre-Pruning stops the tree from growing during training by applying constraints such as maximum depth or minimum samples per split.

Advantage: Reduces overfitting and training time.

Post-Pruning allows the tree to grow fully and then removes unnecessary branches after training.

Advantage: Often achieves better generalization by simplifying the tree based on validation data.



 QUESTION 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?


Answer=> Information Gain measures the reduction in entropy after splitting a dataset on a particular feature. It tells us how much information a feature provides about the target class.
It is important because the feature with the highest Information Gain produces the most homogeneous child nodes, leading to better classification accuracy.


Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Answer=>
Applications:

Medical diagnosis
Credit risk analysis
Fraud detection
Customer churn prediction
Recommendation systems

Advantages:

Easy to interpret and visualize
Works with numerical and categorical data
Requires little data preprocessing

Limitations:

Prone to overfitting
Sensitive to small data changes
Can create biased trees if data is imbalanced




In [None]:
# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

# Answer=>

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)


print("Accuracy:", accuracy)
print("Feature Importances:", model.feature_importances_)

In [None]:
# Question 7: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

#  Answer

shallow_tree = DecisionTreeClassifier(max_depth=3)
shallow_tree.fit(X_train, y_train)
shallow_acc = accuracy_score(y_test, shallow_tree.predict(X_test))

full_tree = DecisionTreeClassifier()
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))


print("Shallow Tree Accuracy:", shallow_acc)
print("Full Tree Accuracy:", full_acc)

In [None]:
# Question 8: Write a Python program to:
# ● Load the California Housing dataset from sklearn
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

# Answer=>

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)


In [None]:
# Question 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# ● Print the best parameters and the resulting model accuracy


# Answer=>
from sklearn.model_selection import GridSearchCV
params = {
'max_depth': [2, 3, 4, 5, None],
'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(DecisionTreeClassifier(), params, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", accuracy_score(y_test, y_pred))


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting

Answer=>
Handle Missing Values:

Numerical: mean/median imputation

Categorical: most frequent or separate category

Encode Categorical Features:

One-Hot Encoding or Label Encoding

Train Decision Tree Model:

Split data into training and testing sets

Train using appropriate criterion

Hyperparameter Tuning:

Use GridSearchCV to tune max_depth, min_samples_split, etc.

Evaluate Performance:

Metrics: Accuracy, Precision, Recall, F1-score, ROC-AUC

Business Value:

Early disease detection

Reduced diagnostic costs

Improved patient outcomes

Data-driven clinical decision support