Question 1: What is a Decision Tree, and how does it work in
classification?
A Decision Tree is a supervised machine learning algorithm used for classification and regression.
It represents decisions in a tree structure consisting of root nodes, decision nodes, and leaf nodes.
In classification, the dataset is split based on feature values to separate classes. The model selects
the best feature at each step to create pure subsets. A new data point moves through decision rules
from root to leaf, and the leaf node provides the predicted class.

----------------------------------------------------------------------------
Question 2: Gini Impurity and Entropy
Gini Impurity measures the probability of misclassification of a randomly selected sample. Lower
Gini means higher purity. Entropy measures randomness or disorder. Higher entropy indicates
more mixed data. Both are used to select splits that reduce impurity. The best split produces the
most homogeneous child nodes.

---------------------------------------------------------------------------
Question 3: Pre-Pruning vs Post-Pruning
Pre-Pruning stops tree growth early using conditions like maximum depth or minimum samples.
Advantage: reduces overfitting early and saves computation. Post-Pruning builds the full tree first
and then removes weak branches. Advantage: produces more optimized and accurate trees.

---------------------------------------------------------------------------
Question 4: Information Gain
Information Gain measures the reduction in entropy after splitting data. It helps select the most
informative feature for splitting. Higher information gain means better separation.

----------------------------------------------------------------------------
Question 5: Applications, Advantages and Limitations
Applications include medical diagnosis, credit scoring, fraud detection, and customer churn
prediction. Advantages: • Easy to interpret • Handles categorical and numerical data • Minimal
preprocessing Limitations: • Can overfit • Sensitive to small data changes • Less accurate than
ensemble models

---------------------------------------------------------------------------
Question 6: Iris Dataset - Gini Classifier
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)
print("Feature Importances:", model.feature_importances_)

-------------------------------------------------------------------------
Question 7: Compare max_depth=3 vs Fully Grown Tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
full_tree = DecisionTreeClassifier()
full_tree.fit(X_train, y_train)
print("Fully grown accuracy:", full_tree.score(X_test, y_test))
limited_tree = DecisionTreeClassifier(max_depth=3)
limited_tree.fit(X_train, y_train)
print("Max depth=3 accuracy:", limited_tree.score(X_test, y_test))

-------------------------------------------------------------------------
Question 8: Boston Housing Decision Tree Regressor
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = load_boston()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)

---------------------------------------------------------------------------
Question 9: GridSearchCV Tuning
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
param_grid = {
    'max_depth': [2,3,4,5],
    'min_samples_split': [2,5,10]
}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.score(X_test, y_test))

---------------------------------------------------------------------------
Question 10: Healthcare Disease Prediction Workflow
Steps: 1. Handle missing values using mean, median, or mode imputation. 2. Encode categorical
variables using label or one-hot encoding. 3. Train a decision tree using training data. 4. Tune
hyperparameters using grid search and cross-validation. 5. Evaluate using accuracy, precision,
recall, and ROC-AUC. Business Value: • Early disease detection • Better treatment decisions •
Reduced healthcare costs • Improved patient outcomes