#**1**: What is a Decision Tree, and how does it work in the context of classification?
- A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. It works by splitting data into subsets based on the value of input features, forming a tree-like structure of decision nodes and leaf nodes.
 - Root Node: The top node that represents the entire dataset.

- Decision Nodes: Points where the data is split based on a feature condition.

- Leaf Nodes (Terminal Nodes): Represent the final output or class label.

#**2**: Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?

Gini and Entropy both measure how mixed the classes are in a node.

- The Decision Tree algorithm chooses splits that minimize impurity.

- Lower impurity → better, more informative split → purer nodes.

#**3**:  What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

-Pre-Pruning stops the tree from growing too deep — fast and efficient.

- Post-Pruning trims the fully grown tree — more accurate and generalizable.

#**4**: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

- Information Gain :  how much a feature helps reduce uncertainty.
Decision Trees use it to choose the best feature to split on at each step, leading to purer, more informative branches.



#**5**:What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

- Decision Trees are widely used in finance, healthcare, marketing, and manufacturing because they’re interpretable and flexible, but they can overfit easily and are less stable on noisy data.
That’s why, in practice, they’re often used as the base model in ensemble methods like Random Forests and Gradient Boosted Trees.


#**6**:Write a Python program to:
● Load the Iris Dataset
 ● Train a Decision Tree Classifier using the Gini criterion
  ● Print the model’s accuracy and feature importances


In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data          # Features
y = iris.target        # Target labels

# 2. Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Create and train the Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# 4. Make predictions on the test data
y_pred = clf.predict(X_test)

# 5. Calculate and print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# 6. Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Model Accuracy: 100.00%

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


#**7**:Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [2]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data          # Features
y = iris.target        # Target labels

# 2. Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a Decision Tree Classifier with max_depth=3
clf_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# 4. Train a fully grown Decision Tree (no depth limit)
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# 5. Compare accuracies
print("Decision Tree Comparison on Iris Dataset")
print("----------------------------------------")
print(f"Accuracy (max_depth=3): {accuracy_limited * 100:.2f}%")
print(f"Accuracy (fully-grown): {accuracy_full * 100:.2f}%")

# Optional: show feature importances of the limited tree
print("\nFeature Importances (max_depth=3):")
for feature_name, importance in zip(iris.feature_names, clf_limited.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Decision Tree Comparison on Iris Dataset
----------------------------------------
Accuracy (max_depth=3): 100.00%
Accuracy (fully-grown): 100.00%

Feature Importances (max_depth=3):
sepal length (cm): 0.0000
sepal width (cm): 0.0000
petal length (cm): 0.9346
petal width (cm): 0.0654


#**8**:Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

In [3]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data           # Features
y = housing.target         # Target variable (median house value)

# 2. Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(criterion='squared_error', random_state=42)
regressor.fit(X_train, y_train)

# 4. Make predictions
y_pred = regressor.predict(X_test)

# 5. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# 6. Print feature importances
print("\nFeature Importances:")
for name, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error (MSE): 0.4952

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


#**9**:Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy


In [None]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data        # features
y = iris.target      # target labels

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Define the Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# 4. Define the hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# 5. Perform GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# 6. Get the best parameters and best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# 7. Evaluate on test data
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 8. Print results
print("Best Parameters:", best_params)
print(f"Model Accuracy: {accuracy:.4f}")


#**10**:Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
 - ● Handle the missing values
- ● Encode the categorical features
- ● Train a Decision Tree model
- ● Tune its hyperparameters
- ● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

- 1) Understand the data first

Inspect data types, value counts, and missingness by column (df.info(), df.describe(), df.isna().mean()).

Identify categorical vs numeric features, and flag ordinal variables (e.g., disease stage).

Check class balance (value_counts()), outliers, and obvious data quality issues (duplicates, impossible values).

Split dataset into train / validation / test before doing any target-related preprocessing to avoid leakage (typical: 60–20–20 or 70–15–15). Use stratified split if classes are imbalanced.

- 2) Handle missing values

Principles: avoid leaking target info and prefer imputing within cross-validation/pipeline.

Options by feature type:

Numeric features

If missingness is small & MCAR: SimpleImputer(strategy='median') (robust to outliers).

If not MCAR or missingness is informative: add a binary indicator column feat_missing and impute (median/mean).

For heavy missingness consider dropping the feature or using model-based imputation (IterativeImputer) if justified.

Categorical features

Treat missing as a separate category ("MISSING") or impute with most frequent value; sometimes "MISSING" carries signal.

Time-dependent / grouped data

Impute within groups (e.g., patient ID) if appropriate.

Always fit imputers only on training data (use pipelines so this is automatic).

- 3) Encode categorical features

Decision Trees in common libraries expect numeric input.

Nominal categories (no order): use OneHotEncoder(handle_unknown='ignore'). Works well for low-cardinality features.

High-cardinality nominal:

Target encoding / frequency encoding / embedding approaches. If using target encoding, do it inside cross-validation or use smoothing to prevent leakage.

Frequency encoding (replace category by its frequency) is safer and simple.

Ordinal categories: use OrdinalEncoder with explicit ordering.

If you use sklearn pipelines, combine with ColumnTransformer.

Why: Correct encoding preserves relationships and avoids exploding dimensionality.

- 5) Train the Decision Tree model

Start with sensible defaults (e.g., criterion='gini', class_weight='balanced' if classes unequal).

Fit on training data only.

Evaluate baseline on validation set.

Notes: trees don’t require scaling; scaling shown above is OK but optional.

- 6) Tune hyperparameters (GridSearchCV or RandomizedSearchCV)

Important hyperparameters for Decision Trees:

max_depth (limits tree size)

min_samples_split and min_samples_leaf (stability, overfit control)

max_leaf_nodes

criterion (gini vs entropy)

class_weight (or handle imbalance with sampling)

Use StratifiedKFold and include the entire pipeline inside the grid search to avoid leakage.

- 7) Model evaluation (metrics and plots)

Use a held-out test set only once for final evaluation.

Recommended metrics for a disease prediction problem:

Primary: ROC AUC (probability ranking), PR AUC (precision-recall area) — PR AUC is better when positive class is rare.

Threshold-dependent: precision, recall (sensitivity), specificity, F1-score.

Confusion matrix — visualize false positives vs false negatives.

Calibration: reliability diagram / Brier score — very important in healthcare if probabilities are used in decisions.

Decision analysis: compute cost-sensitive metrics if false negatives are much worse than false positives.

Confidence intervals: bootstrap metrics to estimate uncertainty.

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier

num_cols = [...]
cat_lowcard = [...]
cat_highcard = [...]
ord_cols = [...]

num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())   # optional for trees but OK to keep
])

cat_low_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='MISSING')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

cat_high_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='MISSING')),
    # use frequency or leave as string to custom encoder
])

preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat_low', cat_low_pipe, cat_lowcard),
    # add others...
])

clf = Pipeline([
    ('pre', preprocessor),
    ('model', DecisionTreeClassifier(random_state=42, class_weight='balanced'))
])
