1. What is a Decision Tree, and how does it work in the context of classification?
- A **Decision Tree** is a supervised machine learning algorithm used for both classification and regression, but it is most commonly applied in classification tasks.
- It works by splitting the dataset into subsets based on the most significant features, using conditions or rules at each internal node. Each node represents a decision point on a feature, branches represent possible outcomes, and leaf nodes represent the final class labels.
- The algorithm recursively divides the data using measures like **Gini Index**, **Entropy**, or **Information Gain**, aiming to create pure subsets where most or all data points belong to a single class.
- In classification, the decision tree predicts the class of a new instance by traversing from the root node down to a leaf node, following the decision rules based on the feature values of that instance.
- This makes decision trees easy to interpret, as they mimic human decision-making with clear “if-else” rules.


2. Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?
- **Gini Impurity** and **Entropy** are measures used in Decision Trees to determine how mixed or impure a node is. 
- Gini Impurity reflects the probability of incorrectly classifying a randomly chosen sample if it was labeled based on the distribution of classes in that node.
- Entropy, on the other hand, measures the level of disorder or uncertainty in the data, with higher values indicating that the classes are more evenly mixed.
- In the context of a Decision Tree, these measures guide the algorithm in selecting the best splits: the tree tries to divide the data so that each resulting group is as pure as possible, meaning it contains mostly samples of a single class.
- By doing this, the tree becomes more accurate in classification, as the decisions at each node become clearer and less uncertain.


3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- **Pre-pruning** and **Post-pruning** are techniques used to prevent overfitting in Decision Trees.
- Pre-pruning, also called *early stopping*, stops the tree from growing too deep by applying constraints such as limiting the maximum depth, requiring a minimum number of samples for a split, or setting a threshold for impurity reduction.
- Its practical advantage is that it makes the model simpler and faster to train, which is useful when working with large datasets.
- Post-pruning, on the other hand, allows the tree to grow fully and then prunes back branches that do not improve performance, usually by evaluating accuracy on a validation set.
- Its practical advantage is that it produces more accurate and generalized models, since pruning is guided by actual performance rather than fixed rules.


4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
- **Information Gain** in Decision Trees is a measure of how well a feature separates the data into distinct classes.
- It represents the reduction in uncertainty or impurity after splitting the dataset based on a specific feature. In simple terms, it tells us how much "useful information" a split provides about the target variable.
- A higher information gain means the split creates more homogeneous groups, making classification easier.
- It is important because the Decision Tree uses this measure to decide the best feature and threshold for splitting at each node, ensuring that the tree becomes more accurate and efficient in distinguishing between classes as it grows.



5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
- Decision Trees are widely used in real-world applications such as medical diagnosis, where they help doctors identify diseases based on symptoms, credit scoring and risk assessment in finance, customer segmentation and churn prediction in marketing, fraud detection, and even in recommendation systems.
- Their main advantage is that they are easy to understand, interpret, and visualize since they follow simple “if-else” rules that resemble human decision-making.
- They also handle both numerical and categorical data without requiring heavy preprocessing.
- However, their limitations include a tendency to overfit the data if not properly pruned, sensitivity to small changes in the dataset that can lead to different tree structures, and relatively lower predictive accuracy compared to more advanced models like Random Forests or Gradient Boosted Trees.


In [None]:
# 6. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data   # features
y = iris.target # labels

# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Make predictions on test set
y_pred = clf.predict(X_test)

# Print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", round(accuracy, 4))

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")




In [None]:
# 7. Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy toa fully-grown tree.


# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fully grown Decision Tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Decision Tree with max_depth = 3
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
y_pred_pruned = pruned_tree.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Print results
print("Accuracy of Fully Grown Tree:", round(accuracy_full, 4))
print("Accuracy of Tree with max_depth=3:", round(a_


In [None]:
# 8.Write a Python program to:
#● Load the Boston Housing Dataset
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances


# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset (replacement for deprecated Boston dataset)
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split dataset into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on test data
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", round(mse, 4))

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


In [None]:
# 9. Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#● Print the best parameters and the resulting model accuracy


# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid for tuning
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 6, 10]
}

# Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# GridSearchCV for tuning
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,  # 5-fold cross validation
    scoring="accuracy"
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Get best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate on test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", best_params)
print("Model Accuracy w


10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting

- To build a disease prediction model, the first step would be to handle missing values carefully since healthcare data is rarely complete.
- For numerical features, missing values can be imputed using the mean or median depending on the distribution, while categorical features can be filled with the most frequent category or a new label such as “Unknown.” Once the data is complete, categorical features need to be encoded into numerical form so that the model can process them—techniques like one-hot encoding or label encoding can be applied depending on whether the categories are nominal or ordinal.
- After preprocessing, a Decision Tree model can be trained on the dataset, as it naturally handles both numerical and categorical inputs and does not require heavy feature scaling.
- To improve its performance, hyperparameters such as maximum depth, minimum samples per split, and criterion (Gini or entropy) can be tuned using GridSearchCV with cross-validation.
- Once the best model is selected, its performance should be evaluated on a test set using metrics such as accuracy, precision, recall, and F1-score, since in healthcare minimizing false negatives is often critical.
- In a real-world setting, such a model could provide significant business value by enabling early detection of diseases, helping doctors prioritize high-risk patients, reducing diagnostic costs, and supporting data-driven decision-making in patient care.

