<a href="https://colab.research.google.com/github/ssj108/Files-exceptional-handling-logging-and-memory-management-assignment/blob/main/machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Question 1: What is a Decision Tree, and how does it work in the context of classification?

Answer:
A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. In classification, it splits the dataset into subsets based on feature values, forming a tree structure where:

Each internal node represents a feature test.

Each branch represents an outcome of that test.

Each leaf node represents a class label.


The model makes decisions by traversing from the root to a leaf, following the path determined by the feature values of the input data.


---

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Answer:

Gini Impurity measures the likelihood of incorrectly classifying a randomly chosen element.
Formula:

Entropy measures the amount of information or uncertainty in the dataset.
Formula:


These measures help determine the quality of a split. The lower the impurity after the split, the better. The tree algorithm chooses the feature and threshold that leads to the greatest reduction in impurity.


---

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Answer:

Pre-Pruning stops the tree from growing once a condition is met (e.g., max depth, min samples).
Advantage: Prevents overfitting early and reduces training time.

Post-Pruning allows the tree to fully grow and then removes branches that don't improve performance.
Advantage: Leads to simpler models that generalize better.



---

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer:
Information Gain is the reduction in impurity (Entropy or Gini) achieved after a dataset is split on a feature. It’s calculated as:

\text{Information Gain} = \text{Entropy (parent)} - \sum_{i} \left( \frac{n_i}{n} \cdot \text{Entropy (child}_i) \right)

It helps the Decision Tree select the feature that provides the most informative split, improving the model's classification ability.


---

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Answer:
Applications:

Medical diagnosis

Credit scoring

Fraud detection

Customer segmentation


Advantages:

Easy to understand and interpret

Requires little data preprocessing

Handles both numerical and categorical data


Limitations:

Prone to overfitting

Unstable with small data variations

Can create biased trees with imbalanced data



---

Question 6: Python Program (Iris Dataset, Gini Criterion)

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score



In [None]:
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Question 7: Python Program (Iris Dataset, max_depth=3 vs full tree)


In [None]:
# Full Tree
full_model = DecisionTreeClassifier()
full_model.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_model.predict(X_test))

# Pruned Tree
pruned_model = DecisionTreeClassifier(max_depth=3)
pruned_model.fit(X_train, y_train)
pruned_acc = accuracy_score(y_test, pruned_model.predict(X_test))

print("Full Tree Accuracy:", full_acc)
print("Pruned Tree Accuracy:", pruned_acc)




Question 8: Python Program (Boston Housing, Regression)

from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error



In [None]:
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.3, random_state=42)

regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", regressor.feature_importances_)

Note: load_boston() is deprecated. Use an alternative like fetch_california_housing() if needed.



Question 9: Python Program (GridSearchCV Tuning on Iris Dataset)

from sklearn.model_selection import GridSearchCV


In [None]:
param_grid = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 3, 4]
}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=3)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)


Question 10: Healthcare Use Case – Step-by-Step Process

Answer:

Step 1: Handle Missing Values

Use imputation strategies (mean/median for numerical, mode for categorical)

Optionally drop rows/columns with too many missing values


Step 2: Encode Categorical Features

Use Label Encoding or One-Hot Encoding depending on model needs


Step 3: Train a Decision Tree Model

Split data into train/test

Use DecisionTreeClassifier() and fit on training data


Step 4: Tune Hyperparameters

Use GridSearchCV to find the best max_depth, min_samples_split, etc.


Step 5: Evaluate Performance

Use metrics like Accuracy, Precision, Recall, F1-score, Confusion Matrix


Business Value:

Enables early and accurate disease prediction

Supports better patient care and treatment planning

Saves costs and improves efficiency in medical diagnosis
