Task 3: GaussianNB with Iris or Wine Dataset 
● Train a GaussianNB classifier on a numeric dataset. 

● Split data into train/test sets. 

● Evaluate model performance. 

● Compare with Logistic Regression or Decision Tree briefly.

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Step 2: Load dataset
data = load_iris()
X = data.data
y = data.target

In [3]:
# Step 3: Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [4]:
# Step 4: Train GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)

In [6]:
# Step 5: Train Logistic Regression for comparison
lr = LogisticRegression(max_iter=200)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

In [7]:
# Step 6: Train Decision Tree for comparison
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

In [8]:
# Step 7: Evaluate all models
print("GaussianNB Accuracy:", round(accuracy_score(y_test, y_pred_gnb), 4))
print("Logistic Regression Accuracy:", round(accuracy_score(y_test, y_pred_lr), 4))
print("Decision Tree Accuracy:", round(accuracy_score(y_test, y_pred_dt), 4))

print("\n--- GaussianNB Report ---\n", classification_report(y_test, y_pred_gnb))
print("\n--- Logistic Regression Report ---\n", classification_report(y_test, y_pred_lr))
print("\n--- Decision Tree Report ---\n", classification_report(y_test, y_pred_dt))

GaussianNB Accuracy: 0.9778
Logistic Regression Accuracy: 1.0
Decision Tree Accuracy: 1.0

--- GaussianNB Report ---
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45


--- Logistic Regression Report ---
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45


--- Decision Tree Report ---
               precision    recall  f1-score   suppor

| Model                   | Accuracy (Iris Dataset) | Notes                                                       |
| ----------------------- | ----------------------- | ----------------------------------------------------------- |
| **GaussianNB**          | Typically 0.93–0.96     | Fast, works well with normally distributed numeric data     |
| **Logistic Regression** | 0.95–0.98               | Strong baseline model, interpretable, performs well         |
| **Decision Tree**       | 0.93–1.00               | Captures non-linear relationships, but prone to overfitting |


Task 4: Conceptual Questions 
Answer briefly: 
1. What is entropy and information gain? 

2. Explain the difference between Gini Index and Entropy. 

3. How can a decision tree overfit? How can this be avoided? 

1. What is entropy and information gain?

   Entropy measures the impurity or randomness in a dataset. A high entropy means more disorder (mixed classes).

   Information Gain is the reduction in entropy after a dataset is split on a feature. It helps in selecting the best feature for splitting by measuring how well it separates the classes.

2. Explain the difference between Gini Index and Entropy.

   Both are measures of impurity, but calculated differently:

   Entropy uses logarithms: 
−
∑
𝑝
𝑖
log
⁡
2
𝑝
𝑖
−∑p 
i
​
 log 
2
​
 p 
i
​
 

   Gini Index uses squared probabilities: 
1
−
∑
𝑝
𝑖
2
1−∑p 
i
2
​
 

    Gini is generally faster to compute and tends to favor larger splits, while entropy can be more sensitive to class imbalance.

3. How can a decision tree overfit? How can this be avoided?
   A decision tree can overfit when it grows too deep and learns noise or specific patterns in the training data.
  This can be avoided by:

   Limiting tree depth (max_depth)

   Setting minimum samples per leaf (min_samples_leaf)

   Using pruning techniques or ensemble methods like Random Forests.