# Assignment 9 - Data Analytics III

Problem Statement

Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset.   
Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.

# Theory Before Assignment

Naive Bayes is a **supervised** machine learning algorithm used mainly for classification tasks.
It is based on Bayes' Theorem and the **naive assumption** that **all features are independent of each other**.

**P(A∣B)= (P(B∣A)×P(A))/P(B)**

Where:
𝑃(𝐴∣𝐵) = Probability of A (class) given B (features)  
𝑃(𝐵∣𝐴) = Probability of B (features) given A (class)  
𝑃(𝐴) = Probability of A (class) (prior probability)  
𝑃(𝐵) = Probability of B (features)  

**​2. Why is it called "Naive"?**
It is called naive because it assumes that all features (variables) are completely independent of each other — which is rarely true in real-world data.
But even with this silly assumption, it works surprisingly well for many problems!
 
**3. How does Naive Bayes work?**   
Step 1: It calculates the probability of each class given the input features.   
Step 2: It selects the class with the highest probability.    

It uses probability and frequency from the training data to make decisions

**4. Where is Naive Bayes used?**
1. Spam Detection (email is spam or not)
2. Sentiment Analysis (positive or negative review)
3. Medical Diagnosis (disease detection)
4. Text Classification (news categories)

**5. Types of Naive Bayes Models:**

| Type           | When to use                                   |
|----------------|------------------------------------------------|
| Gaussian NB    | When features are continuous and normally distributed |
| Multinomial NB | When features are counts (e.g., word counts in text) |
| Bernoulli NB   | When features are binary (0 or 1) |


# Assignment 9 - Data Analtyics III

In [28]:
import pandas as pd

In [29]:
df = pd.read_csv('iris.csv')
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [30]:
df = df.drop(columns=["Id"])
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [31]:
# Let us look at the correlation with target feature.
# Inorder to do so, we need to encode our target feature first

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["encoded_species"] = le.fit_transform(df["Species"])
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,encoded_species
0,5.1,3.5,1.4,0.2,Iris-setosa,0
1,4.9,3.0,1.4,0.2,Iris-setosa,0
2,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5.0,3.6,1.4,0.2,Iris-setosa,0
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,2
146,6.3,2.5,5.0,1.9,Iris-virginica,2
147,6.5,3.0,5.2,2.0,Iris-virginica,2
148,6.2,3.4,5.4,2.3,Iris-virginica,2


In [32]:
le.classes_

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [33]:
df.corr(numeric_only=True)["encoded_species"]

SepalLengthCm      0.782561
SepalWidthCm      -0.419446
PetalLengthCm      0.949043
PetalWidthCm       0.956464
encoded_species    1.000000
Name: encoded_species, dtype: float64

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, recall_score, precision_score
import numpy as np

In [35]:
X = df.select_dtypes(include=["number"]).drop(columns=["encoded_species"]).values
y = df.select_dtypes(include=["number"])["encoded_species"].values

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [37]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [38]:
model = GaussianNB()
model.fit(X_train, y_train)

In [39]:
y_pred = model.predict(X_test)

In [40]:
accuracy_score(y_test, y_pred)

1.0

In [42]:
precision_score(y_test, y_pred, average='weighted')

1.0

# K-Fold Cross Validation

K-Fold Cross Validation is a smart way to check if your model is really good — not just lucky on one train-test split.

Instead of splitting your data into only one train-test,   
✅ it splits the data into K parts ("folds")  
✅ trains the model K times  
✅ tests it each time on a different fold  
✅ then averages the results.  

In [43]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print("Cross-validation accuracy scores:", scores)
print("Mean accuracy:", scores.mean())

Cross-validation accuracy scores: [0.93333333 0.96666667 0.93333333 0.93333333 1.        ]
Mean accuracy: 0.9533333333333334
