# Task 1: Theory Questions


## Q1. What is the core assumption of Naive Bayes?

The core assumption of Naive Bayes is that features (predictors) are conditionally independent of each other given the class label.

## Q2. Differentiate between GaussianNB, MultinomialNB, and BernoulliNB.

GaussianNB: Predicting classes based on measurements like height, weight.

MultinomialNB: Text classification using word frequencies.

BernoulliNB: Text classification with word presence/absence (e.g., spam detection)


## Q3. Why is Naive Bayes considered suitable for high-dimensional data?

1. Simple and fast: Due to the independence assumption, the number of parameters grows linearly with the number of features—not exponentially.

2. No need for feature selection: It can work well even when there are many irrelevant features.

3. Efficient training: Works well even with small datasets and in real-time applications.

# Task 2: Spam Detection using MultinomialNB  

In [24]:
#Import Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

● Load a text dataset (e.g., SMS Spam Collection or any public text dataset).

In [25]:
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [26]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [27]:
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)

● Preprocess using CountVectorizer or TfidfVectorizer.

In [28]:
vectorizer = CountVectorizer()  # or TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

● Train a MultinomialNB classifier.

In [29]:
model = MultinomialNB()
model.fit(X_train_vec, y_train)

In [30]:
y_pred = model.predict(X_test_vec)

● Evaluate:

○ Accuracy

○ Precision

○ Recall

○ Confusion Matrix

In [31]:
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

In [33]:
print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print("\nConfusion Matrix:")
print(cm)

Accuracy:  0.9919
Precision: 1.0000
Recall:    0.9396

Confusion Matrix:
[[966   0]
 [  9 140]]


# Task 3: GaussianNB with Iris or Wine Dataset


In [35]:
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report


● Train a GaussianNB classifier on a numeric dataset.

In [36]:
iris = load_iris()
X = iris.data      
y = iris.target    

● Split data into train/test sets.

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

● Evaluate model performance.

In [38]:
# Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)

In [41]:
accuracy_score(y_test, y_pred_gnb)

0.9777777777777777

● Compare with Logistic Regression or Decision Tree briefly.

In [None]:
# Logistic Regression
lr = LogisticRegression(max_iter=200)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

In [43]:
accuracy_score(y_test, y_pred_lr)

1.0

In [None]:
# Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

In [44]:
accuracy_score(y_test, y_pred_dt)

1.0