# SVM and Naive Bayes : Assignment


Question 1: What is a Support Vector Machine (SVM), and how does it work?

Answer:

SVM is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that separates data points of different classes with the maximum margin. The data points closest to the hyperplane are called support vectors, and they determine the position and orientation of the hyperplane.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer:
- Hard Margin SVM assumes data is perfectly linearly separable and does not allow misclassification. It maximizes the margin strictly.
- Soft Margin SVM allows some misclassification to handle noisy or overlapping data. It introduces a regularization parameter (C) to balance margin maximization and classification error.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

Answer:
The Kernel Trick allows SVM to operate in high-dimensional spaces without explicitly computing the coordinates. It transforms non-linearly separable data into a higher-dimensional space where it becomes linearly separable.

Example:
- RBF Kernel (Radial Basis Function): Useful when the decision boundary is non-linear. It maps data into infinite-dimensional space using Gaussian functions.

Question 4: What is a Naive Bayes Classifier, and why is it called "naive"?

Answer:

Naive Bayes is a probabilistic classifier based on Bayes’ Theorem. It assumes that features are conditionally independent given the class label—this assumption is "naive" because it rarely holds true in real-world data.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naive Bayes variants. When would you use each one?

Answer:
- Gaussian NB: For continuous features (e.g., height, weight). Assumes features follow a normal distribution.
- Multinomial NB: For discrete count data (e.g., word frequencies in text).
- Bernoulli NB: For binary features (e.g., presence or absence of words).


Question 6:   Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

Answer:

In [5]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
#loading the dataset:
from sklearn.datasets import load_iris
iris = load_iris()
x= iris.data
y = iris.target

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = .30,random_state = 42)
clf = SVC(kernel='linear')
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

#Results
print(f"Accuracy : {accuracy_score(y_test,y_pred)}\n")
print(f"Support Vectors are :\n {clf.support_vectors_}")

Accuracy : 1.0

Support Vectors are :
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Question 7:  Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.

Answer:  

In [13]:
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB

# loading the dataset
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
x = dataset.data
y = dataset.target

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = .30,random_state = 42)
model = GaussianNB()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
# results
print(f'Classification report :\n {classification_report(y_test,y_pred)}')

Classification report :
               precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.

Answer:

In [32]:
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_wine
from sklearn.svm import SVC

dataset = load_wine()
x = dataset.data
y = dataset.target

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = .30,random_state = 42)

#hyperparameter tuning
params ={
    'C':[1,2,3],
    'gamma':[.1,.2,.3],
    'kernel':['linear','rbf']
}
model = GridSearchCV(SVC(),param_grid= params,cv=3,verbose = 2,n_jobs=-1)
model.fit(X_train,y_train)
y_pred = model.best_estimator_.predict(X_test)

# Results
print(f'Accuracy: {accuracy_score(y_test,y_pred):.3f}')
print(f'Hyparparameters : {model.best_params_}')

Fitting 3 folds for each of 18 candidates, totalling 54 fits
Accuracy: 0.981
Hyparparameters : {'C': 1, 'gamma': 0.1, 'kernel': 'linear'}


Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.


Answer:

In [38]:
# Import libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# 1. Load dataset (subset for speed)
categories = ['rec.sport.baseball', 'sci.space', 'talk.politics.mideast']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# 3. Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# 4. Train a Naïve Bayes Classifier
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# 5. Get prediction probabilities
y_prob = model.predict_proba(X_test_tfidf)

# 6. Calculate ROC-AUC Score (multi-class)
roc_auc = roc_auc_score(y_test, y_prob, multi_class='ovr')

# 7. Print result
print(f"ROC-AUC Score of Naive Bayes Classifier:{roc_auc:.5f}")


ROC-AUC Score of Naive Bayes Classifier:0.99328


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

Answer:

- Preprocessing:
  - Use TfidfVectorizer for text vectorization.
  - Handle missing data using imputation or removal.
- Model Choice:
  - Naïve Bayes(MultinomialNB) is preferred for text due to speed and performance on sparse data.
  - SVM Works well with high-dimensional text data & Can find non-linear boundaries with kernels.
  - For a production spam classifier, I would start with SVM with linear kernel because:
  1. It handles high-dimensional text data effectively.
  2. It usually achieves higher accuracy than Naïve Bayes in real-world spam filtering tasks.
- Class Imbalance:
  - Use SMOTE or class weights.
- Evaluation Metrics:
  - Precision, Recall, F1-score, ROC-AUC, Confusion Matrix.
- Business Impact:
  - Reduces manual filtering, improves productivity, and protects users from phishing.


In [40]:

# SPAM vs NOT SPAM CLASSIFIER


# 1. Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.utils import resample

# 2. Create or load a sample dataset (simulate)
data = {
    'text': [
        'Win a free iPhone now!!!',
        'Meeting at 10am with the sales team',
        'Congratulations, you have won a lottery prize!',
        'Lunch with client tomorrow',
        'Get cheap medicines online',
        None,  # missing data example
        'Important update about your account',
        'Earn money from home easily',
        'Schedule weekly report discussion',
        'Claim your free vacation tickets now'
    ],
    'label': ['spam','ham','spam','ham','spam','ham','ham','spam','ham','spam']
}

df = pd.DataFrame(data)

# 3. Handle missing data
df['text'] = df['text'].fillna("missing_text")

# 4. Encode labels (spam = 1, ham = 0)
df['label'] = df['label'].map({'ham':0, 'spam':1})

# 5. Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.3, random_state=42, stratify=df['label']
)

# 6. Vectorize text using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# 7. Handle class imbalance by adjusting class weights in SVM
model = LinearSVC(class_weight='balanced', random_state=42)
model.fit(X_train_tfidf, y_train)

# 8. Make predictions
y_pred = model.predict(X_test_tfidf)

# 9. Evaluate the model
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# For ROC-AUC, we need decision scores (since LinearSVC doesn't give probabilities)
y_scores = model.decision_function(X_test_tfidf)
roc_auc = roc_auc_score(y_test, y_scores)
print("ROC-AUC Score:", round(roc_auc, 4))


Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.33      1.00      0.50         1

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3

Confusion Matrix:
 [[0 2]
 [0 1]]
ROC-AUC Score: 1.0
