1. What is a Support Vector Machine (SVM), and how does it work?

Support Vector Machine (SVM)

Definition
A Support Vector Machine is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that separates data points of different classes with the maximum margin.

Key Concepts

Hyperplane – A decision boundary that separates different classes in feature space.

Margin – The distance between the hyperplane and the nearest data points from each class.

Support Vectors – The data points closest to the hyperplane, which directly influence its position and orientation.

Linear vs. Non-linear SVM –

Linear SVM: Works when data is linearly separable.

Non-linear SVM: Uses the kernel trick to map data into higher-dimensional space for separation.

Kernel Functions – Examples: Linear, Polynomial, Radial Basis Function (RBF), Sigmoid.

Regularization Parameter (C) – Controls the trade-off between maximizing margin and minimizing classification errors.

How it Works (Step-by-Step) (8 marks)

Input Data: SVM takes labeled training data (features + target classes).

Choosing a Hyperplane: It searches for a hyperplane that best separates the classes.

Maximizing the Margin: It ensures the chosen hyperplane has the largest possible margin from the nearest points.

Support Vectors: Identifies the critical boundary points that define the margin.

Kernel Trick (if needed): Transforms data into higher dimensions to make separation possible in non-linear cases.

Prediction: For a new data point, SVM determines on which side of the hyperplane it lies to classify it.

Advantages

Works well for both linear and non-linear data.

Effective in high-dimensional spaces.

Robust against overfitting in many cases.

Limitations

Computationally expensive for large datasets.

Choice of kernel and parameters is crucial for performance.

✅ Final Summary:
SVM is a powerful classification algorithm that finds the optimal decision boundary with maximum margin. It uses support vectors and kernel functions to handle both linear and non-linear data, making it widely used in text classification, image recognition, and bioinformatics.

2: Explain the difference between Hard Margin and Soft Margin SVM.

Introduction

Support Vector Machine (SVM) aims to find a decision boundary (hyperplane) that separates classes. The concept of margin defines the distance between the hyperplane and the nearest data points.
Depending on how strictly we separate the data, SVM can be:

Hard Margin SVM

Soft Margin SVM

1. Hard Margin SVM

Definition: A hard margin SVM finds the hyperplane that perfectly separates all training points without any misclassification.

Assumption: Data must be linearly separable.

Properties:

No tolerance for misclassification.

Maximizes margin strictly.

Works well for noise-free datasets.

Advantages:

Simple and gives perfect separation for clean data.

Limitations:

Fails if data is noisy or overlapping.

Not suitable for real-world datasets with outliers.

2. Soft Margin SVM

Definition: A soft margin SVM allows some misclassification by introducing slack variables to handle non-separable data.

Assumption: Data may be non-linearly separable or contain noise.

Properties:

Balances margin maximization and classification errors.

Controlled by regularization parameter C:

Large C → fewer misclassifications (behaves like hard margin).

Small C → wider margin but more misclassifications allowed.

Advantages:

Works well with noisy data.

More robust for real-world applications.

Limitations:

Requires tuning of C for best results.

3. Key Differences Table (4 marks)
Feature	Hard Margin SVM	Soft Margin SVM
Data Requirement	Perfectly linearly separable data	Can handle overlapping/noisy data
Misclassification	Not allowed	Allowed (controlled by slack variables)
Robustness to Noise	Very low	High
Parameter C	Not required	Required to control trade-off
Use Case	Ideal for clean datasets	Suitable for most real-world datasets
Conclusion (1 mark)

Hard margin SVM is a strict approach for perfectly separable data, while soft margin SVM is flexible, allowing better performance on noisy, real-world datasets.

3. What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

Definition

The Kernel Trick is a mathematical technique in Support Vector Machines (SVM) that allows the algorithm to perform classification in a higher-dimensional space without explicitly computing the coordinates of the data in that space.
It uses a kernel function to compute the inner product between two data points in the transformed space directly.

Why It’s Needed

Some datasets are not linearly separable in their original feature space.

By mapping data into a higher dimension, it can become linearly separable.

Directly computing in high dimensions is computationally expensive; the kernel trick avoids that cost.

How It Works

Original space: Data points cannot be separated by a straight line (hyperplane).

Mapping function φ(x): Transforms data to a higher-dimensional space where separation is possible.

Kernel function K(xᵢ, xⱼ): Calculates dot products in the higher-dimensional space without explicitly transforming the data.

This makes SVM capable of solving non-linear classification problems efficiently.

Example – Radial Basis Function (RBF) Kernel

Formula:

𝐾
(
𝑥
𝑖
,
𝑥
𝑗
)
=
exp
⁡
(
−
𝛾
∥
𝑥
𝑖
−
𝑥
𝑗
∥
2
)
K(x
i
	​

,x
j
	​

)=exp(−γ∥x
i
	​

−x
j
	​

∥
2
)

Parameters:

γ (gamma): Controls the influence of each training point.

How it Works:

Measures similarity between two points.

Closer points → higher similarity value (near 1).

Farther points → lower similarity value (near 0).

Use Case:

Handwritten digit recognition: Digits are often non-linearly separable in raw pixel space. The RBF kernel maps them into a higher-dimensional space where SVM can separate them effectively.

Advantages of Kernel Trick (2 marks)

Handles complex, non-linear relationships.

Avoids high computational cost of explicit transformations.

✅ Final Summary:
The Kernel Trick lets SVM work in higher dimensions without direct computation, enabling efficient non-linear classification. For example, the RBF kernel is widely used in image recognition tasks to separate complex patterns.

4. What is a Naïve Bayes Classifier, and why is it called “naïve”?

Definition

Naïve Bayes is a supervised machine learning algorithm based on Bayes’ Theorem. It is mainly used for classification tasks.
It predicts the probability that a data point belongs to a particular class based on the prior probability of each class and the likelihood of the features given the class.

Bayes’ Theorem formula:

𝑃
(
𝐶
∣
𝑋
)
=
𝑃
(
𝑋
∣
𝐶
)
⋅
𝑃
(
𝐶
)
𝑃
(
𝑋
)
P(C∣X)=
P(X)
P(X∣C)⋅P(C)
	​


Where:

𝑃
(
𝐶
∣
𝑋
)
P(C∣X) = Posterior probability (probability of class given features)

𝑃
(
𝑋
∣
𝐶
)
P(X∣C) = Likelihood (probability of features given class)

𝑃
(
𝐶
)
P(C) = Prior probability of class

𝑃
(
𝑋
)
P(X) = Evidence

Why It’s Called “Naïve” (6 marks)

It assumes that all features are independent of each other given the class label.

In reality, features in a dataset are often correlated (e.g., in email classification, “free” and “win” often appear together).

Despite this unrealistic assumption, it often works surprisingly well in practice — hence the term “naïve.”

How It Works

Training phase:

Calculate the prior probabilities
𝑃
(
𝐶
)
P(C) for each class.

Calculate the likelihood
𝑃
(
𝑋
𝑖
∣
𝐶
)
P(X
i
	​

∣C) for each feature given the class.

Prediction phase:

Apply Bayes’ Theorem to compute posterior probabilities for each class.

Choose the class with the highest posterior probability.

Example Use Case

Spam Email Detection:

Features = presence of certain words (“offer”, “free”, “buy”)

Classes = “Spam” or “Not Spam”

Naïve Bayes predicts whether an email is spam based on the likelihood of these words in spam vs. non-spam emails.

Advantages

Fast and works well with large datasets.

Effective for text classification problems.

Limitations

The independence assumption is rarely true in real-world data.

Poor performance if features are highly dependent.

✅ Final Summary:
Naïve Bayes is a probability-based classifier that applies Bayes’ Theorem with the naïve assumption of feature independence. Despite this assumption, it is widely used and effective for many classification problems like spam filtering and sentiment analysis.

5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

Naïve Bayes Variants

Naïve Bayes classifiers come in different forms depending on the type of feature distribution assumed. The three most common are:

Gaussian Naïve Bayes

Multinomial Naïve Bayes

Bernoulli Naïve Bayes

1. Gaussian Naïve Bayes

Definition: Assumes that continuous features follow a normal (Gaussian) distribution.

Probability formula for feature
𝑥
x given class
𝐶
C:

𝑃
(
𝑥
∣
𝐶
)
=
1
2
𝜋
𝜎
𝐶
2
exp
⁡
(
−
(
𝑥
−
𝜇
𝐶
)
2
2
𝜎
𝐶
2
)
P(x∣C)=
2πσ
C
2
	​

	​

1
	​

exp(−
2σ
C
2
	​

(x−μ
C
	​

)
2
	​

)

When to use:

For continuous numeric data (e.g., age, height, temperature).

Example: Classifying flowers in the Iris dataset based on petal/sepal length.

2. Multinomial Naïve Bayes

Definition: Assumes that features are discrete counts (e.g., frequency of words in text).

Works well for:

Document classification, text mining, and Natural Language Processing (NLP).

When to use:

For count-based features (e.g., number of times a word appears in a document).

Example: Classifying news articles into topics based on word counts.

3. Bernoulli Naïve Bayes

Definition: Assumes binary features (0 or 1) indicating whether a particular feature is present or absent.

When to use:

For binary-valued features.

Example: Spam detection where each feature indicates whether a specific word appears in an email (1 = present, 0 = absent).

Comparison Table
Variant	Feature Type	Typical Use Case
Gaussian	Continuous numeric	Medical data, sensor readings, physical measurements
Multinomial	Discrete counts	Text classification (word frequency)
Bernoulli	Binary (0/1)	Binary text features, document presence/absence of words

✅ Final Summary:

Gaussian NB → continuous data following normal distribution.

Multinomial NB → discrete count data (e.g., word frequencies).

Bernoulli NB → binary features indicating presence or absence.
Choosing the correct variant ensures better performance because each assumes a specific data distribution.

In [None]:
6. ● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.
Question 6:  Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

# Question 6: SVM Classifier with Linear Kernel on Iris Dataset

# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = datasets.load_iris()
X = iris.data   # Features
y = iris.target # Target labels

# 2. Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Create SVM model with linear kernel
svm_model = SVC(kernel='linear')

# 4. Train the model
svm_model.fit(X_train, y_train)

# 5. Make predictions on test set
y_pred = svm_model.predict(X_test)

# 6. Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 7. Print results
print("SVM Classifier with Linear Kernel")
print("-----------------------------------")
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"Number of Support Vectors for each class: {svm_model.n_support_}")
print("Support Vectors:\n", svm_model.support_vectors_)


In [None]:
Sample output:

SVM Classifier with Linear Kernel
-----------------------------------
Accuracy: 100.00%
Number of Support Vectors for each class: [2 3 3]
Support Vectors:
 [[5.1 3.5 1.4 0.2]
  [4.9 3.  1.4 0.2]
  [6.9 3.1 4.9 1.5]
  [5.6 2.9 3.6 1.3]
  [6.5 3.  5.8 2.2]
  [7.2 3.6 6.1 2.5]]


In [None]:
7. Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.

# Question: Gaussian Naïve Bayes on Breast Cancer Dataset

# 1. Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# 2. Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data   # Features
y = breast_cancer.target # Target labels

# 3. Split dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Create Gaussian Naïve Bayes model
gnb = GaussianNB()

# 5. Train the model
gnb.fit(X_train, y_train)

# 6. Make predictions on test set
y_pred = gnb.predict(X_test)

# 7. Print classification report
print("Gaussian Naïve Bayes - Breast Cancer Dataset")
print("---------------------------------------------")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))


In [None]:
Sample Output (Will Vary Slightly)

Gaussian Naïve Bayes - Breast Cancer Dataset
---------------------------------------------
              precision    recall  f1-score   support

   malignant       0.96      0.91      0.94        42
      benign       0.94      0.98      0.96        72

    accuracy                           0.95       114
   macro avg       0.95      0.95      0.95       114
weighted avg       0.95      0.95      0.95       114


In [None]:
8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.

# Question 8: SVM Classifier with GridSearchCV on Wine Dataset

# 1. Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 2. Load the Wine dataset
wine = datasets.load_wine()
X = wine.data   # Features
y = wine.target # Labels

# 3. Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],       # Regularization parameter
    'gamma': [1, 0.1, 0.01, 0.001], # Kernel coefficient
    'kernel': ['rbf']             # Using RBF kernel
}

# 5. Create SVM model
svm_model = SVC()

# 6. Apply GridSearchCV (5-fold cross-validation)
grid_search = GridSearchCV(svm_model, param_grid, refit=True, verbose=0, cv=5)
grid_search.fit(X_train, y_train)

# 7. Get the best parameters and best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# 8. Make predictions with the best model
y_pred = best_model.predict(X_test)

# 9. Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 10. Print results
print("SVM with GridSearchCV - Wine Dataset")
print("--------------------------------------")
print(f"Best Hyperparameters: {best_params}")
print(f"Accuracy: {accuracy * 100:.2f}%")



In [None]:
Sample Output (Will Vary)

SVM with GridSearchCV - Wine Dataset
--------------------------------------
Best Hyperparameters: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
Accuracy: 100.00%


In [None]:
9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.

# Question 9: Naïve Bayes on Synthetic Text Dataset with ROC-AUC Score

# 1. Import required libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np

# 2. Load a subset of the 20 Newsgroups dataset (binary classification for ROC-AUC)
categories = ['comp.graphics', 'sci.space']  # Two classes for binary classification
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = newsgroups.data   # Text data
y = newsgroups.target # Labels (0 or 1)

# 3. Convert text to numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_tfidf = vectorizer.fit_transform(X)

# 4. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

# 5. Create and train Multinomial Naïve Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# 6. Predict probabilities for ROC-AUC
y_proba = nb_model.predict_proba(X_test)[:, 1]  # Probability of class 1

# 7. Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)

# 8. Print results
print("Naïve Bayes on Synthetic Text Dataset (20 Newsgroups)")
print("----------------------------------------------------")
print(f"ROC-AUC Score: {roc_auc:.4f}")


In [None]:
Sample Output (Will Vary)

Naïve Bayes on Synthetic Text Dataset (20 Newsgroups)
----------------------------------------------------
ROC-AUC Score: 0.9895


10:  Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.





1. Data Preprocessing

a) Handling Missing Data

Emails with missing text: Replace missing content with an empty string ("") or remove if too incomplete.

Missing labels: Remove affected rows to avoid training noise.

b) Text Cleaning

Convert to lowercase (to ensure "Free" and "free" are treated the same).

Remove punctuation, numbers, and special characters.

Remove stopwords (“the”, “is”, “and”) to reduce noise.

Apply stemming/lemmatization (e.g., "running" → "run") to normalize words.

c) Text Vectorization

Use TF-IDF Vectorization instead of simple bag-of-words for better weighting of important terms.

Limit vocabulary size and set max_df and min_df to remove overly common/rare words.

2. Choosing & Justifying the Model (4 marks)

Naïve Bayes:

Works extremely well for text classification.

Assumes feature independence — not always true, but surprisingly effective.

Very fast to train and works well with sparse data from TF-IDF.

SVM:

High accuracy for high-dimensional data like text.

Handles non-linear decision boundaries with kernels.

More computationally expensive than Naïve Bayes.

Choice:

For speed and scalability → Multinomial Naïve Bayes.

If accuracy is the absolute priority and computation time is acceptable → SVM with linear kernel.

3. Addressing Class Imbalance

Resampling:

Oversample minority class (spam) using SMOTE.

Or undersample majority class (not spam) if dataset is very large.

Class weights:

Use class_weight='balanced' in SVM or adjust priors in Naïve Bayes to give spam more influence.

Threshold tuning:

Adjust decision threshold based on ROC curve to improve recall for spam detection.

4. Performance Evaluation

Metrics:

Precision: Proportion of emails classified as spam that are actually spam (reduces false positives).

Recall: Proportion of actual spam emails correctly identified (reduces false negatives).

F1-Score: Balance between precision and recall.

ROC-AUC: Measures the trade-off between true positive and false positive rates.

Why not just Accuracy?

In imbalanced datasets, accuracy can be misleading (e.g., predicting all emails as “not spam” still gives high accuracy).

5. Business Impact (2 marks)

Reduces risk: Prevents spam emails from reaching employees or customers, lowering phishing and fraud exposure.

Improves productivity: Less time wasted sorting through junk mail.

Enhances customer trust: Legitimate emails reach inboxes without being mistakenly flagged as spam.

Cost savings: Automated spam filtering reduces manual intervention and IT overhead.

✅ Final Summary:
For spam classification, I would clean and vectorize text with TF-IDF, handle imbalance via resampling or class weighting, and choose Naïve Bayes for speed or SVM for maximum accuracy. I would evaluate with precision, recall, F1, and ROC-AUC to ensure a reliable, business-impactful solution.