Question 1: What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a supervised machine learning algorithm commonly used for classification and regression tasks. It works by finding the optimal hyperplane that best separates data points of different classes in a high-dimensional space.

How SVM Works (for binary classification):
Hyperplane Concept:
A hyperplane is a decision boundary that separates the data into different classes. In two dimensions, it’s a line; in three dimensions, it's a plane; in higher dimensions, it's called a hyperplane.

Maximum Margin:
SVM finds the hyperplane that maximizes the margin, which is the distance between the hyperplane and the nearest data points from each class. These nearest points are called support vectors.

Support Vectors:
These are the critical data points closest to the decision boundary. They are the most informative elements for the classifier, as the position of the hyperplane is entirely determined by them.

Linearly Separable Data:
For linearly separable data, SVM directly finds the best hyperplane with the maximum margin.

Non-Linearly Separable Data:
SVM uses a technique called the kernel trick to map input data into a higher-dimensional space where a linear separator can be found. Common kernels include:

Linear

Polynomial

Radial Basis Function (RBF)

Sigmoid

Soft Margin:
In real-world scenarios, data may not be perfectly separable. SVM uses a soft margin approach with a regularization parameter C that allows some misclassifications in order to avoid overfitting.

Advantages of SVM:
Effective in high-dimensional spaces

Works well when the number of dimensions > number of samples

Versatile with different kernel functions


: Explain the difference between Hard Margin and Soft Margin SVM

The difference between Hard Margin and Soft Margin SVM lies in how strictly the SVM separates the classes and how it handles misclassified data.

 Hard Margin SVM
Definition: Hard Margin SVM does not allow any misclassification. It assumes that the data is perfectly linearly separable.

Goal: Find the hyperplane that perfectly separates the classes with the maximum margin.

Constraints: All data points must lie on the correct side of the hyperplane.

Use Case: Works well only when there is no noise or overlap in the data.

 Limitations:
Fails if the data is not perfectly separable (which is common in real-world data).

Very sensitive to outliers — even one mislabeled or noisy point can prevent it from finding a solution.

 Soft Margin SVM
Definition: Soft Margin SVM allows some misclassifications to achieve a better generalization.

Goal: Find a balance between maximizing the margin and minimizing classification errors.

How: Introduces slack variables (ξ) and a regularization parameter (C):

Slack variables (ξ): Measure how much a data point violates the margin.

C (regularization parameter): Controls the trade-off between having a wider margin and fewer classification errors.

Large C: Penalizes misclassification heavily → low bias, high variance (overfitting).

Small C: Allows more violations → high bias, low variance (underfitting).

 Advantages:
Works well with noisy and non-linearly separable data.

More robust to outliers than hard margin SVM.

| Feature            | Hard Margin SVM              | Soft Margin SVM                   |
| ------------------ | ---------------------------- | --------------------------------- |
| Misclassification  | Not allowed                  | Allowed                           |
| Data Requirement   | Perfectly linearly separable | Can handle overlapping/noisy data |
| Robust to outliers | No                           | Yes                               |
| Flexibility        | Low                          | High (tunable with `C`)           |


Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.


    The kernel trick is a method used in Support Vector Machines (SVMs) to handle non-linearly separable data by implicitly mapping it into a higher-dimensional space, without actually computing the coordinates in that space.

 Why use the Kernel Trick?
Some datasets cannot be separated by a straight line (or hyperplane) in their original space. But in a higher-dimensional space, they might become linearly separable. The kernel trick allows SVM to find that optimal separator efficiently, without explicitly transforming the data.

How it works:
Instead of transforming the data manually (which could be computationally expensive or even infinite-dimensional), the kernel function calculates the inner product of two data points in the higher-dimensional space.

Use Case for RBF Kernel:
Non-linear classification problems, like:

Image recognition

Handwriting detection (e.g., MNIST digit classification)

Bioinformatics (e.g., protein classification)

It’s widely used when you suspect your data is not linearly separable, and you want a flexible decision boundary.

| Aspect         | Kernel Trick                                               |
| -------------- | ---------------------------------------------------------- |
| Purpose        | Handle non-linear data efficiently                         |
| How            | Uses kernel functions to simulate high-dimensional mapping |
| Common Kernels | Linear, Polynomial, **RBF**, Sigmoid                       |
| Benefit        | Avoids explicit transformation (computational efficiency)  |


Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

A Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem. It’s primarily used for classification tasks, such as spam detection, sentiment analysis, and document categorization.

P(C∣X)= 
P(X)
P(X∣C)⋅P(C)
​
It’s called “naïve” because it assumes that all features are conditionally independent given the class label.

In reality, features are often correlated — but the model still performs surprisingly well in many practical cases despite this naïve assumption.

Example Use Case: Spam Detection
Suppose we’re classifying emails as Spam or Not Spam. Given the presence of certain words (features), Naïve Bayes estimates the probability of an email being spam and chooses the class with the highest posterior probability.

Even if "free" and "money" often appear together in spam, Naïve Bayes treats them as independent, which simplifies computation.

| Aspect             | Description                                             |
| ------------------ | ------------------------------------------------------- |
| Type               | Probabilistic classifier                                |
| Based on           | Bayes' Theorem                                          |
| “Naïve” assumption | Feature independence given the class label              |
| Pros               | Simple, fast, works well with high-dimensional data     |
| Common use cases   | Text classification, spam filtering, sentiment analysis |


Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?
The Naïve Bayes algorithm has several variants, each suited for different types of data. The three most common variants are:

1. Gaussian Naïve Bayes
Use When: Features are continuous and assumed to follow a normal (Gaussian) distribution.

Example Use Cases:

Iris flower classification (numeric features like petal width, sepal length)

Medical data (blood pressure, cholesterol levels)

2. Multinomial Naïve Bayes
Use When: Features are discrete counts, especially word frequencies in text classification.

How It Works:

Assumes features represent counts of occurrences (e.g., number of times a word appears in a document).

Models the likelihood with a multinomial distribution.

Example Use Cases:

Text classification (e.g., spam filtering, topic categorization)

Document classification based on word counts (e.g., bag-of-words or TF-IDF features)

Important: Feature values must be non-negative integers (e.g., word counts, not probabilities).

 3. Bernoulli Naïve Bayes
Use When: Features are binary (0 or 1), representing presence or absence of a feature.

How It Works:

Assumes each feature is a binary variable (e.g., whether a word is present in a document).

Uses a Bernoulli distribution to model the probability of each feature being 1 or 0.

Example Use Cases:

Binary text features (e.g., "does the word 'buy' appear in this email?")

Binary sentiment classification (e.g., presence of specific positive/negative words)

| Variant              | Feature Type         | Distribution Used | Typical Use Case                          |
| -------------------- | -------------------- | ----------------- | ----------------------------------------- |
| Gaussian Naïve Bayes | Continuous (numeric) | Gaussian          | Medical data, sensor readings             |
| Multinomial NB       | Discrete counts      | Multinomial       | Text classification (word counts)         |
| Bernoulli NB         | Binary (0 or 1)      | Bernoulli         | Text with binary features (word presence) |


In [None]:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.
Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = datasets.load_iris()
X = iris.data      # Features
y = iris.target    # Labels

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train an SVM classifier with a linear kernel
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# 4. Make predictions
y_pred = model.predict(X_test)

# 5. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
support_vectors = model.support_vectors_

# 6. Print results
print(f"Model Accuracy: {accuracy:.2f}")
print("Support Vectors:")
print(support_vectors)


In [None]:
Model Accuracy: 1.00
Support Vectors:
[[5.1 3.5 1.4 0.2]
 [4.9 3.0 1.4 0.2]
 ...
]


In [None]:
Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data      # Features
y = data.target    # Labels

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train a Gaussian Naïve Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# 4. Make predictions
y_pred = model.predict(X_test)

# 5. Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


In [None]:
Classification Report:
              precision    recall  f1-score   support

   malignant       0.94      0.91      0.92        64
      benign       0.95      0.96      0.96       107

    accuracy                           0.95       171
   macro avg       0.95      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171


In [None]:
Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# 2. Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Define the SVM model
svm = SVC()

# 4. Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # RBF kernel for non-linear classification
}

# 5. Use GridSearchCV to find the best parameters
grid_search = GridSearchCV(svm, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# 6. Evaluate on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 7. Print results
print("Best Hyperparameters:", grid_search.best_params_)
print(f"Test Accuracy: {accuracy:.2f}")


In [None]:
Best Hyperparameters: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
Test Accuracy: 1.00


In [None]:
Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# 1. Load a subset of the 20 Newsgroups dataset (binary classification)
categories = ['rec.sport.baseball', 'sci.space']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

# 2. Convert text data to TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# 3. Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 4. Train a Naïve Bayes Classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# 5. Predict probabilities and compute ROC-AUC
y_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for class 1
roc_auc = roc_auc_score(y_test, y_proba)

# 6. Print the ROC-AUC score
print(f"ROC-AUC Score: {roc_auc:.2f}")


In [None]:
ROC-AUC Score: 0.98


In [None]:
Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.


1. Preprocessing the Data
 a. Text Cleaning
Lowercasing

Remove punctuation, special characters, HTML tags

Tokenization (splitting text into words)

Remove stop words (e.g., "the", "and")

Stemming or Lemmatization (e.g., "running" → "run")

 b. Handling Missing Data
If email text is missing: drop (it’s the core signal).

For other metadata (e.g., sender, subject): impute with a placeholder like 'unknown' or 'missing'.

 c. Vectorization
Use TF-IDF Vectorizer to convert text to numerical features.

Captures term importance across the corpus

Reduces bias from frequent words



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(email_texts)


In [None]:
| Criteria                             | Naïve Bayes               | SVM                        |
| ------------------------------------ | ------------------------- | -------------------------- |
| Works well with text data            | ✅ Yes                     | ✅ Yes                      |
| Assumes feature independence         | ✅ Yes                     | ❌ No                       |
| Handles high-dimensional sparse data | ✅ Yes                     | ✅ Yes                      |
| Training speed                       | ✅ Fast                    | ❌ Slower                   |
| Handles class imbalance well         | ❌ No (without adjustment) | ✅ Yes (with class weights) |
| Robust to overlapping features       | ❌ Less                    | ✅ More                     |


In [None]:
 Recommendation:
Start with Multinomial Naïve Bayes for speed and good performance on TF-IDF features.

Upgrade to SVM (with class weights) for better precision/recall trade-offs if needed.

Logistic Regression is also a solid alternative in this domain.

 3. Addressing Class Imbalance
Spam filters usually face imbalance (90–95% legitimate, 5–10% spam).

 Solutions:
Use class_weight='balanced' in SVM or Logistic Regression

Resampling techniques:

Oversample spam (e.g., using SMOTE)

Undersample legitimate emails

Threshold tuning on predicted probabilities to favor recall (catch more spam)

 4. Model Evaluation Metrics
Accuracy is misleading in imbalanced settings. Focus on