**Question 1:** What is a Support Vector Machine (SVM), and how does it work?

**Answer:** A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used primarily for classification tasks, although it can also be used for regression. The main idea behind SVM is to find the optimal hyperplane that best separates data points of different classes in a high-dimensional space.

**How SVM Works:**

**1. Linear Separation:**

- SVM tries to find a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) that separates the data points of different classes with the maximum margin.

- The margin is the distance between the hyperplane and the closest data points from each class. These closest points are called support vectors.

**Optimal Hyperplane:**

- The optimal hyperplane is the one that maximizes the margin between the support vectors of the two classes.

**Non-Linearly Separable Data:**

When the data isn't linearly separable, SVM uses a technique called the kernel trick to transform the data into a higher-dimensional space where a linear separator might exist.

**- Common kernels:**

- Linear

- Polynomial

- Radial Basis Function (RBF) or Gaussian

- Sigmoid

**Soft Margin (for overlapping classes):**

- SVM allows some misclassification using a soft margin and introduces a parameter C to control the trade-off between maximizing the margin and minimizing classification errors.

**Mathematical Insight:**
**- For linearly separable data:**

- The hyperplane is defined by:

**w⋅x+b=0**

Maximize the margin:
2/
∥
𝑤
∥


- Subject to the constraint for each data point:
 y
i
​
 (w⋅x
i
​
 +b)≥1

**Advantages of SVM:**

- Works well with high-dimensional data.

- Effective when the number of features > number of samples.

- Can model non-linear decision boundaries using kernels.

- Robust to overfitting (especially with proper kernel and C tuning).



**Question 2:** Explain the difference between Hard Margin and Soft Margin SVM.

**Answer:**  The difference between Hard Margin and Soft Margin SVM lies in how strictly the algorithm enforces the separation of data points when creating the decision boundary (hyperplane).

**1. Hard Margin SVM:**

- Hard Margin SVM strictly separates the data.

- It assumes the data is linearly separable, meaning there exists a clear margin between the two classes with no overlap or misclassification.

**How it works:**
- Finds a hyperplane that perfectly separates the data with the maximum margin.

- No tolerance for misclassified data points.

**Mathematical Constraint:**

y
i
​
 (w⋅x
i
​
 +b)≥1

 for all
𝑖

**Limitation:**
- Highly sensitive to outliers.

- Not suitable for real-world noisy data.

**2. Soft Margin SVM:**

**Definition:**
- Soft Margin SVM allows some misclassification of data points to achieve better generalization.

- Designed to handle non-linearly separable or noisy data.

**How it works:**
- Introduces slack variables (ξᵢ) that allow some violations of the margin constraints.

- Includes a regularization parameter
𝐶
C that balances margin width and classification error:

- Large
𝐶
C: Less tolerance for misclassification (tries to fit the training data closely).

- Small
𝐶
C: More tolerance, better generalization.

**Mathematical Constraint:**

y
i
​
 (w⋅x
i
​
 +b)≥1−ξ
i
​
 ,ξ
i
​
 ≥0

**Key Differences Table:**

| Feature              | Hard Margin SVM              | Soft Margin SVM                       |
| -------------------- | ---------------------------- | ------------------------------------- |
| Data Requirement     | Perfectly linearly separable | Works with non-separable, noisy data  |
| Misclassification    | Not allowed                  | Allowed (with penalty)                |
| Flexibility          | Rigid                        | Flexible                              |
| Sensitivity to Noise | Very high                    | Lower                                 |
| Use of Slack (ξᵢ)    | No                           | Yes                                   |
| Regularization (C)   | Not used                     | Used to control margin–error tradeoff |


**Question 3:** What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.


**Answer:** The Kernel Trick is a mathematical technique used in Support Vector Machines (SVM) to enable the algorithm to solve non-linearly separable problems by implicitly mapping data into a higher-dimensional space—without actually computing the coordinates in that space.

This allows SVM to find a linear hyperplane in that higher-dimensional space, which corresponds to a non-linear decision boundary in the original space.

**Need:**
In many real-world problems, data cannot be separated by a straight line (linear boundary). The kernel trick helps by:

- Transforming input features into a higher-dimensional space where a linear separator may exist.

- Doing this transformation efficiently using kernel functions instead of explicitly calculating the mapping.

SVM optimization involves the dot product of feature vectors:
K(x
i
​
 ,x
j
​
 )=ϕ(x
i
​
 )⋅ϕ(x
j
​
 )

 The kernel function
𝐾
(
𝑥
𝑖
,
𝑥
𝑗
)
K(x
i
​
 ,x
j
​
 ) computes this dot product directly in the high-dimensional space, avoiding the need to compute
𝜙
(
𝑥
)
ϕ(x) explicitly.

**Example of a Kernel Function:**

Radial Basis Function (RBF) Kernel / Gaussian Kernel:

K(x
i
​
 ,x
j
​
 )=exp(−γ∥x
i
​
 −x
j
​
 ∥
2
 )


 γ: controls the influence of a single training example. Higher
𝛾
γ means closer points have more influence.

**Use Case:**

Image classification (e.g., handwritten digit recognition like MNIST dataset)

- Data points (images) are complex and not linearly separable in raw pixel form.

- RBF kernel maps them into a space where SVM can find a hyperplane that separates digits like ‘3’ and ‘8’ even when they have similar shapes.

**Other Common Kernels:**

| Kernel Type | Formula                                         | Use Case                                     |
| ----------- | ----------------------------------------------- | -------------------------------------------- |
| Linear      | $K(x_i, x_j) = x_i \cdot x_j$                   | High-dimensional but linearly separable data |
| Polynomial  | $K(x_i, x_j) = (x_i \cdot x_j + c)^d$           | Text classification, NLP                     |
| Sigmoid     | $K(x_i, x_j) = \tanh(\alpha x_i \cdot x_j + c)$ | Neural network similarity                    |


**Question 4:** What is a Naïve Bayes Classifier, and why is it called “naïve”?

**Answer:** The Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes' Theorem. It is primarily used for classification tasks, such as text classification, spam filtering, and sentiment analysis.

It calculates the probability of a class given a set of features and chooses the class with the highest posterior probability.

** Bayes’ Theorem Recap:**

P(C∣X)=
P(X)
P(X∣C)⋅P(C)
​
Where:

𝑃
(
𝐶
∣
𝑋
)
P(C∣X): Posterior probability (class given features)

𝑃
(
𝑋
∣
𝐶
)
P(X∣C): Likelihood (features given class)

𝑃
(
𝐶
)
P(C): Prior probability (class probability)

𝑃
(
𝑋
)
P(X): Evidence (probability of features)

It’s called naïve because it assumes all features are independent of each other given the class label, which is rarely true in real-world data.

**Example:**
In spam filtering, features might be the presence of words like "free", "money", and "click".

- Naïve Bayes assumes the appearance of the word “free” is independent of the word “money” given the message is spam — which is obviously a simplifying assumption.

Despite this unrealistic assumption, Naïve Bayes often works surprisingly well, especially in text-based applications.

**Types of Naïve Bayes Classifiers:**

| Type               | Feature Type          | Example Use Case    |
| ------------------ | --------------------- | ------------------- |
| **Gaussian NB**    | Continuous features   | Medical diagnosis   |
| **Multinomial NB** | Discrete word counts  | Text classification |
| **Bernoulli NB**   | Binary features (0/1) | Spam detection      |


**Advantages:**

- Very fast and scalable.

- Works well with high-dimensional data (e.g., text).

- Performs well even with relatively small datasets.

**Question 5:** Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?


**Answer:** Naïve Bayes classifiers have three main variants, each suited to a different type of feature distribution in your data. The choice of variant depends on the nature of the input features.

**1. Gaussian Naïve Bayes**
- Assumes that the features follow a normal (Gaussian) distribution.

- Typically used when features are continuous numerical values.

*Formula*

The likelihood
𝑃
(
𝑥
𝑖
∣
𝑦
)
P(x
i
​
 ∣y) is computed using the Gaussian (normal) distribution:

𝑃
(
𝑥
𝑖
∣
𝑦
)
=
1
2
𝜋
𝜎
𝑦
2
exp
⁡
(
−
(
𝑥
𝑖
−
𝜇
𝑦
)
2
2
𝜎
𝑦
2
)
P(x
i
​
 ∣y)=
2πσ
y
2
​

​

1
​
 exp(−
2σ
y
2
​

(x
i
​
 −μ
y
​
 )
2

​
 )


** Use Case:**

- Medical diagnosis (e.g., classifying diseases based on blood pressure, cholesterol, etc.)

- Iris flower classification (based on petal/sepal lengths)

- Sensor data classification.

**2. Multinomial Naïve Bayes**

**Description:**
- Assumes features are discrete counts (e.g., word frequencies).

- Used when data represents counts or frequencies, especially in document classification.

**Formula:**
The likelihood is based on the frequency of each feature (like word count) in a class:

P(x
i
​
 ∣y)=
total count of all features in class y+n
count(x
i
​
  in class y)+1
​

**Use Case:**
Text classification, such as:

- Spam detection

- News categorization

- Sentiment analysis

- Works well with Bag of Words or TF-IDF feature representations

**3. Bernoulli Naïve Bayes**

- Assumes binary features (0 or 1), representing presence or absence of a feature.

- Suitable when features are booleans (e.g., “Is the word ‘offer’ present in the email?”).

Formula:
Likelihoods are computed for binary outcomes:

P(x
i
​
 =1∣y) and P(x
i
​
 =0∣y)

** Summary Table:**

| Variant            | Feature Type    | Distribution Assumed | Typical Use Case                        |
| ------------------ | --------------- | -------------------- | --------------------------------------- |
| **Gaussian NB**    | Continuous      | Normal (bell curve)  | Sensor data, medical stats              |
| **Multinomial NB** | Discrete counts | Multinomial          | Word count-based text classification    |
| **Bernoulli NB**   | Binary (0/1)    | Bernoulli (yes/no)   | Binary-feature models, keyword presence |


**Answer: **

In [1]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = datasets.load_iris()
X = iris.data        # Features
y = iris.target      # Labels

# Step 2: Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train an SVM Classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Step 4: Make predictions on the test data
y_pred = svm_model.predict(X_test)

# Step 5: Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Step 6: Print the support vectors
print("\nSupport Vectors:")
print(svm_model.support_vectors_)

# Optionally: Print indices of support vectors for each class
print("\nSupport Vector Indices for each class:")
print(svm_model.support_)


Model Accuracy: 100.00%

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]

Support Vector Indices for each class:
[ 31  33  91  22  45  54  59  60  62  73  79  80 105 110   5  16  30  42
  68  81  87 101 112 113 116]


**Question 7:** Write a Python program to:

- Load the Breast Cancer dataset
- Train a Gaussian Naïve Bayes model
- Print its classification report including precision, recall, and F1-score.

**Answer:**

In [2]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Step 1: Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data        # Features
y = data.target      # Target labels

# Step 2: Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Step 4: Make predictions
y_pred = gnb.predict(X_test)

# Step 5: Print the classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



**Question 8:** Write a Python program to:
- Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
- Print the best hyperparameters and accuracy.

**Answer:**

In [6]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 1: Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Step 2: Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1]
}

# Step 4: Initialize SVM with RBF kernel
svm = SVC(kernel='rbf')

# Step 5: Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Step 6: Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 7: Output results
print("Best Hyperparameters:", grid_search.best_params_)
print(f"Test Accuracy: {accuracy * 100:.2f}%")


Best Hyperparameters: {'C': 100, 'gamma': 0.001}
Test Accuracy: 83.33%


**Question 9**: Write a Python program to:
- Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
- Print the model's ROC-AUC score for its predictions.

**Answer:**

In [7]:
# Import required libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Step 1: Load a subset of the 20 Newsgroups dataset (binary classification for ROC-AUC)
categories = ['rec.sport.baseball', 'sci.med']  # binary classes
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = data.data
y = data.target  # Binary: 0 or 1

# Step 2: Vectorize the text using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_vec = vectorizer.fit_transform(X)

# Step 3: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)

# Step 4: Train a Multinomial Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 5: Get predicted probabilities and compute ROC-AUC
y_probs = model.predict_proba(X_test)[:, 1]  # probability for class 1
roc_auc = roc_auc_score(y_test, y_probs)

# Step 6: Print ROC-AUC Score
print(f"ROC-AUC Score: {roc_auc:.4f}")


ROC-AUC Score: 0.9971


**Question 10:** Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
- Text with diverse vocabulary
- Potential class imbalance (far more legitimate emails than spam)
- Some incomplete or missing data

Explain the approach you would take to:
- Preprocess the data (e.g. text vectorization, handling missing data)
- Choose and justify an appropriate model (SVM vs. Naïve Bayes)
- Address class imbalance
- Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

**Answer:** Here's a structured approach to building an email spam classifier under real-world constraints:

**1. Preprocessing the Data**
Emails are unstructured and noisy, so preprocessing is essential.

**- Text Cleaning & Normalization**
- Lowercase conversion

- Remove punctuation, numbers, and special characters

- Remove HTML tags and email headers (if present)

- Remove stopwords (e.g., "the", "and", "is")

- Apply stemming or lemmatization

**- Handling Missing Data**
- Missing text: Replace with a placeholder like "no content" (don't drop unless completely empty).

- Missing metadata (e.g., subject line): Use domain knowledge to decide whether to impute, drop, or treat as a separate feature.

**- Vectorization (Feature Extraction)**
- Use TF-IDF vectorization (Term Frequency–Inverse Document Frequency) to convert emails to numerical features while capturing word importance.

- Consider adding additional features:

- Number of links

- Number of capital letters

- Presence of spammy words (e.g., “free”, “offer”)

- Email length

**2. Model Selection: SVM vs. Naïve Bayes**

| Aspect                             | Naïve Bayes                                  | SVM                                         |
| ---------------------------------- | -------------------------------------------- | ------------------------------------------- |
| **Speed**                          | Very fast                                    | Slower with large data                      |
| **Assumptions**                    | Assumes word independence (naïve)            | No distribution assumptions                 |
| **Works well on**                  | Text classification with clear word patterns | High-dimensional sparse data (e.g., TF-IDF) |
| **Performance on Imbalanced Data** | Decent with class priors                     | Needs tuning and imbalance handling         |


**Recommendation:**

- Start with Multinomial Naïve Bayes for a baseline: fast, interpretable, great for word-count-based models.

- Try SVM with class weighting if higher accuracy or margin-based classification is needed.

- Use ensemble methods (e.g., Random Forest, XGBoost) if interpretability is less of a concern and resources allow.

**3. Handling Class Imbalance**

Spam datasets often have 10-20% spam, which can bias models.

**Techniques:**

**- Class weights:**

class_weight='balanced' in SVM or tree-based models

**- Resampling:**

- Oversample minority class (e.g., using SMOTE)

- Undersample majority class if dataset is large

**- Threshold tuning:**

- Adjust the decision threshold based on ROC/Precision-Recall curve.

**4. Evaluation Metrics**

| Metric               | Why it's important                                         |
| -------------------- | ---------------------------------------------------------- |
| **Precision**        | Avoids false positives (e.g., legit emails marked as spam) |
| **Recall**           | Captures actual spam (minimize false negatives)            |
| **F1-score**         | Balances precision & recall                                |
| **ROC-AUC**          | Overall ability to separate classes                        |
| **PR-AUC**           | Better for imbalanced data                                 |
| **Confusion Matrix** | Visualizes type of errors made                             |


**5. Business Impact**
**- Benefits of an Accurate Spam Filter:**

- Customer trust: Ensures important legitimate emails aren’t mislabeled

- Security: Filters phishing or scam emails early

- Efficiency: Reduces manual review of junk emails

- Reduced downtime: Employees see fewer distractions

- Reputation: Keeps customer-facing communication clean

**Summary**

| Step              | Action                                                            |
| ----------------- | ----------------------------------------------------------------- |
| **Preprocessing** | Clean, tokenize, use TF-IDF, handle missing data                  |
| **Model Choice**  | Start with Naïve Bayes, upgrade to SVM if needed                  |
| **Imbalance**     | Use class weights or resampling                                   |
| **Evaluation**    | Focus on precision, recall, F1-score, and ROC/PR-AUC              |
| **Impact**        | Improves productivity, protects from threats, enhances user trust |





In [8]:
# Step 1: Import required libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Step 2: Load binary classes as spam-like vs. ham-like
categories = ['rec.sport.hockey', 'talk.politics.misc']  # Hockey = not spam, Politics = spam-like
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X_raw = data.data
y = data.target  # 0 and 1

# Step 3: Handle missing data
X = ['no content' if x.strip() == '' else x for x in X_raw]

# Step 4: Text vectorization using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_vec = vectorizer.fit_transform(X)

# Step 5: Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42, stratify=y)

# ----------- Naïve Bayes Model ----------- #
print("=== Naïve Bayes Classifier ===")
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
nb_preds = nb_model.predict(X_test)
nb_probs = nb_model.predict_proba(X_test)[:, 1]

# Evaluation
print("\nClassification Report:")
print(classification_report(y_test, nb_preds))
print("ROC-AUC Score:", roc_auc_score(y_test, nb_probs))
print("Confusion Matrix:\n", confusion_matrix(y_test, nb_preds))

# ----------- SVM Model (with class weight handling) ----------- #
print("\n=== SVM Classifier with Class Weight ===")
svm_model = SVC(kernel='linear', probability=True, class_weight='balanced', random_state=42)
svm_model.fit(X_train, y_train)
svm_preds = svm_model.predict(X_test)
svm_probs = svm_model.predict_proba(X_test)[:, 1]

# Evaluation
print("\nClassification Report:")
print(classification_report(y_test, svm_preds))
print("ROC-AUC Score:", roc_auc_score(y_test, svm_probs))
print("Confusion Matrix:\n", confusion_matrix(y_test, svm_preds))


=== Naïve Bayes Classifier ===

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.96      0.95       200
           1       0.95      0.92      0.94       155

    accuracy                           0.95       355
   macro avg       0.95      0.94      0.95       355
weighted avg       0.95      0.95      0.95       355

ROC-AUC Score: 0.9924677419354838
Confusion Matrix:
 [[193   7]
 [ 12 143]]

=== SVM Classifier with Class Weight ===

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.92      0.94       200
           1       0.90      0.95      0.92       155

    accuracy                           0.93       355
   macro avg       0.93      0.93      0.93       355
weighted avg       0.93      0.93      0.93       355

ROC-AUC Score: 0.9742741935483871
Confusion Matrix:
 [[183  17]
 [  8 147]]
