# SVM & Naïve Bayes Assignment

## Q1. What is a Support Vector Machine (SVM), and how does it work?

**Answer:**  
A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks.  
It works by finding the **optimal hyperplane** that best separates data points of different classes.  

- The hyperplane is chosen to **maximize the margin**, i.e., the distance between the hyperplane and the nearest data points from each class (called **support vectors**).  
- For **linearly separable data**, SVM finds a straight-line (or hyperplane) separator.  
- For **non-linear data**, SVM uses the **kernel trick** to map data into a higher-dimensional space where it can find a linear separator.  
- This approach makes SVM powerful for handling both linear and non-linear classification problems.

## Q2. Explain the difference between Hard Margin and Soft Margin SVM.

**Answer:**  
Support Vector Machines can separate data using two main approaches: **Hard Margin** and **Soft Margin**.  

- **Hard Margin SVM**  
  - Assumes the data is perfectly linearly separable.  
  - No misclassification is allowed.  
  - Finds the hyperplane with the maximum margin that correctly classifies all points.  
  - Very sensitive to noise and outliers.  

- **Soft Margin SVM**  
  - Allows some misclassifications using slack variables.  
  - Balances between maximizing the margin and minimizing classification errors.  
  - Controlled by the regularization parameter **C**.  
  - More robust and widely used in real-world problems where data is noisy or overlapping.

## Q3. What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

**Answer:**  
The **Kernel Trick** is a technique in SVM that allows the algorithm to handle non-linear data by mapping it into a higher-dimensional space, without explicitly computing the transformation.  
Instead of working with transformed features, SVM uses a **kernel function** to compute dot products directly in the higher-dimensional space, making it efficient and powerful.  

- **Example: Radial Basis Function (RBF) Kernel**  
  - Formula: \( K(x, x') = \exp(-\gamma \|x - x'\|^2) \)  
  - Use case: When data has complex, non-linear decision boundaries (e.g., classifying images, medical data, or text).  
  - RBF can capture local patterns and flexible boundaries, making it one of the most commonly used kernels in practice.

## Q4. What is a Naïve Bayes Classifier, and why is it called “naïve”?

**Answer:**  
A **Naïve Bayes Classifier** is a probabilistic machine learning model based on **Bayes’ Theorem**.  
It predicts the probability of a class given input features and chooses the class with the highest probability.  

- Formula:  
  \( P(y \mid X) = \frac{P(X \mid y) \cdot P(y)}{P(X)} \)  

- Why it is called **“naïve”**:  
  - It assumes that all features are **conditionally independent** given the class label.  
  - In reality, features are often correlated, so this assumption is rarely true.  
  - Despite this simplification, it works very well in many applications, especially in **text classification** and **spam filtering**.

## Q5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

**Answer:**  
Naïve Bayes has different variants depending on the type of feature distribution:  

- **Gaussian Naïve Bayes**  
  - Assumes features follow a **normal (Gaussian) distribution**.  
  - Suitable for **continuous data** (e.g., medical measurements, sensor values).  

- **Multinomial Naïve Bayes**  
  - Works with **discrete counts** (e.g., word frequencies).  
  - Commonly used in **text classification** with term frequency or TF-IDF features.  

- **Bernoulli Naïve Bayes**  
  - Assumes **binary features** (presence/absence).  
  - Useful when only the occurrence of a feature matters (e.g., whether a word appears in a document, yes/no).  

**Summary:**  
- Use **Gaussian NB** for continuous data.  
- Use **Multinomial NB** for count-based text data.  
- Use **Bernoulli NB** for binary feature data.

# Q6. Write a Python program to:
- Load the Iris dataset  
- Train an SVM Classifier with a linear kernel  
- Print the model's accuracy and support vectors.  

**Answer:**



In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.25, random_state=42, stratify=iris.target
)

# Train SVM with linear kernel
svm_linear = SVC(kernel="linear", random_state=42)
svm_linear.fit(X_train, y_train)

# Predictions and accuracy
y_pred = svm_linear.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Accuracy:", accuracy)
print("Number of support vectors per class:", svm_linear.n_support_)
print("First 5 support vectors:\n", svm_linear.support_vectors_[:5])


Accuracy: 1.0
Number of support vectors per class: [ 3 10  9]
First 5 support vectors:
 [[5.1 3.8 1.9 0.4]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [6.9 3.1 4.9 1.5]
 [6.  2.9 4.5 1.5]]


## Q7. Write a Python program to:
- Load the Breast Cancer dataset  
- Train a Gaussian Naïve Bayes model  
- Print its classification report including precision, recall, and F1-score.  

**Answer (using Breast Cancer dataset):**

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load Breast Cancer dataset
cancer = datasets.load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.25, random_state=42, stratify=cancer.target
)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predictions
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred))


Classification Report:

              precision    recall  f1-score   support

           0       0.96      0.87      0.91        53
           1       0.93      0.98      0.95        90

    accuracy                           0.94       143
   macro avg       0.94      0.92      0.93       143
weighted avg       0.94      0.94      0.94       143



## Q8. Write a Python program to:
- Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.  
- Print the best hyperparameters and accuracy.  

**Answer (using Wine dataset):**


In [3]:
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = datasets.load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.25, random_state=42, stratify=wine.target
)

# Define parameter grid for GridSearchCV
param_grid = {
    "C": [0.1, 1, 10, 100],
    "gamma": [0.001, 0.01, 0.1, 1],
    "kernel": ["rbf"]
}

# Train SVM with GridSearchCV
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_

# Predictions and accuracy
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid.best_params_)
print("Accuracy on Test Set:", accuracy)


Best Parameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Accuracy on Test Set: 0.7777777777777778


## Q9. Write a Python program to:  
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).  
● Print the model's ROC-AUC score for its predictions.  

**Answer:**

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load synthetic text dataset (20 newsgroups)
newsgroups = fetch_20newsgroups(subset="all", categories=['rec.sport.baseball', 'sci.med'], shuffle=True, random_state=42)

X, y = newsgroups.data, newsgroups.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Text vectorization using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Naive Bayes Classifier
nb = MultinomialNB()
nb.fit(X_train_vec, y_train)

# Predict probabilities
y_prob = nb.predict_proba(X_test_vec)[:, 1]

# ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)

print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.9999187031526917


## Q10. Imagine you’re working as a data scientist for a company that handles email communications.  
Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:  
● Text with diverse vocabulary  
● Potential class imbalance (far more legitimate emails than spam)  
● Some incomplete or missing data  

Explain the approach you would take to:  
● Preprocess the data (e.g. text vectorization, handling missing data)  
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)  
● Address class imbalance  
● Evaluate the performance of your solution with suitable metrics  
And explain the business impact of your solution.  

**Answer:**  
### 1. Data Preprocessing
- **Handling Missing Data:** Replace missing values with empty strings or use imputation methods.  
- **Text Vectorization:** Use **TF-IDF Vectorizer** or **CountVectorizer** to convert raw text into numerical features.  
- **Feature Scaling:** Not required for Naïve Bayes, but useful for SVM.  

### 2. Model Choice  
- **Naïve Bayes (Preferred):** Works well for text data due to the independence assumption and efficiency in high-dimensional sparse data.  
- **SVM:** Can also perform well, but may be computationally more expensive on very large text datasets.  
 For spam classification, **Multinomial Naïve Bayes** is typically the most effective and efficient.  

### 3. Handling Class Imbalance  
- Use **SMOTE (Synthetic Minority Oversampling Technique)** or **Random Oversampling** to balance the dataset.  
- Alternatively, apply **class weights** in the model to penalize misclassification of the minority class (spam).  

### 4. Evaluation Metrics  
- **Accuracy alone is not enough** due to class imbalance.  
- Use **Precision, Recall, and F1-score** to evaluate spam detection.  
- **ROC-AUC score** can measure overall model performance.  
- High **recall** is important to catch as much spam as possible, while maintaining a good **precision** to avoid flagging legitimate emails.  

### 5. Business Impact  
- **Improved productivity:** Employees waste less time deleting spam manually.  
- **Better security:** Reduces risk of phishing and malware from spam emails.  
- **Customer trust:** Ensures important legitimate emails are not wrongly classified as spam.  
- **Cost savings:** Automating spam detection reduces reliance on manual checks and IT interventions.  

---
### Example Python Code:

In [5]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import RandomOverSampler

# Load dataset (simulating spam vs not spam using 2 categories)
data = fetch_20newsgroups(subset="all",
                          categories=["sci.space", "rec.autos"],
                          shuffle=True, random_state=42)

X, y = data.data, data.target

# Handle missing data (replace None with empty string)
X = [text if text is not None else "" for text in X]

# Convert text into numerical features (TF-IDF)
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
X_vec = vectorizer.fit_transform(X)

# Handle class imbalance (oversampling minority class)
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_vec, y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.25, random_state=42
)

# Train Naïve Bayes model
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predictions
y_pred = nb.predict(X_test)
y_prob = nb.predict_proba(X_test)[:, 1]

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99       252
           1       1.00      0.99      0.99       243

    accuracy                           0.99       495
   macro avg       0.99      0.99      0.99       495
weighted avg       0.99      0.99      0.99       495

ROC-AUC Score: 0.9998203671043177
