Assisgnment code - DA-AG-013

### Question 1: What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a supervised machine learning algorithm mainly used for classification tasks, but it can also be used for regression. It is especially powerful for binary classification problems, where the goal is to separate data points into two categories.

**How SVM Works:**  
SVM works by finding the best decision boundary (called a hyperplane) that separates the data into different classes. This hyperplane is chosen in such a way that the margin between the classes is as large as possible. The margin is defined as the distance between the hyperplane and the closest data points from each class. These closest points are called support vectors, and they play a key role in defining the position and orientation of the hyperplane.

**Key Concepts:**

- **Hyperplane:**  
  A line (in 2D), a plane (in 3D), or a higher-dimensional surface that divides the dataset into different classes.

- **Margin:**  
  The distance between the hyperplane and the nearest data points from each class. A larger margin usually leads to better generalization.

- **Support Vectors:**  
  These are the data points closest to the hyperplane. They are "supporting" the hyperplane and directly affect its position.

- **Linear vs. Non-linear Separation:**  
  - If data is linearly separable, SVM finds a straight line (or plane) to divide the classes.  
  - If data is not linearly separable, SVM uses a technique called the **kernel trick** to project the data into a higher dimension where it becomes separable.

- **Kernel Trick:**  
  A mathematical function used to transform the data into a higher-dimensional space. This makes it possible to separate non-linear data using a linear boundary in that space.

- **Regularization (Soft Margin):**  
  SVM can allow some misclassifications to prevent overfitting. This is called using a soft margin, which gives the model flexibility when dealing with noisy or overlapping data.

**Advantages of SVM:**
- Works well for high-dimensional data.
- Effective even when the number of features is greater than the number of samples.
- Can handle non-linear data using kernels.

**Disadvantages of SVM:**
- Training can be slow for very large datasets.
- Performance depends heavily on the choice of kernel and its parameters.
- Not suitable for datasets with a lot of noise and overlapping classes.

**Real-life Example:**  
Suppose we are building an email spam detector. An SVM can learn to classify emails as spam or not spam by analyzing the words in the emails and learning the optimal decision boundary that separates spam from legitimate messages.
"""
cells.append(nbf.new_markdown_cell(q1))



### Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

In Support Vector Machine (SVM), the goal is to find the best boundary (called a hyperplane) that separates the data into classes. Depending on how strictly the model separates the classes, there are two types of margins: Hard Margin and Soft Margin.

#### 1. Hard Margin SVM
A Hard Margin SVM is the strictest version of SVM. It tries to separate the data with no errors at all.

**Characteristics:**
- Assumes the data is perfectly linearly separable.
- No data points are allowed to lie within the margin or on the wrong side of the boundary.
- Creates the maximum margin between classes without any tolerance.

**Advantages:**
- Simple and works well for clean, noise-free data.
- Perfect classification on training data.

**Disadvantages:**
- Very sensitive to outliers.
- Fails when the data is not perfectly separable or contains noise.
- Can lead to overfitting in practical datasets.

#### 2. Soft Margin SVM
A Soft Margin SVM allows some misclassifications to occur.

**Characteristics:**
- Accepts that real-world data may be noisy or overlapping.
- Introduces flexibility by allowing some points to be within the margin or misclassified.
- Controlled by parameter **C**:
  - A large C = strict separation, less tolerance to errors.
  - A small C = more tolerance, allows a wider margin.

**Advantages:**
- Works better on real-world datasets with noise and overlapping classes.
- Helps prevent overfitting.
- More robust than hard margin SVM.

**Disadvantages:**
- Requires careful tuning of the C parameter.
- Slightly more complex to implement and understand.

**Summary Table:**

| Feature              | Hard Margin SVM        | Soft Margin SVM          |
|----------------------|------------------------|---------------------------|
| Tolerance to errors  | No (strict separation) | Yes (allows errors)       |
| Handles noisy data   | Poorly                 | Better                    |
| Real-world use       | Limited                | Preferred                 |
| Controlled by        | No                     | Yes, parameter C          |
| Overfitting risk     | High                   | Lower                     |

**Example:**  
Classifying emails as spam: hard margin assumes perfect separation, while soft margin allows some flexibility for better generalization.
"""
cells.append(nbf.new_markdown_cell(q2))



### Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

In Support Vector Machine (SVM), we try to find a hyperplane that separates the classes of data. However, in many real-world problems, the data is not linearly separable.

#### What is the Kernel Trick?
The Kernel Trick is a mathematical technique used in SVM to transform non-linearly separable data into a higher-dimensional space where it becomes linearly separable.

**Advantages:**
- Allows SVM to solve non-linear problems.
- Saves computation by not transforming the data explicitly.
- Suitable for complex data like images, texts, or biological sequences.

#### Common Kernel Functions:
- **Linear Kernel:** K(x, y) = x • y  
- **Polynomial Kernel:** K(x, y) = (x • y + c)^d  
- **Radial Basis Function (RBF) / Gaussian Kernel:** K(x, y) = exp(-γ ||x − y||²)  
- **Sigmoid Kernel:** K(x, y) = tanh(α x • y + c)

#### Example: RBF Kernel
Suppose we are classifying data in concentric circles. In 2D, these cannot be separated linearly. The RBF kernel transforms the data into a higher dimension where a straight line (hyperplane) can separate the classes.

#### When to Use Which Kernel?

| Kernel     | Use Case                                 |
|------------|-------------------------------------------|
| Linear     | Linearly separable data                  |
| Polynomial | Medium-complexity data with interactions |
| RBF        | Non-linear data with complex boundaries  |
| Sigmoid    | Rarely used, inspired by neural nets     |

**Real-Life Example:**  
In handwriting recognition, the RBF kernel can transform pixel data so SVM can classify digits effectively.
"""
cells.append(nbf.new_markdown_cell(q3))




### Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

A Naïve Bayes Classifier is a simple and efficient supervised learning algorithm based on **Bayes’ Theorem**, mainly used for classification.

#### Bayes’ Theorem:
P(Class | Data) = [P(Data | Class) × P(Class)] / P(Data)

#### Why is it called “Naïve”?
Because it assumes **feature independence** — each feature contributes independently to the final probability. This assumption is rarely true but simplifies computation and works surprisingly well.

#### How it works:
1. Calculates probability of data given each class (likelihood).
2. Multiplies by prior probability of each class.
3. Chooses class with highest posterior probability.

#### Example:
Classifying an email as spam using word presence. Assumes each word’s appearance is independent of the others.

#### Advantages:
- Fast, simple, and scalable.
- Works well with high-dimensional data (like text).
- Requires small training data.

#### Disadvantages:
- Assumes feature independence.
- Struggles with correlated features.
- Cannot model complex relationships.

#### Real-Life Applications:
- Spam filtering
- Sentiment analysis
- Disease prediction
- Document categorization
"""
cells.append(nbf.new_markdown_cell(q4))




### Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

#### 1. Gaussian Naïve Bayes
- Used for continuous numeric features.
- Assumes normal distribution of features.
- Example: temperature, age, sensor data.

#### 2. Multinomial Naïve Bayes
- Used for count-based data (e.g., word frequency).
- Inputs must be non-negative integers.
- Example: spam detection using word counts, document classification.

#### 3. Bernoulli Naïve Bayes
- Used for binary features (presence/absence).
- Ignores frequency; only cares about existence of features.
- Example: if a word appears in an email (1) or not (0).

#### Summary Table:

| Variant              | Data Type      | Feature Type      | Use Case                        |
|----------------------|----------------|--------------------|----------------------------------|
| Gaussian             | Continuous      | Real-valued        | Medical/sensor data             |
| Multinomial          | Discrete counts | Word frequency     | Text classification, spam       |
| Bernoulli            | Binary          | 0/1 (presence)     | Email filtering, short texts    |

#### Important Note:
- Use **Gaussian** for real numbers.
- Use **Multinomial** for count data.
- Use **Bernoulli** for binary features.
"""
cells.append(nbf.new_markdown_cell(q5))




Dataset Info:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.


Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
(Include your Python code and output in the code box below.)


**Question 6: Train an SVM Classifier on the Iris Dataset**

We will use the `Iris` dataset from `sklearn.datasets` and train an **SVM (Support Vector Machine)** classifier using a **linear kernel**.

Steps:
1. Load the Iris dataset
2. Split the data into training and testing sets
3. Train an SVM classifier
4. Print:
   - Accuracy of the model on the test set
   - Support vectors used by the model


In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data  # Features
y = iris.target  # Target labels

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create SVM model with linear kernel
model = SVC(kernel='linear')

# Train the model
model.fit(X_train, y_train)

# Make predictions on test set
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy and support vectors
print("Model Accuracy:", accuracy)
print("\nNumber of Support Vectors for Each Class:", model.n_support_)
print("\nSupport Vectors:")
print(model.support_vectors_)


Model Accuracy: 1.0

Number of Support Vectors for Each Class: [ 3 11 11]

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
(Include your Python code and output in the code box below.)

**Question 7: Train a Gaussian Naïve Bayes Model on the Breast Cancer Dataset**

We will use the `Breast Cancer` dataset from `sklearn.datasets` and perform classification using the **Gaussian Naïve Bayes** algorithm.

Steps:
1. Load the dataset
2. Split it into training and testing sets
3. Train a Gaussian Naïve Bayes classifier
4. Print the classification report (precision, recall, F1-score)


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data  # Features
y = data.target  # Labels

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.
(Include your Python code and output in the code box below.)

**Question 8: SVM Classifier on Wine Dataset with Hyperparameter Tuning (GridSearchCV)**

We will:
1. Load the Wine dataset from sklearn
2. Train an SVM classifier
3. Use `GridSearchCV` to find the best `C` and `gamma`
4. Print the best parameters and accuracy


In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # using RBF kernel
}

# Create the SVM model and apply GridSearchCV
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best model and evaluate
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output results
print("Best Hyperparameters:", grid_search.best_params_)
print("Test Set Accuracy:", round(accuracy * 100, 2), "%")


Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Set Accuracy: 83.33 %


Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)


**Question 9: Naïve Bayes on Text Data with ROC-AUC Score**

We will:
1. Load a subset of the `20newsgroups` dataset using `sklearn.datasets`
2. Preprocess the text using `TfidfVectorizer`
3. Train a Multinomial Naïve Bayes classifier
4. Use `roc_auc_score` to evaluate the model

Note: ROC-AUC score is used for binary classification, so we'll use only two categories.


In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

# Load 2 classes for binary classification (e.g., 'sci.space' and 'rec.autos')
categories = ['sci.space', 'rec.autos']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Naïve Bayes Classifier
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predict probabilities for ROC-AUC
y_prob = nb.predict_proba(X_test)[:, 1]

# Compute ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_prob)
print("ROC-AUC Score:", round(roc_auc, 4))


ROC-AUC Score: 0.9993


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)


**Question 10: Spam Email Classification — Full Pipeline**

We are tasked to classify emails as Spam or Not Spam.

### 🧹 1. Data Preprocessing:
- **Handle Missing Values:** Replace missing email bodies with empty strings.
- **Text Vectorization:** Use `TfidfVectorizer` to convert text into numerical features.
- **Train-test split:** Reserve part of the data for testing.

### 🤖 2. Model Selection:
- **Naïve Bayes (MultinomialNB)** is ideal for text data — fast and efficient.
- **SVM** is powerful but computationally heavier. Not ideal for high-dimensional sparse text unless tuned well.
- We’ll use **MultinomialNB** for a practical, real-world solution.

### ⚖️ 3. Handle Class Imbalance:
- Use **class weights**, **oversampling (e.g., SMOTE)**, or **stratified sampling**.
- We’ll simulate imbalance and then use `class_weight='balanced'` or `resample`.

### 📊 4. Evaluation Metrics:
- **Precision**: Important to reduce false positives (marking legit email as spam).
- **Recall**: Important to catch as much spam as possible.
- **F1-Score**: Harmonic balance between precision and recall.
- **ROC-AUC**: Overall model discrimination ability.

### 💼 5. Business Impact:
- Preventing spam saves time and reduces risk.
- Avoiding false positives ensures important emails aren’t lost.
- Balanced filtering improves user trust in the email platform.


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.utils import resample

# 🔸 Simulate Dataset (You can replace this with a real CSV file)
data = {
    'email_text': [
        "Congratulations! You've won a lottery, claim now!",
        "Dear user, your invoice is attached",
        "Win big prizes!!! Click here now",
        np.nan,
        "Meeting at 3pm with the sales team.",
        "Limited time offer, buy now!",
        "Reminder: Project deadline tomorrow",
        "Exclusive deal just for you, buy today!",
        "Lunch with client at noon",
        "You are selected for a prize. Send details!"
    ],
    'label': [1, 0, 1, 0, 0, 1, 0, 1, 0, 1]  # 1 = Spam, 0 = Not Spam
}

df = pd.DataFrame(data)

# 🔹 Step 1: Handle Missing Data
df['email_text'] = df['email_text'].fillna("")

# 🔹 Step 2: Simulate Class Imbalance (Optional)
df_majority = df[df.label == 0]
df_minority = df[df.label == 1]
df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority), random_state=42)
df_balanced = pd.concat([df_majority_downsampled, df_minority])

# 🔹 Step 3: Vectorize Text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_balanced['email_text'])
y = df_balanced['label']

# 🔹 Step 4: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.3)

# 🔹 Step 5: Train Model
model = MultinomialNB()
model.fit(X_train, y_train)

# 🔹 Step 6: Evaluate Model
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print("🔹 Classification Report:\n")
print(classification_report(y_test, y_pred))

print("\n🔹 Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\n🔹 ROC-AUC Score:", round(roc_auc_score(y_test, y_prob), 4))


🔹 Classification Report:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.33      1.00      0.50         1

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3


🔹 Confusion Matrix:
[[0 2]
 [0 1]]

🔹 ROC-AUC Score: 1.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
