
## üß© **Assignment 2 ‚Äî Email Spam Classification**

### üéØ **Title**

> Classify the email using binary classification.
> Use **K-Nearest Neighbors (KNN)** and **Support Vector Machine (SVM)** for classification and compare their performance.

---

## üß† **Objective**

To build and evaluate a **machine learning classifier** that detects whether an email is **Spam (1)** or **Not Spam (0)** using:

* **K-Nearest Neighbors (KNN)**
* **Support Vector Machine (SVM)**

---

## üìä **Dataset Details**

**Source:** [Kaggle: Email Spam Classification Dataset](https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)

| Attribute            | Description                                   |
| -------------------- | --------------------------------------------- |
| `Email No.`          | Unique email ID (numeric name)                |
| `word1` ‚Äì `word3000` | Frequency count of top 3000 most common words |
| `label`              | Target variable: 1 = spam, 0 = not spam       |

* Total rows: **5172**
* Total columns: **3002 (1 ID + 3000 word features + 1 label)**

---

## ‚öôÔ∏è **Theory You Must Know**

### 1Ô∏è‚É£ **Data Preprocessing**

Preparing raw data before applying models:

* Remove irrelevant features (like Email ID)
* Handle missing/null values
* Normalize/scale data (important for KNN and SVM)
* Split into train and test data

---

### 2Ô∏è‚É£ **Binary Classification**

Binary classification predicts **two categories** (e.g., spam or not spam).

Model output:

```
1 ‚Üí Spam
0 ‚Üí Not Spam
```

---

### 3Ô∏è‚É£ **K-Nearest Neighbors (KNN)**

| Concept         | Explanation                                                                    |
| --------------- | ------------------------------------------------------------------------------ |
| Type            | Supervised Learning (Classification)                                           |
| Idea            | A data point is classified by the **majority vote** of its K nearest neighbors |
| Distance Metric | Usually **Euclidean Distance**                                                 |
| Formula         | ( d = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \ldots} )                          |
| Hyperparameter  | ‚Äòk‚Äô = number of neighbors to consider                                          |

**Advantages:**

* Simple and intuitive
* Works well with clean data

**Disadvantages:**

* Slow for large datasets
* Needs scaling
* Sensitive to irrelevant features

---

### 4Ô∏è‚É£ **Support Vector Machine (SVM)**

| Concept     | Explanation                                                |
| ----------- | ---------------------------------------------------------- |
| Type        | Supervised Learning (Classification)                       |
| Idea        | Finds a **hyperplane** that best separates the two classes |
| Key Concept | **Maximizes margin** between spam and not-spam             |
| Kernels     | Linear, RBF (Radial Basis Function), Polynomial, etc.      |

**Advantages:**

* Works well with high-dimensional data (like word frequencies)
* Handles non-linear separation using kernel trick

**Disadvantages:**

* Slower on very large datasets
* Needs proper parameter tuning

---

### 5Ô∏è‚É£ **Model Evaluation Metrics**

| Metric               | Meaning                             | Ideal Value                  |
| -------------------- | ----------------------------------- | ---------------------------- |
| **Accuracy**         | (TP + TN) / (Total)                 | Closer to 1                  |
| **Precision**        | TP / (TP + FP)                      | High ‚Üí fewer false positives |
| **Recall**           | TP / (TP + FN)                      | High ‚Üí fewer false negatives |
| **F1 Score**         | Harmonic mean of Precision & Recall | High = balanced model        |
| **Confusion Matrix** | Table showing prediction results    | 2√ó2 matrix (TP, TN, FP, FN)  |

---

## üíª **Typical Code Flow in Your Notebook (B2.ipynb)**

### Step 1: Import Libraries

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
```

---

### Step 2: Load Dataset

```python
df = pd.read_csv('emails.csv')
df.head()
```

---

### Step 3: Data Preprocessing

```python
# Drop ID column
df = df.drop('Email No.', axis=1)

# Separate features and labels
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

---

### Step 4: Feature Scaling

KNN and SVM are **distance-based algorithms**, so feature scaling is mandatory.

```python
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

---

### Step 5: Apply **KNN Classifier**

```python
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
```

---

### Step 6: Apply **SVM Classifier**

```python
svm = SVC(kernel='linear')  # or kernel='rbf'
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
```

---

### Step 7: Evaluate Models

```python
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))
print(confusion_matrix(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(confusion_matrix(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))
```

---

### üßÆ **Typical Results**

| Model            | Accuracy | Precision | Recall | F1-Score |
| ---------------- | -------- | --------- | ------ | -------- |
| **KNN (k=5)**    | ~0.92    | 0.91      | 0.93   | 0.92     |
| **SVM (Linear)** | ~0.96    | 0.95      | 0.96   | 0.95     |

‚úÖ **Conclusion:**
SVM performs slightly better than KNN due to its ability to handle high-dimensional (word-based) data efficiently.

---

## üß© **Viva Questions You Might Be Asked**

| Question                          | Short Answer                                                                         |
| --------------------------------- | ------------------------------------------------------------------------------------ |
| What is KNN?                      | Instance-based ML algorithm that classifies based on nearest neighbors.              |
| What is SVM?                      | Algorithm that separates classes using the best hyperplane.                          |
| Why scale features?               | To give equal weight to all features (distance-based models are sensitive to scale). |
| What is Kernel in SVM?            | Function that transforms data into higher dimensions for better separation.          |
| What is binary classification?    | Predicting one of two categories (spam or not spam).                                 |
| What is Confusion Matrix?         | Table showing true and false predictions for each class.                             |
| Why does SVM perform better here? | High-dimensional sparse data is handled well by SVM.                                 |
| What is F1 score?                 | Balance between precision and recall.                                                |
| How do you choose ‚Äòk‚Äô in KNN?     | Try multiple values; choose the one with best validation accuracy.                   |

---

## üß† **Possible ‚ÄúModification‚Äù Questions During Viva**

| Asked To Do                          | How To Do It                                       |
| ------------------------------------ | -------------------------------------------------- |
| ‚ÄúTry another kernel‚Äù                 | `SVC(kernel='rbf')` or `'poly'`                    |
| ‚ÄúChange value of k‚Äù                  | `KNeighborsClassifier(n_neighbors=7)`              |
| ‚ÄúShow confusion matrix as a heatmap‚Äù | `sns.heatmap(confusion_matrix(...))`               |
| ‚ÄúShow F1 Score only‚Äù                 | `from sklearn.metrics import f1_score`             |
| ‚ÄúUse cross-validation‚Äù               | `cross_val_score(model, X, y, cv=5)`               |
| ‚ÄúNormalize instead of standardize‚Äù   | Use `MinMaxScaler()` instead of `StandardScaler()` |

---

## üìà **Visualization Ideas (Optional but Good for Marks)**

```python
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, fmt='d', cmap='coolwarm')
plt.title("SVM Confusion Matrix")
plt.show()
```

---

## üß© **Key Takeaways**

* **KNN** works on **distance similarity**, best for simple patterns.
* **SVM** separates classes using a **hyperplane**, better for text-like, high-dimensional data.
* **Scaling** is **mandatory** for both.
* Evaluate with **Accuracy, Precision, Recall, F1**.
* SVM generally outperforms KNN for spam datasets.

---

Would you like me to generate a **1-page practical exam revision sheet (PDF-style)** for this Assignment B2 ‚Äî with:

* Key definitions
* Theory formulas
* Common viva questions
* Code snippets
* Expected output summary?
