### Question 1: What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the hyperplane that best separates the data into different classes with the maximum margin. The margin is defined as the distance between the hyperplane and the closest data points from each class, known as support vectors. The idea is to choose a hyperplane that maximizes this margin to ensure better generalization on unseen data.

In the case of non-linearly separable data, SVM uses kernel functions to map the data into a higher-dimensional space where a linear separation is possible.

### Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

- **Hard Margin SVM** assumes that the data is linearly separable and tries to find a hyperplane that perfectly separates the data without any misclassification. It works only when there are no noisy points or overlaps in the classes.
- **Soft Margin SVM**, on the other hand, allows some misclassification (slack) to find a better generalizing hyperplane. It introduces a regularization parameter `C` which balances the trade-off between maximizing the margin and minimizing the classification error.

Soft Margin SVM is more practical and is widely used in real-world problems where perfect separation is not always possible.

### Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

The Kernel Trick is a technique used in SVM to handle non-linearly separable data. It transforms the data into a higher-dimensional space where a linear separator (hyperplane) can be found.

Instead of explicitly computing the transformation, the kernel trick computes the inner product of the transformed features directly using a kernel function.

**Example: Radial Basis Function (RBF) Kernel** - It is useful when the decision boundary between classes is non-linear and complex. RBF kernel can capture intricate patterns in the data by projecting it into an infinite-dimensional space.

### Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem. It predicts the class of a data point by calculating the posterior probability for each class and selecting the one with the highest probability.

It is called “naïve” because it assumes that all features are independent of each other given the class label. This assumption is rarely true in real-world data, but the classifier still performs well in many cases due to its simplicity and efficiency.

### Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

- **Gaussian Naïve Bayes**: Assumes features follow a normal distribution. Used when features are continuous (e.g., real-valued attributes like height, weight).
- **Multinomial Naïve Bayes**: Used for discrete count data such as word counts in text classification. Ideal for bag-of-words models.
- **Bernoulli Naïve Bayes**: Used for binary/boolean features indicating the presence or absence of a feature (e.g., whether a word appears in a document).

# Machine Learning Assignment 2 – SVM & Naive Bayes (Coding Part)

*Solutions with code, output, and explanations*

### Question 6: Train SVM on Iris Dataset

In [None]:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM with linear kernel
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
support_vectors = model.support_vectors_

print(f"Accuracy: {accuracy:.2f}")
print("Support Vectors:")
print(support_vectors)


### Question 7: Gaussian Naive Bayes on Breast Cancer Dataset

In [None]:

from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train GaussianNB model
model = GaussianNB()
model.fit(X, y)
y_pred = model.predict(X)

# Print classification report
print(classification_report(y, y_pred))


### Question 8: GridSearchCV with SVM on Wine Dataset

In [None]:

from sklearn.datasets import load_wine
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Load Wine dataset
data = load_wine()
X, y = data.data, data.target

# Define parameter grid
param_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}
grid = GridSearchCV(SVC(), param_grid, cv=5)
grid.fit(X, y)

# Output best parameters and accuracy
print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


### Question 9: Naive Bayes on Text Dataset with ROC-AUC

In [None]:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

# Load text dataset
data = fetch_20newsgroups(subset='train', categories=['rec.sport.hockey', 'sci.med'], remove=('headers', 'footers', 'quotes'))
X, y = data.data, data.target

# Pipeline with TF-IDF + Naive Bayes
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X, y)
y_prob = model.predict_proba(X)[:, 1]

# ROC-AUC score
roc_auc = roc_auc_score(y, y_prob)
print(f"ROC-AUC Score: {roc_auc:.2f}")


### Question 10: Email Spam Classification Strategy


To classify emails as Spam or Not Spam, here's the complete approach:

**Preprocessing:**
- Use TfidfVectorizer to convert text into numerical format.
- Handle missing data using SimpleImputer or fill with empty strings.

**Model Choice:**
- Use **Multinomial Naive Bayes** if speed and scalability are priority.
- Use **SVM** if higher accuracy is desired but at higher computation cost.
- Naive Bayes is preferred for sparse, high-dimensional text data.

**Class Imbalance Handling:**
- Use `class_weight='balanced'` in SVM or SMOTE oversampling.
- Use stratified split to maintain class distribution.

**Evaluation Metrics:**
- Precision, Recall, F1-Score
- ROC-AUC Score

**Business Impact:**
- Reduces manual spam filtering
- Protects users from phishing
- Ensures higher productivity and trust in email system


### Question 10:
**Scenario**: Automatically classify emails as Spam or Not Spam.

**Approach**:

1. **Preprocessing**:
   - Text cleaning: lowercasing, removing punctuation, stopwords, stemming/lemmatization.
   - Handling missing values: fill with empty strings or remove rows depending on extent.
   - Text vectorization: Use `TfidfVectorizer` for converting emails to numerical format.

2. **Model Selection**:
   - **Naive Bayes (MultinomialNB)** is preferred due to its effectiveness in text classification tasks, especially with word frequencies.
   - **SVM** could be used but may be slower with large, sparse datasets.

3. **Handling Class Imbalance**:
   - Use SMOTE or class weighting.
   - Resample the dataset (oversample minority class or undersample majority class).

4. **Evaluation Metrics**:
   - Accuracy, Precision, Recall, F1-Score.
   - ROC-AUC for a balanced view of performance.

5. **Business Impact**:
   - Improved spam detection leads to better productivity.
   - Reduces security risk from malicious spam.


In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
import numpy as np

# Load synthetic text dataset
categories = ['sci.crypt', 'talk.politics.misc']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
X, y = data.data, data.target

# Introduce some missing values artificially
X = [x if i % 10 != 0 else None for i, x in enumerate(X)]

# Preprocessing pipeline
def clean_missing_text(X):
    return [" " if x is None else x for x in X]

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
model = MultinomialNB()

pipeline = Pipeline([
    ('clean_text', FunctionTransformer(func=clean_missing_text, validate=False)),
    ('tfidf', vectorizer),
    ('clf', model)
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

# Evaluation
print(classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))
