SVM & Naive Bayes | Assignment

Question 1: What is a Support Vector Machine (SVM), and how does it work?
Ans: A Support Vector Machine (SVM) is a powerful and versatile supervised machine learning algorithm used for both classification and regression tasks. Its primary objective is to find the optimal decision boundary, known as a hyperplane, that separates data points of different classes.

The key idea behind SVM is to not just separate the classes, but to do so with the largest possible margin. The margin is the distance between the hyperplane and the closest data points from each class. These closest points are called support vectors, and they play a crucial role in defining the hyperplane.



How SVM Works
Finding the Optimal Hyperplane: SVM's goal is to find a hyperplane that maximizes the margin between the two classes. A larger margin generally leads to better generalization on new, unseen data, as it provides a wider "buffer zone" between the classes.


Support Vectors: The data points that are closest to the hyperplane and essentially "support" or define the margin are the support vectors. If you were to remove any of these support vectors, the position and orientation of the hyperplane would change. All other data points have no influence on the final boundary.



Handling Non-Linear Data with the Kernel Trick: Many real-world datasets aren't linearly separable, meaning you can't draw a straight line (or flat hyperplane) to divide the classes. For these cases, SVM uses a technique called the kernel trick. The kernel trick implicitly maps the data into a higher-dimensional space where it can be linearly separated.  By applying a kernel function (like the Radial Basis Function or RBF), SVM can find a non-linear decision boundary in the original space, effectively handling complex relationships between data points without the computational cost of explicitly transforming the data.

Soft vs. Hard Margin:

Hard Margin SVM: This approach assumes the data is perfectly separable and aims for a perfect separation with no misclassifications. It works well when the data is clean and there are no outliers.


Soft Margin SVM: This is a more flexible approach that allows for some misclassifications or points to be on the wrong side of the margin. It introduces a regularization parameter (C) that balances the trade-off between maximizing the margin and minimizing the number of classification errors. This makes it more robust to noisy data and outliers.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM
Ans:The primary difference between Hard Margin and Soft Margin SVM lies in how they handle data that isn't perfectly separable.

Hard Margin SVM
A Hard Margin SVM seeks to find a hyperplane that perfectly separates the data points of different classes. This means there can be no misclassifications, and all data points must be on the correct side of the margin.


When to use: This approach is only feasible when the data is linearly separable—meaning a single straight line or flat hyperplane can perfectly divide the classes—and there are no outliers or noise.

Key characteristic: It's very sensitive to outliers. A single misplaced data point can make it impossible to find a separating hyperplane, causing the algorithm to fail.


Analogy: Imagine trying to separate two different colors of marbles on a table with a ruler. A Hard Margin SVM is like insisting on placing the ruler so that none of the marbles are on the wrong side or even touching the ruler.

Soft Margin SVM
A Soft Margin SVM is a more flexible and practical approach that allows for some misclassifications or data points to be within the margin. This is achieved by introducing a penalty term for these violations.


When to use: This is the more common and robust approach, used for real-world data that often contains noise, outliers, or overlapping classes.

Key characteristic: It introduces a regularization parameter (C) which controls the trade-off between maximizing the margin and minimizing misclassification errors.

A low C value tolerates more errors, leading to a wider margin but potentially underfitting the data.

A high C value penalizes errors more severely, resulting in a narrower margin but a stricter separation, similar to a Hard Margin SVM.

Analogy: Using the same marble analogy, a Soft Margin SVM allows you to place the ruler with a few marbles slightly on the wrong side or within a "buffer zone" to get a better, more generalized separation overall.

In summary, Hard Margin SVM aims for perfection but is fragile, while Soft Margin SVM is more realistic and robust by balancing the goal of a wide margin with the reality of imperfect data.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
Ans:The Kernel Trick is a powerful technique used by Support Vector Machines (SVMs) to handle non-linearly separable data without explicitly transforming the data into a higher-dimensional space.  Instead of performing the computationally expensive task of mapping the data, the kernel function calculates the dot product of the data points as if they were already in that higher-dimensional space. This allows SVM to find a linear decision boundary in the higher dimension, which translates to a non-linear boundary in the original feature space.

Example: Radial Basis Function (RBF) Kernel
The Radial Basis Function (RBF) kernel is one of the most popular and versatile kernels used with SVM.

Formula: The RBF kernel is defined as:
K(x,x
′
 )=exp(−γ∣∣x−x
′
 ∣∣
2
 )
where ∣∣x−x
′
 ∣∣
2
  is the squared Euclidean distance between two points, and γ is a hyperparameter that controls the influence of each support vector.

Use Case: The RBF kernel is particularly effective for datasets where the classes are arranged in complex, non-linear patterns. For example, consider a classification problem where the data points of one class form a circle in the center of data points from another class. A simple linear boundary would fail. The RBF kernel, however, can handle this by essentially creating a circular decision boundary, as it measures the proximity of data points to a central point. It is a good default choice for many classification problems as it can handle a wide variety of non-linear relationships.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?
Ans:A Naïve Bayes classifier is a family of simple and probabilistic supervised machine learning algorithms used for classification. It's based on Bayes' Theorem and operates under a strong, simplifying assumption: that all features are independent of each other. This means the presence or absence of one feature does not affect the presence or absence of any other feature.



Why it’s called “naïve”
The "naïve" part of the name comes directly from this simplifying assumption of independence. In reality, this assumption is almost always false. For example, a classifier predicting whether a fruit is an orange might consider its features like "round shape," "orange color," and "citrus smell." A Naïve Bayes classifier would assume that the orange color is completely independent of the round shape, which isn't true.


However, despite this unrealistic assumption, Naïve Bayes classifiers often perform surprisingly well in practice, especially for tasks like text classification, spam filtering, and sentiment analysis. The simplicity of the model makes it computationally efficient and fast to train.




Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?
Ans:While all Naïve Bayes classifiers are based on the same core principle of Bayes' Theorem and the "naïve" assumption of feature independence, they differ in how they model the distribution of the features. The choice of variant depends on the type of data you're working with.

Gaussian Naïve Bayes
The Gaussian Naïve Bayes variant assumes that the features follow a normal (Gaussian) distribution. This is particularly useful for continuous numerical data. It calculates the mean and variance for each feature within each class to estimate the probability of a data point belonging to a certain class.



When to use it: Use Gaussian Naïve Bayes when your features are continuous, such as height, weight, temperature, or other measurements that can be described by a bell-shaped curve. It's often applied in fields like medical diagnosis or fraud detection where features are numerical.


Multinomial Naïve Bayes
The Multinomial Naïve Bayes variant is designed for data where the features are discrete counts. It models the probability of observing these counts. For example, in text classification, a document can be represented by a vector of word counts.


When to use it: This is the go-to classifier for text classification problems like spam filtering, sentiment analysis, and document categorization. It works with features that represent the frequency of events, such as how many times a particular word appears in a document.


Bernoulli Naïve Bayes
The Bernoulli Naïve Bayes variant is also suited for discrete data, but it's specifically for binary features. It models the presence or absence of a feature, not its frequency. The feature vector is a binary vector (0s and 1s) indicating whether a specific word or feature exists in a document or not.


When to use it: This is ideal for tasks where the simple presence or absence of a feature is more important than its frequency. A common use case is also text classification, but when the model should only consider if a word is present, not how many times it appears.

Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors

In [None]:
#ans:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target # Target variable (species)

# Split the dataset into training and testing sets
# We'll use 80% of the data for training and 20% for testing
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train an SVM Classifier with a linear kernel
# SVC stands for Support Vector Classification
# kernel='linear' specifies a linear decision boundary
svm_model = SVC(kernel='linear', random_state=42)

# Fit the model to the training data
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# 3. Print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# 4. Print the support vectors
# Support vectors are the data points closest to the hyperplane
# They play a crucial role in defining the decision boundary
print("\nSupport Vectors:")
# svm_model.support_vectors_ contains the actual support vectors
print(svm_model.support_vectors_)

# You can also get the indices of the support vectors from the training set
print("\nIndices of Support Vectors (from training set):")
print(svm_model.support_)


Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.


In [None]:
#Ans:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# 1. Load the Breast Cancer dataset
# This dataset is commonly used for binary classification tasks.
# It contains features computed from a digitized image of a fine needle aspirate (FNA)
# of a breast mass, describing characteristics of the cell nuclei present in the image.
breast_cancer = load_breast_cancer()
X = breast_cancer.data  # Features
y = breast_cancer.target # Target variable (malignant or benign)

# Split the dataset into training and testing sets
# We'll use 80% of the data for training and 20% for testing.
# random_state ensures reproducibility of the split, meaning you'll get the same split
# every time you run the code with the same random_state value.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train a Gaussian Naïve Bayes model
# Gaussian Naïve Bayes is a probabilistic classifier based on Bayes' theorem
# with the assumption of independence among predictors.
# It assumes that the features follow a Gaussian (normal) distribution.
gnb_model = GaussianNB()

# Fit the model to the training data
# This step trains the model using the provided training features (X_train)
# and their corresponding target labels (y_train).
gnb_model.fit(X_train, y_train)

# Make predictions on the test set
# Once the model is trained, we use it to predict the target labels
# for the unseen test data (X_test).
y_pred = gnb_model.predict(X_test)

# 3. Print its classification report including precision, recall, and F1-score.
# The classification report is a text summary of the precision, recall, F1-score
# for each class, and support (number of occurrences of each class in y_test).
# Precision: The ratio of correctly predicted positive observations to the total predicted positives.
# Recall (Sensitivity): The ratio of correctly predicted positive observations to all observations in actual class.
# F1-score: The weighted average of Precision and Recall. It tries to find the balance
# between precision and recall.
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))


Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy

In [None]:
#Ans:# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
# The Wine dataset is a classic dataset for classification,
# containing chemical analysis of wines grown in the same region in Italy
# but derived from three different cultivars.
wine = load_wine()
X = wine.data  # Features (chemical properties)
y = wine.target # Target variable (wine cultivar)

# Split the dataset into training and testing sets
# A test size of 20% is used, and random_state ensures reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid for GridSearchCV
# 'C' is the regularization parameter, controlling the trade-off between
# achieving a low training error and a low testing error (generalization ability).
# 'gamma' defines how much influence a single training example has.
# Small gamma means a large influence, large gamma means small influence.
param_grid = {
    'C': [0.1, 1, 10, 100],  # Common values for C
    'gamma': [1, 0.1, 0.01, 0.001], # Common values for gamma
    'kernel': ['rbf'] # We'll use the Radial Basis Function (RBF) kernel, which is common with gamma
}

# Create an SVM Classifier model
# We start with a basic SVC model. GridSearchCV will tune its parameters.
svm_model = SVC(random_state=42)

# 2. Train an SVM Classifier using GridSearchCV to find the best C and gamma
# GridSearchCV exhaustively searches over specified parameter values for an estimator.
# cv=5 means 5-fold cross-validation will be used on the training data.
# verbose=3 provides a detailed output during the search process.
grid_search = GridSearchCV(svm_model, param_grid, cv=5, verbose=3, scoring='accuracy')

# Fit GridSearchCV to the training data
# This step performs the cross-validation and hyperparameter tuning.
grid_search.fit(X_train, y_train)

# 3. Print the best hyperparameters and accuracy
print("\nBest Hyperparameters found by GridSearchCV:")
print(grid_search.best_params_)

# Get the best estimator (model with the best hyperparameters)
best_svm_model = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred = best_svm_model.predict(X_test)

# Calculate and print the accuracy of the best model on the test set
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy of the best SVM model on the test set: {accuracy:.2f}")


Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.

In [None]:
#Ans:# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np

# 1. Load a synthetic text dataset (20 Newsgroups)
# We'll select two categories to make it a binary classification problem for ROC-AUC calculation.
# Choosing 'alt.atheism' and 'soc.religion.christian' as they are distinct.
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_data = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

X_text = newsgroups_data.data  # The raw text documents
y = newsgroups_data.target     # The target labels (0 for alt.atheism, 1 for soc.religion.christian)

# Convert text data into numerical features using TF-IDF Vectorizer
# TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic
# that is intended to reflect how important a word is to a document in a collection or corpus.
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000) # Limit features for efficiency
X = vectorizer.fit_transform(X_text)

# Split the dataset into training and testing sets
# 80% for training, 20% for testing, with random_state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train a Naïve Bayes Classifier
# Multinomial Naïve Bayes is well-suited for classification with discrete features
# (like word counts or TF-IDF values in text classification).
nb_model = MultinomialNB()

# Fit the model to the training data
nb_model.fit(X_train, y_train)

# 3. Print the model's ROC-AUC score for its predictions.
# ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) is a performance metric
# for binary classifiers. It represents the ability of the model to distinguish between classes.
# A higher AUC indicates a better model.
# We need probability estimates for ROC-AUC. predict_proba returns the probability of each class.
# For binary classification, we usually take the probability of the positive class (class 1).
y_pred_proba = nb_model.predict_proba(X_test)[:, 1] # Probability of the positive class (soc.religion.christian)

# Calculate the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Model's ROC-AUC Score: {roc_auc:.2f}")


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# --- 1. Create a Synthetic Text Dataset (Simulating Emails) ---
# This dataset simulates an imbalanced scenario: 10 legitimate emails, 2 spam emails.
# In a real scenario, you'd load actual email data.
emails = [
    "Hello team, please find the meeting minutes attached. Regards, John.", # Legitimate (0)
    "Hi, just a reminder about our project deadline next Friday.",         # Legitimate (0)
    "Your invoice is ready for download. Click here to view.",             # Spam (1) - suspicious link
    "Meeting rescheduled to 3 PM today. See you there.",                   # Legitimate (0)
    "Congratulations! You've won a free iPhone. Claim your prize now!",   # Spam (1) - typical scam
    "Regarding your recent inquiry, here is the information you requested.",# Legitimate (0)
    "Weekly report for Q2 is due by end of day.",                          # Legitimate (0)
    "Important security update: Verify your account details immediately.", # Legitimate (0) - could be spam, but let's assume legitimate for this example
    "Please review the attached document and provide your feedback.",      # Legitimate (0)
    "Confirm your email address to avoid account suspension.",             # Legitimate (0) - could be spam, but let's assume legitimate for this example
    "New policy document has been uploaded to the shared drive.",          # Legitimate (0)
    "Exclusive offer: Limited time discount on all products!",            # Legitimate (0) - marketing, not spam for this example
]

# 0 for Not Spam, 1 for Spam
labels = [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]

# --- 2. Preprocess the Data (Text Vectorization) ---
# TF-IDF Vectorizer to convert text into numerical features.
# max_features limits the vocabulary size, stop_words removes common words.
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(emails)
y = np.array(labels)

# Split the dataset into training and testing sets
# Using a small test size due to the small synthetic dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print("--- Data Preprocessing Complete ---")
print(f"Original data points: {len(emails)}")
print(f"Training data points: {X_train.shape[0]}, Test data points: {X_test.shape[0]}")
print(f"Training label distribution (0: Not Spam, 1: Spam): {np.bincount(y_train)}")
print(f"Test label distribution (0: Not Spam, 1: Spam): {np.bincount(y_test)}")

# --- 3. Choose and Justify an Appropriate Model (Multinomial Naïve Bayes) ---
# Multinomial Naïve Bayes is well-suited for text classification with TF-IDF features.
# It's efficient and provides a good baseline.

# --- 4. Address Class Imbalance ---
# Calculate class weights to give more importance to the minority class (Spam).
# This helps the model learn from the underrepresented class more effectively.
# The 'balanced' mode automatically adjusts weights inversely proportional to class frequencies.
nb_model = MultinomialNB(class_weight='balanced')

# Train the Naïve Bayes model
print("\n--- Training Naïve Bayes Model ---")
nb_model.fit(X_train, y_train)
print("Model training complete.")

# --- 5. Evaluate the Performance of Your Solution ---
# Make predictions on the test set
y_pred = nb_model.predict(X_test)

print("\n--- Model Evaluation ---")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred)) # Shows TP, TN, FP, FN

# Print the classification report including precision, recall, and F1-score
# target_names provides readable labels for the classes.
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Spam', 'Spam']))

# Note: For a very small synthetic dataset, results might not be perfectly indicative
# of real-world performance. Real datasets require more extensive preprocessing
# and potentially more advanced techniques for imbalance and feature engineering.
