Theory Questions
1. What is a Support Vector Machine (SVM)?

Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that best separates different classes in the dataset. SVM aims to maximize the margin between the closest data points (support vectors) from different classes.


2. What is the difference between Hard Margin and Soft Margin SVM?

Hard Margin SVM: Strictly separates data points without allowing misclassification. Suitable for linearly separable data but sensitive to noise.
Soft Margin SVM: Allows some misclassification to handle overlapping or non-linearly separable data, using a penalty parameter (C) to balance margin maximization and classification errors.


3. What is the mathematical intuition behind SVM?

SVM finds the hyperplane that maximizes the margin between two classes. Mathematically, it solves the following optimization problem:
min⁡w,b12∣∣w∣∣2\min_{w, b} \frac{1}{2} ||w||^2
Subject to:
yi(w⋅xi+b)≥1∀iy_i (w \cdot x_i + b) \geq 1 \quad \forall i
For non-linearly separable data, a slack variable ξ\xi is introduced:
yi(w⋅xi+b)≥1−ξiy_i (w \cdot x_i + b) \geq 1 - \xi_i

4. What is the role of Lagrange Multipliers in SVM?

Lagrange multipliers help solve the constrained optimization problem in SVM by transforming it into a dual form. This allows SVM to efficiently find the optimal hyperplane using kernel functions.



5. What are Support Vectors in SVM?

Support vectors are the data points that lie closest to the decision boundary (hyperplane). They define the margin and influence the position of the hyperplane.


6. What is a Support Vector Classifier (SVC)?

SVC is an extension of SVM used for classification tasks, allowing soft margins to handle noisy and overlapping data.


7. What is a Support Vector Regressor (SVR)?

SVR applies the SVM concept to regression problems. It finds a function that approximates the data while allowing a margin of tolerance (epsilon ε\varepsilon).


8. What is the Kernel Trick in SVM?

The kernel trick enables SVM to handle non-linearly separable data by mapping input features into a higher-dimensional space where a linear separator exists.


9. Compare Linear Kernel, Polynomial Kernel, and RBF Kernel

Linear Kernel: Suitable for linearly separable data.
Polynomial Kernel: Captures non-linear relationships with polynomial degrees.
RBF Kernel: Uses Gaussian function to model complex decision boundaries and is widely used for non-linear data.

10. What is the effect of the C parameter in SVM?

The parameter C controls the trade-off between maximizing margin and minimizing misclassification. Higher C values reduce margin but decrease misclassification, while lower C values allow a larger margin with potential misclassification.

11. What is the role of the Gamma parameter in RBF Kernel SVM?

Gamma determines how far the influence of a training example reaches. Higher gamma values lead to more complex models (overfitting), while lower values create simpler models (underfitting).


12. What is the Naïve Bayes classifier, and why is it called "Naïve"?

Naïve Bayes is a probabilistic classifier based on Bayes' Theorem, assuming independence among features. It is called "naïve" because it assumes that all features contribute independently to the probability of a class, which is rarely true in real-world scenarios.


13. What is Bayes' Theorem?

Bayes' Theorem describes the probability of an event based on prior knowledge:
P(A∣B)=P(B∣A)P(A)P(B)P(A|B) = \frac{P(B|A) P(A)}{P(B)}
where:
P(A∣B)P(A|B) is the posterior probability.
P(B∣A)P(B|A) is the likelihood.
P(A)P(A) is the prior probability.
P(B)P(B) is the marginal probability.

14. Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes

Gaussian Naïve Bayes: Used for continuous data assuming normal distribution.
Multinomial Naïve Bayes: Used for discrete data like word counts in text classification.
Bernoulli Naïve Bayes: Used for binary features (presence/absence of words).

15. When should you use Gaussian Naïve Bayes over other variants?

Gaussian Naïve Bayes is preferable when dealing with continuous numerical data that follows a normal distribution, such as height, weight, or sensor data.


16. What are the key assumptions made by Naïve Bayes?

Features are independent given the class.
Each feature contributes equally to the outcome.
Data follows a specific distribution (e.g., Gaussian for continuous data).

17. What are the advantages and disadvantages of Naïve Bayes?

Advantages:
Fast and efficient for large datasets.
Works well with high-dimensional data.
Performs well in text classification tasks.
Disadvantages:
Assumes feature independence, which may not hold.
Struggles with correlated features.
Requires sufficient data to estimate probabilities accurately.


18. Why is Naïve Bayes a good choice for text classification?

Naïve Bayes performs well in text classification because:
It handles high-dimensional data efficiently.
It is robust to irrelevant features.
It requires minimal training data.

19. Compare SVM and Naïve Bayes for classification tasks

SVM is a margin-based classifier that works well for complex, high-dimensional data but is computationally expensive. Naïve Bayes is a probabilistic classifier that is fast and effective, especially for text classification, but struggles with correlated features. SVM excels in structured datasets, while Naïve Bayes is ideal for quick, efficient classification tasks.

20. How does Laplace Smoothing help in Naïve Bayes?

Laplace Smoothing (also called additive smoothing) prevents zero probabilities by adding a small constant to all probability estimates. This ensures unseen words in the test data do not lead to a probability of zero.
P(w∣c)=(count(w,c)+α)(count(c)+α×∣V∣)P(w|c) = \frac{(count(w,c) + \alpha)}{(count(c) + \alpha \times |V|)}
where α\alpha is a smoothing parameter, typically set to 1.





Practical questions

21.
from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVM classifier
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)

# Predict
y_pred = svm_clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("SVM Classifier Accuracy:", accuracy)


22.

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

# Load dataset
wine = datasets.load_wine()
X, y = wine.data, wine.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)

# Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)

# Compare Accuracies
accuracy_linear = accuracy_score(y_test, y_pred_linear)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

print(f"Linear Kernel Accuracy: {accuracy_linear}")
print(f"RBF Kernel Accuracy: {accuracy_rbf}")


23.

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.svm import SVR

from sklearn.metrics import mean_squared_error

# Load dataset
boston = datasets.load_diabetes()
X, y = boston.data, boston.target  # Using Diabetes dataset as a housing dataset alternative

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVR model
svr = SVR(kernel='rbf')
svr.fit(X_train, y_train)

# Predict
y_pred = svr.predict(X_test)

# Evaluate MSE
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)


24.

import numpy as np

import matplotlib.pyplot as plt

from sklearn.svm import SVC

from sklearn.datasets import make_moons

from sklearn.preprocessing import StandardScaler

# Generate dataset
X, y = make_moons(n_samples=100, noise=0.1, random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train SVM with Polynomial Kernel
svm_poly = SVC(kernel='poly', degree=3, C=1)
svm_poly.fit(X_scaled, y)

# Plot decision boundary
xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))

Z = svm_poly.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)


plt.contourf(xx, yy, Z, alpha=0.3)

plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, edgecolors='k')

plt.title("SVM with Polynomial Kernel")

plt.show()


25.
from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score

# Load dataset
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gaussian Naïve Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Gaussian Naïve Bayes Accuracy:", accuracy)


26.
from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import CountVectorizer,

TfidfTransformer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score

# Load dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.autos', 'comp.graphics'])

newsgroups_test = fetch_20newsgroups(subset='test', categories=['sci.space', 'rec.autos', 'comp.graphics'])

# Create a pipeline for text classification
text_clf = Pipeline([
    ('vect', CountVectorizer()),  
    ('tfidf', TfidfTransformer()),  
    ('clf', MultinomialNB()),  
])

# Train model
text_clf.fit(newsgroups_train.data, newsgroups_train.target)

# Predict
y_pred = text_clf.predict(newsgroups_test.data)

# Evaluate accuracy
accuracy = accuracy_score(newsgroups_test.target, y_pred)

print("Multinomial Naïve Bayes Accuracy:", accuracy)


27.
import numpy as np

import matplotlib.pyplot as plt

from sklearn.svm import SVC

from sklearn.datasets import make_moons

from sklearn.preprocessing import StandardScaler

# Generate dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define different C values
C_values = [0.1, 1, 10]

plt.figure(figsize=(15, 5))
for i, C in enumerate(C_values, 1):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_scaled, y)

    # Plot decision boundary
    xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.subplot(1, 3, i)
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, edgecolors='k')
    plt.title(f"SVM Decision Boundary (C={C})")

plt.show()


28.
from sklearn.naive_bayes import BernoulliNB

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

from sklearn.metrics import accuracy_score

# Generate binary feature dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

X = (X > 0).astype(int)  # Convert features to binary

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bernoulli Naïve Bayes
bnb = BernoulliNB()

bnb.fit(X_train, y_train)

# Predict and evaluate
y_pred = bnb.predict(X_test)

print("Bernoulli Naïve Bayes Accuracy:", accuracy_score(y_test, y_pred))


29.
from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score

# Load dataset
wine = datasets.load_wine()
X, y = wine.data, wine.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVM without scaling
svm_no_scaling = SVC(kernel='rbf')
svm_no_scaling.fit(X_train, y_train)
accuracy_no_scaling = accuracy_score(y_test, svm_no_scaling.predict(X_test))

# Apply Standard Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SVM with scaling
svm_with_scaling = SVC(kernel='rbf')
svm_with_scaling.fit(X_train_scaled, y_train)
accuracy_with_scaling = accuracy_score(y_test, svm_with_scaling.predict(X_test_scaled))

print(f"Without Scaling Accuracy: {accuracy_no_scaling}")

print(f"With Scaling Accuracy: {accuracy_with_scaling}")


30.
from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model without Laplace smoothing
gnb_no_smoothing = GaussianNB(var_smoothing=1e-9)
gnb_no_smoothing.fit(X_train, y_train)

# Train model with Laplace smoothing
gnb_smoothing = GaussianNB(var_smoothing=1e-2)
gnb_smoothing.fit(X_train, y_train)

# Compare accuracy
print("Without Smoothing Accuracy:", accuracy_score(y_test, gnb_no_smoothing.predict(X_test)))

print("With Smoothing Accuracy:", accuracy_score(y_test, gnb_smoothing.predict(X_test)))


31.
from sklearn.model_selection import GridSearchCV

from sklearn.svm import SVC

# Define parameters grid
param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto'], 'kernel': ['linear', 'rbf']}

# Perform Grid Search
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)


32.
from sklearn.utils import compute_class_weight

# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)

# Train normal SVM
svm_no_weight = SVC()
svm_no_weight.fit(X_train, y_train)

accuracy_no_weight = accuracy_score(y_test, svm_no_weight.predict(X_test))

# Train SVM with class weighting
svm_weighted = SVC(class_weight='balanced')

svm_weighted.fit(X_train, y_train)

accuracy_weighted = accuracy_score(y_test, svm_weighted.predict(X_test))

print(f"Without Class Weighting Accuracy: {accuracy_no_weight}")

print(f"With Class Weighting Accuracy: {accuracy_weighted}")


33.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.pipeline import Pipeline

from sklearn.naive_bayes import MultinomialNB

from sklearn.datasets import fetch_openml

# Load spam dataset
data = fetch_openml('sms_spam', version=1, as_frame=True)
X, y = data.data['text'], data.target

# Create pipeline
spam_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

# Train and evaluate
spam_clf.fit(X_train, y_train)

print("Spam Detection Accuracy:", accuracy_score(y_test, spam_clf.predict(X_test)))


34.
from sklearn.naive_bayes import GaussianNB

# Train SVM
svm = SVC(kernel='linear')

svm.fit(X_train, y_train)

accuracy_svm = accuracy_score(y_test, svm.predict(X_test))

# Train Naïve Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)

accuracy_nb = accuracy_score(y_test, nb.predict(X_test))

print(f"SVM Accuracy: {accuracy_svm}")
print(f"Naïve Bayes Accuracy: {accuracy_nb}")


35.
from sklearn.feature_selection import SelectKBest, chi2

# Select best features
selector = SelectKBest(chi2, k=10)
X_new = selector.fit_transform(X, y)

# Train Naïve Bayes with selected features
nb_selected = GaussianNB()

nb_selected.fit(X_train, y_train)

accuracy_selected = accuracy_score(y_test, nb_selected.predict(X_test))

print("Accuracy after Feature Selection:", accuracy_selected)


36.
from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

from sklearn.metrics import accuracy_score

# Load dataset
wine = datasets.load_wine()
X, y = wine.data, wine.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train One-vs-Rest SVM
ovr_svm = OneVsRestClassifier(SVC(kernel='linear'))
ovr_svm.fit(X_train, y_train)
accuracy_ovr = accuracy_score(y_test, ovr_svm.predict(X_test))

# Train One-vs-One SVM
ovo_svm = OneVsOneClassifier(SVC(kernel='linear'))
ovo_svm.fit(X_train, y_train)
accuracy_ovo = accuracy_score(y_test, ovo_svm.predict(X_test))

print(f"One-vs-Rest Accuracy: {accuracy_ovr}")
print(f"One-vs-One Accuracy: {accuracy_ovo}")


37.
from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

# Load dataset
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train models with different kernels
kernels = ['linear', 'poly', 'rbf']
for kernel in kernels:
    svm = SVC(kernel=kernel)
    
    svm.fit(X_train, y_train)
    
    accuracy = accuracy_score(y_test, svm.predict(X_test))
    
    print(f"SVM with {kernel} kernel Accuracy: {accuracy}")


38.
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Define stratified K-Fold
skf = StratifiedKFold(n_splits=5)

# Train and evaluate using cross-validation
svm = SVC(kernel='linear')
scores = cross_val_score(svm, X, y, cv=skf)

print(f"Average Accuracy using Stratified K-Fold: {scores.mean()}")


39.
from sklearn.naive_bayes import GaussianNB

# Train Naïve Bayes with different priors
priors_list = [[0.7, 0.3], [0.5, 0.5], [0.3, 0.7]]

for priors in priors_list:
    nb = GaussianNB(priors=priors)
    
    nb.fit(X_train, y_train)
    
    accuracy = accuracy_score(y_test, nb.predict(X_test))
    
    print(f"Accuracy with priors {priors}: {accuracy}")


40.
from sklearn.feature_selection import RFE

# Feature selection using RFE
svm = SVC(kernel='linear')

rfe = RFE(estimator=svm, n_features_to_select=10)

X_rfe = rfe.fit_transform(X, y)

# Train and evaluate SVM with selected features
svm.fit(X_rfe, y)

print("SVM Accuracy after RFE:", accuracy_score(y_test, svm.predict(rfe.transform(X_test))))


41.
from sklearn.metrics import precision_score, recall_score, f1_score

# Train SVM
svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)

# Evaluate
precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")


42.
from sklearn.metrics import log_loss

# Train Naïve Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
y_prob = nb.predict_proba(X_test)

# Compute Log Loss
print("Log Loss:", log_loss(y_test, y_prob))


43.
import seaborn as sns

from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

# Train SVM
svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)

# Plot Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.xlabel("Predicted")

plt.ylabel("Actual")

plt.title("Confusion Matrix")

plt.show()


44.
from sklearn.svm import SVR

from sklearn.metrics import mean_absolute_error

# Load regression dataset
diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target

# Train SVR
svr = SVR(kernel='rbf')
svr.fit(X_train, y_train)

# Evaluate MAE
y_pred = svr.predict(X_test)
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))


45.
from sklearn.metrics import roc_auc_score

# Train Naïve Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
y_prob = nb.predict_proba(X_test)[:, 1]

# Compute ROC-AUC
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


46.
from sklearn.metrics import precision_recall_curve

# Train SVM
svm = SVC(probability=True)
svm.fit(X_train, y_train)
y_prob = svm.predict_proba(X_test)[:, 1]

# Compute Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)

# Plot Precision-Recall Curve
plt.plot(recall, precision, marker='.')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.show()
