# **THEORITICAL QUESTIONS**

Q1. What is a Support Vector Machine (SVM)?
- A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that best separates the data points of different classes with the maximum margin between them.


Q2. What is the difference between Hard Margin and Soft Margin SVM?

- The key difference between Hard Margin and Soft Margin SVM lies in how they handle data that is not perfectly separable. A Hard Margin SVM assumes that the data is completely linearly separable, meaning it tries to find a hyperplane that classifies all training examples correctly without allowing any misclassifications. This approach works well only when there is a clear margin between the classes and no overlap, but it is very sensitive to noise and outliers.

- On the other hand, a Soft Margin SVM introduces flexibility by allowing some data points to be on the wrong side of the margin or even misclassified. This is achieved by introducing slack variables and a regularization parameter C, which balances the trade-off between maximizing the margin and minimizing classification error. The Soft Margin approach is more robust and better suited for real-world datasets where perfect separation is rare due to noise or overlapping classes.

Q3. What is the mathematical intuition behind SVM?
- SVM tries to:

- Maximize the margin between the support vectors (closest points of each class).

Formulate the problem as an optimization task:

min
⁡
𝑤
,
𝑏
1
2
∥
𝑤
∥
2
subject to
𝑦
𝑖
(
𝑤
𝑇
𝑥
𝑖
+
𝑏
)
≥
1
w,b
min
​
  
2
1
​
 ∥w∥
2
 subject to y
i
​
 (w
T
 x
i
​
 +b)≥1

- In Soft Margin, slack variables
𝜉
𝑖
ξ
i
​
  are introduced to allow margin violations:

min
⁡
1
2
∥
𝑤
∥
2
+
𝐶
∑
𝜉
𝑖
min
2
1
​
 ∥w∥
2
 +C∑ξ
i
​


Q4. What is the role of Lagrange Multipliers in SVM?
- Lagrange multipliers are used to:

- Convert the constrained optimization problem into an unconstrained dual problem.

- Allow solving the problem more efficiently, especially with kernels.

- In dual form, only support vectors have non-zero Lagrange multipliers
𝛼
𝑖
α
i
​
 , which determine the final model:

𝑤
=
∑
𝛼
𝑖
𝑦
𝑖
𝑥
𝑖
w=∑α
i
​
 y
i
​
 x
i
​



Q5. What are Support Vectors in SVM?
- Support Vectors are:

The data points that lie closest to the decision boundary (hyperplane).

They are critical because they define the position and orientation of the hyperplane.

Only these points directly affect the model; removing them would change the margin.



Q6. What is a Support Vector Classifier (SVC)?
- A Support Vector Classifier (SVC) is a type of Support Vector Machine (SVM) used for classification tasks. It finds the optimal hyperplane that best separates data points of different classes by maximizing the margin between the closest points (support vectors) of each class.


Q7. What is a Support Vector Regressor (SVR)?
- A Support Vector Regressor (SVR) is the regression version of SVM. Instead of classifying data, it tries to fit a function within a margin of tolerance (ε) from the actual data points. The goal is to find a line (or curve) that predicts continuous values with minimal error, ignoring errors within the epsilon margin.


Q8. What is the Kernel Trick in SVM?
- The Kernel Trick is a method used in SVM to transform data into a higher-dimensional space without explicitly computing the transformation. It enables SVM to solve non-linear problems by applying kernel functions (like RBF or Polynomial) to compute the dot product in the transformed space, allowing linear separation in that space.

Q9. Compare Linear Kernel, Polynomial Kernel, and RBF Kernel:
- The Linear Kernel, Polynomial Kernel, and RBF (Radial Basis Function) Kernel are commonly used kernel functions in SVM, each suited for different types of data. The Linear Kernel is the simplest and is used when the data is linearly separable—meaning it can be divided by a straight line (or hyperplane). It is efficient and works well in high-dimensional spaces, such as text classification. The Polynomial Kernel introduces more complexity by mapping the data into a higher-dimensional space using polynomial functions. This is useful when the relationship between the features and the target is curved or non-linear. The RBF Kernel, also known as the Gaussian Kernel, is the most widely used and can handle complex and highly non-linear relationships by considering the distance between data points. It transforms the data into an infinite-dimensional space, allowing SVM to draw flexible boundaries between classes. In summary, the choice of kernel depends on the nature of the data: use linear when the data is simple, polynomial for moderate non-linearity, and RBF for highly non-linear and complex patterns.

Q10. What is the effect of the C parameter in SVM?
- The C parameter controls the trade-off between a smooth decision boundary and classifying training points correctly:

A small C makes the margin wider, allowing some misclassifications (more generalization).

A large C tries to classify all training points correctly, which may lead to overfitting.

Q11. What is the role of the Gamma parameter in RBF Kernel SVM?
- Gamma (γ) defines how far the influence of a single training example reaches in an RBF (Radial Basis Function) kernel. A low gamma means a large similarity radius, so points far away from each other are considered similar, leading to smoother decision boundaries. A high gamma makes the similarity radius smaller, meaning the model focuses tightly around each support vector, which can lead to overfitting. Essentially, gamma controls the curvature of the decision boundary.

Q12. What is the Naïve Bayes classifier, and why is it called "Naïve"?
- Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem. It assumes that all features are independent given the class label. It is called "naïve" because this independence assumption is rarely true in real data — yet the model still works well in many practical scenarios, especially in text classification and spam filtering.

Q13. What is Bayes’ Theorem?
- Bayes’ Theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event. It is mathematically stated as:
P(A∣B)=
P(B)
P(B∣A)⋅P(A)
​

Where:

𝑃
(
𝐴
∣
𝐵
)
P(A∣B) is the posterior probability: probability of A given B.

𝑃
(
𝐵
∣
𝐴
)
P(B∣A) is the likelihood: probability of B given A.

𝑃
(
𝐴
)
P(A) is the prior probability of A.

𝑃
(
𝐵
)
P(B) is the evidence or probability of B

Q14. Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes:

- Gaussian Naïve Bayes: Assumes that the features follow a normal (Gaussian) distribution. Suitable for continuous data, such as heights, weights, or sensor readings.

- Multinomial Naïve Bayes: Used for discrete/count data, especially in text classification where features are word counts or frequencies.

- Bernoulli Naïve Bayes: Designed for binary/boolean features (0s and 1s), such as whether a word is present or not in a document.

Q15. When should you use Gaussian Naïve Bayes over other variants?

- You should use Gaussian Naïve Bayes when your features are continuous and approximately normally distributed. It is ideal for datasets where attributes are real-valued and do not represent counts or binary indicators — for example, in medical data, sensor data, or financial indicators.

Q16. What are the key assumptions made by Naïve Bayes?


The main assumption of Naïve Bayes is conditional independence:

- It assumes that all features are independent of each other given the class label.

- That is, the presence (or absence) of a feature does not depend on the presence (or absence) of any other feature, given the class.

This assumption is rarely true in real-world data, but the model still performs well in many applications due to its simplicity and efficiency.

Q17. What are the advantages and disadvantages of Naïve Bayes?
- Advantages:

Simple and fast to train and predict.

Works well with high-dimensional data, such as text classification.

Performs well even with small datasets.

Not sensitive to irrelevant features.

Requires less training data.

- Disadvantages:

Strong assumption of feature independence, which is often unrealistic.

Poor estimation of probabilities (not calibrated).

Doesn't perform well when features are highly correlated.

Struggles with zero-frequency problem (handled by Laplace smoothing).



Q18. Why is Naïve Bayes a good choice for text classification?

- Text data is typically high-dimensional and sparse—Naïve Bayes handles both efficiently.

Words (features) in documents are often treated as conditionally independent given the class, which fits Naïve Bayes assumptions reasonably well.

It is fast to train and predict, even on large text datasets.

Works well with bag-of-words and TF-IDF features.

Despite its simplicity, it often achieves competitive accuracy in real-world NLP tasks.



Q19. Compare SVM and Naïve Bayes for classification tasks:
- Support Vector Machines (SVM) and Naïve Bayes are both popular classification algorithms, but they differ significantly in their approach. Naïve Bayes is a probabilistic model that relies on the assumption of conditional independence among features, meaning it assumes that each feature contributes independently to the outcome. In contrast, SVM is a deterministic model that aims to find the optimal hyperplane that best separates different classes in the feature space, without making any assumptions about feature independence. Naïve Bayes is very fast to train and performs well on high-dimensional, sparse datasets like text classification, where independence assumptions somewhat hold. However, it may suffer when features are highly correlated or when probabilities are important. On the other hand, SVM generally provides higher accuracy in complex classification problems and can model non-linear decision boundaries using kernel tricks, though it is computationally more intensive. Unlike Naïve Bayes, SVM does not naturally provide probability estimates unless additional calibration is applied. Overall, Naïve Bayes is preferred for speed and simplicity, especially in text tasks, while SVM is often chosen for higher accuracy and robustness in complex, real-world classification problems.

Q20. How does Laplace Smoothing help in Naïve Bayes?
- Laplace Smoothing (also called add-one smoothing) helps Naïve Bayes by addressing the zero-frequency problem.

If a word (feature) in test data was never seen in training data for a particular class, its probability becomes zero, which can zero out the entire product of probabilities.

Laplace smoothing adds a small constant (usually 1) to all feature counts to avoid zero probabilities.

Formula (for categorical data):

𝑃
(
𝑥
𝑖
∣
𝑦
)
=
count
(
𝑥
𝑖
,
𝑦
)
+
1
/ count
(
𝑦
)
+
𝑉


Where V is the total number of possible feature values (e.g., vocabulary size).

# **PRACTICAL QUESTIONS**

In [None]:
# Q21. Write a Python program to train an SVM Classifier on the Iris dataset and evaluate accuracy
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM classifier
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Iris SVM Classifier Accuracy: {accuracy:.2f}")



In [None]:
# Q22. Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Linear kernel
linear_svm = SVC(kernel='linear')
linear_svm.fit(X_train, y_train)
linear_accuracy = accuracy_score(y_test, linear_svm.predict(X_test))

# RBF kernel
rbf_svm = SVC(kernel='rbf')
rbf_svm.fit(X_train, y_train)
rbf_accuracy = accuracy_score(y_test, rbf_svm.predict(X_test))

print(f"Wine Dataset - Linear Kernel Accuracy: {linear_accuracy:.2f}")
print(f"Wine Dataset - RBF Kernel Accuracy: {rbf_accuracy:.2f}")


In [None]:
# Q23.  Write a Python program to train an SVM Regressor (SVR) on a housing dataset and evaluate it using Mean Squared Error (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVR model
model = SVR(kernel='rbf')
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Housing SVR Model Mean Squared Error: {mse:.2f}")


In [None]:
# Q24. Write a Python program to train an SVM Classifier with a Polynomial Kernel and visualize the decision boundary
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Generate synthetic 2D dataset
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train SVM with Polynomial Kernel
clf = SVC(kernel='poly', degree=3, C=1.0)
clf.fit(X_scaled, y)

# Plot decision boundary
def plot_decision_boundary(clf, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),




In [None]:
# Q25. Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Train the classifier
gnb.fit(X_train, y_train)

# Predict on test data
y_pred = gnb.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Gaussian Naive Bayes classifier: {accuracy:.4f}")



In [None]:
# Q26. Write a Python program to train a Multinomial Naïve Bayes classifier for text classification using the 20 Newsgroups dataset.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Load the 20 Newsgroups dataset (subset for speed)
newsgroups = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)

X = newsgroups.data
y = newsgroups.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a pipeline to vectorize, transform with TF-IDF, and train MultinomialNB
model = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

# Train the classifier
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate accuracy and classification report
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}\n")
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=newsgroups.target_names))


In [None]:
# Q27. Write a Python program to train an SVM Classifier with different C values and compare the decision boundaries visually
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC

# Generate a simple 2D dataset for visualization
X, y = datasets.make_blobs(n_samples=100, centers=2, random_state=6, cluster_std=1.5)

# Define function to plot decision boundary
def plot_decision_boundary(clf, X, y, ax, title):
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', s=30, edgecolors='k')
    ax.set_title(title)
    ax.set_xlim(X[:, 0].min()-1, X[:, 0].max()+1)
    ax.set_ylim(X[:, 1].min()-1, X[:, 1].max()+1)

    # Create grid to evaluate model
    xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 500),
                         np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 500))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.2, cmap='coolwarm')

# Different C values to try
C_values = [0.1, 1, 10, 100]

fig, axes = plt.subplots(1, len(C_values), figsize=(16, 4))

for i, C in enumerate(C_values):
    clf = SVC(kernel='linear', C=C)
    clf.fit(X, y)
    plot_decision_boundary(clf, X, y, axes[i], f'C = {C}')

plt.tight_layout()
plt.show()


In [None]:
# Q28. Write a Python program to train a Bernoulli Naïve Bayes classifier for binary classification on a dataset with binary features
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Example binary feature dataset (toy data)
X = np.array([[1,0,1,0,1],
              [0,1,0,1,0],
              [1,1,1,0,0],
              [0,0,0,1,1],
              [1,0,0,1,0],
              [0,1,1,0,1],
              [1,1,0,0,0],
              [0,0,1,1,1]])
y = np.array([1, 0, 1, 0, 1, 0, 1, 0])  # Binary target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

# Predict & evaluate
y_pred = bnb.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")


In [None]:
# Q29. Write a Python program to apply feature scaling before training an SVM model and compare results with unscaled data
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# For simplicity, do binary classification (class 0 and 1)
X = X[y != 2]
y = y[y != 2]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train SVM without scaling
svm_unscaled = SVC(kernel='rbf', C=1)
svm_unscaled.fit(X_train, y_train)
y_pred_unscaled = svm_unscaled.predict(X_test)
acc_unscaled = accuracy_score(y_test, y_pred_unscaled)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SVM with scaled data
svm_scaled = SVC(kernel='rbf', C=1)
svm_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = svm_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy without scaling: {acc_unscaled:.2f}")
print(f"Accuracy with scaling: {acc_scaled:.2f}")


In [None]:
# Q30. Write a Python program to train a Gaussian Naïve Bayes model and compare the predictions before and after Laplace Smoothing
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load data
data = load_iris()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train GaussianNB without smoothing
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_no_smooth = gnb.predict(X_test)
acc_no_smooth = accuracy_score(y_test, y_pred_no_smooth)

# Implement smoothing by adding small value to variance (sigma^2)
class GaussianNBLaplace(GaussianNB):
    def __init__(self, var_smoothing=1e-9):
        super().__init__()
        self.var_smoothing = var_smoothing

    def _update_variance(self, X):
        # Override variance calculation to add smoothing
        var = np.var(X, axis=0) + self.var_smoothing
        return var

    def fit(self, X, y):
        self.classes_ = np.unique(y)
        self.class_prior_ = np.array([np.mean(y == c) for c in self.classes_])
        self.theta_ = np.array([X[y == c].mean(axis=0) for c in self.classes_])
        self.sigma_ = np.array([self._update_variance(X[y == c]) for c in self.classes_])
        return self

laplace_gnb = GaussianNBLaplace(var_smoothing=1e-2)  # increased smoothing
laplace_gnb.fit(X_train, y_train)
y_pred_smooth = laplace_gnb.predict(X_test)
acc_smooth = accuracy_score(y_test, y_pred_smooth)

print(f"Accuracy without smoothing: {acc_no_smooth:.4f}")
print(f"Accuracy with smoothing (variance smoothing): {acc_smooth:.4f}")


In [None]:
# Q31. Write a Python program to train an SVM Classifier and use GridSearchCV to tune the hyperparameters (C, gamma, kernel)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load data
data = load_iris()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define model
svc = SVC()

# Hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1],
    'kernel': ['linear', 'rbf', 'poly']
}

# GridSearchCV
grid_search = GridSearchCV(svc, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print("Best parameters found:", grid_search.best_params_)

# Predict and evaluate on test set
best_svc = grid_search.best_estimator_
y_pred = best_svc.predict(X_test)
print(classification_report(y_test, y_pred))



In [None]:
# Q32. Write a Python program to train an SVM Classifier on an imbalanced dataset and apply class weighting and check it improve accuracy
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,
                           weights=[0.9, 0.1], flip_y=0, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM without class weighting
svc_no_weight = SVC()
svc_no_weight.fit(X_train, y_train)
y_pred_no_weight = svc_no_weight.predict(X_test)
acc_no_weight = accuracy_score(y_test, y_pred_no_weight)

# Train SVM with class weighting balanced
svc_weighted = SVC(class_weight='balanced')
svc_weighted.fit(X_train, y_train)
y_pred_weighted = svc_weighted.predict(X_test)
acc_weighted = accuracy_score(y_test, y_pred_weighted)

print(f"Accuracy without class weighting: {acc_no_weight:.4f}")
print(f"Accuracy with class weighting: {acc_weighted:.4f}")

print("\nClassification report without weighting:\n", classification_report(y_test, y_pred_no_weight))
print("\nClassification report with weighting:\n", classification_report(y_test, y_pred_weighted))


In [None]:
# Q33. Write a Python program to implement a Naïve Bayes classifier for spam detection using email data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Example dataset (you can replace this with your own email dataset)
data = {
    'text': [
        "Free money now!!!",
        "Hi, how are you?",
        "Win a free ticket",
        "Are we meeting today?",
        "You won a prize",
        "Let's catch up tomorrow",
        "Cheap meds available",
        "Can you call me back?",
        "Urgent! Claim your reward",
        "See you at the party"
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}

df = pd.DataFrame(data)

# Split data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.3, random_state=42)

# Convert text data to numeric vectors
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Train Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_counts, y_train)

# Predict and evaluate
y_pred = nb_classifier.predict(X_test_counts)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


In [None]:
# Q34.  Write a Python program to train an SVM Classifier and a Naïve Bayes Classifier on the same dataset and compare their accuracy
from sklearn.svm import SVC

# Using the same dataset and vectorizer as above

# Train SVM classifier
svm_classifier = SVC(kernel='linear', random_state=42)
svm_classifier.fit(X_train_counts, y_train)

# Predict using SVM
y_pred_svm = svm_classifier.predict(X_test_counts)

# Predict using Naive Bayes (already trained above)
y_pred_nb = nb_classifier.predict(X_test_counts)

# Compare accuracies
accuracy_svm = accuracy_score(y_test, y_pred_svm)
accuracy_nb = accuracy_score(y_test, y_pred_nb)

print(f"SVM Accuracy: {accuracy_svm:.4f}")
print(f"Naive Bayes Accuracy: {accuracy_nb:.4f}")


In [None]:
# Q35. Write a Python program to perform feature selection before training a Naïve Bayes classifier and compare results
from sklearn.feature_selection import SelectKBest, chi2

# Feature selection - select top 5 features (adjust k as needed)
selector = SelectKBest(chi2, k=5)
X_train_selected = selector.fit_transform(X_train_counts, y_train)
X_test_selected = selector.transform(X_test_counts)

# Train Naive Bayes with selected features
nb_classifier_fs = MultinomialNB()
nb_classifier_fs.fit(X_train_selected, y_train)

# Predict and evaluate
y_pred_fs = nb_classifier_fs.predict(X_test_selected)

print("Accuracy without feature selection:", accuracy_score(y_test, y_pred))
print("Accuracy with feature selection:", accuracy_score(y_test, y_pred_fs))
print("\nClassification Report with feature selection:\n", classification_report(y_test, y_pred_fs))


In [None]:
# Q36. Write a Python program to train an SVM Classifier using One-vs-Rest (OvR) and One-vs-One (OvO) strategies on the Wine dataset and compare their accuracy
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = datasets.load_wine()
X, y = wine.data, wine.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# OvR SVM
ovr_clf = OneVsRestClassifier(SVC(kernel='linear', random_state=42))
ovr_clf.fit(X_train, y_train)
y_pred_ovr = ovr_clf.predict(X_test)
accuracy_ovr = accuracy_score(y_test, y_pred_ovr)

# OvO SVM
ovo_clf = OneVsOneClassifier(SVC(kernel='linear', random_state=42))
ovo_clf.fit(X_train, y_train)
y_pred_ovo = ovo_clf.predict(X_test)
accuracy_ovo = accuracy_score(y_test, y_pred_ovo)

print(f"One-vs-Rest Accuracy: {accuracy_ovr:.4f}")
print(f"One-vs-One Accuracy: {accuracy_ovo:.4f}")


In [None]:
# Q37. Write a Python program to train an SVM Classifier using Linear, Polynomial, and RBF kernels on the Breast Cancer dataset and compare their accuracy
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

kernels = ['linear', 'poly', 'rbf']

for kernel in kernels:
    clf = SVC(kernel=kernel, random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"SVM with {kernel} kernel accuracy: {accuracy:.4f}")


In [None]:
# Q38. Write a Python program to train an SVM Classifier using Stratified K-Fold Cross-Validation and compute the average accuracy
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import numpy as np

# Load Breast Cancer dataset (or any dataset)
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target

# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

accuracies = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    clf = SVC(kernel='rbf', random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)

print(f"Average accuracy with Stratified K-Fold CV: {np.mean(accuracies):.4f}")


In [None]:
# Q39. Write a Python program to train a Naïve Bayes classifier using different prior probabilities and compare performance
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Default priors
model_default = GaussianNB()
model_default.fit(X_train, y_train)
y_pred_default = model_default.predict(X_test)
acc_default = accuracy_score(y_test, y_pred_default)

# Custom priors (example: uniform distribution)
custom_priors = [1/3, 1/3, 1/3]
model_custom = GaussianNB(priors=custom_priors)
model_custom.fit(X_train, y_train)
y_pred_custom = model_custom.predict(X_test)
acc_custom = accuracy_score(y_test, y_pred_custom)

print(f"Accuracy with default priors: {acc_default:.4f}")
print(f"Accuracy with custom priors: {acc_custom:.4f}")



In [None]:
# Q40. Write a Python program to perform Recursive Feature Elimination (RFE) before training an SVM Classifier and compare accuracy
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Base SVM without RFE
model_base = SVC(kernel='linear')
model_base.fit(X_train, y_train)
acc_base = accuracy_score(y_test, model_base.predict(X_test))

# SVM with RFE (select top 10 features)
rfe = RFE(estimator=SVC(kernel='linear'), n_features_to_select=10)
rfe.fit(X_train, y_train)
model_rfe = SVC(kernel='linear')
model_rfe.fit(X_train[:, rfe.support_], y_train)
acc_rfe = accuracy_score(y_test, model_rfe.predict(X_test[:, rfe.support_]))

print(f"Accuracy without RFE: {acc_base:.4f}")
print(f"Accuracy with RFE (10 features): {acc_rfe:.4f}")


In [None]:
# Q41. Write a Python program to train an SVM Classifier and evaluate its performance using Precision, Recall, and F1-Score instead of accuracy
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load dataset
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM
model = SVC(kernel='rbf')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate
report = classification_report(y_test, y_pred)
print("Classification Report:\n")
print(report)


In [None]:
# Q42. Write a Python program to train a Naïve Bayes Classifier and evaluate its performance using Log Loss (Cross-Entropy Loss)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import log_loss

# Load dataset
X, y = load_iris(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Naïve Bayes
model = GaussianNB()
model.fit(X_train, y_train)

# Predict probabilities
y_proba = model.predict_proba(X_test)

# Calculate log loss
loss = log_loss(y_test, y_proba)
print("Log Loss (Cross-Entropy):", loss)


In [None]:
# Q43. Write a Python program to train an SVM Classifier and visualize the Confusion Matrix using seaborn
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load dataset
X, y = load_wine(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM
model = SVC(kernel='rbf')
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


In [None]:
# Q44. Write a Python program to train an SVM Regressor (SVR) and evaluate its performance using Mean Absolute Error (MAE) instead of MSE
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVR
model = SVR(kernel='rbf')
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate using MAE
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)


In [None]:
# Q45. Write a Python program to train a Naïve Bayes classifier and evaluate its performance using the ROC-AUC score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import roc_auc_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Naïve Bayes model
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Predict probabilities
y_probs = nb_model.predict_proba(X_test)[:, 1]

# Evaluate using ROC-AUC score
roc_auc = roc_auc_score(y_test, y_probs)
print(f"ROC-AUC Score: {roc_auc:.4f}")


In [None]:
# Q46. Write a Python program to train an SVM Classifier and visualize the Precision-Recall Curve
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import precision_recall_curve, average_precision_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM with probability estimates enabled
svm_model = SVC(probability=True, kernel='rbf')
svm_model.fit(X_train, y_train)

# Predict probabilities
y_scores = svm_model.predict_proba(X_test)[:, 1]

# Compute Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_scores)
avg_precision = average_precision_score(y_test, y_scores)

# Plot
plt.figure(figsize=(8, 5))
plt.plot(recall, precision, label=f'AP = {avg_precision:.2f}')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (SVM)")
plt.legend()
plt.grid()
plt.show()
