# SVM ASSIGNMENT..

QUES.1

ANS;A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks, but it is mainly popular for classification.

🔹 How It Works

Input Data: You have training data with features and labels.
Example: Two classes, say red and blue points, in a 2D feature space.

Find Hyperplane: SVM tries to find a hyperplane that separates the two classes.

If multiple hyperplanes can separate them, SVM chooses the one with the maximum margin (largest distance from the nearest support vectors).

This makes the model more robust and less prone to overfitting.

Support Vectors: Only the critical data points (closest to the hyperplane) matter. Other points don’t affect the boundary.



QUES.2

ANS;            
| Aspect                | Hard Margin SVM       | Soft Margin SVM                         |
| --------------------- | --------------------- | --------------------------------------- |
| **Data requirement**  | Perfectly separable   | Can handle overlap/noise                |
| **Misclassification** | Not allowed           | Allowed (controlled by slack variables) |
| **Robustness**        | Sensitive to outliers | Robust to noise and outliers            |
| **Use case**          | Clean datasets        | Real-world datasets (common)            |


QUES.3

ANS;🔹 What is the Kernel Trick?

Many datasets are not linearly separable in their original feature space.

Instead of explicitly transforming data into higher dimensions (which is expensive), the Kernel Trick lets us compute inner products in higher-dimensional space without actually performing the transformation.

🔹 Example of a Kernel: Radial Basis Function (RBF) Kernel

Formula:

𝐾
(
𝑥
𝑖
,
𝑥
𝑗
)
=
exp
⁡
(
−
𝛾
∥
𝑥
𝑖
−
𝑥
𝑗
∥
2
)
K(x
i
	​

,x
j
	​

)=exp(−γ∥x
i
	​

−x
j
	​

∥
2
)

Here,
𝛾
γ controls the spread of the kernel.

🔹 Use Case:

Works well when the boundary between classes is curved or circular.

Example: Imagine data points arranged in concentric circles (one class inside, another outside).

A linear SVM cannot separate them.

But with the RBF kernel, SVM maps the data into a higher dimension where the classes can be separated with a hyperplane.



QUES.4

ANS; 🔹 What is a Naïve Bayes Classifier?

Naïve Bayes is a probabilistic supervised learning algorithm based on Bayes’ Theorem.

It is used for classification tasks.

The key idea: It calculates the probability of a data point belonging to a certain class based on the likelihood of its features, then assigns the class with the highest probability.


🔹 Why is it called “Naïve”?

Because it makes a strong assumption:

All features are independent of each other given the class label.

In reality, features are often correlated (e.g., in spam detection, "discount" and "offer" often appear together).

But this “naïve” independence assumption makes the model much simpler and computationally efficient.

QUES.5

ANS;
| Variant            | Feature Type                               | Example Use Case                                     |
| ------------------ | ------------------------------------------ | ---------------------------------------------------- |
| **Gaussian NB**    | Continuous (real values, assumed Gaussian) | Medical diagnosis, Iris dataset                      |
| **Multinomial NB** | Discrete counts (integers)                 | Text classification (spam detection, topic modeling) |
| **Bernoulli NB**   | Binary (0/1, yes/no)                       | Sentiment analysis, presence/absence of words        |


QUES.6

ANS;

In [1]:
# Import libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = datasets.load_iris()
X = iris.data   # features
y = iris.target # labels

# 2. Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train an SVM Classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# 4. Make predictions
y_pred = svm_model.predict(X_test)

# 5. Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Print support vectors
print("\nSupport Vectors:\n", svm_model.support_vectors_)


Model Accuracy: 1.0

Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


QUES.7

ANS;

In [2]:
# Import libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data   # features
y = data.target # labels

# 2. Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# 4. Make predictions
y_pred = gnb.predict(X_test)

# 5. Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



QUES.8

ANS;

In [3]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']  # Using RBF kernel
}

# 4. Apply GridSearchCV
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=0, cv=5)
grid.fit(X_train, y_train)

# 5. Best hyperparameters
print("Best Hyperparameters:", grid.best_params_)

# 6. Evaluate accuracy on test set
y_pred = grid.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", accuracy)


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Test Set Accuracy: 0.7777777777777778


QUES.9

ANS;

In [4]:
# Import libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# 1. Load the text dataset (using only 2 categories for binary classification)
categories = ['rec.sport.baseball', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=categories)

X = data.data
y = data.target

# 2. Convert text to numeric features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# 3. Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.3, random_state=42
)

# 4. Train a Naïve Bayes Classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# 5.


QUES.10

ANS;
 | Step                         | Technique / Method                                                                                                                                                                                                          | Notes                                            |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ |
| **Data Preprocessing**       | - Handle missing values (empty string or placeholder) <br> - Clean text (lowercase, remove stopwords, punctuation) <br> - Lemmatization/Stemming <br> - TF-IDF vectorization (or embeddings)                                | Ensures clean, structured input for the model    |
| **Model Choice**             | - **Multinomial Naïve Bayes** (fast, baseline, good with text counts) <br> - **SVM (linear kernel)** (robust, better accuracy, but computationally heavier)                                                                 | Start with NB → try SVM for improvement          |
| **Handling Class Imbalance** | - Oversampling spam (SMOTE) <br> - Undersampling non-spam <br> - Class weights (`class_weight="balanced"`) <br> - Threshold tuning                                                                                          | Prevents bias toward “Not Spam” class            |
| **Evaluation Metrics**       | - **Precision** → avoid flagging legit emails as spam <br> - **Recall** → catch as many spam as possible <br> - **F1-score** → balance precision & recall <br> - **ROC-AUC** → overall separability <br> - Confusion Matrix | Accuracy alone is misleading due to imbalance    |
| **Business Impact**          | - Reduce phishing & malware risks <br> - Save employee time (less spam filtering manually) <br> - Ensure legit customer emails are delivered <br> - Build customer trust <br> - Retrain periodically as spam evolves        | Directly improves security, productivity & trust |


# THANK YOU ASSIGNMENT COMPLETED..