# Practical Questions:

#### Question 6

Write a Python program to:

Load the Iris dataset

Train an SVM Classifier with a linear kernel

Print the model's accuracy and support vectors.

In [2]:
# Step 1: Import required libraries
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 2: Load the Iris dataset
iris = datasets.load_iris()

# Step 3: Convert the dataset into a pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target  # Add the target column

# Step 4: Separate the features (X) and the target (y)
X = df.drop('target', axis=1)  # All columns except target
y = df['target']               # Only the target column

# Step 5: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Train the SVM model with a linear kernel
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# Step 7: Predict on the test set
y_pred = model.predict(X_test)

# Step 8: Print the accuracy and support vectors
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model:", accuracy)
print("Support Vectors:\n", model.support_vectors_)


Accuracy of the model: 1.0
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


#### Question 7

Write a Python program to:

Load the Breast Cancer dataset

Train a Gaussian Naïve Bayes model

Print its classification report including precision, recall, and F1-score

In [5]:
# Step 1: Import the required libraries
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Step 2: Load the Breast Cancer dataset
cancer = datasets.load_breast_cancer()

# Step 3: Convert to a pandas DataFrame
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Step 4: Separate the features and target
X = df.drop('target', axis=1)
y = df['target']

# Step 5: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Train the Gaussian Naïve Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Step 7: Make predictions
y_pred = model.predict(X_test)

# Step 8: Print the classification report
report = classification_report(y_test, y_pred, target_names=cancer.target_names)
print("Classification Report:\n")
print(report)


Classification Report:

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



#### Question 8:
    
Write a Python program to:

Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

Print the best hyperparameters and accuracy.

In [6]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 2: Load the Wine dataset
wine = datasets.load_wine()

# Step 3: Convert the dataset into a pandas DataFrame
df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
df['target'] = wine.target

# Step 4: Split the features and target
X = df.drop('target', axis=1)
y = df['target']

# Step 5: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Create an SVM model and set up parameter grid
svm = SVC()

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# Step 7: Use GridSearchCV to find best hyperparameters
grid = GridSearchCV(svm, param_grid, cv=5)
grid.fit(X_train, y_train)

# Step 8: Make predictions and check accuracy
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 9: Print the results
print("Best Hyperparameters found by GridSearchCV:")
print(grid.best_params_)
print(f"\nAccuracy on test set: {accuracy:.2f}")


Best Hyperparameters found by GridSearchCV:
{'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}

Accuracy on test set: 0.83


#### Question 10:

Imagine you’re working as a data scientist for a company that handles email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:

Text with diverse vocabulary

Potential class imbalance (far more legitimate emails than spam)

Some incomplete or missing data

Explain the approach you would take to:

Preprocess the data (e.g. text vectorization, handling missing data)

Choose and justify an appropriate model (SVM vs. Naïve Bayes)

Address class imbalance

Evaluate the performance of your solution with suitable metrics

And explain the business impact of your solution.

#### Objective:

To build a machine learning model that automatically classifies emails into Spam or Not Spam, based on their text content. The task includes:

Preprocessing the text data

Choosing the right model

Dealing with class imbalance

Evaluating performance

Understanding business impact

#### Step-by-Step Explanation with Python Code:

## Step 1: Importing Required Libraries

In [7]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score

#### Explanation:

#### We use fetch_20newsgroups to get a text dataset (simulating real emails).

#### TfidfVectorizer transforms raw text into numeric features using a method called TF-IDF, which scores words based on their importance.

#### train_test_split divides the data into training (learn) and test (evaluate).

#### MultinomialNB is a Naive Bayes model, great for text classification.

#### classification_report and roc_auc_score are metrics to measure how well our model performs.

## Step 2: Loading the Dataset

In [8]:
data = fetch_20newsgroups(subset='all', categories=['sci.space', 'rec.sport.hockey'], shuffle=True)
X_raw = data.data
y_raw = data.target

#### Explanation:

#### We pick two categories to simulate a binary classification:

#### sci.space = Not Spam

#### rec.sport.hockey = Spam

#### X_raw contains the text data (emails/posts).

#### y_raw contains the target labels:

#### 0 for sci.space

#### 1 for rec.sport.hockey

#### These are pre-labeled by scikit-learn — this is similar to real-life when the data team gives you a labeled dataset.

## Step 3: Text Preprocessing with TF-IDF

In [9]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=2000)
X = vectorizer.fit_transform(X_raw)


#### Explanation:

#### Raw text must be converted into numbers, since ML models can’t understand plain English.

#### TF-IDF means:

#### Term Frequency (TF): how often a word appears in a document

#### Inverse Document Frequency (IDF): how rare that word is across all documents

#### We use max_features=2000 to keep only the top 2000 most important words.

#### This results in a sparse matrix (X) where each row represents one email and each column represents a word’s score.

## Step 4: Splitting the Dataset

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y_raw, test_size=0.2, random_state=42)


#### Explanation:

#### We split 80% of the data to train the model

#### The remaining 20% is used to test how well the model performs on unseen data

#### random_state=42 ensures we get the same results every time (for reproducibility)

## Step 5: Training the Naive Bayes Model

In [11]:
model = MultinomialNB()
model.fit(X_train, y_train)

#### Explanation:

#### We initialize a Multinomial Naive Bayes model, which is ideal for text data like emails.

#### It assumes the words (features) are independent, which works surprisingly well in practice.

#### The model learns patterns of words associated with spam and not spam by calculating probabilities.

## Step 6: Predicting and Evaluating

In [12]:
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

#### Explanation:

#### predict() returns the predicted class (0 or 1) for each email.

#### predict_proba() gives the probability score of each class.

#### We extract [:, 1] to get the probability of being spam (class 1) for ROC-AUC.

## Step 7: Print the Classification Report

In [13]:
print("Classification Report:\n")
print(classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))


Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       202
           1       0.99      1.00      1.00       196

    accuracy                           1.00       398
   macro avg       1.00      1.00      1.00       398
weighted avg       1.00      1.00      1.00       398

ROC-AUC Score: 0.9999242271165892


#### Explanation:

#### classification_report shows:

#### Precision: How many predicted spams were actually spam?

#### Recall: How many actual spams did we catch?

#### F1-Score: A balance between precision and recall

#### roc_auc_score tells us how well the model ranks the true positives vs. false positives. A score close to 1.0 is excellent.

## Business Impact

| Area             | Benefit                                                           |
| ---------------- | ----------------------------------------------------------------- |
| **Time saving**  | Reduces manual email checking effort by classifying automatically |
| **Security**     | Helps detect spam, phishing, or scam emails — reducing risk       |
| **Productivity** | Important emails stay visible while distractions are filtered out |
| **Scalability**  | Naive Bayes is fast and can scale to millions of emails easily    |

## Final Thoughts
 
#### We used a real **ML pipeline: loading → preprocessing → modeling → evaluating**

#### We chose Naive Bayes because it's efficient, text-friendly, and works well without heavy tuning

#### We used TF-IDF to turn messy raw text into a numeric form that models understand

#### We measured success using precision, recall, F1-score, and ROC-AUC

## Question 9:

Write a Python program to:

Train a Naïve Bayes Classifier on a synthetic text dataset (e.g., using sklearn.datasets.fetch_20newsgroups)

Print the model's ROC-AUC score for its predictions.

In [14]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# Step 1: Load synthetic binary text dataset
data = fetch_20newsgroups(subset='all', categories=['rec.autos', 'sci.electronics'], shuffle=True)
X_text = data.data
y = data.target

# Step 2: Convert text into numeric features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=2000)
X = vectorizer.fit_transform(X_text)

# Step 3: Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 5: Predict probabilities and calculate ROC-AUC
y_proba = model.predict_proba(X_test)[:, 1]
roc_score = roc_auc_score(y_test, y_proba)

print("ROC-AUC Score:", roc_score)


ROC-AUC Score: 0.9939548284200237


# Theoretical Questions:

### Question 1: What is a Support Vector Machine (SVM), and how does it work?

**Answer:**

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. Its main goal is to find the best boundary (called a **hyperplane**) that separates different classes in the data.

SVM works by:
- Mapping input data into high-dimensional space (if needed)
- Finding the hyperplane that **maximizes the margin** (distance) between the nearest points (called **support vectors**) of different classes
- Making predictions by checking on which side of the hyperplane a new point lies

It’s especially useful in binary classification problems and performs well with clear class separation.

---

### Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

**Answer:**

- **Hard Margin SVM** assumes that the data is perfectly separable and draws a boundary without allowing any errors. It's very strict and not tolerant to noise.
  
- **Soft Margin SVM** allows for some misclassification in order to handle noisy or overlapping data. It introduces a **penalty term (C)** to control the trade-off between margin size and misclassification.

**Key Difference:**  
Hard margin = no error tolerance; Soft margin = balances accuracy and flexibility (real-world choice).

---

### Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

**Answer:**

The **Kernel Trick** allows SVM to work in high-dimensional spaces without explicitly transforming the data. It computes the inner product of transformed features using a **kernel function**, saving time and resources.

**Example:**  
The **RBF (Radial Basis Function)** kernel is commonly used when the relationship between features is non-linear. It helps SVM classify data that is **not linearly separable** in its original space.

---

### Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

**Answer:**

Naïve Bayes is a **probabilistic classifier** based on **Bayes’ Theorem**, which calculates the probability of a class given certain features.

It is called “naïve” because it **assumes that all features are independent** of each other — which is rarely true in real data. Despite this unrealistic assumption, it performs very well in many scenarios like **text classification**.

---

### Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

**Answer:**

- **Gaussian Naive Bayes**: Used when features are continuous and follow a normal distribution. Example: height, weight, age.
  
- **Multinomial Naive Bayes**: Used for **count-based** features like word frequencies in text data. Example: spam classification.
  
- **Bernoulli Naive Bayes**: Used for **binary/boolean** features — when a feature is either present or not (1 or 0). Example: sentiment analysis with presence/absence of words.

Each variant is chosen based on the **type of data** you're working with.

---