In [None]:
#what is a  sector support machine(svm),and how does it work?

'''
->A Support Vector Machine (SVM) is a supervised learning algorithm typically used for classification (and sometimes for regression) that aims to find the best possible decision boundary—a hyperplane—between data points of different classes by maximizing the margin between them

How SVM Works
1. Linear SVM & Maximum Margin
SVM finds the hyperplane that maximizes the margin—the distance between the boundary and the closest training points from each class. These closest points are called support vectors

2. Soft Margin & Hinge Loss
Real-world data often overlaps. Soft‑margin SVM allows some misclassification or margin violations to improve generalization.

3. Kernel Trick for Non‑Linear Data
When data isn’t linearly separable in the original feature space, SVM uses a kernel function to implicitly map it into a higher-dimensional space where separation is easier.

4. Training via Optimization
The SVM solves a convex quadratic optimization problem (often via dual formulation using Lagrange multipliers) to determine which data points become support vectors and define the hyperplane
'''

In [None]:
#Explain the difference between Hard Margin and Soft Margin SVM.

'''
->Hard Margin SVM
Definition: Used when data is perfectly linearly separable—every point can lie outside the margin with no misclassification allowed.

Pros:

Maximizes margin for perfect separation
Unique, well-defined solution

Cons:

Fails when data is not linearly separable
Highly sensitive to outliers or noise—just one bad point destroys feasibility

Soft-Margin SVM
Designed for real-world datasets that may be noisy or not perfectly separable.
Trade-off parameter (C):

Large C → penalizes misclassification heavily → nearby to hard margin

Small C → allows more violations → wider margin and better generalization
'''

In [None]:
#What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

'''
->The Kernel Trick is a powerful technique used in Support Vector Machines (SVMs) to enable non-linear classification without explicitly mapping data to high-dimensional feature spaces.

Example Kernel: Radial Basis Function (RBF)
When to Use It
Ideal for highly non-linear and complex datasets with unknown structure.

Frequently the default choice in real-world SVM implementations because of its flexibility and smooth decision boundaries

Intuition
The RBF kernel implicitly maps data into an infinite-dimensional feature space, capturing very subtle distinctions between points without ever having to compute explicit features
'''

In [None]:
#What is a Naïve Bayes Classifier, and why is it called “naïve”?

'''
->A Naïve Bayes Classifier is a straightforward yet effective probabilistic classification method grounded in Bayes’ theorem, leveraging the principle of conditional independence among features to make predictions efficiently.

Why It’s Called “Naïve”
The classifier’s naïvety stems from its strong assumption that features are completely independent of each other, conditional on the class label.
“We assume conditional independence because it makes it easier to compute the probabilities. Even though we know it does not exactly reflect the real world. This is why we call it 'naive'.“
'''

In [None]:
#Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.When would you use each one?

'''
->1. Gaussian Naïve Bayes
Model assumption: Each feature (predictor) is continuous and follows a Gaussian (normal) distribution within each class. You estimate mean and variance per feature per class.
Use case: Ideal for datasets with numeric measurements—e.g. height, weight, temperature, sensor data, or the classic Iris dataset.

2. Multinomial Naïve Bayes
Model assumption: Features are discrete counts or frequencies, assumed to follow a multinomial distribution. Likelihood is based on counts raised to power of their probabilities.
Use case: Very common in text classification (spam detection, sentiment analysis, topic categorization), where features are word counts or term frequencies. It also applies to any other scenario modeled with counts.

3. Bernoulli Naïve Bayes
Model assumption: Features are binary (0/1), modeled as independent Bernoulli distributions. It explicitly considers presence versus absence.
Use case: Well-suited for binary-valued text features (e.g. word occurs or doesn't in a document), or any scenario with Boolean-type predictors.

 Choosing the Right Variant
If your features are continuous numeric values, go with Gaussian Naïve Bayes.

If your features represent counts or frequencies, especially in NLP or document classification, use Multinomial NB.

If your features are binary (yes/no, true/false), such as word presence, choose Bernoulli NB.

Rare cases where none fits perfectly (e.g., TF‑IDF continuous features), people often still use Multinomial NB or fall back to Gaussian—but careful validation is recommended.
'''

In [None]:
#Write a Python program to:● Load the Iris dataset● Train an SVM Classifier with a linear kernel● Print the model's accuracy and support vectors
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# 2. Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train SVM with linear kernel
svm_model = SVC(kernel='linear', random_state=0)
svm_model.fit(X_train, y_train)

# 4. Evaluate accuracy
y_pred = svm_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the SVM model: {accuracy:.2f}")

# 5. Get support vectors
print("Support vectors (coordinates):")
print(svm_model.support_vectors_)
print("Number of support vectors for each class:")
print(svm_model.n_support_)
print("Indices of support vectors:")
print(svm_model.support_)


Accuracy of the SVM model: 1.00
Support vectors (coordinates):
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]
Number of support vectors for each class:
[ 3 11 10]
Indices of support vectors:
[ 16  18  76   7  30  39  44  45  47  58  64  65  90  95   1  15  27  53
  66  72  86  97  98 101]


In [None]:
#Write a Python program to:● Load the Breast Cancer dataset● Train a Gaussian Naïve Bayes model● Print its classification report including precision, recall, and F1-score
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = datasets.load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Gaussian Naïve Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [None]:
#: Write a Python program to:● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.● Print the best hyperparameters and accuracy.
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid to search over
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# Create an SVM classifier
svm = SVC()

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5, n_jobs=-1, verbose=3)

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and corresponding accuracy score
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Cross-Validation Accuracy: {:.2f}%".format(grid_search.best_score_ * 100))

# Evaluate the model on the test set
best_svm = grid_search.best_estimator_
y_pred = best_svm.predict(X_test)
print("Test Accuracy: {:.2f}%".format(best_svm.score(X_test, y_test) * 100))

# Print the classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))



Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best Hyperparameters:  {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Best Cross-Validation Accuracy: 69.47%
Test Accuracy: 77.78%

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87        19
           1       0.83      0.71      0.77        21
           2       0.62      0.71      0.67        14

    accuracy                           0.78        54
   macro avg       0.77      0.77      0.77        54
weighted avg       0.79      0.78      0.78        54



In [None]:
#Write a Python program to:● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. usingsklearn.datasets.fetch_20newsgroups).● Print the model's ROC-AUC score for its predictions
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# Load the 20 Newsgroups dataset
newsgroups = datasets.fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
X = newsgroups.data
y = newsgroups.target

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.3, random_state=42)

# Initialize and train the Naïve Bayes classifier
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predict probabilities on the test set
y_pred_prob = nb.predict_proba(X_test)

# Compute the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_prob, multi_class='ovr')
print(f"ROC-AUC Score: {roc_auc:.4f}")

ROC-AUC Score: 0.9607


In [None]:
# Imagine you’re working as a data scientist for a company that handles email communications.Your task is to automatically classify emails as Spam or Not Spam. The emails maycontain:● Text with diverse vocabulary● Potential class imbalance (far more legitimate emails than spam)● Some incomplete or missing dataExplain the approach you would take to:● Preprocess the data (e.g. text vectorization, handling missing data)● Choose and justify an appropriate model (SVM vs. Naïve Bayes)● Address class imbalance● Evaluate the performance of your solution with suitable metricsAnd explain the business impact of your solution.

'''
->1. Data Preprocessing
Text Vectorization:

TF-IDF Vectorizer: Converts text into numerical features by evaluating the importance of words in the documents. It reduces the weight of commonly used words and highlights more informative terms.
Handling Missing Data:
Imputation: Replace missing email content with empty strings to maintain data consistency and ensure the model can process all entries.
Text Cleaning:
Lowercasing: Standardizes text by converting all characters to lowercase.
Stopword Removal: Eliminates common words (e.g., "the", "is") that don't contribute to distinguishing spam.
2. Model Selection
Naïve Bayes Classifier:
Multinomial Naïve Bayes is effective for text classification tasks. It assumes feature independence and is computationally efficient.
Performance: Achieves high accuracy (~98%) in spam detection tasks.
Support Vector Machine (SVM):
Linear SVM is suitable for high-dimensional spaces, like text data.
3. Addressing Class Imbalance
SMOTE (Synthetic Minority Over-sampling Technique): Generates synthetic samples for the minority class (spam) to balance the dataset.
4. Evaluation Metrics
Accuracy: Measures the overall correctness of the model.
Precision: Indicates the proportion of true positives among all positive predictions.
5. Business Impact
Cost Reduction: Automates spam detection, reducing the need for manual intervention.

Improved Productivity: Ensures employees focus on legitimate emails, enhancing efficiency.
'''
