SVM & Naive bayes

Assignment Questions

Theoretical

Q1. What is a Support Vector Machine (SVM)?

A Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression. It finds the optimal hyperplane that separates data points of different classes with the maximum margin.

Q2. What is the difference between Hard Margin and Soft Margin SVM?

Hard Margin SVM assumes data is perfectly separable and does not allow misclassification, while Soft Margin SVM allows some misclassifications using a penalty parameter (C) to handle noisy or non-linear data.

Q3. What is the mathematical intuition behind SVM?

SVM maximizes the margin between classes by minimizing
1
2
‚à£
‚à£
ùë§
‚à£
‚à£
2
2
1
	‚Äã

‚à£‚à£w‚à£‚à£
2
 subject to
ùë¶
ùëñ
(
ùë§
‚ãÖ
ùë•
ùëñ
+
ùëè
)
‚â•
1
y
i
	‚Äã

(w‚ãÖx
i
	‚Äã

+b)‚â•1. A larger margin implies better generalization of the classifier.

Q4. What is the role of Lagrange Multipliers in SVM?

Lagrange multipliers are used to convert the constrained optimization problem into its dual form, making it easier to solve. They help identify support vectors and enable the use of kernel functions.

Q5. What are Support Vectors in SVM?

Support Vectors are the data points closest to the separating hyperplane. They define the position and orientation of the hyperplane, and removing them can change the decision boundary.

Q6. What is a Support Vector Classifier (SVC)?

A Support Vector Classifier (SVC) is the practical version of SVM that uses a soft margin to allow some misclassifications. It balances maximizing the margin and minimizing classification errors.

Q7. What is a Support Vector Regressor (SVR)?

Support Vector Regression (SVR) adapts SVM for regression by fitting a function within an epsilon margin. Only data points outside this margin become support vectors and affect the model.

Q8. What is the Kernel Trick in SVM?

The Kernel Trick allows SVM to perform non-linear classification by mapping input data to a higher-dimensional space using kernel functions like linear, polynomial, or RBF, without explicit transformation.

Q9. Compare Linear Kernel, Polynomial Kernel, and RBF Kernel.

Linear Kernel is best for linearly separable data, Polynomial Kernel handles moderate non-linearity, and RBF Kernel captures complex relationships but may overfit if not tuned properly.

Q10. What is the effect of the C parameter in SVM?

The C parameter controls the trade-off between maximizing the margin and minimizing classification errors. A high C reduces errors but may overfit, while a low C increases margin but may underfit.

Q11. What is the role of the Gamma parameter in RBF Kernel SVM?

Gamma determines how far the influence of a training sample reaches. A high gamma means the model is more complex and may overfit, while a low gamma gives smoother, simpler boundaries.

Q12. What is the Na√Øve Bayes classifier, and why is it called "Na√Øve"?

Na√Øve Bayes is a probabilistic classifier based on Bayes‚Äô theorem. It is called ‚Äúna√Øve‚Äù because it assumes that all features are conditionally independent given the class label.

Q13. What is Bayes‚Äô Theorem?

Bayes' Theorem is a powerful mathematical formula that is used to update the probability of a hypothesis ($A$) when new evidence ($B$) is introduced. It provides a way to logically revise an initial belief (the prior probability) based on observable data (the likelihood) to get a new, more accurate belief (the posterior probability).

Formula and InterpretationThe formula for Bayes' Theorem is:$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$



Q14. Explain the differences between Gaussian, Multinomial, and Bernoulli Na√Øve Bayes.
Gaussian NB handles continuous features, Multinomial NB is used for count data like word frequencies, and Bernoulli NB is used for binary features indicating presence or absence.

Q15. When should you use Gaussian Na√Øve Bayes over other variants?
Gaussian Na√Øve Bayes should be used when features are continuous and follow a normal (Gaussian) distribution, such as in sensor or medical data.

Q16. What are the key assumptions made by Na√Øve Bayes?
Na√Øve Bayes assumes that features are conditionally independent given the class label and that each feature contributes equally to the outcome.

Q17. What are the advantages and disadvantages of Na√Øve Bayes?
Advantages: Simple, fast, and effective with small or high-dimensional data.
Disadvantages: Assumes feature independence and can perform poorly when features are correlated.

Q18. Why is Na√Øve Bayes a good choice for text classification?
Na√Øve Bayes works well for text data because word occurrences are nearly independent, and it efficiently handles high-dimensional and sparse data like TF-IDF vectors.

Q19. Compare SVM and Na√Øve Bayes for classification tasks.
SVM is a discriminative model focusing on decision boundaries, while Na√Øve Bayes is a generative model based on probabilities. SVM is slower but more accurate; Na√Øve Bayes is faster but assumes independence.

Q20. How does Laplace Smoothing help in Na√Øve Bayes?
Laplace Smoothing prevents zero probabilities for unseen features by adding one to feature counts. This ensures all probabilities remain non-zero and improves model robustness.

Practical

21. Write a Python program to train an SVM Classifier on the Iris dataset and evaluate accuracy.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# --- 1. Load the Dataset ---
# The Iris dataset is a classic and simple dataset for classification tasks.
print("Loading Iris dataset...")
iris = load_iris()
X = iris.data  # Features (Sepal length, Sepal width, Petal length, Petal width)
y = iris.target # Target (Species: Setosa, Versicolor, Virginica)

# --- 2. Split Data into Training and Testing Sets ---
# We split the data to evaluate how well the model generalizes to unseen data.
# test_size=0.3 means 30% of the data will be used for testing, and 70% for training.
# random_state ensures reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Total samples: {len(X)}")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print("-" * 30)

# --- 3. Initialize and Train the SVM Classifier ---
# We use the Support Vector Classifier (SVC). The default kernel is 'rbf' (Radial Basis Function),
# which is often highly effective for non-linear classification tasks like this.
print("Initializing and training the SVM (SVC) classifier...")
svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X_train, y_train)
print("Training complete.")
print("-" * 30)

# --- 4. Make Predictions ---
# Use the trained model to predict the class labels for the test set.
y_pred = svm_model.predict(X_test)

# --- 5. Evaluate Accuracy ---
# Calculate the classification accuracy by comparing predicted labels (y_pred)
# against the true labels (y_test).
accuracy = accuracy_score(y_test, y_pred)

print(f"Model predictions on test set: {y_pred}")
print(f"Actual labels for test set: {y_test}")
print("-" * 30)

# The accuracy is the percentage of correctly classified instances.
print(f"Classification Accuracy: {accuracy * 100:.2f}%")

# --- Example Usage (Predicting a single new sample) ---
# Let's use the first sample from the test set as an example new flower.
example_sample = X_test[0].reshape(1, -1)
example_prediction = svm_model.predict(example_sample)[0]
actual_species = iris.target_names[y_test[0]]
predicted_species = iris.target_names[example_prediction]

print("-" * 30)
print("Example Prediction:")
print(f"Input Features: {example_sample[0]}")
print(f"Actual Species: {actual_species}")
print(f"Predicted Species: {predicted_species}")

 22.Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then
compare their accuracies.

In [None]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# --- 1. Load the Dataset ---
# The Wine dataset is used for multi-class classification based on chemical analysis.
print("Loading Wine dataset...")
wine = load_wine()
X = wine.data  # Features (13 chemical constituents)
y = wine.target # Target (3 types of wine)

# --- 2. Split Data into Training and Testing Sets ---
# We split the data to evaluate how well the model generalizes to unseen data.
# test_size=0.3 means 30% of the data will be used for testing, and 70% for training.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Total samples: {len(X)}")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print("-" * 30)

# --- 3. Initialize and Train Two SVM Classifiers ---

# Classifier 1: Linear Kernel
print("Initializing and training SVM with LINEAR Kernel...")
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
print("Linear SVM training complete.")

# Classifier 2: RBF (Radial Basis Function) Kernel
print("Initializing and training SVM with RBF Kernel...")
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
print("RBF SVM training complete.")
print("-" * 30)

# --- 4. Make Predictions ---
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# --- 5. Evaluate and Compare Accuracies ---

# Calculate accuracy for Linear SVM
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# Calculate accuracy for RBF SVM
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

print("--- Accuracy Comparison (Wine Dataset) ---")
print(f"Linear SVM Accuracy: {accuracy_linear * 100:.2f}%")
print(f"RBF SVM Accuracy:    {accuracy_rbf * 100:.2f}%")
print("-" * 30)

# Provide a brief analysis
if accuracy_linear > accuracy_rbf:
    print("Conclusion: The Linear kernel performed slightly better on this test set.")
elif accuracy_rbf > accuracy_linear:
    print("Conclusion: The RBF kernel performed slightly better on this test set.")
else:
    print("Conclusion: Both kernels achieved the same level of accuracy on this test set.")

23.Write a Python program to train an SVM Regressor (SVR) on a housing dataset and evaluate it using Mean
Squared Error (MSE).

In [None]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR # Import Support Vector Regressor
from sklearn.metrics import mean_squared_error # Import Mean Squared Error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# --- 1. Load the Dataset ---
# We use the California Housing dataset for regression (predicting house prices).
print("Loading California Housing dataset...")
# Data is loaded as a Bunch object
housing = fetch_california_housing()
X = housing.data    # Features (8 features like median income, house age, etc.)
y = housing.target  # Target (Median house value, in hundreds of thousands of dollars)

# --- 2. Split Data into Training and Testing Sets ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Total samples: {len(X)}")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print("-" * 30)

# --- 3. Initialize and Train the SVR Model ---
# SVR is highly sensitive to feature scaling, so we'll use a pipeline to scale data
# before feeding it to the RBF SVR model.
print("Initializing and training the SVR (Support Vector Regressor) with RBF Kernel...")

# C=10 gives a decent trade-off, gamma='auto' scales well with the number of features.
svr_model = make_pipeline(
    StandardScaler(), # Feature scaling is crucial for SVR
    SVR(kernel='rbf', C=10, gamma='auto')
)

svr_model.fit(X_train, y_train)
print("SVR training complete.")
print("-" * 30)

# --- 4. Make Predictions ---
# Use the trained model to predict the house values for the test set.
y_pred = svr_model.predict(X_test)

# --- 5. Evaluate Performance using Mean Squared Error (MSE) ---

# MSE measures the average squared difference between the estimated values and the actual value.
# A lower MSE indicates better model performance.
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("--- Regression Evaluation (California Housing Dataset) ---")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print("-" * 30)

# Example: Inspecting the first few predictions vs. actual values
print("Sample Predictions (Predicted vs. Actual):")
for i in range(5):
    # Remember the target value is in hundreds of thousands (e.g., 2.34 means $234,000)
    print(f"Predicted: {y_pred[i]:.2f} | Actual: {y_test[i]:.2f}")


24.Write a Python program to train an SVM Classifier with a Polynomial Kernel and visualize the decision
boundary.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC # Support Vector Classifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# --- 1. Load the Dataset ---
# We use the Iris dataset for classification.
print("Loading Iris dataset...")
iris = load_iris()
# For 2D visualization, we only use the first two features (Sepal Length and Sepal Width)
X = iris.data[:, :2]
y = iris.target

# --- 2. Split Data into Training and Testing Sets ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Total samples: {len(X)}")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print("-" * 30)

# --- 3. Initialize and Train the SVM Classifier with Polynomial Kernel ---
print("Initializing and training the SVC (Support Vector Classifier) with Polynomial Kernel...")

# C controls regularization, degree specifies the polynomial degree.
svm_poly_model = make_pipeline(
    StandardScaler(), # Scaling is highly recommended for SVM
    SVC(kernel='poly', degree=3, C=1.0, random_state=42)
)

svm_poly_model.fit(X_train, y_train)
print("Polynomial SVM training complete.")
print("-" * 30)

# --- 4. Evaluate Accuracy ---
y_pred = svm_poly_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("--- Classification Evaluation (Iris Dataset - 2 Features) ---")
print(f"Polynomial SVM Accuracy on Test Set: {accuracy * 100:.2f}%")
print("-" * 30)

# --- 5. Visualization of Decision Boundary ---

print("Generating decision boundary visualization...")

# Define the boundaries of the plot based on feature ranges
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

# Create a mesh grid (a grid of points across the feature space)
h = .02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict the class for every point in the mesh (Z)
# The pipeline automatically scales the mesh points before prediction
Z = svm_poly_model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundary as a colored background
plt.figure(1, figsize=(10, 7))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.2)

# Plot the training points
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
            cmap=plt.cm.RdYlBu, edgecolor='k', s=60, label="Training Data")

# Plot the test points (using an 'X' marker for distinction)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
            cmap=plt.cm.RdYlBu, marker='X', s=80, label="Test Data")

# Add labels and title
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title("SVM Classifier with Polynomial Kernel (Degree 3) Decision Boundary")
plt.legend()
plt.axis('tight')
plt.show()

25.Write a Python program to train a Gaussian Na√Øve Bayes classifier on the Breast Cancer dataset and
evaluate accuracy.

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer # New dataset
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB # New classifier
from sklearn.metrics import accuracy_score

# --- 1. Load the Dataset ---
# We use the Breast Cancer Wisconsin (Diagnostic) dataset for binary classification.
print("Loading Breast Cancer dataset...")
cancer = load_breast_cancer()
X = cancer.data   # Features (30 different measurements)
y = cancer.target # Target (Malignant or Benign)

# --- 2. Split Data into Training and Testing Sets ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Total samples: {len(X)}")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print("-" * 30)

# --- 3. Initialize and Train the Gaussian Na√Øve Bayes Classifier ---
print("Initializing and training the Gaussian Na√Øve Bayes (GNB) Classifier...")

# GNB assumes features follow a Gaussian (Normal) distribution.
# It doesn't require feature scaling, unlike SVM.
gnb_model = GaussianNB()

gnb_model.fit(X_train, y_train)
print("Gaussian Na√Øve Bayes training complete.")
print("-" * 30)

# --- 4. Make Predictions ---
y_pred = gnb_model.predict(X_test)

# --- 5. Evaluate Accuracy ---
accuracy = accuracy_score(y_test, y_pred)

print("--- Classification Evaluation (Breast Cancer Dataset) ---")
print(f"Gaussian Na√Øve Bayes Accuracy on Test Set: {accuracy * 100:.2f}%")
print("-" * 30)

# Provide a brief analysis of the result
print("The Na√Øve Bayes model successfully classified instances with the calculated accuracy.")

26.Write a Python program to train a Multinomial Na√Øve Bayes classifier for text classification using the 20
Newsgroups dataset.

In [None]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# --- Configuration ---
# We will use a subset of the categories to make the example run quickly and focus the results.
# The full list is: 'alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware',
# 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
# 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med',
# 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast',
# 'talk.politics.misc', 'talk.religion.misc'
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Use 80% for training and 20% for testing
TEST_SIZE = 0.2

def train_and_evaluate_mnb(categories_to_use):
    """
    Loads the 20 Newsgroups dataset, trains a Multinomial Naive Bayes classifier,
    and prints the classification report and accuracy.
    """
    print(f"--- Loading data for categories: {categories_to_use} ---")

    # 1. Load the 20 Newsgroups dataset
    # We remove headers, footers, and quotes which often contain metadata
    # that makes classification artificially easy.
    newsgroups_data = fetch_20newsgroups(
        subset='all',
        categories=categories_to_use,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )

    X, y = newsgroups_data.data, newsgroups_data.target
    target_names = newsgroups_data.target_names

    print(f"Total samples loaded: {len(X)}")
    print(f"Classes: {target_names}")

    # 2. Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=42
    )

    print(f"\nTraining samples: {len(X_train)}")
    print(f"Testing samples: {len(X_test)}")

    # 3. Feature Extraction (Vectorization)
    # TfidfVectorizer converts text into a matrix of TF-IDF features.
    # TF-IDF stands for Term Frequency-Inverse Document Frequency.
    vectorizer = TfidfVectorizer(
        stop_words='english', # Remove common English words
        lowercase=True,       # Convert text to lowercase
        max_df=0.5,           # Ignore terms that appear in more than 50% of the documents
        ngram_range=(1, 2)    # Use unigrams and bigrams
    )

    # Fit the vectorizer on the training data and transform both training and testing data
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    print(f"Feature space size (vocabulary): {X_train_vec.shape[1]}")

    # 4. Train the Multinomial Na√Øve Bayes Classifier
    print("\n--- Training Multinomial Na√Øve Bayes Classifier ---")
    # MultinomialNB is well-suited for classification with discrete features (like word counts)
    # The TF-IDF weights are treated as feature counts scaled by probability.
    mnb_classifier = MultinomialNB()
    mnb_classifier.fit(X_train_vec, y_train)
    print("Training complete.")

    # 5. Make Predictions
    y_pred = mnb_classifier.predict(X_test_vec)

    # 6. Evaluate the Model
    print("\n--- Model Evaluation ---")

    # Calculate overall accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy Score: {accuracy:.4f}")

    # Print detailed classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=target_names))

if __name__ == "__main__":
    # Ensure scikit-learn and numpy are installed:
    # pip install scikit-learn numpy

    # Run the classification
    train_and_evaluate_mnb(categories)

27.Write a Python program to train an SVM Classifier with different C values and compare the decision
boundaries visually.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# --- Configuration for Visualization ---
# To visualize the decision boundary, we use only two categories to create a
# clear binary classification problem.
categories = [
    'rec.autos',
    'comp.graphics',
]
# Use 80% for training and 20% for testing
TEST_SIZE = 0.2

def visualize_svm_decision_boundaries(categories_to_use):
    """
    Loads a subset of the 20 Newsgroups data, reduces dimensionality to 2D using PCA,
    trains SVC with different C values, and plots the decision boundaries.
    """
    print(f"--- Loading data for categories: {categories_to_use} ---")

    # 1. Load the 20 Newsgroups dataset (binary classification)
    newsgroups_data = fetch_20newsgroups(
        subset='all',
        categories=categories_to_use,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )

    X, y = newsgroups_data.data, newsgroups_data.target
    target_names = newsgroups_data.target_names

    print(f"Total samples loaded: {len(X)}")
    print(f"Classes: {target_names}")

    # 2. Feature Extraction (TF-IDF Vectorization)
    vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, max_df=0.5)
    X_vec = vectorizer.fit_transform(X)
    print(f"Feature space size (vocabulary) before PCA: {X_vec.shape[1]}")

    # 3. Dimensionality Reduction using PCA
    # We reduce the feature space to 2 dimensions for visualization.
    pca = PCA(n_components=2, random_state=42)
    X_2d = pca.fit_transform(X_vec.toarray()) # PCA requires dense array

    # 4. Split the 2D data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X_2d, y, test_size=TEST_SIZE, random_state=42
    )
    print(f"\nData successfully reduced to 2 dimensions for visualization.")
    print(f"Training samples: {len(X_train)}, Testing samples: {len(X_test)}")

    # 5. Define different C values to compare regularization effects
    C_values = [0.01, 1.0, 100.0]

    plt.figure(figsize=(15, 5))

    # Create meshgrid for plotting the decision boundary
    h = .02  # Step size in the mesh
    x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
    y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # 6. Train and Visualize SVM for each C value
    for i, C in enumerate(C_values):
        # We use a linear kernel for better interpretability in this context
        svm_classifier = SVC(kernel='linear', C=C, random_state=42)
        svm_classifier.fit(X_train, y_train)

        # Predict class for every point in the mesh
        Z = svm_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)

        # Create subplot for the current C value
        ax = plt.subplot(1, len(C_values), i + 1)
        ax.set_title(f"SVM Decision Boundary (C={C})")

        # Plot the decision boundary and margin
        ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.3)

        # Plot the training points
        scatter = ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.coolwarm, edgecolors='k')

        # Optional: Plot the test set classification
        y_pred = svm_classifier.predict(X_test)

        # Add legend
        ax.legend(handles=scatter.legend_elements()[0], labels=target_names)

        # Print a quick report for this C value
        report = classification_report(y_test, y_pred, target_names=target_names, output_dict=True)
        accuracy = report['accuracy']
        print(f"C={C}: Test Accuracy = {accuracy:.4f}")
        ax.text(0.05, 0.95, f'Acc: {accuracy:.2f}', transform=ax.transAxes, fontsize=10, verticalalignment='top')

    plt.tight_layout()
    plt.show()

if __name__ == "__main__":
    # Ensure scikit-learn, numpy, and matplotlib are installed:
    # pip install scikit-learn numpy matplotlib

    # Run the visualization program
    visualize_svm_decision_boundaries(categories)


28.Write a Python program to train a Bernoulli Na√Øve Bayes classifier for binary classification on a dataset with
binary features.

In [None]:
import numpy as np
# Removed matplotlib and PCA as visualization is not applicable to high-dimensional BNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer # Changed to CountVectorizer
from sklearn.naive_bayes import BernoulliNB # Changed to BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# --- Configuration for Binary Classification ---
# We use only two categories for a clear binary classification problem.
categories = [
    'rec.autos',
    'comp.graphics',
]
# Use 80% for training and 20% for testing
TEST_SIZE = 0.2

def train_and_evaluate_bnb(categories_to_use):
    """
    Loads the 20 Newsgroups dataset (binary), extracts binary features
    using CountVectorizer(binary=True), trains a Bernoulli Naive Bayes classifier,
    and prints the classification report and accuracy.
    """
    print(f"--- Loading data for categories: {categories_to_use} (Binary Classification) ---")

    # 1. Load the 20 Newsgroups dataset
    newsgroups_data = fetch_20newsgroups(
        subset='all',
        categories=categories_to_use,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )

    X, y = newsgroups_data.data, newsgroups_data.target
    target_names = newsgroups_data.target_names

    print(f"Total samples loaded: {len(X)}")
    print(f"Classes: {target_names}")

    # 2. Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=42
    )

    print(f"\nTraining samples: {len(X_train)}")
    print(f"Testing samples: {len(X_test)}")

    # 3. Feature Extraction (Binary Vectorization)
    # CountVectorizer(binary=True) is used to ensure that the features are binary (0 or 1),
    # representing only the presence or absence of a word, which is suitable
    # for Bernoulli Naive Bayes.
    vectorizer = CountVectorizer(
        stop_words='english',
        lowercase=True,
        max_df=0.5,
        binary=True # CRUCIAL: Makes the feature vectors binary (0 or 1)
    )

    # Fit and transform the data
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    print(f"Feature space size (vocabulary): {X_train_vec.shape[1]}")

    # 4. Train the Bernoulli Na√Øve Bayes Classifier
    print("\n--- Training Bernoulli Na√Øve Bayes Classifier ---")
    bnb_classifier = BernoulliNB()
    bnb_classifier.fit(X_train_vec, y_train)
    print("Training complete.")

    # 5. Make Predictions
    y_pred = bnb_classifier.predict(X_test_vec)

    # 6. Evaluate the Model
    print("\n--- Model Evaluation ---")

    # Calculate overall accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy Score: {accuracy:.4f}")

    # Print detailed classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=target_names))

if __name__ == "__main__":
    # Ensure scikit-learn and numpy are installed:
    # pip install scikit-learn numpy

    # Run the classification program
    train_and_evaluate_bnb(categories)


29.Write a Python program to apply feature scaling before training an SVM model and compare results with
unscaled data.

In [None]:
import numpy as np
# Removed matplotlib and PCA as visualization is not applicable to high-dimensional BNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer # Changed back to TfidfVectorizer for continuous features
from sklearn.svm import SVC # Switched to Support Vector Classifier
from sklearn.preprocessing import StandardScaler # Added for feature scaling
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# --- Configuration for Binary Classification ---
# We use only two categories for a clear binary classification problem.
categories = [
    'rec.autos',
    'comp.graphics',
]
# Use 80% for training and 20% for testing
TEST_SIZE = 0.2

def train_and_compare_svm_scaling(categories_to_use):
    """
    Loads the 20 Newsgroups dataset, extracts TF-IDF features, and compares
    the performance of an SVM classifier on unscaled versus scaled features.
    """
    print(f"--- Loading data for categories: {categories_to_use} (Binary Classification) ---")

    # 1. Load the 20 Newsgroups dataset
    newsgroups_data = fetch_20newsgroups(
        subset='all',
        categories=categories_to_use,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )

    X, y = newsgroups_data.data, newsgroups_data.target
    target_names = newsgroups_data.target_names

    print(f"Total samples loaded: {len(X)}")
    print(f"Classes: {target_names}")

    # 2. Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=42
    )

    print(f"\nTraining samples: {len(X_train)}")
    print(f"Testing samples: {len(X_test)}")

    # 3. Feature Extraction (TF-IDF Vectorization)
    # Using TF-IDF, which produces continuous weights, making feature scaling relevant for SVM.
    vectorizer = TfidfVectorizer(
        stop_words='english',
        lowercase=True,
        max_df=0.5,
        ngram_range=(1, 2)
    )

    # Fit the vectorizer on the training data and transform both training and testing data
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    print(f"Feature space size (vocabulary): {X_train_vec.shape[1]}")

    # Convert sparse matrices to dense for scaling.
    # Note: StandardScaler in scikit-learn works best with dense data for consistent results.
    X_train_dense = X_train_vec.toarray()
    X_test_dense = X_test_vec.toarray()

    # --- SCENARIO 1: SVM on UNSCALED Data ---
    print("\n--- SCENARIO 1: SVM on UNSCALED Data (TF-IDF vectors) ---")

    # Train the SVM model on the original (unscaled) dense TF-IDF vectors.
    svm_unscaled = SVC(kernel='linear', random_state=42)
    svm_unscaled.fit(X_train_dense, y_train)
    y_pred_unscaled = svm_unscaled.predict(X_test_dense)

    accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
    print(f"Unscaled Data Accuracy: {accuracy_unscaled:.4f}")
    print("\nUnscaled Data Classification Report:")
    print(classification_report(y_test, y_pred_unscaled, target_names=target_names))

    # --- SCENARIO 2: SVM on SCALED Data ---
    print("\n--- SCENARIO 2: SVM on SCALED Data (TF-IDF vectors) ---")

    # 4. Feature Scaling (StandardScaler)
    # Fit the scaler ONLY on the training data to prevent data leakage.
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_dense)
    X_test_scaled = scaler.transform(X_test_dense)

    # 5. Train the SVM Classifier on scaled data
    svm_scaled = SVC(kernel='linear', random_state=42)
    svm_scaled.fit(X_train_scaled, y_train)

    # 6. Make Predictions and Evaluate
    y_pred_scaled = svm_scaled.predict(X_test_scaled)

    accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
    print(f"Scaled Data Accuracy: {accuracy_scaled:.4f}")
    print("\nScaled Data Classification Report:")
    print(classification_report(y_test, y_pred_scaled, target_names=target_names))

    # Final comparison summary
    print("\n--- Comparison Summary ---")
    print(f"Accuracy (Unscaled): {accuracy_unscaled:.4f}")
    print(f"Accuracy (Scaled):   {accuracy_scaled:.4f}")


if __name__ == "__main__":
    # Ensure scikit-learn and numpy are installed:
    # pip install scikit-learn numpy

    # Run the comparison program
    train_and_compare_svm_scaling(categories)


30.Write a Python program to train a Gaussian Na√Øve Bayes model and compare the predictions before and
after Laplace Smoothing.

In [None]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer # Changed to CountVectorizer for MNB (count-based features)
from sklearn.naive_bayes import MultinomialNB # Switched to Multinomial Naive Bayes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# --- Configuration for Binary Classification ---
# We use only two categories for a clear binary classification problem.
categories = [
    'rec.autos',
    'comp.graphics',
]
# Use 80% for training and 20% for testing
TEST_SIZE = 0.2

def train_and_compare_mnb_smoothing(categories_to_use):
    """
    Loads the 20 Newsgroups dataset, extracts count features, and compares
    the performance of Multinomial Naive Bayes with and without Laplace Smoothing.

    Note: Laplace Smoothing is a core feature of MultinomialNB (via the 'alpha' parameter),
    which is why we use MNB instead of GaussianNB for this comparison.
    """
    print(f"--- Loading data for categories: {categories_to_use} (Binary Classification) ---")

    # 1. Load the 20 Newsgroups dataset
    newsgroups_data = fetch_20newsgroups(
        subset='all',
        categories=categories_to_use,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )

    X, y = newsgroups_data.data, newsgroups_data.target
    target_names = newsgroups_data.target_names

    print(f"Total samples loaded: {len(X)}")
    print(f"Classes: {target_names}")

    # 2. Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=42
    )

    print(f"\nTraining samples: {len(X_train)}")
    print(f"Testing samples: {len(X_test)}")

    # 3. Feature Extraction (Count Vectorization)
    # Using CountVectorizer to get integer counts (frequency of words),
    # which is the input required by Multinomial Naive Bayes.
    vectorizer = CountVectorizer(
        stop_words='english',
        lowercase=True,
        max_df=0.5
    )

    # Fit the vectorizer on the training data and transform both training and testing data
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    print(f"Feature space size (vocabulary): {X_train_vec.shape[1]}")

    # --- SCENARIO 1: Multinomial NB with Standard Laplace Smoothing (alpha=1.0) ---
    # This is the standard, robust setting for Naive Bayes text classification.
    print("\n--- SCENARIO 1: MNB with Standard Laplace Smoothing (alpha=1.0) ---")

    # alpha=1.0 implements standard Laplace/Additive Smoothing
    mnb_smoothed = MultinomialNB(alpha=1.0)
    mnb_smoothed.fit(X_train_vec, y_train)
    y_pred_smoothed = mnb_smoothed.predict(X_test_vec)

    accuracy_smoothed = accuracy_score(y_test, y_pred_smoothed)
    print(f"Smoothed Data Accuracy: {accuracy_smoothed:.4f}")
    print("\nSmoothed Data Classification Report:")
    print(classification_report(y_test, y_pred_smoothed, target_names=target_names))

    # --- SCENARIO 2: Multinomial NB with Minimal Smoothing (alpha=1e-5) ---
    # We use a very small alpha instead of alpha=0 to avoid division by zero errors
    # in case a word has a zero count in a class (demonstrates the effect of no smoothing).
    print("\n--- SCENARIO 2: MNB with Minimal/No Smoothing (alpha=1e-5) ---")

    mnb_unsmoothed = MultinomialNB(alpha=1e-5)
    mnb_unsmoothed.fit(X_train_vec, y_train)
    y_pred_unsmoothed = mnb_unsmoothed.predict(X_test_vec)

    accuracy_unsmoothed = accuracy_score(y_test, y_pred_unsmoothed)
    print(f"Minimal Smoothing Accuracy: {accuracy_unsmoothed:.4f}")
    print("\nMinimal Smoothing Classification Report:")
    print(classification_report(y_test, y_pred_unsmoothed, target_names=target_names))

    # Final comparison summary
    print("\n--- Comparison Summary ---")
    print(f"Accuracy (Standard Laplace Smoothing, alpha=1.0): {accuracy_smoothed:.4f}")
    print(f"Accuracy (Minimal Smoothing, alpha=1e-5):         {accuracy_unsmoothed:.4f}")


if __name__ == "__main__":
    # Ensure scikit-learn and numpy are installed:
    # pip install scikit-learn numpy

    # Run the comparison program
    train_and_compare_mnb_smoothing(categories)


31.Write a Python program to train an SVM Classifier and use GridSearchCV to tune the hyperparameters (C,
gamma, kernel)

In [None]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from scipy.sparse import issparse

# --- Configuration for Binary Classification ---
# We use only two categories for a clear binary classification problem.
categories = [
    'rec.autos',
    'comp.graphics',
]
# Use 80% for training and 20% for testing
TEST_SIZE = 0.2

def train_and_tune_svm(categories_to_use):
    """
    Loads the 20 Newsgroups dataset, extracts TF-IDF features, and uses
    GridSearchCV to find the best hyperparameters for an SVM Classifier.
    """
    print(f"--- Loading data for categories: {categories_to_use} (Binary Classification) ---")

    # 1. Load the 20 Newsgroups dataset
    newsgroups_data = fetch_20newsgroups(
        subset='all',
        categories=categories_to_use,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )

    X, y = newsgroups_data.data, newsgroups_data.target
    target_names = newsgroups_data.target_names

    # 2. Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=42
    )

    print(f"\nTraining samples: {len(X_train)}")
    print(f"Testing samples: {len(X_test)}")

    # 3. Feature Extraction (TF-IDF Vectorization)
    vectorizer = TfidfVectorizer(
        stop_words='english',
        lowercase=True,
        max_df=0.5
    )

    # Fit and transform the data
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    print(f"Feature space size (vocabulary): {X_train_vec.shape[1]}")

    # Convert sparse matrices to dense arrays for compatibility with GridSearchCV
    # when using non-linear kernels like 'rbf'.
    if issparse(X_train_vec):
        X_train_dense = X_train_vec.toarray()
        X_test_dense = X_test_vec.toarray()
    else:
        X_train_dense = X_train_vec
        X_test_dense = X_test_vec

    # 4. Define Hyperparameter Grid for SVM
    param_grid = [
      # Linear kernel: tune C (regularization parameter)
      {'C': [0.1, 1, 10], 'kernel': ['linear']},
      # RBF kernel: tune C and gamma (kernel coefficient)
      {'C': [1, 10, 100], 'gamma': [0.001, 0.01, 0.1], 'kernel': ['rbf']},
     ]

    # 5. Initialize and Run GridSearchCV
    print("\n--- Starting GridSearchCV for SVM (may take a few minutes) ---")

    # GridSearchCV performs 3-fold cross-validation (cv=3) for each parameter combination
    grid_search = GridSearchCV(
        SVC(random_state=42),
        param_grid,
        cv=3,
        scoring='accuracy',
        verbose=1,
        n_jobs=-1 # Use all available cores for parallel processing
    )

    grid_search.fit(X_train_dense, y_train)

    print("Grid Search complete.")

    # 6. Evaluate the Best Estimator

    # Get the best model determined by the grid search
    best_svm = grid_search.best_estimator_

    # Print the tuning results
    print("\n--- Hyperparameter Tuning Results ---")
    print(f"Best Parameters found on training set: {grid_search.best_params_}")
    print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")

    # Make predictions on the held-out test set using the best model
    y_pred = best_svm.predict(X_test_dense)

    # 7. Print Final Test Set Performance
    test_accuracy = accuracy_score(y_test, y_pred)
    print(f"\n--- Final Test Set Evaluation using Best Model ---")
    print(f"Test Accuracy Score: {test_accuracy:.4f}")
    print("\nClassification Report on Test Set:")
    print(classification_report(y_test, y_pred, target_names=target_names))

if __name__ == "__main__":
    # Ensure scikit-learn and numpy are installed:
    # pip install scikit-learn numpy

    # Run the SVM tuning program
    train_and_tune_svm(categories)


--- Loading data for categories: ['rec.autos', 'comp.graphics'] (Binary Classification) ---

Training samples: 1570
Testing samples: 393
Feature space size (vocabulary): 18309

--- Starting GridSearchCV for SVM (may take a few minutes) ---
Fitting 3 folds for each of 12 candidates, totalling 36 fits


32.Write a Python program to train an SVM Classifier on an imbalanced dataset and apply class weighting and
check it improve accuracy.

In [None]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from scipy.sparse import issparse

# --- Configuration for Binary Classification ---
# We use two categories that are generally well-separated but will be artificially
# made imbalanced in the training set for demonstration.
categories = [
    'rec.autos',      # Class 0
    'comp.graphics',  # Class 1
]
# Use 80% for training and 20% for testing
TEST_SIZE = 0.2
MINORITY_SAMPLES = 50 # Artificially restrict one class in the training set to this size

def train_and_compare_class_weighting(categories_to_use):
    """
    Loads the 20 Newsgroups dataset, creates a synthetic imbalance in the
    training data, and compares the performance of unweighted vs. class-weighted SVM.
    """
    print(f"--- Loading data for categories: {categories_to_use} (Binary Classification) ---")

    # 1. Load the 20 Newsgroups dataset
    newsgroups_data = fetch_20newsgroups(
        subset='all',
        categories=categories_to_use,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )

    X, y = newsgroups_data.data, newsgroups_data.target
    target_names = newsgroups_data.target_names

    # 2. Split the data into initial training and testing sets
    # The test set remains representative of the original, balanced distribution.
    X_train_raw, X_test, y_train_raw, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=42
    )

    # 3. Create Synthetic Imbalance in the Training Set
    class_0_indices = np.where(y_train_raw == 0)[0]
    class_1_indices = np.where(y_train_raw == 1)[0]

    # Identify minority and majority for the demonstration
    if len(class_0_indices) < len(class_1_indices):
        minority_indices = class_0_indices
        majority_indices = class_1_indices
        minority_label_idx = 0
    else:
        minority_indices = class_1_indices
        majority_indices = class_0_indices
        minority_label_idx = 1

    minority_class_name = target_names[minority_label_idx]

    # Subsample the majority class to create the imbalance
    # Note: We are doing the opposite of what the variable names suggest for a clear demonstration:
    # We are keeping a large majority and heavily restricting the minority in the training set.
    np.random.seed(42)
    np.random.shuffle(minority_indices)

    # Restrict the minority class to MINORITY_SAMPLES
    minority_indices_kept = minority_indices[:MINORITY_SAMPLES]

    # Combine the restricted minority and the full majority to form the imbalanced training set
    imbalanced_train_indices = np.concatenate([minority_indices_kept, majority_indices])

    X_train = np.array(X_train_raw)[imbalanced_train_indices]
    y_train = y_train_raw[imbalanced_train_indices]

    print(f"\nTraining set samples (Imbalanced): {len(X_train)}")
    print(f"  Class '{target_names[0]}' count: {np.sum(y_train == 0)}")
    print(f"  Class '{target_names[1]}' count: {np.sum(y_train == 1)}")
    print(f"  Minority Class (Heavily Undersampled): {minority_class_name}")
    print(f"Testing set samples (Original Distribution): {len(X_test)}")

    # 4. Feature Extraction (TF-IDF Vectorization)
    vectorizer = TfidfVectorizer(
        stop_words='english',
        lowercase=True,
        max_df=0.5
    )

    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    # Convert sparse to dense for SVC, as done previously
    if issparse(X_train_vec):
        X_train_dense = X_train_vec.toarray()
        X_test_dense = X_test_vec.toarray()
    else:
        X_train_dense = X_train_vec
        X_test_dense = X_test_vec

    # --- SCENARIO 1: SVM on Imbalanced Data (UNWEIGHTED) ---
    print("\n--- SCENARIO 1: SVM (Unweighted, class_weight=None) ---")

    svm_unweighted = SVC(kernel='linear', C=1.0, random_state=42, class_weight=None)
    svm_unweighted.fit(X_train_dense, y_train)
    y_pred_unweighted = svm_unweighted.predict(X_test_dense)

    print(f"Accuracy (Unweighted): {accuracy_score(y_test, y_pred_unweighted):.4f}")
    print("\nClassification Report (Unweighted):")
    print(classification_report(y_test, y_pred_unweighted, target_names=target_names))

    # --- SCENARIO 2: SVM on Imbalanced Data (CLASS WEIGHTED) ---
    print("\n--- SCENARIO 2: SVM (Class-Weighted, class_weight='balanced') ---")

    # 'balanced' automatically adjusts weights inversely proportional to class frequencies
    # in the input data, giving more importance to the minority class.
    svm_weighted = SVC(kernel='linear', C=1.0, random_state=42, class_weight='balanced')
    svm_weighted.fit(X_train_dense, y_train)
    y_pred_weighted = svm_weighted.predict(X_test_dense)

    print(f"Accuracy (Weighted): {accuracy_score(y_test, y_pred_weighted):.4f}")
    print("\nClassification Report (Weighted):")
    print(classification_report(y_test, y_pred_weighted, target_names=target_names))

    # Final comparison summary focusing on the minority class
    print("\n--- Comparison Summary (Focus on Minority Class Metrics) ---")

    # Find the index corresponding to the minority class in the classification report
    minority_report_unweighted = classification_report(y_test, y_pred_unweighted, target_names=target_names, output_dict=True)[minority_class_name]
    minority_report_weighted = classification_report(y_test, y_pred_weighted, target_names=target_names, output_dict=True)[minority_class_name]

    print(f"Minority Class: {minority_class_name}")
    print(f"Unweighted Recall: {minority_report_unweighted['recall']:.4f} | Weighted Recall: {minority_report_weighted['recall']:.4f}")
    print(f"Unweighted F1-Score: {minority_report_unweighted['f1-score']:.4f} | Weighted F1-Score: {minority_report_weighted['f1-score']:.4f}")


if __name__ == "__main__":
    # Ensure scikit-learn and numpy are installed:
    # pip install scikit-learn numpy

    # Run the SVM class weighting comparison program
    train_and_compare_class_weighting(categories)

33.Write a Python program to implement a Na√Øve Bayes classifier for spam detection using email data

In [None]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer # Using CountVectorizer for Naive Bayes
from sklearn.naive_bayes import MultinomialNB # Using Multinomial Naive Bayes
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# --- Configuration for Binary Classification (Simulating Spam/Ham) ---
# We use two distinct newsgroup categories to simulate the content difference
# between two classes, like Spam and Legitimate (Ham) emails.
categories = [
    'alt.atheism',      # Simulating one class (e.g., "Spam")
    'comp.graphics',    # Simulating the other class (e.g., "Ham")
]
# Use 80% for training and 20% for testing
TEST_SIZE = 0.2

def train_naive_bayes_spam_classifier(categories_to_use):
    """
    Loads a subset of the 20 Newsgroups dataset, uses CountVectorizer for features,
    and trains a Multinomial Naive Bayes classifier for binary classification
    (simulated spam detection).
    """
    print(f"--- Loading data for categories: {categories_to_use} (Simulated Spam/Ham) ---")

    # 1. Load the 20 Newsgroups dataset
    newsgroups_data = fetch_20newsgroups(
        subset='all',
        categories=categories_to_use,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )

    X, y = newsgroups_data.data, newsgroups_data.target
    target_names = newsgroups_data.target_names

    # Rename targets to reflect Spam/Ham analogy
    target_names_map = {0: 'Class_A (alt.atheism)', 1: 'Class_B (comp.graphics)'}
    mapped_target_names = [target_names_map[i] for i in range(len(target_names))]

    print(f"Total samples loaded: {len(X)}")
    print(f"Classes: {mapped_target_names}")

    # 2. Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=42
    )

    print(f"\nTraining samples: {len(X_train)}")
    print(f"Testing samples: {len(X_test)}")

    # 3. Feature Extraction (Count Vectorization)
    # Naive Bayes generally performs well with raw word counts.
    vectorizer = CountVectorizer(
        stop_words='english',
        lowercase=True,
        max_df=0.7 # Ignore terms that appear in more than 70% of the documents
    )

    # Fit the vectorizer on the training data and transform both sets
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    print(f"Feature space size (vocabulary): {X_train_vec.shape[1]}")

    # 4. Train the Multinomial Naive Bayes Model
    # MNB uses Additive (Laplace) Smoothing by default (alpha=1.0) for stability.
    print("\n--- Training Multinomial Naive Bayes Classifier ---")

    mnb_classifier = MultinomialNB(alpha=1.0)
    mnb_classifier.fit(X_train_vec, y_train)

    # 5. Evaluate the Classifier
    y_pred = mnb_classifier.predict(X_test_vec)

    test_accuracy = accuracy_score(y_test, y_pred)
    print(f"\nFinal Test Accuracy: {test_accuracy:.4f}")

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=mapped_target_names))


if __name__ == "__main__":
    # Ensure scikit-learn and numpy are installed:
    # pip install scikit-learn numpy

    # Run the Naive Bayes spam detection program
    train_naive_bayes_spam_classifier(categories)

34.Write a Python program to train an SVM Classifier and a Na√Øve Bayes Classifier on the same dataset and
compare their accuracy.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# --- 1. Load the Dataset ---
# We will use the Iris dataset, a classic dataset for classification tasks.
X, y = load_iris(return_X_y=True)

# --- 2. Split the Data ---
# Split the data into training (70%) and testing (30%) sets.
# We set a random_state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# --- 3. Initialize and Train Classifiers ---

# A. Support Vector Machine (SVC)
# We use a radial basis function (rbf) kernel, a common choice for non-linear data.
svm_classifier = SVC(kernel='rbf', random_state=42)
print("Training SVM Classifier...")
svm_classifier.fit(X_train, y_train)

# B. Gaussian Na√Øve Bayes (GNB)
# GaussianNB is suitable for continuous data, assuming features follow a Gaussian distribution.
gnb_classifier = GaussianNB()
print("Training Gaussian Na√Øve Bayes Classifier...")
gnb_classifier.fit(X_train, y_train)

# --- 4. Predict and Evaluate ---

# A. SVM Prediction and Accuracy
svm_predictions = svm_classifier.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_predictions)

# B. Na√Øve Bayes Prediction and Accuracy
gnb_predictions = gnb_classifier.predict(X_test)
gnb_accuracy = accuracy_score(y_test, gnb_predictions)


# --- 5. Print Comparison Results ---
print("\n" + "="*50)
print("CLASSIFIER ACCURACY COMPARISON")
print("="*50)
print(f"Dataset: Iris (30% Test Set)")
print("-" * 50)
print(f"| {'Model':<30} | {'Accuracy':<15} |")
print("-" * 50)
print(f"| {'Support Vector Machine (SVC)':<30} | {svm_accuracy:.4f} ({(svm_accuracy*100):.2f}%) |")
print(f"| {'Gaussian Na√Øve Bayes (GNB)':<30} | {gnb_accuracy:.4f} ({(gnb_accuracy*100):.2f}%) |")
print("-" * 50)

# Provide a brief analysis
if svm_accuracy > gnb_accuracy:
    print(f"\nConclusion: SVM outperformed Na√Øve Bayes by {(svm_accuracy - gnb_accuracy):.4f} accuracy points on this dataset.")
elif gnb_accuracy > svm_accuracy:
    print(f"\nConclusion: Na√Øve Bayes outperformed SVM by {(gnb_accuracy - svm_accuracy):.4f} accuracy points on this dataset.")
else:
    print("\nConclusion: Both classifiers achieved the exact same accuracy score.")

print("="*50)

35.Write a Python program to perform feature selection before training a Na√Øve Bayes classifier and compare
results.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, f_classif # NEW: Imports for feature selection

# --- 1. Load the Dataset ---
# We will use the Iris dataset, a classic dataset for classification tasks.
X, y = load_iris(return_X_y=True)

# --- 2. Split the Data ---
# Split the data into training (70%) and testing (30%) sets.
# We set a random_state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# --- 3. Feature Selection ---
# Use SelectKBest with f_classif (ANOVA F-value) to find the best 2 features (out of 4).
k_best_features = 2
selector = SelectKBest(f_classif, k=k_best_features)
selector.fit(X_train, y_train)

# Transform the training and test sets using the selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

print(f"Feature Selection: Selected {k_best_features} features based on f_classif.")
# Print the names of the selected features for clarity
feature_names = load_iris().feature_names
selected_indices = selector.get_support(indices=True)
print(f"Selected feature names: {[feature_names[i] for i in selected_indices]}")


# --- 4. Initialize and Train Classifiers ---

# A. Support Vector Machine (SVC) - Trained on ALL features
# We use a radial basis function (rbf) kernel, a common choice for non-linear data.
svm_classifier = SVC(kernel='rbf', random_state=42)
print("\nTraining SVM Classifier (on ALL features)...")
svm_classifier.fit(X_train, y_train)

# B. Gaussian Na√Øve Bayes (GNB) - Trained on ALL features
# GaussianNB is suitable for continuous data, assuming features follow a Gaussian distribution.
gnb_classifier = GaussianNB()
print("Training Gaussian Na√Øve Bayes Classifier (on ALL features)...")
gnb_classifier.fit(X_train, y_train)

# C. Gaussian Na√Øve Bayes (GNB) - Trained on SELECTED features (NEW MODEL)
gnb_selected_classifier = GaussianNB()
print(f"Training GNB Classifier (on SELECTED {k_best_features} features)...")
gnb_selected_classifier.fit(X_train_selected, y_train)


# --- 5. Predict and Evaluate ---

# A. SVM Prediction and Accuracy (on ALL features)
svm_predictions = svm_classifier.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_predictions)

# B. Na√Øve Bayes Prediction and Accuracy (on ALL features)
gnb_predictions = gnb_classifier.predict(X_test)
gnb_accuracy = accuracy_score(y_test, gnb_predictions)

# C. Na√Øve Bayes Prediction and Accuracy (on SELECTED features)
gnb_selected_predictions = gnb_selected_classifier.predict(X_test_selected)
gnb_selected_accuracy = accuracy_score(y_test, gnb_selected_predictions)


# --- 6. Print Comparison Results --- (Updated to include all three models)
print("\n" + "="*70)
print("CLASSIFIER ACCURACY COMPARISON WITH FEATURE SELECTION")
print("="*70)
print(f"Dataset: Iris (30% Test Set, Random State 42)")
print(f"Feature Selection Method: SelectKBest (k={k_best_features}) with f_classif")
print("-" * 70)
print(f"| {'Model':<50} | {'Accuracy':<15} |")
print("-" * 70)
print(f"| {'Support Vector Machine (SVC) - All Features':<50} | {svm_accuracy:.4f} ({(svm_accuracy*100):.2f}%) |")
print(f"| {'Gaussian Na√Øve Bayes (GNB) - All Features':<50} | {gnb_accuracy:.4f} ({(gnb_accuracy*100):.2f}%) |")
print(f"| {'GNB - Selected Features (k=2)':<50} | {gnb_selected_accuracy:.4f} ({(gnb_selected_accuracy*100):.2f}%) |")
print("-" * 70)

# Provide a brief analysis (focusing on GNB comparison)
print("\nAnalysis of Na√Øve Bayes Models:")

if gnb_selected_accuracy > gnb_accuracy:
    print(f"- Feature selection IMPROVED GNB accuracy by {(gnb_selected_accuracy - gnb_accuracy):.4f}.")
elif gnb_accuracy > gnb_selected_accuracy:
    print(f"- Feature selection DECREASED GNB accuracy by {(gnb_accuracy - gnb_selected_accuracy):.4f}.")
else:
    print("- Feature selection resulted in the same GNB accuracy.")

# Overall winner
accuracies = {
    'SVC - All Features': svm_accuracy,
    'GNB - All Features': gnb_accuracy,
    'GNB - Selected Features (k=2)': gnb_selected_accuracy
}
best_model = max(accuracies, key=accuracies.get)
best_accuracy = accuracies[best_model]

print(f"\nOverall Best Model: {best_model} with an accuracy of {best_accuracy:.4f}.")
print("="*70)

36.Write a Python program to train an SVM Classifier using One-vs-Rest (OvR) and One-vs-One (OvO)
strategies on the Wine dataset and compare their accuracy.

In [None]:
import numpy as np
from sklearn.datasets import load_wine # Changed from load_iris to load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier # NEW: Import for explicit One-vs-Rest strategy
from sklearn.metrics import accuracy_score
# Feature selection and Naive Bayes imports are removed as they are not needed for this comparison

# --- 1. Load the Dataset ---
# We will use the Wine dataset, which has 3 classes and 13 features.
X, y = load_wine(return_X_y=True)

# --- 2. Split the Data ---
# Split the data into training (70%) and testing (30%) sets.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# --- 3. Initialize and Train SVM Classifiers ---

# The underlying SVC model used for both strategies
base_svc = SVC(kernel='rbf', random_state=42)

# A. One-vs-One (OvO) Strategy
# SVC implements OvO by default, but we set decision_function_shape='ovo' for clarity.
# SVC is often preferred for OvO due to computational efficiency on smaller datasets.
svc_ovo = SVC(kernel='rbf', decision_function_shape='ovo', random_state=42)
print("Training SVM Classifier with One-vs-One (OvO) strategy...")
svc_ovo.fit(X_train, y_train)

# B. One-vs-Rest (OvR) Strategy
# We use the OneVsRestClassifier meta-estimator to explicitly enforce the OvR strategy.
svc_ovr = OneVsRestClassifier(base_svc)
print("Training SVM Classifier with One-vs-Rest (OvR) strategy...")
svc_ovr.fit(X_train, y_train)


# --- 4. Predict and Evaluate ---

# A. OvO Prediction and Accuracy
ovo_predictions = svc_ovo.predict(X_test)
ovo_accuracy = accuracy_score(y_test, ovo_predictions)

# B. OvR Prediction and Accuracy
ovr_predictions = svc_ovr.predict(X_test)
ovr_accuracy = accuracy_score(y_test, ovr_predictions)


# --- 5. Print Comparison Results ---
print("\n" + "="*70)
print("SVM MULTI-CLASS STRATEGY ACCURACY COMPARISON (WINE DATASET)")
print("="*70)
print(f"Dataset: Wine (30% Test Set, Random State 42)")
print("-" * 70)
print(f"| {'Strategy':<50} | {'Accuracy':<15} |")
print("-" * 70)
print(f"| {'Support Vector Machine (OvO - SVC default)':<50} | {ovo_accuracy:.4f} ({(ovo_accuracy*100):.2f}%) |")
print(f"| {'Support Vector Machine (OvR - OneVsRestWrapper)':<50} | {ovr_accuracy:.4f} ({(ovr_accuracy*100):.2f}%) |")
print("-" * 70)

# Provide a brief analysis
if ovo_accuracy > ovr_accuracy:
    print(f"\nConclusion: The One-vs-One (OvO) strategy performed better on the Wine dataset by {(ovo_accuracy - ovr_accuracy):.4f} accuracy points.")
elif ovr_accuracy > ovo_accuracy:
    print(f"\nConclusion: The One-vs-Rest (OvR) strategy performed better on the Wine dataset by {(ovr_accuracy - ovo_accuracy):.4f} accuracy points.")
else:
    print("\nConclusion: Both OvO and OvR strategies resulted in the exact same accuracy score on the Wine dataset.")

print("="*70)


37.Write a Python program to train an SVM Classifier using Linear, Polynomial, and RBF kernels on the Breast
Cancer dataset and compare their accuracy.

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer # NEW: Switched to Breast Cancer dataset
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler # NEW: Added for feature scaling, critical for SVM
from sklearn.pipeline import make_pipeline # NEW: For chaining scaling and SVC
from sklearn.metrics import accuracy_score

# --- 1. Load the Dataset ---
# We will use the Breast Cancer dataset (binary classification).
X, y = load_breast_cancer(return_X_y=True)

# --- 2. Split the Data ---
# Split the data into training (70%) and testing (30%) sets.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# --- 3. Initialize and Train SVM Classifiers with Different Kernels ---
# SVM performance is highly dependent on feature scaling, so we use a pipeline
# to standardize the data before training each classifier.

# A. Linear Kernel SVM
# Good for linearly separable data.
pipeline_linear = make_pipeline(
    StandardScaler(),
    SVC(kernel='linear', C=1.0, random_state=42)
)
print("Training SVM Classifier with Linear Kernel...")
pipeline_linear.fit(X_train, y_train)

# B. Polynomial Kernel SVM
# Good for non-linear data; 'degree' controls the complexity of the non-linearity.
pipeline_poly = make_pipeline(
    StandardScaler(),
    SVC(kernel='poly', degree=3, C=1.0, random_state=42)
)
print("Training SVM Classifier with Polynomial (Degree 3) Kernel...")
pipeline_poly.fit(X_train, y_train)

# C. RBF (Radial Basis Function) Kernel SVM
# The most common choice, good for highly non-linear data.
pipeline_rbf = make_pipeline(
    StandardScaler(),
    SVC(kernel='rbf', gamma='scale', C=1.0, random_state=42)
)
print("Training SVM Classifier with RBF Kernel...")
pipeline_rbf.fit(X_train, y_train)


# --- 4. Predict and Evaluate ---

# A. Linear Kernel Accuracy
linear_predictions = pipeline_linear.predict(X_test)
linear_accuracy = accuracy_score(y_test, linear_predictions)

# B. Polynomial Kernel Accuracy
poly_predictions = pipeline_poly.predict(X_test)
poly_accuracy = accuracy_score(y_test, poly_predictions)

# C. RBF Kernel Accuracy
rbf_predictions = pipeline_rbf.predict(X_test)
rbf_accuracy = accuracy_score(y_test, rbf_predictions)


# --- 5. Print Comparison Results ---
print("\n" + "="*80)
print("SVM KERNEL ACCURACY COMPARISON (BREAST CANCER DATASET)")
print("="*80)
print(f"Dataset: Breast Cancer (30% Test Set, Random State 42)")
print(f"Note: All models include feature scaling (StandardScaler).")
print("-" * 80)
print(f"| {'Kernel Type':<60} | {'Accuracy':<15} |")
print("-" * 80)
print(f"| {'1. Linear Kernel':<60} | {linear_accuracy:.4f} ({(linear_accuracy*100):.2f}%) |")
print(f"| {'2. Polynomial Kernel (Degree 3)':<60} | {poly_accuracy:.4f} ({(poly_accuracy*100):.2f}%) |")
print(f"| {'3. RBF (Radial Basis Function) Kernel':<60} | {rbf_accuracy:.4f} ({(rbf_accuracy*100):.2f}%) |")
print("-" * 80)

# Identify the best performing kernel
accuracies = {
    'Linear': linear_accuracy,
    'Polynomial (Degree 3)': poly_accuracy,
    'RBF': rbf_accuracy
}
best_kernel = max(accuracies, key=accuracies.get)
best_accuracy = accuracies[best_kernel]

print(f"\nOverall Best Kernel: {best_kernel} with an accuracy of {best_accuracy:.4f}.")
print("="*80)

38.Write a Python program to train an SVM Classifier using Stratified K-Fold Cross-Validation and compute the
average accuracy.

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score # NEW: Imports for Cross-Validation
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from statistics import mean, stdev # For calculation of mean and std deviation

# --- 1. Load the Dataset ---
# We will use the Breast Cancer dataset (binary classification).
X, y = load_breast_cancer(return_X_y=True)

# Define the number of splits for cross-validation
N_SPLITS = 5

# --- 2. Setup the Model Pipeline ---
# SVM performance is highly dependent on feature scaling, so we use a pipeline
# to standardize the data before training the RBF kernel classifier.

# We use the RBF (Radial Basis Function) Kernel, the most common choice.
# The 'gamma' and 'C' parameters are left at default for simplicity.
model_pipeline = make_pipeline(
    StandardScaler(),
    SVC(kernel='rbf', gamma='scale', C=1.0, random_state=42)
)

# --- 3. Setup Stratified K-Fold Cross-Validation ---
# StratifiedKFold ensures that each fold has the same proportion of class labels (malignant/benign)
# as the full dataset, which is important for binary classification problems.
cv_strategy = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

print(f"Model: RBF Kernel SVM with StandardScaler")
print(f"Validation Strategy: Stratified {N_SPLITS}-Fold Cross-Validation (Shuffle=True)")
print("Starting cross-validation...")

# --- 4. Evaluate the Model using Cross-Validation ---
# The cross_val_score function automatically trains and evaluates the model
# N_SPLITS times, fitting the pipeline on the training folds and scoring on the test fold.
cv_scores = cross_val_score(
    model_pipeline,
    X,
    y,
    cv=cv_strategy,
    scoring='accuracy',
    n_jobs=-1 # Use all available processors for faster computation
)

# Calculate the statistics
mean_accuracy = mean(cv_scores)
std_dev = stdev(cv_scores)


# --- 5. Print Comparison Results ---
print("\n" + "="*80)
print("SVM RBF KERNEL ACCURACY WITH STRATIFIED CROSS-VALIDATION")
print("="*80)
print(f"Dataset: Breast Cancer (Total Samples: {X.shape[0]})")
print(f"Cross-Validation Folds: {N_SPLITS}")
print("-" * 80)
print(f"Individual Fold Accuracy Scores:")
# Print scores in a formatted list
for i, score in enumerate(cv_scores):
    print(f"  Fold {i+1}: {score:.4f} ({(score*100):.2f}%)")
print("-" * 80)
print(f"| {'Metric':<30} | {'Value':<15} |")
print("-" * 80)
print(f"| {'Mean Cross-Validation Accuracy':<30} | {mean_accuracy:.4f} ({(mean_accuracy*100):.2f}%) |")
# Report mean +/- standard deviation * 2 (approx. 95% confidence interval)
print(f"| {'Accuracy (Mean +/- 2*Std Dev)':<30} | {mean_accuracy:.4f} +/- {(std_dev*2):.4f} |")
print("-" * 80)

print("\nConclusion: The average accuracy across all folds provides a robust estimate of the model's performance on unseen data.")
print("="*80)

Write a Python program to train a Na√Øve Bayes classifier using different prior probabilities and compare
performance

Write a Python program to perform Recursive Feature Elimination (RFE) before training an SVM Classifier and
compare accuracy

 Write a Python program to train an SVM Classifier and evaluate its performance using Precision, Recall, and
F1-Score instead of accuracy

Write a Python program to train a Na√Øve Bayes Classifier and evaluate its performance using Log Loss
(Cross-Entropy Loss)

 Write a Python program to train an SVM Classifier and visualize the Confusion Matrix using seaborn

Write a Python program to train an SVM Regressor (SVR) and evaluate its performance using Mean Absolute
Error (MAE) instead of MSE

 Write a Python program to train a Na√Øve Bayes classifier and evaluate its performance using the ROC-AUC
score

 Write a Python program to train an SVM Classifier and visualize the Precision-Recall Curve.

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.naive_bayes import GaussianNB # NEW: Switched to Gaussian Naive Bayes
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from statistics import mean, stdev

# --- 1. Load the Dataset ---
# We use the Breast Cancer dataset (binary classification: 0=malignant, 1=benign).
X, y = load_breast_cancer(return_X_y=True)

# Define the number of splits for cross-validation
N_SPLITS = 5
N_CLASSES = 2

# --- 2. Setup Model Pipelines with Different Priors ---
# Note: Feature scaling (StandardScaler) is included in the pipeline. While less
# critical for Naive Bayes than SVM, it is generally good practice.

# A. Naive Bayes with Default Priors (Priors are estimated from class frequencies in training folds)
pipeline_default_priors = make_pipeline(
    StandardScaler(),
    GaussianNB(priors=None)
)

# B. Naive Bayes with Custom (Uniform) Priors (Manually set to 50/50, ignoring actual class imbalance)
# The custom priors list MUST sum to 1.0. Order corresponds to class labels (0, 1).
custom_priors = [0.5, 0.5]
pipeline_custom_priors = make_pipeline(
    StandardScaler(),
    GaussianNB(priors=custom_priors)
)

# --- 3. Setup Stratified K-Fold Cross-Validation ---
cv_strategy = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

print(f"Dataset: Breast Cancer (Total Samples: {X.shape[0]})")
print(f"Validation Strategy: Stratified {N_SPLITS}-Fold Cross-Validation (Shuffle=True)")
print("Starting cross-validation for both Na√Øve Bayes models...")

# --- 4. Evaluate Models using Cross-Validation ---

# Evaluate model with default priors
scores_default = cross_val_score(
    pipeline_default_priors,
    X,
    y,
    cv=cv_strategy,
    scoring='accuracy',
    n_jobs=-1
)

# Evaluate model with custom priors
scores_custom = cross_val_score(
    pipeline_custom_priors,
    X,
    y,
    cv=cv_strategy,
    scoring='accuracy',
    n_jobs=-1
)

# Calculate statistics
mean_default = mean(scores_default)
std_default = stdev(scores_default)

mean_custom = mean(scores_custom)
std_custom = stdev(scores_custom)


# --- 5. Print Comparison Results ---
print("\n" + "="*80)
print("NA√èVE BAYES ACCURACY COMPARISON: DEFAULT vs. CUSTOM PRIORS")
print("="*80)
print("-" * 80)
print(f"| {'Model Configuration':<45} | {'Mean Accuracy':<15} | {'Std Dev':<10} |")
print("-" * 80)
print(f"| {'1. Default Priors (Estimated from Data)':<45} | {mean_default:.4f} ({(mean_default*100):.2f}%) | {std_default:.4f} |")
print(f"| {'2. Custom Priors (Uniform [0.5, 0.5])':<45} | {mean_custom:.4f} ({(mean_custom*100):.2f}%) | {std_custom:.4f} |")
print("-" * 80)

# Provide analysis
if mean_default > mean_custom:
    print("\nConclusion: The model using default priors (estimated from the training data) performed better.")
    print("This indicates that incorporating the actual class distribution is important for this dataset.")
elif mean_custom > mean_default:
    print("\nConclusion: The model using uniform custom priors performed better.")
    print("This might suggest that the small difference in class frequency is irrelevant or misleading in this context.")
else:
    print("\nConclusion: Both prior settings resulted in the exact same mean accuracy.")

print("="*80)


In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.svm import SVC # Switched back to SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import RFE # NEW: Import for Recursive Feature Elimination
from statistics import mean, stdev

# --- 1. Load the Dataset ---
# We use the Breast Cancer dataset (binary classification).
X, y = load_breast_cancer(return_X_y=True)
N_FEATURES = X.shape[1] # Total number of features (30 for Breast Cancer)

# Define the number of splits for cross-validation
N_SPLITS = 5
# Define the number of features to select using RFE (arbitrarily chosen 15 out of 30)
N_FEATURES_TO_SELECT = 15

# --- 2. Setup Base Model and Cross-Validation ---
# Base SVM Classifier (Linear kernel is required for RFE to rank features based on coefficients)
base_svc = SVC(kernel='linear', C=1.0, random_state=42)

# Standard Scaler for preprocessing
scaler = StandardScaler()

# Stratified K-Fold Cross-Validation strategy
cv_strategy = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

# --- 3. Setup Model Pipelines ---

# A. Pipeline with ALL Features (Standard SVC)
pipeline_all_features = make_pipeline(
    scaler,
    base_svc
)

# B. Pipeline with RFE Feature Selection
# RFE wraps the SVC, fits it, ranks features, and recursively eliminates the worst ones
# until N_FEATURES_TO_SELECT remain.
pipeline_rfe_features = make_pipeline(
    scaler,
    RFE(estimator=base_svc, n_features_to_select=N_FEATURES_TO_SELECT),
    base_svc # The RFE output is passed to a final SVC for training/scoring
)

print(f"Dataset: Breast Cancer (Total Samples: {X.shape[0]}, Total Features: {N_FEATURES})")
print(f"Validation Strategy: Stratified {N_SPLITS}-Fold Cross-Validation")
print(f"RFE Target: Selecting {N_FEATURES_TO_SELECT} out of {N_FEATURES} features.")
print("Starting cross-validation for both models...")

# --- 4. Evaluate Models using Cross-Validation ---

# Evaluate model with ALL features
scores_all = cross_val_score(
    pipeline_all_features,
    X,
    y,
    cv=cv_strategy,
    scoring='accuracy',
    n_jobs=-1
)

# Evaluate model with RFE-selected features
scores_rfe = cross_val_score(
    pipeline_rfe_features,
    X,
    y,
    cv=cv_strategy,
    scoring='accuracy',
    n_jobs=-1
)

# Calculate statistics
mean_all = mean(scores_all)
std_all = stdev(scores_all)

mean_rfe = mean(scores_rfe)
std_rfe = stdev(scores_rfe)


# --- 5. Print Comparison Results ---
print("\n" + "="*90)
print("SVM ACCURACY COMPARISON: ALL FEATURES vs. RECURSIVE FEATURE ELIMINATION (RFE)")
print("="*90)
print("-" * 90)
print(f"| {'Model Configuration':<45} | {'Mean Accuracy':<15} | {'Std Dev':<10} | {'Features':<10} |")
print("-" * 90)
print(f"| {'1. Standard SVM (All Features)':<45} | {mean_all:.4f} ({(mean_all*100):.2f}%) | {std_all:.4f} | {N_FEATURES:<10} |")
print(f"| {'2. RFE-Optimized SVM':<45} | {mean_rfe:.4f} ({(mean_rfe*100):.2f}%) | {std_rfe:.4f} | {N_FEATURES_TO_SELECT:<10} |")
print("-" * 90)

# Provide analysis
print("\nAnalysis of Results:")
if mean_rfe > mean_all:
    print(f"-> RFE improved the model performance! By reducing features from {N_FEATURES} to {N_FEATURES_TO_SELECT}, the model became slightly more accurate, potentially by eliminating noisy or redundant features.")
elif mean_all > mean_rfe:
    print(f"-> RFE led to a minor decrease in performance. This suggests that the features eliminated were still important, or the default number of features ({N_FEATURES_TO_SELECT}) was too small.")
else:
    print("-> Both models achieved identical mean accuracy, indicating that feature selection did not have a measurable impact on the generalized accuracy in this case.")
print("="*90)

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_validate # CHANGED: Using cross_validate for multiple metrics
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import RFE
# Removed: from statistics import mean, stdev (Will use numpy.mean/std)

# --- 1. Load the Dataset ---
# We use the Breast Cancer dataset (binary classification).
X, y = load_breast_cancer(return_X_y=True)
N_FEATURES = X.shape[1] # Total number of features (30 for Breast Cancer)

# Define the number of splits for cross-validation
N_SPLITS = 5
# Define the number of features to select using RFE (arbitrarily chosen 15 out of 30)
N_FEATURES_TO_SELECT = 15
# NEW: Define metrics for multi-metric evaluation
SCORING_METRICS = ['accuracy', 'precision', 'recall', 'f1']


# --- 2. Setup Base Model and Cross-Validation ---
# Base SVM Classifier (Linear kernel is required for RFE to rank features based on coefficients)
base_svc = SVC(kernel='linear', C=1.0, random_state=42)

# Standard Scaler for preprocessing
scaler = StandardScaler()

# Stratified K-Fold Cross-Validation strategy
cv_strategy = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

# --- 3. Setup Model Pipelines ---

# A. Pipeline with ALL Features (Standard SVC)
pipeline_all_features = make_pipeline(
    scaler,
    base_svc
)

# B. Pipeline with RFE Feature Selection
# RFE wraps the SVC, fits it, ranks features, and recursively eliminates the worst ones
# until N_FEATURES_TO_SELECT remain.
pipeline_rfe_features = make_pipeline(
    scaler,
    RFE(estimator=base_svc, n_features_to_select=N_FEATURES_TO_SELECT),
    base_svc # The RFE output is passed to a final SVC for training/scoring
)

print(f"Dataset: Breast Cancer (Total Samples: {X.shape[0]}, Total Features: {N_FEATURES})")
print(f"Validation Strategy: Stratified {N_SPLITS}-Fold Cross-Validation")
print(f"RFE Target: Selecting {N_FEATURES_TO_SELECT} out of {N_FEATURES} features.")
print("Starting cross-validation for both models using Precision, Recall, and F1-Score...")

# --- 4. Evaluate Models using Cross-Validation ---

# Evaluate model with ALL features
results_all = cross_validate( # CHANGED to cross_validate
    pipeline_all_features,
    X,
    y,
    cv=cv_strategy,
    scoring=SCORING_METRICS, # Use list of metrics
    n_jobs=-1,
    return_train_score=False # Only need test scores
)

# Evaluate model with RFE-selected features
results_rfe = cross_validate( # CHANGED to cross_validate
    pipeline_rfe_features,
    X,
    y,
    cv=cv_strategy,
    scoring=SCORING_METRICS, # Use list of metrics
    n_jobs=-1,
    return_train_score=False
)

# --- 5. Calculate and Prepare Results ---

def get_stats(results_dict):
    """Calculates mean and std dev for requested metrics from cross_validate results."""
    stats = {}
    for metric in SCORING_METRICS:
        scores = results_dict[f'test_{metric}']
        stats[metric] = {
            'mean': np.mean(scores),
            'std': np.std(scores)
        }
    return stats

stats_all = get_stats(results_all)
stats_rfe = get_stats(results_rfe)


# --- 6. Print Comparison Results ---
print("\n" + "="*90)
print("SVM ACCURACY COMPARISON: ALL FEATURES vs. RECURSIVE FEATURE ELIMINATION (RFE)")
print("="*90)
print(f"Features: | 1. Standard SVM: {N_FEATURES} | 2. RFE-Optimized SVM: {N_FEATURES_TO_SELECT}")
print("-" * 90)

# Print Header Row
print(f"| {'Model':<15} | {'Mean Precision':<15} | {'Std Precision':<15} | {'Mean Recall':<15} | {'Std Recall':<15} | {'Mean F1-Score':<15} | {'Std F1-Score':<15} |")
print("-" * 90)

# Print All Features Results
print(f"| {'All Features':<15} | {stats_all['precision']['mean']:.4f} | {stats_all['precision']['std']:.4f} | {stats_all['recall']['mean']:.4f} | {stats_all['recall']['std']:.4f} | {stats_all['f1']['mean']:.4f} | {stats_all['f1']['std']:.4f} |")
print("-" * 90)

# Print RFE Results
print(f"| {'RFE-Optimized':<15} | {stats_rfe['precision']['mean']:.4f} | {stats_rfe['precision']['std']:.4f} | {stats_rfe['recall']['mean']:.4f} | {stats_rfe['recall']['std']:.4f} | {stats_rfe['f1']['mean']:.4f} | {stats_rfe['f1']['std']:.4f} |")
print("-" * 90)

print(f"\nOverall Mean Accuracy (All Features): {stats_all['accuracy']['mean']:.4f}")
print(f"Overall Mean Accuracy (RFE-Optimized): {stats_rfe['accuracy']['mean']:.4f}")


# Provide analysis based on F1-Score, as it balances Precision and Recall
f1_all = stats_all['f1']['mean']
f1_rfe = stats_rfe['f1']['mean']

print("\nAnalysis of Results (Based on F1-Score):")
if f1_rfe > f1_all:
    print(f"-> RFE improved the model performance! The RFE-optimized model achieved a higher F1-Score ({f1_rfe:.4f} vs {f1_all:.4f}) with fewer features.")
elif f1_all > f1_rfe:
    print(f"-> The Standard SVM performed better. The RFE process resulted in a lower F1-Score ({f1_rfe:.4f} vs {f1_all:.4f}), suggesting important features were removed.")
else:
    print("-> Both models achieved identical mean F1-Score, indicating that feature selection did not have a measurable impact on the generalized performance.")
print("="*90)

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.naive_bayes import GaussianNB # NEW: Switched to Gaussian Naive Bayes
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Removed RFE import as it is not needed for this evaluation

# --- 1. Load the Dataset ---
# We use the Breast Cancer dataset (binary classification).
X, y = load_breast_cancer(return_X_y=True)
N_FEATURES = X.shape[1] # Total number of features (30 for Breast Cancer)

# Define the number of splits for cross-validation
N_SPLITS = 5
# Define metrics for evaluation. 'neg_log_loss' is used because lower log loss is better,
# and scikit-learn's scoring convention requires higher scores to be better.
SCORING_METRICS = ['accuracy', 'neg_log_loss']


# --- 2. Setup Base Model and Cross-Validation ---
# Base Na√Øve Bayes Classifier
nb_classifier = GaussianNB()

# Standard Scaler for preprocessing (good practice)
scaler = StandardScaler()

# Stratified K-Fold Cross-Validation strategy
cv_strategy = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

# --- 3. Setup Model Pipeline ---

# Pipeline for Gaussian Na√Øve Bayes
nb_pipeline = make_pipeline(
    scaler,
    nb_classifier
)

print(f"Dataset: Breast Cancer (Total Samples: {X.shape[0]}, Total Features: {N_FEATURES})")
print(f"Classifier: Gaussian Na√Øve Bayes")
print(f"Validation Strategy: Stratified {N_SPLITS}-Fold Cross-Validation")
print("Starting cross-validation using Accuracy and Negative Log Loss...")

# --- 4. Evaluate Model using Cross-Validation ---

results_nb = cross_validate(
    nb_pipeline,
    X,
    y,
    cv=cv_strategy,
    scoring=SCORING_METRICS,
    n_jobs=-1,
    return_train_score=False
)

# --- 5. Calculate and Prepare Results ---

# Log Loss (Negative scores are converted to positive loss values)
log_loss_scores = -results_nb['test_neg_log_loss']
mean_log_loss = np.mean(log_loss_scores)
std_log_loss = np.std(log_loss_scores)

# Accuracy
accuracy_scores = results_nb['test_accuracy']
mean_accuracy = np.mean(accuracy_scores)
std_accuracy = np.std(accuracy_scores)


# --- 6. Print Comparison Results ---
print("\n" + "="*80)
print("NA√èVE BAYES PERFORMANCE EVALUATION (Log Loss & Accuracy)")
print("="*80)
print(f"Validation Folds: {N_SPLITS}")
print("-" * 80)
print(f"| {'Metric':<20} | {'Mean Score':<20} | {'Std Dev':<20} |")
print("-" * 80)
print(f"| {'Log Loss (Cross-Entropy)':<20} | {mean_log_loss:.4f} | {std_log_loss:.4f} |")
print(f"| {'Accuracy':<20} | {mean_accuracy:.4f} ({(mean_accuracy*100):.2f}%) | {std_accuracy:.4f} |")
print("-" * 80)

print("\nInterpretation:")
print(f"-> Mean Log Loss of {mean_log_loss:.4f}: This indicates the average cross-entropy between the predicted probabilities and the true labels. Lower values are better, representing a more confident and accurate model.")
print(f"-> Mean Accuracy of {mean_accuracy:.4f}: This represents the average proportion of correctly classified samples across all {N_SPLITS} folds.")
print("="*80)


In [None]:
import numpy as np
import matplotlib.pyplot as plt # NEW: For plotting
import seaborn as sns # NEW: For visualizing the Confusion Matrix

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split # CHANGED: Using train_test_split for a single test set
from sklearn.svm import SVC # CHANGED: Switched back to SVC
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # NEW: For evaluation metrics and matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# --- 1. Load the Dataset ---
# We use the Breast Cancer dataset (binary classification).
X, y = load_breast_cancer(return_X_y=True)
N_FEATURES = X.shape[1] # Total number of features (30 for Breast Cancer)
TARGET_NAMES = load_breast_cancer().target_names # Names for labels (malignant, benign)


# --- 2. Data Split ---
# Split the data into training and testing sets for a single evaluation.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# --- 3. Setup Model Pipeline ---
# Use an RBF kernel SVM, which generally performs well.
svc_classifier = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)

# Pipeline for SVC with Standardization
svc_pipeline = make_pipeline(
    StandardScaler(),
    svc_classifier
)

print(f"Dataset: Breast Cancer (Total Samples: {X.shape[0]}, Features: {N_FEATURES})")
print(f"Classifier: Support Vector Classifier (RBF Kernel)")
print(f"Evaluation: Standard 70/30 Train/Test Split")
print("Training model and generating Confusion Matrix...")

# --- 4. Train Model and Predict ---

# Train the model
svc_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svc_pipeline.predict(X_test)

# --- 5. Evaluate Performance ---

# Calculate Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=TARGET_NAMES)

# --- 6. Plot Confusion Matrix ---

# Set a figure size for better visualization
plt.figure(figsize=(8, 6))
# Create the seaborn heatmap
sns.heatmap(
    cm,
    annot=True, # Annotate cells with the numeric value
    fmt='d', # Format as integer
    cmap='Blues', # Color map
    cbar=False, # Do not show the color bar
    xticklabels=TARGET_NAMES, # X-axis labels (Predicted)
    yticklabels=TARGET_NAMES # Y-axis labels (Actual)
)
plt.title('Confusion Matrix for SVM Classifier')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
# Save the plot to a file (this step is necessary to display it in the environment)
plt.savefig('confusion_matrix_svc.png')
# Close the plot to free memory
plt.close()


# --- 7. Print Results ---
print("\n" + "="*80)
print("SVM PERFORMANCE EVALUATION (RBF Kernel)")
print("="*80)
print(f"Accuracy: {accuracy:.4f}")
print("-" * 80)
print("Classification Report:")
print(report)
print("-" * 80)
print("The Confusion Matrix is displayed above/as an output file (confusion_matrix_svc.png).")
print("="*80)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_california_housing # CHANGED: Switched to regression dataset
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR # CHANGED: Switched to Support Vector Regressor (SVR)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # NEW: For regression metrics
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# --- 1. Load the Dataset ---
# We use the California Housing dataset (regression problem).
housing = fetch_california_housing(as_frame=False)
X, y = housing.data, housing.target
N_FEATURES = X.shape[1] # Total number of features (8 for California Housing)

# --- 2. Data Split ---
# Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# --- 3. Setup Model Pipeline ---
# Use an RBF kernel SVR, which is a common choice for non-linear regression.
svr_regressor = SVR(kernel='rbf', C=10.0, gamma='scale') # Increased C for better fitting

# Pipeline for SVR with Standardization
svr_pipeline = make_pipeline(
    StandardScaler(),
    svr_regressor
)

print(f"Dataset: California Housing (Total Samples: {X.shape[0]}, Features: {N_FEATURES})")
print(f"Model: Support Vector Regressor (SVR, RBF Kernel)")
print(f"Evaluation: Standard 70/30 Train/Test Split")
print("Training model and evaluating performance...")

# --- 4. Train Model and Predict ---

# Train the model
svr_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svr_pipeline.predict(X_test)

# --- 5. Evaluate Performance (Mean Absolute Error - MAE) ---

# Calculate MAE (The requested metric)
mae = mean_absolute_error(y_test, y_pred)

# Calculate other common regression metrics for context
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# --- 6. Visualize Prediction vs. Actual (For Context) ---

# Create a figure to plot predictions vs. actual values
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.6)
# Plot the ideal 45-degree line where predictions equal actuals
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r--')

plt.title('SVR: Actual vs. Predicted House Prices')
plt.xlabel('True Values (Actual Price in $100,000s)')
plt.ylabel('Predicted Values (Price in $100,000s)')
plt.grid(True)
# Save the plot to a file
plt.savefig('svr_prediction_vs_actual.png')
plt.close()


# --- 7. Print Results ---
print("\n" + "="*80)
print("SVR PERFORMANCE EVALUATION (California Housing)")
print("="*80)
print(f"Mean Absolute Error (MAE): {mae:.4f}") # Requested Metric
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2 Score): {r2:.4f}")
print("-" * 80)

print("\nInterpretation:")
print(f"-> MAE of {mae:.4f}: This means the average magnitude of error in predicting the house price is {mae:.4f} units (where a unit is $100,000). For example, if MAE is 0.5, the model is off by $50,000 on average.")
print("-> The scatter plot (svr_prediction_vs_actual.png) shows how closely predictions align with the true values.")
print("="*80)

In [None]:
import numpy as np
# Removed matplotlib and seaborn as the requested output is a single score

from sklearn.datasets import load_breast_cancer # CHANGED: Switched to classification dataset
from sklearn.model_selection import StratifiedKFold, cross_val_score # CHANGED: Using cross_val_score and StratifiedKFold
from sklearn.naive_bayes import GaussianNB # CHANGED: Switched to Gaussian Na√Øve Bayes
# Removed regression metric imports and kept general imports
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# --- 1. Load the Dataset ---
# We use the Breast Cancer dataset (binary classification).
X, y = load_breast_cancer(return_X_y=True)
N_FEATURES = X.shape[1]

# Define the number of splits for cross-validation
N_SPLITS = 5
# Define the scoring metric
SCORING_METRIC = 'roc_auc'


# --- 2. Setup Base Model and Cross-Validation ---
# Base Na√Øve Bayes Classifier
nb_classifier = GaussianNB()

# Stratified K-Fold Cross-Validation strategy
cv_strategy = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

# --- 3. Setup Model Pipeline ---

# Pipeline for Gaussian Na√Øve Bayes with Standardization
nb_pipeline = make_pipeline(
    StandardScaler(),
    nb_classifier
)

print(f"Dataset: Breast Cancer (Total Samples: {X.shape[0]}, Total Features: {N_FEATURES})")
print(f"Classifier: Gaussian Na√Øve Bayes")
print(f"Validation Strategy: Stratified {N_SPLITS}-Fold Cross-Validation")
print("Starting cross-validation using ROC-AUC score...")

# --- 4. Evaluate Model using Cross-Validation ---

# Calculate ROC-AUC score for each fold. AUC uses predicted probabilities.
auc_scores = cross_val_score(
    nb_pipeline,
    X,
    y,
    cv=cv_strategy,
    scoring=SCORING_METRIC, # Score using Area Under the ROC Curve
    n_jobs=-1
)

# --- 5. Calculate and Prepare Results ---

mean_auc = np.mean(auc_scores)
std_auc = np.std(auc_scores)


# --- 6. Print Results ---
print("\n" + "="*80)
print("NA√èVE BAYES PERFORMANCE EVALUATION (ROC-AUC)")
print("="*80)
print(f"Validation Folds: {N_SPLITS}")
print("-" * 80)
print(f"| {'Metric':<20} | {'Mean Score':<20} | {'Std Dev':<20} |")
print("-" * 80)
print(f"| {'ROC AUC Score':<20} | {mean_auc:.4f} | {std_auc:.4f} |")
print("-" * 80)

print("\nInterpretation:")
print(f"-> Mean ROC-AUC of {mean_auc:.4f}: The ROC-AUC score represents the model's ability to distinguish between classes. A score close to 1.0 is excellent, while 0.5 is no better than random guessing.")
print("="*80)

In [None]:
import numpy as np
import matplotlib.pyplot as plt # NEW: For plotting the curve

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split # CHANGED: Using train_test_split for a single test set
from sklearn.svm import SVC # CHANGED: Switched back to SVC
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay, average_precision_score # NEW: For PR Curve
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# --- 1. Load the Dataset ---
# We use the Breast Cancer dataset (binary classification).
X, y = load_breast_cancer(return_X_y=True)
N_FEATURES = X.shape[1]
TARGET_NAMES = load_breast_cancer().target_names

# --- 2. Data Split ---
# Split the data into training and testing sets for a single evaluation.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# --- 3. Setup Model Pipeline ---
# Use an RBF kernel SVM. Set probability=True to enable predict_proba,
# which is needed for the Precision-Recall Curve.
# Note: Enabling probability=True adds computational cost due to Platt Scaling.
svc_classifier = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42, probability=True)

# Pipeline for SVC with Standardization
svc_pipeline = make_pipeline(
    StandardScaler(),
    svc_classifier
)

print(f"Dataset: Breast Cancer (Total Samples: {X.shape[0]}, Features: {N_FEATURES})")
print(f"Classifier: Support Vector Classifier (RBF Kernel with Probability Output)")
print(f"Evaluation: Standard 70/30 Train/Test Split")
print("Training model and generating Precision-Recall Curve...")

# --- 4. Train Model and Get Probabilities ---

# Train the model
svc_pipeline.fit(X_train, y_train)

# Get decision function scores (or probabilities for positive class)
# For SVC with probability=True, predict_proba returns probabilities
y_scores = svc_pipeline.predict_proba(X_test)[:, 1] # Probability of the positive class

# --- 5. Calculate Precision-Recall Curve components ---

# Calculate precision, recall, and thresholds
precision, recall, _ = precision_recall_curve(y_test, y_scores)

# Calculate Average Precision Score
avg_precision = average_precision_score(y_test, y_scores)

# --- 6. Plot Precision-Recall Curve ---

# Create a figure for the plot
plt.figure(figsize=(8, 6))

# Use PrecisionRecallDisplay to plot the curve
disp = PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=avg_precision)
disp.plot(ax=plt.gca(), name=f'SVM (AP={avg_precision:.2f})')

plt.title('Precision-Recall Curve for SVM Classifier')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.grid(True)
plt.legend(loc="lower left")

# Save the plot to a file
plt.savefig('precision_recall_curve_svc.png')
plt.close()

# --- 7. Print Results ---
print("\n" + "="*80)
print("SVM PERFORMANCE EVALUATION (PRECISION-RECALL)")
print("="*80)
print(f"Average Precision Score (AP): {avg_precision:.4f}")
print("-" * 80)
print("The Precision-Recall Curve is displayed above/as an output file (precision_recall_curve_svc.png).")
print("="*80)