<a href="https://colab.research.google.com/github/waelrash1/predictive_analytics_DT302/blob/main/SVC_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Introduction to SVM
1- Explain the concept of Support Vector Machines (SVM). SVM is a supervised machine learning algorithm which can be used for both classification and regression challenges. In this notebook, we focus on classification.



2- Importing Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

3. Loading the Breast Cancer Dataset


In [None]:
# Load dataset
cancer = datasets.load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                  columns=np.append(cancer['feature_names'], ['target']))

4. Data Preprocessing

In [None]:
# Display the first few rows of the DataFrame
print(df.head())

# Dataset dimensions
print(df.shape)

Splitting the Data


In [None]:

# Splitting the dataset into a training set and a testing set
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Create SVC Linear  Kernal pipeline


In [None]:
# Creating a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='linear'))
])


5. Train the pipeline

In [None]:
# Create a SVM Classifier
# Training the pipeline
pipeline.fit(X_train, y_train)



6. Model Evaluation
Evaluating the Model

In [None]:


# Making predictions
y_pred = pipeline.predict(X_test)

In [None]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

In [None]:
# Classification Report
class_report = classification_report(y_test, y_pred)
print(class_report)

In [None]:
#Visualizing the Confusion Matrix

sns.heatmap(conf_matrix, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')
plt.show()

## Polynomial SVC Kernel

In [None]:
# Polynomial kernel
poly_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_poly', SVC(kernel='poly', degree=1))  # degree can be adjusted
])
poly_pipeline.fit(X_train, y_train)
y_pred_poly = poly_pipeline.predict(X_test)

# Evaluate
print(confusion_matrix(y_test, y_pred_poly))
print(classification_report(y_test, y_pred_poly))


## RBF Kernel

In [None]:
# RBF kernel
rbf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_rbf', SVC(kernel='rbf'))
])
rbf_pipeline.fit(X_train, y_train)
y_pred_rbf = rbf_pipeline.predict(X_test)

# Evaluate
print(confusion_matrix(y_test, y_pred_rbf))
print(classification_report(y_test, y_pred_rbf))


## SVC Parameter Tuning
Start by explaining the importance of parameter tuning in machine learning models, particularly for SVM. Discuss the parameters:

* C: Regularization parameter. The strength of the regularization is inversely proportional to C. It helps to avoid overfitting.
* Kernel: Specifies the kernel type to be used in the algorithm.
* Gamma: Kernel coefficient for 'rbf', 'poly', and 'sigmoid'. It defines how far the influence of a single training example reaches.

2. Setting Up Grid Search
First, import the necessary module for Grid Search:

In [None]:

from sklearn.model_selection import GridSearchCV
# Now, set up the parameter grid to test:

param_grid = {
    'C': [0.1, 1, 10, 100],  # A range of values for C
    'gamma': [1, 0.1, 0.01, 0.001],  # A range of values for gamma
    'kernel': ['linear','rbf', 'poly', 'sigmoid']  # Different kernel types
}



3. Applying Grid Search

Create a GridSearchCV object and fit it to the training data:





In [None]:
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1)
grid.fit(X_train, y_train)


4. Evaluating the Best Model

After fitting, we can check the best parameter combination found by Grid Search:



In [None]:

print("Best Parameters Found: ", grid.best_params_)


In [None]:

#Use the best estimator to make predictions:
grid_predictions = grid.predict(X_test)

# Evaluate
print(confusion_matrix(y_test, grid_predictions))
print(classification_report(y_test, grid_predictions))

# Logistic Regression

Importing Logistic Regression
You've already imported necessary libraries. Now, import Logistic Regression from scikit-learn.




In [None]:
from sklearn.linear_model import LogisticRegression
# Creating a Pipeline for Logistic Regression

# Creating a pipeline for logistic regression
logistic_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logistic', LogisticRegression())
])

# Training the logistic regression pipeline
logistic_pipeline.fit(X_train, y_train)

In [None]:
# Making predictions
y_pred_logistic = logistic_pipeline.predict(X_test)

# Evaluating Logistic Regression Model

# Confusion Matrix for Logistic Regression
conf_matrix_logistic = confusion_matrix(y_test, y_pred_logistic)
print(conf_matrix_logistic)

# Classification Report for Logistic Regression
class_report_logistic = classification_report(y_test, y_pred_logistic)
print(class_report_logistic)
# Visualizing the Confusion Matrix for Logistic Regression

sns.heatmap(conf_matrix_logistic, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')
plt.title('Confusion Matrix for Logistic Regression')
plt.show()



7. Conclusion
Summarize the performance of the model and discuss the results. You can discuss how changing kernel types and tuning parameters can affect the model's performance.

8. Additional Exercises and Resources
Encourage students to try different kernels ('rbf', 'poly', etc.) and play with the C and gamma parameters. Provide links to further reading.

## MNIST Dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


In [None]:
# Load dataset
digits = datasets.load_digits()

# Displaying the shape of data and target
print("Image Data Shape: ", digits.data.shape)
print("Label Data Shape: ", digits.target.shape)

# Displaying the first few images and labels
fig, axes = plt.subplots(1, 10, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title('Training: %i' % label)


In [None]:
# Split data into 70% train and 30% test subsets
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.3, shuffle=False)


In [None]:
# Create a classifier: a support vector classifier
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', svm.SVC(gamma=0.001))
])

# Fit to the training data
svm_pipeline.fit(X_train, y_train)


In [None]:
# Predict the value of the digit on the test subset
predicted = svm_pipeline.predict(X_test)


In [None]:
# Confusion matrix
print(f"Confusion matrix:\n{metrics.confusion_matrix(y_test, predicted)}")

# Classification report
print(f"Classification report for classifier {svm_pipeline}:\n"
      f"{metrics.classification_report(y_test, predicted)}")


## Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV
# Now, set up the parameter grid to test:

param_grid = {
    'C': [0.1, 1, 10, 100],  # A range of values for C
    'gamma': [1, 0.1, 0.01, 0.001],  # A range of values for gamma
    'kernel': ['linear','rbf', 'poly', 'sigmoid']  # Different kernel types
}


In [None]:
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1)
grid.fit(X_train, y_train)

In [None]:
print("Best Parameters Found: ", grid.best_params_)


In [None]:

#Use the best estimator to make predictions:
grid_predictions = grid.predict(X_test)

# Evaluate
print(confusion_matrix(y_test, grid_predictions))
print(classification_report(y_test, grid_predictions))

## Experimenting with Banknote Authentication Dataset
Begin by introducing the dataset. The Banknote Authentication Dataset contains images of genuine and forged banknote-like specimens. Features are extracted from these images, such as variance, skewness, curtosis of the wavelet-transformed image, and entropy.






In [None]:
# Importing Required Libraries and Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import urllib.request

In [None]:
#Loading the Banknote Authentication Dataset
# The dataset can be downloaded from the UCI Machine Learning Repository. Here's how to do it:


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
urllib.request.urlretrieve(url, "data_banknote_authentication.csv")

In [None]:
# Read the dataset
df = pd.read_csv("data_banknote_authentication.csv", header=None)
df.columns = ["Variance", "Skewness", "Curtosis", "Entropy", "Class"]
print(df.head())


In [None]:
# Data Preprocessing
#Exploratory Data Analysis

# Basic stats
print(df.describe())

# Checking for null values
print(df.isnull().sum())

# Class distribution
sns.countplot(df['Class'])
plt.show()


In [None]:

# Split the data into features and target label
X = df.drop('Class', axis=1)
y = df['Class']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Building and Training the SVM Model

# Creating a SVM Classifier with a radial basis function (rbf) kernel
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='linear'))
])

# Train the model using the training sets
svm_pipeline.fit(X_train, y_train)
# Model Evaluation
# Making Predictions and Evaluating the Model

# Predicting the Test set results
y_pred = svm_pipeline.predict(X_test)

# Confusion Matrix
print(confusion_matrix(y_test, y_pred))

# Classification Report
print(classification_report(y_test, y_pred))


7. Experimenting with Different Kernels
Encourage experimentation with different kernels ('linear', 'poly', 'sigmoid') and parameters (C, gamma) to observe their impact on the model's performance.



## The Fashion MNIST Dataset
Start by introducing the dataset. The Fashion MNIST dataset is a collection of 70,000 grayscale images of 10 fashion categories, including shirts, dresses, shoes, etc. Each image is 28x28 pixels. This dataset is often used as a more challenging replacement for the classic MNIST dataset.





In [None]:
# Importing Required Libraries and Dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import tensorflow as tf



In [None]:
# Loading the Fashion MNIST Dataset
# Fashion MNIST can be easily loaded via TensorFlow or Keras:


# Load the Fashion MNIST dataset
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()



In [None]:
import matplotlib.pyplot as plt
import tensorflow as tf

# Load the Fashion MNIST dataset
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# Define class names for Fashion MNIST
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# Plotting a few samples from the dataset
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(X_train[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[y_train[i]])
plt.show()


In [None]:

# Normalize the data
X_train, X_test = X_train / 255.0, X_test / 255.0

# Reshape the data to fit the SVM input requirements
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)

print("Training Set Shape:", X_train.shape)
print("Test Set Shape:", X_test.shape)

In [None]:
# Data Preprocessing
# Since we have already normalized and reshaped the data, no further preprocessing is needed.

# Building and Training the SVM Model
# Given the size of the dataset, consider using a subset of the training data for faster processing, or use a linear kernel for quicker execution.



# Using a linear kernel for quicker execution
svm_model = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', svm.SVC(kernel='rbf'))
])



In [None]:
# Fit the model (consider using a smaller subset of data for faster processing)
svm_model.fit(X_train[:10000], y_train[:10000]) # Using first 10000 samples for training
# Model Evaluation

# Making predictions
y_pred = svm_model.predict(X_test[:1000]) # Using first 1000 samples for testing

# Confusion Matrix
print(metrics.confusion_matrix(y_test[:1000], y_pred))

# Classification Report
print(metrics.classification_report(y_test[:1000], y_pred))

In [None]:

#Visualizing the Predictions
#Visualizing predictions can help in understanding where the model performs well or poorly.

fig, axes = plt.subplots(3, 3, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(28, 28), cmap='gray')
    ax.set_title(f"True: {y_test[i]}, Predicted: {y_pred[i]}")
    ax.set_axis_off()