<h1 align="center">Support Vector Machines with Python and Scikit-Learn</h1>

In this project, I build a Support Vector Machines classifier to classify a Pulsar star. I have used the Predicting a Pulsar Star dataset for this project. I have downloaded this dataset from the Kaggle website.

#### Table of Contents

- Introduction to Support Vector Machines
- Support Vector Machines intuition
- Kernel trick
- The problem statement
- Dataset description
- Import libraries
- Import dataset
- Exploratory data analysis
- Declare feature vector and target variable
- Split data into separate training and test set
- Feature scaling
- Run SVM with default hyperparameters
- Run SVM with linear kernel
- Run SVM with polynomial kernel
- Run SVM with sigmoid kernel
- Confusion matrix
- Classification metrices
- ROC - AUC
- Stratified k-fold Cross Validation with shuffle split
- Hyperparameter optimization using GridSearch CV
- Results and conclusion

# 1. Introduction to Support Vector Machines

**Support Vector Machines** (SVMs in short) are machine learning algorithms that are used for classification and regression purposes. SVMs are one of the powerful machine learning algorithms for classification, regression and outlier detection purposes. An SVM classifier builds a model that assigns new data points to one of the given categories. Thus, it can be viewed as a non-probabilistic binary linear classifier.

The original SVM algorithm was developed by Vladimir N Vapnik and Alexey Ya. Chervonenkis in 1963. At that time, the algorithm was in early stages. The only possibility is to draw hyperplanes for linear classifier. In 1992, Bernhard E. Boser, Isabelle M Guyon and Vladimir N Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The current standard was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995.

SVMs can be used for linear classification purposes. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using the **kernel trick**. It enable us to implicitly map the inputs into high dimensional feature spaces.

# 2. Support Vector Machines intuition

Now, we should be familiar with some SVM terminology.

**Hyperplane:**

A hyperplane is a decision boundary which separates between given set of data points having different class labels. The SVM classifier separates data points using a hyperplane with the maximum amount of margin. This hyperplane is known as the `maximum margin hyperplane` and the linear classifier it defines is known as the `maximum margin classifier`.

**Support Vectors:**

Support vectors are the sample data points, which are closest to the hyperplane. These data points will define the separating line or hyperplane better by calculating margins.

**Margin:**

A margin is a separation gap between the two lines on the closest data points. It is calculated as the perpendicular distance from the line to support vectors or closest data points. In SVMs, we try to maximize this separation gap so that we get maximum margin.

**SVM Under the hood**

In SVMs, our main objective is to select a hyperplane with the maximum possible margin between support vectors in the given dataset. SVM searches for the maximum margin hyperplane in the following 2 step process 
  - Generate hyperplanes which segregates the classes in the best possible way. There are many hyperplanes that might classify the data. We should look for the best hyperplane that represents the largest separation, or margin, between the two classes.

  - So, we choose the hyperplane so that distance from it to the support vectors on each side is maximized. If such a hyperplane exists, it is known as the `maximum margin hyperplane` and the linear classifier it defines is known as a `maximum margin classifier`.
 
**Problem with dispersed datasets**

Sometimes, the sample data points are so dispersed that it is not possible to separate them using a linear hyperplane. In such a situation, SVMs uses a `kernel trick` to transform the input space to a higher dimensional space as shown in the diagram below. It uses a mapping function to transform the 2-D input space into the 3-D input space. Now, we can easily segregate the data points using linear separation.


# 3. Kernel trick

In practice, SVM algorithm is implemented using a `kernel`. It uses a technique called the `kernel trick`. In simple words, a kernel is just a function that maps the data to a higher dimension where data is separable. A kernel transforms a low-dimensional input data space into a higher dimensional space. So, it converts non-linear separable problems to linear separable problems by adding more dimensions to it. Thus, the kernel trick helps us to build a more accurate classifier. Hence, it is useful in non-linear separation problems.

In the context of SVMs, there are 4 popular kernels – `Linear kernel`, `Polynomial kernel` and `Radial Basis Function (RBF)` kernel (also called Gaussian kernel) and `Sigmoid kernel`.

# 4. The problem statement
In this project, I try to classify a pulsar star as legitimate or spurious pulsar star. The legitimate pulsar stars form a minority positive class and spurious pulsar stars form the majority negative class. I implement Support Vector Machines (SVMs) classification algorithm with Python and Scikit-Learn to solve this problem.

To answer the question, I build a SVM classifier to classify the `pulsar star` as legitimate or spurious. I have used the Predicting a <a href="https://www.kaggle.com/datasets/chandrashekhargt/pulsar-star">Pulsar_Star_Dataset</a>  downloaded from the Kaggle website for this project.

# 5. Dataset description

Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter. Classification algorithms in particular are being adopted, which treat the data sets as binary classification problems. Here the legitimate pulsar examples form minority positive class and spurious examples form the majority negative class.

The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive).

**Attribute Information:** 

Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile. The remaining four variables are similarly obtained from the DM-SNR curve . These are summarised below:

 - Mean of the integrated profile.

 - Standard deviation of the integrated profile.

 - Excess kurtosis of the integrated profile.

 - Skewness of the integrated profile.

 - Mean of the DM-SNR curve.

 - Standard deviation of the DM-SNR curve.

 - Excess kurtosis of the DM-SNR curve.

 - Skewness of the DM-SNR curve.

 - Class

# 6. Import libraries

I will start off by importing the required Python libraries.

In [None]:
# libraries


# 7.Import dataset

In [None]:
# Dataset


# 8. Exploratory data analysis

Now, I will explore the data to gain insights about the data.

# 9. Declare feature vector and target variable

# 10. Split data into separate training and test set

# 11. Feature Scaling

# 12. Run SVM with default hyperparameters

Default hyperparameter means C=`1.0`, kernel=`rbf` and gamma=`auto` among other parameters.

# 13. Run SVM with linear kernel

# 14. Run SVM with polynomial kernel

**Comments**

We get maximum accuracy with `rbf` and `linear kernel` with `C=100.0`. and the accuracy is `0.9832`. Based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.

But, this is not true. Here, we have an imbalanced dataset. The problem is that accuracy is an inadequate measure for quantifying predictive performance in the imbalanced dataset problem.

So, we must explore alternative metrices that provide better guidance in selecting models. In particular, we would like to know the underlying distribution of values and the type of errors our classifer is making.

One such metric to analyze the model performance in imbalanced classes problem is `Confusion matrix`.

# 16. Confusion matrix

A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.

Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-

**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.

**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.

**False Positives (FP)** – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called Type I error.

**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called Type II error.

These four outcomes are summarized in a confusion matrix given below.

# 17. Classification metrices

**Classification report** is another way to evaluate the classification model performance. It displays the **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later.

# 18. ROC - AUC

Another tool to measure the classification model performance visually is `ROC Curve`. ROC Curve stands for **Receiver Operating Characteristic Curve**. An ROC Curve is a plot which shows the performance of a classification model at various classification threshold levels.

The ROC Curve plots the `True Positive Rate (TPR)` against the `False Positive Rate (FPR)` at various threshold levels.

True Positive Rate (TPR) is also called Recall. It is defined as the ratio of `TP to (TP + FN)`.

False Positive Rate (FPR) is defined as the ratio of `FP to (FP + TN)`.

In the ROC Curve, we will focus on the TPR (True Positive Rate) and FPR (False Positive Rate) of a single point. This will give us the general performance of the ROC curve which consists of the TPR and FPR at various threshold levels. So, an ROC Curve plots TPR vs FPR at different classification threshold levels. If we lower the threshold levels, it may result in more items being classified as positve. It will increase both True Positives (TP) and False Positives (FP).

#### ROC AUC

ROC AUC stands for **Receiver Operating Characteristic - Area Under Curve**. It is a technique to compare classifier performance. In this technique, we measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5.

So, ROC AUC is the percentage of the ROC plot that is underneath the curve.

# 19. Stratified k-fold Cross Validation with shuffle split

k-fold cross-validation is a very useful technique to evaluate model performance. But, it fails here because we have a imbalnced dataset. So, in the case of imbalanced dataset, I will use another technique to evaluate model performance. It is called stratified k-fold cross-validation.

In stratified k-fold cross-validation, we split the data such that the proportions between classes are the same in each fold as they are in the whole dataset.

Moreover, I will shuffle the data before splitting because shuffling yields much better result.

**Stratified k-Fold Cross Validation with shuffle split with linear kernel**

# 20. Hyperparameter Optimization using GridSearch CV