# Introduction
Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks. It's known for its effectiveness in handling high dimensional data and its ability to perform well even with limited training data.

### Classification with SVMs
- The core idea is to find an optimal hyperplane that separates the data points of different classes with the maximum margin.
- A hyperplane is a decision boundary in n-dimensional space (n = number of features).
- The margin is the distance between the hyperplane and the closest data points from each class called the support vectors.
- SVMs aim to maximize this margin, which intuitively leads to a better separation between classes and potentially a better generalization to unseen data.

### Key components
- Support vectors: These are the data points closest to the hyperplane that define the margin. They are crucial for training the model and influence the classification of new data points.
- Kernel trick: This technique allows SVMs to handle non-linearly separable data. It essentially transforms the data into a higher-dimensional space where a linear separation might be possible. Common kernels include, linear, polynomial, and radial basis function (RBF).

### Advantages of SVMs
- Effective in high-dimensional spaces: SVMs can perform well even with a large number of features, making them suitable for complex datasets.
- Robust to overfitting: The focus on maximizing the margin can help reduce overfitting, especially when dealing with limited training data.
- Interpretability: In some cases, the decision boundary learned by the SVM can be visualized and interpreted, providing insights into the model's behavior.

### Disadvantages of SVMs
- Can be computationally expensive: Training SVMs can be slower than some other algorithms, especially for large datasets.
- Parameter tuning: Choosing the right kernel and its hyperparameters is crucial for optimal performance and can involve experimentation.
- Not ideal for very high-dimensional data: While SVMs can handle high dimensions, extremely high dimensionality can still pose challenges.

### Applications of SVMs
- Text classification (spam detection, sentiment analysis).
- Image classification (object detection, handwriting recognition).
- Bioinformatic data analysis (gene expression analysis).
- Anomaly detection (fraud detection, system intrusion detection).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [2]:
pd.set_option("display.max_columns", None)
sns.set_theme(style = "whitegrid")
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize'] = (20, 10)

In [4]:
df = pd.read_csv("spam_processed.csv", encoding = "latin-1")
df.head()

Unnamed: 0,type,message,cleaned_message
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah nt think goes usf lives around though


# SVM Algorithm
### 1. Data representation
- Each data point is represented as a vector of features ($x_i$) with a corresponding class label ($y_i$).
- For example, if classifying emails as spam or ham, features might include frequencies, and class labels would be 1 (spam) or 0 (ham).

### 2. Hyperplane
- The goal is to find a hyperplane (a decision boundary) in the feature space that separates the data points of different classes with the maximum margin.
- The margin is the distance between the hyperplane and the closest data points from each class, called support vectors.

### 3. Support vectors
- These are the most critical training instances that define the margin.
- They are typically the data points closest to the hyperplane on either side, one for each class.
- The intuition is that these points have the most influence on the classification of new data points.

### 4. Maximizing the margin
- The SVM algorithm aims to maximize the margin between the hyperplane and the support vectors.
- A larger margin intuitively leads to a better separation between classes and potentially better generalization to unseen data.

### 5. Kernel trick (for non-linear data)
- In some cases, the data might not be linearly separable in the original feature space.
- The kernel trick addresses this by transforming the data into a higher-dimensional space where a linear separation might be possible.
- Common kernel functions include,
    - Linear kernel (for already linearly separable data).
    - Polynomial kernel (transforms data to a higher-dimensional polynomial space).
    - Radial Basis Function (RBF) kernel (projects data into a high-dimensional space using a Radial Basis Function).

### 6. Classification of new data points
Once the SVM is trained (hyperplane and support vectors identified), a new data point is classified by,
- Transforming the data points into the same feature space as the training data (if using a kernel).
- Calculating the distaance from the new point to the hyperplane.
- Assigning the class label based on which side of the hyperplane the new point falls on.

### Mathematical formulation (simplified)
The decision for an SVM with a linear kernel can be expressed as, $f(x) = w^T * x + w_0$. Where,
- $w$ = Weight vector (normal to the hyperplane).
- $w_0$ = Bias term.
- $x$ = New data point.

The goal is to find $w$ and $w_0$ that maximize the margin while correctly classifying all training points. This involves solving a constrained optimization proble,

### Learning algorithms
Several algorithms are used to train SVMs, including, Sequential Minimal Optimization (SMO), a popular algorithm that efficiently solves the optimization problem for finding the optimal hyperplane.