# Introduction
Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks. It's known for its effectiveness in handling high dimensional data and its ability to perform well even with limited training data.

### Classification with SVMs
- The core idea is to find an optimal hyperplane that separates the data points of different classes with the maximum margin.
- A hyperplane is a decision boundary in n-dimensional space (n = number of features).
- The margin is the distance between the hyperplane and the closest data points from each class called the support vectors.
- SVMs aim to maximize this margin, which intuitively leads to a better separation between classes and potentially a better generalization to unseen data.

### Key components
- Support vectors: These are the data points closest to the hyperplane that define the margin. They are crucial for training the model and influence the classification of new data points.
- Kernel trick: This technique allows SVMs to handle non-linearly separable data. It essentially transforms the data into a higher-dimensional space where a linear separation might be possible. Common kernels include, linear, polynomial, and radial basis function (RBF).

### Advantages of SVMs
- Effective in high-dimensional spaces: SVMs can perform well even with a large number of features, making them suitable for complex datasets.
- Robust to overfitting: The focus on maximizing the margin can help reduce overfitting, especially when dealing with limited training data.
- Interpretability: In some cases, the decision boundary learned by the SVM can be visualized and interpreted, providing insights into the model's behavior.

### Disadvantages of SVMs
- Can be computationally expensive: Training SVMs can be slower than some other algorithms, especially for large datasets.
- Parameter tuning: Choosing the right kernel and its hyperparameters is crucial for optimal performance and can involve experimentation.
- Not ideal for very high-dimensional data: While SVMs can handle high dimensions, extremely high dimensionality can still pose challenges.

### Applications of SVMs
- Text classification (spam detection, sentiment analysis).
- Image classification (object detection, handwriting recognition).
- Bioinformatic data analysis (gene expression analysis).
- Anomaly detection (fraud detection, system intrusion detection).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [2]:
pd.set_option("display.max_columns", None)
sns.set_theme(style = "whitegrid")
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize'] = (20, 10)

In [4]:
df = pd.read_csv("spam_processed.csv", encoding = "latin-1")
df.head()

Unnamed: 0,type,message,cleaned_message
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah nt think goes usf lives around though


# SVM Algorithm
### 1. Data representation
- Each data point is represented as a vector of features ($x_i$) with a corresponding class label ($y_i$).
- For example, if classifying emails as spam or ham, features might include frequencies, and class labels would be 1 (spam) or 0 (ham).

### 2. Hyperplane
- The goal is to find a hyperplane (a decision boundary) in the feature space that separates the data points of different classes with the maximum margin.
- The margin is the distance between the hyperplane and the closest data points from each class, called support vectors.

### 3. Support vectors
- These are the most critical training instances that define the margin.
- They are typically the data points closest to the hyperplane on either side, one for each class.
- The intuition is that these points have the most influence on the classification of new data points.

### 4. Maximizing the margin
- The SVM algorithm aims to maximize the margin between the hyperplane and the support vectors.
- A larger margin intuitively leads to a better separation between classes and potentially better generalization to unseen data.

### 5. Kernel trick (for non-linear data)
- In some cases, the data might not be linearly separable in the original feature space.
- The kernel trick addresses this by transforming the data into a higher-dimensional space where a linear separation might be possible.
- Common kernel functions include,
    - Linear kernel (for already linearly separable data).
    - Polynomial kernel (transforms data to a higher-dimensional polynomial space).
    - Radial Basis Function (RBF) kernel (projects data into a high-dimensional space using a Radial Basis Function).

### 6. Classification of new data points
Once the SVM is trained (hyperplane and support vectors identified), a new data point is classified by,
- Transforming the data points into the same feature space as the training data (if using a kernel).
- Calculating the distaance from the new point to the hyperplane.
- Assigning the class label based on which side of the hyperplane the new point falls on.

### Mathematical formulation (simplified)
The decision for an SVM with a linear kernel can be expressed as, $f(x) = w^T * x + w_0$. Where,
- $w$ = Weight vector (normal to the hyperplane).
- $w_0$ = Bias term.
- $x$ = New data point.

The goal is to find $w$ and $w_0$ that maximize the margin while correctly classifying all training points. This involves solving a constrained optimization proble,

### Learning algorithms
Several algorithms are used to train SVMs, including, Sequential Minimal Optimization (SMO), a popular algorithm that efficiently solves the optimization problem for finding the optimal hyperplane.

# Maximum Margin Classifier
### Problem
The task is to classify data points (positive - green, negative - red) into separate classes using a hyperplace (decision boundary).

### Challenge
Choosing the "best" hyperplane among many possible ones.

### Traditional approach
Previously, algorithms might have considered all data points and evaluated different hyperplanes based on some criteria.

### Maximum Margin Classifier (MMC)
This approach focuses on maximizing the margin between the hyperplane and the closest data points (support vectors) from each class.

### Key points
1. Margin: The distance between the hyperplane and the closest data points (one from each class) on either side.
2. Support vectors: These are the data points closest to the hyperplane and define the margin.
3. Intuition: A larger margin intuitively leads to better separation between classes and potentially better generalization to unseen data.

### Why support vectors?
Instead of considering all data points, MMC only focuses on support vectors because,
- They have the most influence on the classification of new data points.
- Maximizing the margin around them ensures a good separation between classes for most other points as well.

### Finding the best hyperplane
1. Consider 2 hyperplanes ($\pi_1$ and $\pi_2$) and their corresponding support vectors.
2. Create parallel lines ($\pi+$ and $\pi-$) on either side of each hyperplane, touching the support vectors.
3. Calculate the distance between these parallel lines ($d_1$ and $d_2$) for each hyperplane.
4. The hyperplane with the larger distance (larger margin) between its parallel lines is considered better (higher $d$ is better).

### Formalization and optimization
- The equation of a hyperplane can be expressed as, $w^T * x + w_0 = 0$. Where, $w$ = weight vector, and $w_0$ = bias term.
- The margin has to be maximized, which can be classified as $\frac{2 * k}{||w||}$. Where, k = distance between the hyperplane and its parallel lines.
- However, the objective founction becomes, $\argmin_{w, w_0} \frac{||w||}{2}$ due to the inverse relationship.

### Why not just minimize $||w||$?
Minimizing $||w||$ alone does not gurantee a good separation. The hyperplane could collapse onto the support vectors, leading to poor performance on unseen data.

### Constraints
- To prevent the hyperplane from collapsing and to ensure a good separation, additional constraints are required (introduced in Hard Margin SVM, a specific MMC implementation).
- These constraints typically penalize misclassifications of support vectors.

### Summary
The Maximum Margin Classifier focuses on maximizing the margin between the hyperplane and the closest data points (support vectors) to achieve good separation between classes and potentially better generalization to unseen data.

# Hard Margin Classifier
### Labeling in SVM
SVM uses labels +1 and -1 for positive and negative classes respectively. This is just a convention and doesn't affect the underlying concepts.

### Constraints for hard margin
- A hard margin classifier enforces stricter constraints compared to Maximum Margin Classifier (MMC).
- It requires no points to be misclassified and lie between the hyperplane and the support vectors.

### Formalizing the constraints
- Two hyperplanes are defines, $\pi+$ (above the decision boundary) and $\pi-$ (below the decision boundary).
- Every green point should satisfy $(w^T . x_i) * y_i >= 1$ (where $y_i$ is +1 for green points). This ensures they lie above $\pi+$.
- Every red point should satisfy $(w^T . x_i) * y_i <= 1$ (where $y_i$ is -1 for red points). This ensures they lie below $\pi-$.

### Functional margin
- The functional margin is defined as the margin multiplied by the label ($y_i$).
- This ensures both green points (positive label) and red points (negative label) have a positive functional margin.

### Verifying correct classification
- To check for misclassification, a point is multiplied with its label and is plugged into the equation.
- A correctly classified point (either green or red) will always result in a positive value after this multiplication.

### Challenges of hard margin
- The strict constraint of no misclassification can be difficuly to achieve in practive, especially for datasets that are not perfectly linearly separable.
- If the data is not linearly separable, a Hard Margin Classifier might not be able to find a valid hyperplane that satisfies the constraint for all points.

### Why "hard" margin?
- This approach is called as Hard Margin Classifier because it enforces a strict "all or nothing" rule. There is no room for even a single misclassification.
- This makes it ideal only for scenarios with perfectly linearly separable data points, which can be rare in real-world datasets.

### Limitations and Soft Margin Classifiers
- Due to its rigid constraints, Hard Margin Classifiers can be impractical for real-world problems.
- Soft Margin Classifiers addresses this by allowing a certain degree of misclassification. This allows them to handle non-lineraly separable data while still aiming for a good margin.

### Summary
Hard Margin Classifier is a powerful concept but has limitations due to its strict requirement of perfect separation. While it serves as a theoretical foundation for understanding maximum margin classification, Soft Margin Classifiers offer a more practical solution for mst real-world tasks where data might not be perfectly separable.

# Soft Margin Classifier
### Limitations of Hard Margin Classifiers
- Hard Margin Classifiers enforce a strict constraints of no misclassifications.
- This can be unrealistic for real-world datasets that might not be perfectly linearly separable.

### Soft margins
- Soft Margin Classifiers address this limitation by allowing a certain degree of misclassification.
- This allows for more flexibility in handling non-ideal data while still aiming for a good separation between classes.

### Error calculation
- In soft margin classification, an error term ($E_i$) is defined for each data point.
- This error represents the deviation of a point from its ideal distance to the margin (typically 1 unit).
- This error is calculated as, $E_i = 1 - Z_i$ , where $Z_i$ is the output of $(w^T . x_i) + w_0$ for the point.

### Constraint relaxation
- Unlike Hard Margin Classifiers where, $Z_i * y_i >= 1$ must always hold, Soft Margin Classifiers introduce a slack variable ($\xi_i$) (pronounced, xi).
- The constraint becomes $Z_i * y_i >= 1 - E_i$ (with $\xi_i >= 0$). This allows points to violate the margin by a certain amount ($E_i$) as long as the slack variable ($\xi_i$) is non-negative.

### Loss function with soft margin
The objective function in Soft Margin Classifiers aims to minimize two things,
- The norm of the weight vector (w) - similar to Hard Margins (encourages a large margin).
- The total error caused by misclassifications (sum of $E_i$).

### Ignoring irrelevant errors
- The points that are correctly classfied are not penalized even if they fall within the margin ($E_i < 0$).
- The slack variable ($\xi_i$) ensures a loss of zero for such points (since $\xi_i$ is non-negative).

### Trade-off parameter ($C$)
- The degree to which misclassifications are tolerated is controlled by a hyperparameter called $C$.
- A higher $C$ emphasizes a larger margin, potentially leading to stricter classification but might be less robust to noise.
- A lower $C$ allows for more misclassifications but might result in a smaller margin and potentially lower performance on unseen data.

### Summary
Soft Margin Classifiers offer a more practical approach for real-world data by allowing some degree of misclassification. They introduce a trade-off between maximizing the margin and minimizing errors, controlled by the hyperparameter $C$. This flexibility makes them widely used in various machine learning tasks, particularly when dealing with non-linearly separable data.