# 🔍 How Support Vector Machines Work

**Support Vector Machines (SVMs) aim to find the optimal decision boundary (hyperplane) that separates data points from two classes in binary classification.**

---

## 🧭 Key Concepts

### 📐 Hyperplane  
A hyperplane is a flat decision boundary that splits the feature space. In an $n$-dimensional space, it's an $(n-1)$-dimensional plane. The ideal hyperplane separates the two classes while keeping the widest possible margin between them.

The equation of the hyperplane is:

$$
w \cdot x - b = 0
$$

Where:  
- $w$ — weight vector (normal to the hyperplane)  
- $x$ — input feature vector  
- $b$ — bias term

To correctly classify a data point $x_i$ with label $y_i \in \{-1, 1\}$, SVM requires:

$$
y_i (w \cdot x_i - b) \geq 1
$$

This means points are not only correctly classified but also lie outside the margin.

---

### 📏 Margin and Support Vectors  
The **margin** is the distance between the hyperplane and the closest data points from either class — these critical points are known as **support vectors**.

A larger margin typically leads to better generalization. SVM tries to **maximize this margin** while correctly classifying the training data (or minimizing the error when using soft margins).

---

## 🔁 Gradient Update Rules

The optimization objective combines:
- Margin maximization (via minimizing $\|w\|^2$),
- Regularization (to prevent overfitting),
- and penalty for misclassified/marginal points.

Let $\lambda$ be the regularization strength.

If a data point is **correctly classified and outside the margin** ($y_i(w \cdot x_i - b) \geq 1$):

- Gradient w.r.t. weights:  
  $$
  \frac{\partial J}{\partial w} = 2\lambda w
  $$
- Gradient w.r.t. bias:  
  $$
  \frac{\partial J}{\partial b} = 0
  $$

If a data point is **misclassified or within the margin** ($y_i(w \cdot x_i - b) < 1$):

- Gradient w.r.t. weights:  
  $$
  \frac{\partial J}{\partial w} = 2\lambda w - y_i x_i
  $$
- Gradient w.r.t. bias:  
  $$
  \frac{\partial J}{\partial b} = -y_i
  $$

---



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.svm import SVC as SklearnSVM

from mlfs.svm import SVM as CustomSVM
from mlfs.preprocessing import train_test_split, standardize
from mlfs.metrics import (
    accuracy as custom_accuracy,
    precision as custom_precision,
    recall as custom_recall,
    f1_score as custom_f1,
)


In [2]:
df = pd.read_csv("../data/breast-cancer.csv")
df

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


<a id="4"></a>
<h1 style='background:#00EFFF;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '

# Data Preprocessing

In [3]:
df.drop('id', axis=1, inplace=True) 

In [4]:
df['diagnosis'] = (df['diagnosis'] == 'M').astype(int)
corr = df.corr()
plt.figure(figsize=(20,20))
sns.heatmap(corr, cmap='viridis_r',annot=True)
plt.show()

In [5]:
notincluded_columns = abs(corr['diagnosis'])[abs(corr['diagnosis'] < 0.25)]
notincluded_columns = notincluded_columns.index.tolist()
for col in notincluded_columns:
  df.drop(col, axis = 1, inplace = True)

In [6]:
X = df.drop('diagnosis', axis = 1).values
y = df['diagnosis']
print('Shape of X', X.shape)
print('Shape of y', y.shape)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
X_train, mean, std = standardize(X_train, return_params=True)
X_test = (X_test - mean) / std

<a id="4"></a>
<h1 style='background:#00EFFF;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '

# Comparing

In [9]:
def benchmark_svm_custom_vs_sklearn(X, y, n_repeats=5, iterations=1000, lr=0.01, lambdaa=0.01):
    """
    Benchmarks training and prediction times for custom and sklearn SVM implementations.

    Parameters
    ----------
    X : np.ndarray
        Feature matrix.
    y : np.ndarray
        Target vector (binary: 0/1).
    n_repeats : int
        Number of repetitions for averaging time.
    iterations : int
        Number of training epochs for the custom SVM.
    lr : float
        Learning rate for the custom SVM.
    lambdaa : float
        Regularization parameter for the custom SVM.

    Returns
    -------
    pd.DataFrame
        DataFrame with average fit and predict times for both models.
    """

    custom_fit_times = []
    custom_predict_times = []
    sklearn_fit_times = []
    sklearn_predict_times = []

    for _ in range(n_repeats):
        # Custom SVM
        model_custom = CustomSVM(iterations=iterations, lr=lr, lambdaa=lambdaa)

        start = time.time()
        model_custom.fit(X, y)
        custom_fit_times.append(time.time() - start)

        start = time.time()
        model_custom.predict(X)
        custom_predict_times.append(time.time() - start)

        # Sklearn SVM
        model_sklearn = SklearnSVM(kernel='linear')

        start = time.time()
        model_sklearn.fit(X, y)
        sklearn_fit_times.append(time.time() - start)

        start = time.time()
        model_sklearn.predict(X)
        sklearn_predict_times.append(time.time() - start)

    results = pd.DataFrame({
        'Model': ['CustomSVM', 'SklearnSVM'],
        'FitTime': [np.mean(custom_fit_times), np.mean(sklearn_fit_times)],
        'PredictTime': [np.mean(custom_predict_times), np.mean(sklearn_predict_times)]
    })

    return results


In [10]:
results = benchmark_svm_custom_vs_sklearn(X_train, y_train)
print(results)


In [11]:
def benchmark_svm_scalability_vs_sklearn(sample_sizes, n_features=4, iterations=1000, lr=0.01, lambdaa=0.01):
    """
    Benchmarks and compares fit/predict times of custom and sklearn SVM as dataset size increases.

    Parameters
    ----------
    sample_sizes : list[int]
        List of dataset sizes to test.
    n_features : int
        Number of features per sample.
    iterations : int
        Number of training iterations for the custom SVM.
    lr : float
        Learning rate for custom SVM.
    lambdaa : float
        Regularization parameter for custom SVM.

    Returns
    -------
    pd.DataFrame
        DataFrame with all benchmark results.
    """

    records = []

    for n_samples in sample_sizes:
        X, y = make_classification(
            n_samples=n_samples,
            n_features=n_features,
            n_informative=n_features,
            n_redundant=0,
            n_classes=2,
            random_state=42
        )

        # Custom SVM
        custom_model = CustomSVM(iterations=iterations, lr=lr, lambdaa=lambdaa)

        start = time.time()
        custom_model.fit(X, y)
        fit_custom = time.time() - start

        start = time.time()
        custom_model.predict(X)
        predict_custom = time.time() - start

        records.append({
            "Samples": n_samples,
            "Model": "CustomSVM",
            "FitTime": fit_custom,
            "PredictTime": predict_custom
        })

        # Sklearn SVM
        sklearn_model = SklearnSVM(kernel='linear')

        start = time.time()
        sklearn_model.fit(X, y)
        fit_sklearn = time.time() - start

        start = time.time()
        sklearn_model.predict(X)
        predict_sklearn = time.time() - start

        records.append({
            "Samples": n_samples,
            "Model": "SklearnSVM",
            "FitTime": fit_sklearn,
            "PredictTime": predict_sklearn
        })

    df = pd.DataFrame(records)

    # Plotting
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))

    sns.lineplot(data=df, x="Samples", y="FitTime", hue="Model", marker="o", ax=axs[0])
    axs[0].set_title("Fit Time vs Sample Size")
    axs[0].set_ylabel("Time (s)")
    axs[0].grid(True)

    sns.lineplot(data=df, x="Samples", y="PredictTime", hue="Model", marker="o", ax=axs[1])
    axs[1].set_title("Predict Time vs Sample Size")
    axs[1].set_ylabel("Time (s)")
    axs[1].grid(True)

    plt.tight_layout()
    plt.show()

    return df


In [12]:
sample_sizes = [100, 500, 1000, 2000, 5000]
benchmark_df = benchmark_svm_scalability_vs_sklearn(sample_sizes)
print(benchmark_df)


In [14]:
def compare_svm_accuracy_basic(X_train, X_test, y_train, y_test):
    """
    Compares CustomSVM and SklearnSVM using classification metrics:
    - accuracy
    - precision
    - recall
    - f1_score

    Returns
    -------
    pd.DataFrame
        Comparison of all metrics.
    """

    # Custom SVM
    custom_model = CustomSVM()
    custom_model.fit(X_train, y_train)
    preds_custom = custom_model.predict(X_test)

    acc_custom = custom_accuracy(y_test, preds_custom)
    prec_custom = custom_precision(y_test, preds_custom)
    rec_custom = custom_recall(y_test, preds_custom)
    f1_custom = custom_f1(y_test, preds_custom)

    # Sklearn SVM
    sklearn_model = SklearnSVM(kernel='linear')
    sklearn_model.fit(X_train, y_train)
    preds_sklearn = sklearn_model.predict(X_test)

    acc_sklearn = accuracy_score(y_test, preds_sklearn)
    prec_sklearn = precision_score(y_test, preds_sklearn)
    rec_sklearn = recall_score(y_test, preds_sklearn)
    f1_sklearn = f1_score(y_test, preds_sklearn)

    results = pd.DataFrame({
        "Model": ["CustomSVM", "SklearnSVM"],
        "Accuracy": [acc_custom, acc_sklearn],
        "Precision": [prec_custom, prec_sklearn],
        "Recall": [rec_custom, rec_sklearn],
        "F1 Score": [f1_custom, f1_sklearn]
    })

    return results


In [16]:
results = compare_svm_accuracy_basic(X_train, X_test, y_train, y_test)
print(results)


## 🔍 Summary: Comparison of Custom SVM vs. Scikit-learn SVM

### ✅ Accuracy and Classification Metrics

| Model        | Accuracy | Precision | Recall | F1 Score |
|--------------|----------|-----------|--------|----------|
| CustomSVM    | 0.9646   | **1.0000** | 0.8947 | 0.9444   |
| SklearnSVM   | 0.9646   | 0.9722    | **0.9211** | **0.9459** |

- Both implementations achieve identical overall accuracy on the test set.
- CustomSVM exhibits perfect precision (no false positives), but slightly lower recall.
- SklearnSVM strikes a better balance between precision and recall, leading to a higher F1 score.

---

### 🕒 Execution Time

| Samples | Model        | Fit Time (s) | Predict Time (s) |
|---------|--------------|---------------|------------------|
| 100     | CustomSVM    | 0.76          | 0.00003          |
| 100     | SklearnSVM   | **0.0018**     | 0.00032          |
| 5000    | CustomSVM    | 35.85         | 0.00009          |
| 5000    | SklearnSVM   | **1.59**       | **0.20**         |

- Scikit-learn’s implementation is significantly faster during both training and inference.
- The custom SVM uses a full stochastic gradient descent loop with manual hinge loss updates, which results in much slower training, especially on larger datasets.
- SklearnSVM leverages highly optimized C libraries (e.g., `liblinear`), allowing for much better performance and scalability.

---

### ⚖️ Conclusion

- The custom implementation produces comparable accuracy but is much less efficient computationally.
- While it performs well in terms of precision, the lower recall indicates room for improvement in decision boundary flexibility or learning rate tuning.
- Scikit-learn remains the preferred choice for real-world applications due to its optimization and speed, but the custom SVM serves as a solid educational baseline for understanding the underlying mechanics of margin-based classifiers.
