```{contents}
```

## Hyper-paramter Tuning Intiution

Perfect 👍 let’s go deeper into the **mathematical intuition behind the key hyperparameters of SVC**.

We’ll focus on the three most important ones: **C, γ (gamma), and kernel**.

---

### Objective Function of SVC

The **primal optimization problem** of SVM is:

$$
\min_{w, b, \xi} \quad \frac{1}{2} \|w\|^2 + C \sum_{i=1}^N \xi_i
$$

subject to:

$$
y_i (w^T \phi(x_i) + b) \geq 1 - \xi_i, \quad \xi_i \geq 0
$$

Where:

* $w$: weight vector
* $b$: bias term
* $\xi_i$: slack variables (allow misclassifications)
* $C$: **regularization parameter** (controls penalty for misclassifications)
* $\phi(x)$: feature mapping (depends on kernel)

---

### Role of **C** (Regularization)

From the objective function:

* The term $\frac{1}{2} \|w\|^2$ → tries to maximize margin.
* The term $C \sum \xi_i$ → penalizes misclassifications.

👉 Intuition:

* **Small C** → margin maximization dominates (tolerates some errors).

  * Simpler decision boundary.
  * Prevents overfitting.
* **Large C** → error penalty dominates (forces correct classification of training data).

  * Narrow margin.
  * Risk of overfitting.

---

### Role of **γ (Gamma)** in RBF Kernel

The **RBF kernel** is:

$$
K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)
$$

👉 Interpretation:

* If $\|x_i - x_j\|$ is small → similarity close to 1.

* If $\|x_i - x_j\|$ is large → similarity close to 0.

* $\gamma$ controls the **decay rate** of similarity.

* **Small γ:**

  * Kernel is smoother, points far apart are still considered similar.
  * Leads to a smooth, less complex decision boundary.

* **Large γ:**

  * Kernel is sharper, only very close neighbors are considered similar.
  * Leads to highly complex decision boundary (can overfit).

---

### Role of **Kernel Choice**

The kernel defines $\phi(x)$, the transformation of data:

* **Linear kernel:**

  $$
  K(x_i, x_j) = x_i^T x_j
  $$

  → Works well if data is linearly separable.

* **Polynomial kernel:**

  $$
  K(x_i, x_j) = (x_i^T x_j + c)^d
  $$

  → Captures polynomial relationships; degree $d$ is a hyperparameter.

* **RBF kernel:**

  $$
  K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)
  $$

  → Very flexible, maps data to infinite-dimensional feature space.

---

### How C and γ Interact

* **High C + High γ:**

  * Very complex model, tries to classify everything correctly.
  * Risk of overfitting.

* **Low C + Low γ:**

  * Very smooth decision boundary, high bias.
  * Risk of underfitting.

* **Balanced values:**

  * Trade-off between margin size, misclassification, and flexibility.

---

### Decision Function

The final decision function of SVC is:

$$
f(x) = \text{sign}\Big(\sum_{i=1}^N \alpha_i y_i K(x_i, x) + b\Big)
$$

* $\alpha_i$: learned weights (nonzero only for support vectors).
* $K(x_i, x)$: similarity function (depends on γ and kernel).
* $C$: controls how many support vectors exist (larger C → more support vectors).

---

**Summary (Mathematical Intuition):**

* **C** → controls penalty on misclassified points ($\xi_i$).
* **γ** → controls how similarity decays in RBF kernel.
* **Kernel** → defines feature space transformation.
* Together, they shape the **decision boundary**: wide vs narrow, smooth vs complex.


In [None]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load dataset
X, y = datasets.load_iris(return_X_y=True)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Base SVC model
svc = SVC()

# Hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['linear', 'rbf', 'poly']
}

# GridSearchCV
grid = GridSearchCV(
    estimator=svc,
    param_grid=param_grid,
    refit=True,        # keep best model
    cv=5,              # 5-fold cross-validation
    verbose=2,
    n_jobs=-1          # use all CPUs
)

# Fit
grid.fit(X_train, y_train)



# Best hyperparameters
print("Best Parameters:", grid.best_params_)

# Use best model for predictions
y_pred = grid.predict(X_test)

# Evaluation
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best Parameters: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}

Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

