# CS334 Lab2: Support Vector Machines

**Total:** 100 points  

This lab covers the core ideas from the SVM chapter: large margin intuition, feature scaling, soft-margin (C), kernels (RBF γ), and when to choose `LinearSVC` vs `SVC` vs `SGDClassifier`.

## Learning goals
- Train a linear SVM with proper scaling and interpret the `decision_function()` score.
- See how **C** and **γ (gamma)** control the flexibility of an RBF-kernel SVM.
- Compare `LinearSVC`, `SVC`, and `SGDClassifier` in practice and connect results to time complexity.


## Setup.
Run the cell below once.

In [5]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris, make_moons, make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
import time

# Helper function to plot decision regions
def plot_decision_regions(clf, X, y, title=None, ax=None, h=0.02):
    if ax is None:
        fig, ax = plt.subplots()

    x_min, x_max = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
    y_min, y_max = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    grid = np.c_[xx.ravel(), yy.ravel()]
    Z = clf.predict(grid).reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.25)
    ax.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
    ax.set_xlabel("x1")
    ax.set_ylabel("x2")
    if title:
        ax.set_title(title)
    return ax

## Task 1 — Linear SVM + scaling + confidence scores (30 pts)
**Goal:** Train a linear SVM to detect *Iris virginica* using petal length/width.

Steps:
1. Load iris. Use only the 2 features: **petal length** and **petal width**.
2. Split into train/test (80/20), with random_state=42 and stratify=y.
3. [10 pts] Build a pipeline: `StandardScaler()` → `LinearSVC(C=1)`. Train the pipeline and print train/test accuracy.
4. [10 pts] Compute `decision_function()` for the two points: `[[5.5, 1.7], [5.0, 1.5]]`, and print the results.
5. [10 pts] Explain what the sign and magnitude of `decision_function()` mean.

In [None]:
# Load iris data and use two features ["petal length (cm)", "petal width (cm)"]
# and only use the last two classes (1 and 2)
iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 2)  # 1 = virginica, 0 = not virginica

# Split into train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Your code: Build a pipeline: StandardScaler() -> LinearSVC(C=1)


# Your code: Train the pipeline, report train/test accuracy.


# Your code: Compute decision_function() for the two points: [[5.5, 1.7], [5.0, 1.5]]



Your answer to 5:

## Task 2 — RBF kernel: tuning **γ (gamma)** and **C** (30 pts)
**Goal:** See underfitting vs overfitting by sweeping γ and C.

Steps:
1. (code provided) Use moons data, and split train/test.
2. [10 pts] Train `SVC(kernel='rbf')` on a small grid:
   - `gamma` in `{0.1, 1, 10}`
   - `C` in `{0.1, 1, 100}`
3. [10 pts] For each combo, report train accuracy and test accuracy. Use pandas to display results in a table (sorted by gamma and C).
4. [10 pts] Pick one underfitting and one overfitting setting and explain why.

In [None]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

gammas = [0.1, 1, 10]
Cs = [0.1, 1, 100]

# Your code: train SVC with rbf kernel on a small grid

# Your code: for each combo, report train accuracy and test accuracy



Your answer to 4:

## Task 3 — Choosing `LinearSVC` vs `SVC` vs `SGDClassifier` (40 pts)
**Goal:** Compare accuracy and training time, then connect to computational complexity.

Steps:
1. Create a linearly separable dataset with *many* points: e.g.
   `make_classification(n_samples=8000, n_features=20, n_informative=10, class_sep=1.5, random_state=42)`.
2. [10 pts] Train these three models (each with scaling):
   - `LinearSVC(C=1)`
   - `SVC(kernel='linear', C=1)`
   - `SVC(kernel='rbf', C=1)`
   - `SGDClassifier(loss='hinge', alpha=1e-4)`
3. [10 pts] Measure training time and test accuracy. Save in a dataframe (3 columns: model, train_time_sec, test_acc), display the dataframe.
4. [20 pts] Briefly justify which model you’d choose for:
   - (a) huge dataset (millions of rows)
   - (b) small/medium dataset but potentially nonlinear

In [8]:
X, y = make_classification(
    n_samples=8000,
    n_features=20,
    n_informative=10,
    n_redundant=2,
    class_sep=1.5,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

def fit_time_and_acc(model, X_train, y_train, X_test, y_test):
    t0 = time.perf_counter()
    model.fit(X_train, y_train)
    t1 = time.perf_counter()
    acc = accuracy_score(y_test, model.predict(X_test))
    return (t1 - t0), acc

# Your code: train models (each with scaling)

# Your code: Measure training time and test accuracy for each model, save in a dataframe (3 columns: model, train_time_sec, test_acc), display the dataframe.



Your answer to 4: