<a href="https://colab.research.google.com/github/wikit4/bts-face-recognition/blob/main/Classes_05_LinearSVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification: Support Vector Machine with linear kernel. Hard and soft margins

Imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, balanced_accuracy_score
from sklearn.inspection import DecisionBoundaryDisplay
import pandas as pd
import seaborn as sns
import io
from sklearn import set_config
from sklearn.svm import SVC
from sklearn.datasets import make_blobs

sns.set_theme(style="whitegrid", palette="colorblind")
plt.rcParams["figure.figsize"] = (10,7)

In [None]:
# constans
test_size=0.2
random_state=42

In [None]:
def compute_score_classification(y_true, y_pred):
  '''
  Helper function for printing scores.

  Parameters:
  y_true: ndarray of y values from original dataset.
  y_pred: ndarray of y values predicted with given model.

  Return:
  dictionary object that consists of accuracy and classification report.

  '''
  return {
        "Accuracy": f"{accuracy_score(y_true, y_pred):.3f}",
        "Classification Report": classification_report(y_true, y_pred),
}

## Exercise 1

1. Use the toy, generated data below and create the simplest linear SVM classification model. Use [`SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) scikit-learn class with the `kernel="linear"` parameter.

In [None]:
# we create 40 separable points
X, y = make_blobs(n_samples=100, centers=2, random_state=6)

# plot our data
plt.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    cmap=plt.cm.Paired
)
plt.show()

2. Create classification model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

# create object of linear SVC estimator
# your code here

3. Plot decision boundaries.

Plot Decision Boundaries
The code below uses the [`DecisionBoundaryDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#sklearn.inspection.DecisionBoundaryDisplay) class to visualize the SVM decision boundaries. It also highlights the support vectors stored in the classifier after training.

In [None]:
# plot data
plt.scatter(
    X_train[:, 0],
    X_train[:, 1],
    c=y_train,
    s=30,
    cmap=plt.cm.Paired
)

# plot the decision function
ax = plt.gca()
DecisionBoundaryDisplay.from_estimator(
    clf,
    X_train,
    plot_method="contour",
    colors="k",
    levels=[-1, 0, 1],
    alpha=0.5,
    linestyles=["--", "-", "--"],
    ax=ax,
)

# plot support vectors
ax.scatter(
    clf.support_vectors_[:, 0],
    clf.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
)

plt.show()

Now, it is evident that this classification problem was straightforward, and our estimator performed exceptionally well.

## Exercise 2

Now, compare the classification results and the decision boundaries from estimators with **hard** and **soft margins**. Which parameter in SVC controls the margins' softeness? Consider what values can represent soft and hard margins.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

# your code here

## Exercise 3: Real data
Now, take a look at the real-world data SVM.

### Load dataset

In [None]:
df = pd.read_csv('data_neo-ffi_religion.csv')
df['Orthodoxy'] = np.log(df[['Orthodoxy']].to_numpy())


# add class indicator: either External Critique or Orthodoxy
df['class'] = df[['External Critique', 'Orthodoxy']].idxmax(axis=1)
df.head()

Inspect the dataset

In [None]:
df.describe()

### Model

Create simple SVM classification model: *class ~ Agreeableness + Openness* with linear kernel.

1. Check the accuracy and other stats of the model
2. Check the overfitting

In [None]:
# your code here

Plot the decision boundaries

In [None]:
colors = ['red' if x == 'External Critique' else 'blue' for x in y_train]
plt.scatter(
    X_train[:, 0],
    X_train[:, 1],
    c=colors,
    s=30,
    cmap=plt.cm.Paired
)

# plot the decision function
ax = plt.gca()
DecisionBoundaryDisplay.from_estimator(
    clf,
    X_train,
    plot_method="contour",
    colors="k",
    levels=[-1, 0, 1],
    alpha=0.5,
    linestyles=["--", "-", "--"],
    ax=ax,
)
# plot support vectors
ax.scatter(
    clf.support_vectors_[:, 0],
    clf.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
)
plt.show()

### (Exercise 3.1)

Plot the relationship between accuracy on the training dataset, testing dataset, and the C.

In [None]:
# your code here

### (Exercise 3.2)

Some of the problems might stem from class imbalance. Think: how can we address this issue? Try to create model that account for the class imbalance.

In [None]:
# your code here