# support vecotrer machine


In [13]:
# fit a svm on an imbalanced classification dataset
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_predict
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import precision_score, recall_score, accuracy_score,\
                            f1_score, confusion_matrix


## 1. Imbalanced Data and SVM

It is well known that for the imbalanced data, we can set the paramter 'class_weights' to deal with the skewed class disctribution. This means that we should balace the data before fit a SVM model to the data. Some day, I was confused by this fact. Since, to my knowledge, the boundary of the SVM is only decided by the support vectors, the instances on the edge, it should not effect the the support vectors whether whether the class distribution is skewed or not. After a while, I realized, usually, a SVM model is a soft margin classification. For the imbalanced data, the model should tolerate more margin violations in the majority class than in the minority. The parameter 'C' controls the toleration. So, as C increasing, the effect of 'class_weights' should disappear.

Let me check this conclusion!

In [52]:
#-------------------------------------------------------------------------
def scores_predict(y_pred, data_y):
    """
    Evaluate the confusion matirx and the metrics on 'y_pred' and 'data_y' data.
    The metric list to be used are ['precision','recall', 'accuracy','f1', 
    'f1_macro', 'f1_weighted' ]
    """
    scores = {'conf_matrix': confusion_matrix(data_y, y_pred),
              'precision': precision_score(data_y, y_pred),
              'recall': recall_score(data_y, y_pred),
              'accuracy': accuracy_score(data_y, y_pred),
              'f1': f1_score(data_y, y_pred),
              'f1_micro': f1_score(data_y, y_pred, average='micro'),
              'f1_macro': f1_score(data_y, y_pred, average='macro'),
              'f1_weighted': f1_score(data_y, y_pred, average='weighted')}
    #print("Confusion matrix:" )
    #print(scores['conf_matrix'])
    return scores


In [53]:
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
                           n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

In [56]:
# define model
for index in range(1, 11):
    model = LinearSVC(C=10**index, random_state=1, class_weight=None)
    model_weighted = LinearSVC(C=10**index, random_state=1, class_weight='balanced')
    pred = cross_val_predict(model, X, y, cv=10, n_jobs=-1)
    scores = scores_predict(pred, y)
    print(f'C = {10**index}:------------------------')
    print(f"Imbalanced----precision: {scores['precision']:0.3f}, recall: {scores['recall']:0.3f}, f1: {scores['f1']:0.3f}")
    pred = cross_val_predict(model_weighted, X, y, cv=10, n_jobs=-1)
    scores = scores_predict(pred, y)
    print(f"  Balanced----precision: {scores['precision']:0.3f}, recall: {scores['recall']:0.3f}, f1: {scores['f1']:0.3f}")

C = 10:------------------------
Imbalanced----precision: 1.000, recall: 0.610, f1: 0.758
  Balanced----precision: 0.157, recall: 0.900, f1: 0.267
C = 100:------------------------
Imbalanced----precision: 1.000, recall: 0.630, f1: 0.773
  Balanced----precision: 0.664, recall: 0.750, f1: 0.704
C = 1000:------------------------
Imbalanced----precision: 0.833, recall: 0.600, f1: 0.698
  Balanced----precision: 0.977, recall: 0.430, f1: 0.597
C = 10000:------------------------
Imbalanced----precision: 0.942, recall: 0.490, f1: 0.645
  Balanced----precision: 0.581, recall: 0.540, f1: 0.560
C = 100000:------------------------
Imbalanced----precision: 0.389, recall: 0.350, f1: 0.368
  Balanced----precision: 0.667, recall: 0.560, f1: 0.609
C = 1000000:------------------------
Imbalanced----precision: 0.742, recall: 0.660, f1: 0.698
  Balanced----precision: 0.934, recall: 0.570, f1: 0.708
C = 10000000:------------------------
Imbalanced----precision: 0.936, recall: 0.440, f1: 0.599
  Balanced----

Clearly, when C is equal to or greater than $10^8$, 'class_weight' plays no role.