# SVM

SVM (support vector machine) looks to identify the best possible hyper-plane boundary as a separator for observations of different classes. Therefore the SVM looks to maximize the margin between the closest data points of each class and the hyper-plane boundary.

They usually work best for high-dimensional data in which classes are more distinct and small training sets.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('sentiment140_cleaned.csv')
texts = df['clean_text'].astype(str)
labels = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size = 0.2, random_state = 734)

For our baseline we relied on a simple BOW embedding model to embed our text snippets. For SVM let's try a more complex approach with TF-IDF (Term-Frequency Inverse-Document-Frequency) to attach additional weight to more 'important' tokens.

TF measures the frequency at which terms appear in a document.

IDF measures the rarity of terms across all documents.

By doing so, TF-IDF can put extra weight on key terms and also filter out common filler words like 'I' and 'the'.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words = "english")),
    ("svm", LinearSVC())
])

model.fit(X_train, y_train)

0,1,2
,steps,"[('tfidf', ...), ('svm', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,penalty,'l2'
,loss,'squared_hinge'
,dual,'auto'
,tol,0.0001
,C,1.0
,multi_class,'ovr'
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,verbose,0


Now we can train the classifier and evaluate!

In [3]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.72      0.68      0.70      1015
           1       0.69      0.73      0.71       985

    accuracy                           0.70      2000
   macro avg       0.71      0.70      0.70      2000
weighted avg       0.71      0.70      0.70      2000



Oof, we see with an accuracy of 70% we actually don't even meet the baseline of 74%. Let's use GridSearchCV to compare the performances of multiple hyperparameter initializations.  

In [4]:
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(lowercase=True,
                              stop_words="english",
                              ngram_range=(1,2),
                              max_df=0.9,
                              min_df=2)),
    ("svm", LinearSVC(class_weight='balanced', max_iter=5000))
])

param_grid = {
    "tfidf__ngram_range": [(1,1), (1,2)],  # unigrams or bigrams
    "tfidf__max_df": [0.8, 0.9, 1.0],
    "svm__C": [0.1, 1, 10]
}

grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Train accuracy:", grid.score(X_train, y_train))
print("Test accuracy:", grid.score(X_test, y_test))

Best params: {'svm__C': 0.1, 'tfidf__max_df': 0.8, 'tfidf__ngram_range': (1, 2)}
Train accuracy: 0.849375
Test accuracy: 0.7165


Even our best model (72%) doesn't get a test accuracy above our baseline 74%. Also note that our training accuracy is much higher than the test which indicates overfitting and not generalizing well.

This is actually typical for these kinds of datasets as TF-IDF/SVM can be picky for rare terms. Let's look for a better approach