# Tutorial - build LinearSVC model with sklearn

This tutorial demonstrates how to use the Sci-kit Learn (sklearn) package to build linearSVC model, rank features, and use the model for prediction. We will be using the Kaggle sentiment data again.

Note that sklearn actually provides two SVM algorithms: SVC and LinearSVC. 

The SVC module allows for choosing nonlinear kernels, and it uses one-vs-one strategy for multi-class classification.

The LinearSVC module uses the linear kernel, and it uses one-vs-all strategy for multi-class classification, so linearSVC is generally faster than SVC. Since linear kernel works better for text classification in general, this tutorial demonstrates how to use LinearSVC for text classification.

# Step 1: Read in data

In [1]:
# this step is the same as the NB script

# read in the training data
# the data set includes four columns: PhraseId, SentenceId, Phrase, Sentiment
# In this data set a sentence is further split into phrases 
# in order to build a sentiment classification model
# that can not only predict sentiment of sentences but also shorter phrases

# A data example:
# PhraseId SentenceId Phrase Sentiment
# 1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .1

# the Phrase column includes the training examples
# the Sentiment column includes the training labels
# "0" for very negative
# "1" for negative
# "2" for neutral
# "3" for positive
# "4" for very positive


import pandas as p
import numpy as np
train=p.read_csv("data/kaggle-sentiment/train.tsv", delimiter='\t')
y=train['Sentiment'].values
X=train['Phrase'].values

# Step 2: Split train/test data for hold-out test

In [2]:
# this step is the same as the NB script

# check the sklearn documentation for train_test_split
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
# "test_size" : float, int, None, optional
# If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. 
# If int, represents the absolute number of test samples. 
# If None, the value is set to the complement of the train size. 
# By default, the value is set to 0.25. The default will change in version 0.21. It will remain 0.25 only if train_size is unspecified, otherwise it will complement the specified train_size.    

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(X_train[0])
print(y_train[0])
print(X_test[0])
print(y_test[0])

(93636,) (93636,) (62424,) (62424,)
almost in a class with that of Wilde
3
escape movie
2


Sample output from the code above:

(93636,) (93636,) (62424,) (62424,)
almost in a class with that of Wilde
3
escape movie
2

# Step 2.1 Data Checking

In [3]:
# this step is the same as the NB script

# Check how many training examples in each category
# this is important to see whether the data set is balanced or skewed

unique, counts = np.unique(y_train, return_counts=True)
print(np.asarray((unique, counts)))

[[    0     1     2     3     4]
 [ 4141 16449 47718 19859  5469]]


The sample output shows that the data set is skewed with 47718/93636=51% "neutral" examples. All other categories are smaller.

{0, 1, 2, 3, 4}
[[    0  4141]
 [    1 16449]
 [    2 47718]
 [    3 19859]
 [    4  5469]]

# Step 3: Vectorization

In [4]:
# this step is the same as the NB script

# sklearn contains two vectorizers

# CountVectorizer can give you Boolean or TF vectors
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# TfidfVectorizer can give you TF or TFIDF vectors
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

# Read the sklearn documentation to understand all vectorization options

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# several commonly used vectorizer setting

#  unigram boolean vectorizer, set minimum document frequency to 5
unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True, min_df=5, stop_words='english')

#  unigram term frequency vectorizer, set minimum document frequency to 5
unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False, min_df=5, stop_words='english')

#  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
gram12_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')

#  unigram tfidf vectorizer, set minimum document frequency to 5
unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, min_df=5, stop_words='english')


## Step 3.1: Vectorize the training data

In [5]:
# this step is the same as the NB script

# The vectorizer can do "fit" and "transform"
# fit is a process to collect unique tokens into the vocabulary
# transform is a process to convert each document to vector based on the vocabulary
# These two processes can be done together using fit_transform(), or used individually: fit() or transform()

# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_count_vectorizer.fit_transform(X_train)
print(X_train_vec.shape)


# check the size of the constructed vocabulary
print(len(unigram_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_count_vectorizer.vocabulary_.get('imaginative'))

(93636, 11967)
11967
[('class', 1858), ('wilde', 11742), ('derring', 2802), ('chilling', 1764), ('affecting', 313), ('meanspirited', 6557), ('personal', 7662), ('low', 6296), ('involved', 5602), ('worth', 11868)]
5224


Sample output:

(93636, 11967)
[[0 0 0 ..., 0 0 0]]
11967
[('imaginative', 5224), ('tom', 10809), ('smiling', 9708), ('easy', 3310), ('diversity', 3060), ('impossibly', 5279), ('buy', 1458), ('sentiments', 9305), ('households', 5095), ('deteriorates', 2843)]
5224

## Step 3.2: Vectorize the test data

In [6]:
# this step is the same as the NB script

# use the vocabulary constructed from the training data to vectorize the test data. 
# Therefore, use "transform" only, not "fit_transform", 
# otherwise "fit" would generate a new vocabulary from the test data

X_test_vec = unigram_count_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(62424, 11967)


Sample output:

(62424, 14324)

# Step 4: Train a LinearSVC classifier

In [7]:
# import the LinearSVC module
from sklearn.svm import LinearSVC

# initialize the LinearSVC model
svm_clf = LinearSVC(C=1)

# use the training data to train the model
svm_clf.fit(X_train_vec,y_train)

# Step 4.1 Interpret a trained LinearSVC model

In [8]:
## interpreting LinearSVC models
## http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

## LinearSVC uses a one-vs-all strategy to extend the binary SVM classifier to multi-class problems
## for the Kaggle sentiment classification problem, there are five categories 0,1,2,3,4 with 0 as very negative and 4 very positive
## LinearSVC builds five binary classifier, "very negative vs. others", "negative vs. others", "neutral vs. others", "positive vs. others", "very positive vs. others", 
## and then pick the most confident prediction as the final prediction.

## Linear SVC also ranks all features based on their contribution to distinguish the two concepts in each binary classifier
## For category "0" (very negative), get all features and their weights and sort them in increasing order
feature_ranks = sorted(zip(svm_clf.coef_[4], unigram_count_vectorizer.get_feature_names_out()), reverse=True)

## get the top "very negative" words
topN = 20
topN_words = feature_ranks[:topN]
print("Very positive words")
for i in range(0, len(topN_words)):
    print(topN_words[i])
print()



Very positive words
(2.0142068736849783, 'perfection')
(1.9796890798654854, 'miraculous')
(1.8779306360018486, 'glorious')
(1.677504410914676, 'masterfully')
(1.650455163018035, 'masterful')
(1.6472517990059208, 'phenomenal')
(1.6143481059315383, 'flawless')
(1.6104866405407123, 'refreshes')
(1.599861966032188, 'astonish')
(1.5631089253679928, 'stunning')
(1.5234301668483532, 'heralds')
(1.5156776631204303, 'masterpiece')
(1.5068116113627559, 'proud')
(1.4950645642045213, 'magnificent')
(1.483797843365202, 'encourage')
(1.4751453552680465, 'succeeded')
(1.460400082928849, 'extraordinary')
(1.4207476993073285, 'captivated')
(1.4163136283342785, 'magnetic')
(1.411922534365247, 'zings')



In [9]:
# Count the number of words with weights greater than 0.3 in svm_clf.coef_[4]
count = sum(1 for weight in svm_clf.coef_[4] if weight > 0.3)
print("Number of words with weights greater than 0.3:", count)

Number of words with weights greater than 0.3: 1551


In [14]:
# Filter words with weights between 0.3 and 0.31
filtered_words = [word for weight, word in feature_ranks if 0.5 <= weight <= 0.51]

print("Words with weights between 0.5 and 0.51:")
for word in filtered_words:
    print(word)

Words with weights between 0.5 and 0.51:
enlightening
calamity
technical
clarity
spirited
ultimate
aggravating
worthy
concentrates
apocalypse
successfully
documentary
goal
cuts
wide
buoyant
sprightly
package
cleverest
infectious
loopy
happily
invaluable
richly
elemental
polanski
freshness
apex


# Exercise A

In [10]:
# write code similar to the above sample code 
# and print out the top 20 "very positive" words

# Your code starts here

# Your code ends here

# Step 5: Test the LinearSVC classifier

In [15]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec,y_test)



0.6236703831859541

In [16]:
# using sklearn to calculate the majority vote baseline

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_test, y_test)
     

0.5104447007561195

In [17]:
# print confusion matrix and classification report

from sklearn.metrics import confusion_matrix
y_pred = svm_clf.predict(X_test_vec)
cm=confusion_matrix(y_test, y_pred, labels=[0,1,2,3,4])
print(cm)
print()

from sklearn.metrics import classification_report
target_names = ['0','1','2','3','4']
print(classification_report(y_test, y_pred, target_names=target_names))

[[  918  1221   697    82    13]
 [  701  4080  5504   514    25]
 [  195  2106 27081  2310   172]
 [   34   396  6048  5532  1058]
 [    3    51   590  1772  1321]]

              precision    recall  f1-score   support

           0       0.50      0.31      0.38      2931
           1       0.52      0.38      0.44     10824
           2       0.68      0.85      0.75     31864
           3       0.54      0.42      0.48     13068
           4       0.51      0.35      0.42      3737

    accuracy                           0.62     62424
   macro avg       0.55      0.46      0.49     62424
weighted avg       0.60      0.62      0.60     62424



# Step 5.1 Interpret the prediction result

In [18]:
## get the confidence scores for all test examples from each of the five binary classifiers
svm_confidence_scores = svm_clf.decision_function(X_test_vec)
## get the confidence score for the first test example
print(svm_confidence_scores[0])

## sample output: array([-1.05306321, -0.62746206,  0.31074854, -0.89709483, -1.08343089]
## because the confidence score is the highest for category 2, 
## the prediction should be 2. 

## Confirm by printing out the actual prediction
print(y_test[0])
print(X[0])

[-1.01697015 -0.5075494   0.22330632 -0.97509521 -1.24753842]
2
A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .


In [24]:
# output prediction probs
# https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

from sklearn.calibration import CalibratedClassifierCV
svm_calibrated = CalibratedClassifierCV(svm_clf) 
svm_calibrated.fit(X_train_vec, y_train)
y_test_proba = svm_calibrated.predict_proba(X_test_vec)
y_test_proba[51597]

array([0.89601509, 0.09938126, 0.00233236, 0.00127423, 0.00099706])

In [23]:
# print out the most confident predictions for each category

# Get the indices of the 10 most confident predictions for each category
most_confident_indices = {i: y_test_proba[:, i].argsort()[-10:][::-1] for i in range(5)}

# Print out the 10 most confident predictions for each category with their confidence scores
for category, indices in most_confident_indices.items():
    print(f"Category {category}:")
    for idx in indices:
        print(f"Phrase: {idx} {X_test[idx]}, Confidence Score: {y_test_proba[idx][category]}")
    print()

Category 0:
Phrase: 51597 that her life is meaningless , vapid and devoid of substance , in a movie that is definitely meaningless , vapid and devoid of substance, Confidence Score: 0.896015091018129
Phrase: 62006 her life is meaningless , vapid and devoid of substance , in a movie that is definitely meaningless , vapid and devoid of substance, Confidence Score: 0.896015091018129
Phrase: 11427 is meaningless , vapid and devoid of substance , in a movie that is definitely meaningless , vapid and devoid of substance, Confidence Score: 0.8760157946708909
Phrase: 45463 Astonishing is n't the word -- neither is incompetent , incoherent or just plain crap, Confidence Score: 0.8732493417773652
Phrase: 5239 is the sort of low-grade dreck that usually goes straight to video -- with a lousy script , inept direction , pathetic acting , poorly dubbed dialogue and murky cinematography , complete with visible boom mikes, Confidence Score: 0.8726813488021061
Phrase: 28246 is absolutely and completely

In [22]:
# Get the indices of the 10 least confident predictions for each category
least_confident_indices = {i: y_test_proba[:, i].argsort()[:10] for i in range(5)}

# Print out the 10 least confident predictions for each category with their confidence scores
for category, indices in least_confident_indices.items():
    print(f"Category {category}:")
    for idx in indices:
        print(f"Phrase: {idx} {X_test[idx]}, Confidence Score: {y_test_proba[idx][category]}")
    print()

Category 0:
Phrase: 38066 challenges perceptions of guilt and innocence , of good guys and bad , and asks us whether a noble end can justify evil means, Confidence Score: 5.6370428866917815e-06
Phrase: 46912 made nature film and a tribute to a woman whose passion for this region and its inhabitants still shines in her quiet blue eyes, Confidence Score: 1.3835964184543684e-05
Phrase: 61838 The film would have been more enjoyable had the balance shifted in favor of water-bound action over the land-based ` drama , ' but the emphasis on the latter leaves Blue Crush waterlogged ., Confidence Score: 1.633391101107455e-05
Phrase: 47339 If you can tolerate the redneck-versus-blueblood cliches that the film trades in , Sweet Home Alabama is diverting in the manner of Jeff Foxworthy 's stand-up act ., Confidence Score: 1.699659420454118e-05
Phrase: 50805 is an intelligent flick that examines many different ideas from happiness to guilt in an intriguing bit of storytelling ., Confidence Score: 2.

# Exercise B

In [None]:
# do these above least confident predictions make sense to you?
# any chance the labels are wrong?

In [34]:
# The above sample code examines LinearSVC's prediction confidence for the first example.
# Now you are going to print out another example with index=99 (the 100th example), 
# and examine the LinearSVC's prediction confidence and its final prediction
# does this prediction seem correct to you?

# Your code starts here
y_test_proba[99]
# Your code ends here

array([0.02923081, 0.19074941, 0.5905151 , 0.18406639, 0.00543828])

# Step 5.2 Error Analysis

In [25]:
# print out specific type of error for further analysis

# print out the very positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 53 such examples
# note if you use a different vectorizer option, your result might be different

err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==4 and y_pred[i]==0):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

10 minutes into the film you 'll be white-knuckled and unable to look away .
the film is never dull
greatly impressed by the skill of the actors involved in the enterprise
errors: 3


# Exercise C

In [20]:
# write code to print out 
# the errors that very negative examples were mistakenly predicted as very positive?
# and the errors that very positive examples were mistakenly predicted as very negative?

# Try find lingustic patterns for these two types of errors
# Based on the above error analysis, what suggestions would you give to improve the current model?

# Your code starts here

# Your code ends here

# Step 6. Cross Validation

In [26]:
# Cross validation
# for more details see https://scikit-learn.org/stable/modules/cross_validation.html
# evaluation metrics provided by sklearn - https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# if you need to output only one metric, such as accuracy or f1_macro, use the "cross_val_score" function

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

svm_clf_pipe = Pipeline([('vect', CountVectorizer(encoding='latin-1', binary=False)),('svm', LinearSVC(C=1))])
scores = cross_val_score(svm_clf_pipe, X, y, cv=3, scoring='f1_macro')

In [27]:
# if you need to output multiple metrics, such as accuracy and f1_macro, use the "cross_validate" function
from sklearn.model_selection import cross_validate

scoring = ['accuracy', 'f1_macro', 'f1_micro', 'precision_macro', 'recall_macro']
svm_clf_pipe = Pipeline([('vect', CountVectorizer(encoding='latin-1', binary=False)),('svm', LinearSVC(C=1))])
scores = cross_validate(svm_clf_pipe, X, y, cv=3, scoring=scoring, return_train_score=True)

In [28]:
sorted(scores.keys())

['fit_time',
 'score_time',
 'test_accuracy',
 'test_f1_macro',
 'test_f1_micro',
 'test_precision_macro',
 'test_recall_macro',
 'train_accuracy',
 'train_f1_macro',
 'train_f1_micro',
 'train_precision_macro',
 'train_recall_macro']

In [29]:
# retrieve scores from a metric
def get_metric_scores (scores, metric, train_or_test, verbose=False):
    metric_name = train_or_test + '_' + metric
    print(metric_name) 

    metric_scores = scores[metric_name]
    if (verbose == True):
        print(metric_scores)
    avg = sum(metric_scores) / len(metric_scores)
    print('average')
    avg_formatted = "{:.3f}".format(avg)
    print(avg_formatted)

In [30]:
#retrieve test accuracy scores
get_metric_scores(scores, 'accuracy', 'test', verbose=True)

test_accuracy
[0.57608612 0.56412918 0.56683968]
average
0.569


In [31]:
#retrieve training accuracy scores
get_metric_scores(scores, 'accuracy', 'train')

train_accuracy
average
0.738


In [27]:
# compare performance with different choice of C values
# note that with the availability of more powerful classifiers like BERT,
# SVM is now often used as a baseline model for comparison purpose only
# and thus we only need to try a few different hyperparameter settings
# often times just using the default setting is good enough

svm_clf_pipe2 = Pipeline([('vect', CountVectorizer(encoding='latin-1', binary=False)),('svm', LinearSVC(C=0.5))])
scores2 = cross_validate(svm_clf_pipe2, X, y, cv=3, scoring=scoring, return_train_score=True)

svm_clf_pipe3 = Pipeline([('vect', CountVectorizer(encoding='latin-1', binary=False)),('svm', LinearSVC(C=2))])
scores3 = cross_validate(svm_clf_pipe3, X, y, cv=3, scoring=scoring, return_train_score=True)



In [28]:
# compare the effect of different C values
# C=1
print('C=1\n')
get_metric_scores(scores, 'accuracy', 'train')
get_metric_scores(scores, 'accuracy', 'test')

# C=0.5
print('\nC=0.5\n')
get_metric_scores(scores2, 'accuracy', 'train')
get_metric_scores(scores2, 'accuracy', 'test')

# C=2
print('\nC=2\n')
get_metric_scores(scores3, 'accuracy', 'train')
get_metric_scores(scores3, 'accuracy', 'test')



C=1

train_accuracy
average
0.738
test_accuracy
average
0.569

C=0.5

train_accuracy
average
0.732
test_accuracy
average
0.574

C=2

train_accuracy
average
0.741
test_accuracy
average
0.565


# Exercise D

In [29]:
# retrieve the f1_macro scores and compare the effect of C on f1_macro 

# Your code starts here

# Your code ends here