# Sentiment analysis with support vector machines


In this notebook, we will revisit a learning task that we encountered earlier in the course: predicting the sentiment (positive or negative) of a single sentence taken from a review of a movie, restaurant, or product. The data set consists of 3000 labeled sentences, which we divide into a training set of size 2500 and a test set of size 500.we will use a support vector machine.

# 1. Loading and preprocessing the data


In [1]:
%matplotlib inline
import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

## Read in the data set.
with open("sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()
    
## Remove leading and trailing white space
content = [x.strip() for x in content]

In [3]:
content

['So there is no way for me to plug it in here in the US unless I go by a converter.\t0',
 'Good case, Excellent value.\t1',
 'Great for the jawbone.\t1',
 'Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!\t0',
 'The mic is great.\t1',
 'I have to jiggle the plug to get it to line up right to get decent volume.\t0',
 'If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one.\t0',
 'If you are Razr owner...you must have this!\t1',
 'Needless to say, I wasted my money.\t0',
 'What a waste of money and time!.\t0',
 'And the sound quality is great.\t1',
 'He was very impressed when going from the original battery to the extended battery.\t1',
 'If the two were seperated by a mere 5+ ft I started to notice excessive static and garbled sound from the headset.\t0',
 'Very good quality though\t1',
 'The design is very odd, as the ear "clip" is not very comfortable at all.\t0',
 'Highly recommend for any one wh

In [4]:
## Separate the sentences from the labels
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]

In [6]:
## Transform the labels from '0 v.s. 1' to '-1 v.s. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

In [7]:
y

array([-1,  1,  1, ..., -1, -1, -1], dtype=int8)

In [8]:
## full_remove takes a string x and a list of characters removal_list 
## returns x with all the characters in removal_list replaced by ' '
def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x

In [10]:
## Remove digits
digits = [str(x) for x in range(10)]
digits

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [11]:
digit_less = [full_remove(x, digits) for x in sentences]
digit_less

['So there is no way for me to plug it in here in the US unless I go by a converter.',
 'Good case, Excellent value.',
 'Great for the jawbone.',
 'Tied to charger for conversations lasting more than    minutes.MAJOR PROBLEMS!!',
 'The mic is great.',
 'I have to jiggle the plug to get it to line up right to get decent volume.',
 'If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one.',
 'If you are Razr owner...you must have this!',
 'Needless to say, I wasted my money.',
 'What a waste of money and time!.',
 'And the sound quality is great.',
 'He was very impressed when going from the original battery to the extended battery.',
 'If the two were seperated by a mere  + ft I started to notice excessive static and garbled sound from the headset.',
 'Very good quality though',
 'The design is very odd, as the ear "clip" is not very comfortable at all.',
 'Highly recommend for any one who has a blue tooth phone.',
 'I advise EVERYO

In [14]:
list(string.punctuation)

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [12]:
## Remove punctuation
punc_less = [full_remove(x, list(string.punctuation)) for x in digit_less]

In [15]:
punc_less

['So there is no way for me to plug it in here in the US unless I go by a converter ',
 'Good case  Excellent value ',
 'Great for the jawbone ',
 'Tied to charger for conversations lasting more than    minutes MAJOR PROBLEMS  ',
 'The mic is great ',
 'I have to jiggle the plug to get it to line up right to get decent volume ',
 'If you have several dozen or several hundred contacts  then imagine the fun of sending each of them one by one ',
 'If you are Razr owner   you must have this ',
 'Needless to say  I wasted my money ',
 'What a waste of money and time  ',
 'And the sound quality is great ',
 'He was very impressed when going from the original battery to the extended battery ',
 'If the two were seperated by a mere    ft I started to notice excessive static and garbled sound from the headset ',
 'Very good quality though',
 'The design is very odd  as the ear  clip  is not very comfortable at all ',
 'Highly recommend for any one who has a blue tooth phone ',
 'I advise EVERYO

In [16]:
## Make everything lower-case
sents_lower = [x.lower() for x in punc_less]

In [17]:
## Define our stop words
stop_set = set(['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from'])

In [18]:
## Remove stop words
sents_split = [x.split() for x in sents_lower]
sents_processed = [" ".join(list(filter(lambda a: a not in stop_set, x))) for x in sents_split]

In [19]:
## Transform to bag of words representation.
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 4500)
data_features = vectorizer.fit_transform(sents_processed)

In [20]:
print(vectorizer.get_feature_names())



In [21]:
data_mat = data_features.toarray()

In [24]:
data_mat.shape

(3000, 4500)

In [28]:
## Split the data into testing and training sets
np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds = list(set(range(len(labels))) - set(test_inds))

train_data = data_mat[train_inds,]
train_labels = y[train_inds]

test_data = data_mat[test_inds,]
test_labels = y[test_inds]

print("train data: ", train_data.shape)
print("test data: ", test_data.shape)

train data:  (2500, 4500)
test data:  (500, 4500)


# 2. Fitting a support vector machine to the data


In [30]:
from sklearn import svm
def fit_classifier(C_value=1.0):
    clf = svm.LinearSVC(C=C_value, loss='hinge')
    clf.fit(train_data,train_labels)
    ## Get predictions on training data
    train_preds = clf.predict(train_data)
    train_error = float(np.sum((train_preds > 0.0) != (train_labels > 0.0)))/len(train_labels)
    ## Get predictions on test data
    test_preds = clf.predict(test_data)
    test_error = float(np.sum((test_preds > 0.0) != (test_labels > 0.0)))/len(test_labels)
    ##
    return train_error, test_error

In [31]:
cvals = [0.01,0.1,1.0,10.0,100.0,1000.0,10000.0]
for c in cvals:
    train_error, test_error = fit_classifier(c)
    print ("Error rate for C = %0.2f: train %0.3f test %0.3f" % (c, train_error, test_error))

Error rate for C = 0.01: train 0.215 test 0.250
Error rate for C = 0.10: train 0.074 test 0.174
Error rate for C = 1.00: train 0.011 test 0.152




Error rate for C = 10.00: train 0.002 test 0.188




Error rate for C = 100.00: train 0.002 test 0.200




Error rate for C = 1000.00: train 0.005 test 0.216
Error rate for C = 10000.00: train 0.001 test 0.204




In [33]:
## Train it and test it
clf = svm.LinearSVC(C=1, loss='hinge')
clf.fit(train_data, train_labels)
preds = clf.predict(test_data)
error = float(np.sum((preds > 0.0) != (test_labels > 0.0)))/len(test_labels)
print("Test error: ", error)

Test error:  0.152


In [34]:
from sklearn.metrics import accuracy_score

accuracy_score(preds, test_labels)*100

84.8

In [37]:
import pickle
pickle_out = open("classifier.pkl","wb")
pickle.dump(clf, pickle_out)
pickle_out.close()