# Natural Language Processing: Sentiment Analysis Lab

Prerequisite: The sentiment_labelled_sentences needs to be in the same directory as this notebook.
This tutorial built off of https://towardsdatascience.com/nlp-sentiment-analysis-for-beginners-e7897f976897.

![image.png](attachment:image.png)
Image source: https://www.marketmotive.com/blog/discipline-specific/social-media/sentiment-analysis-article

## Goal: To compare various algorithms in predicting sentiment of online reviews.

Here, we are using a sentiment data set that consists of 3000 sentences coming from reviews on imdb.com, amazon.com, and yelp.com. Each sentence is labeled according to whether it comes from a positive review (labelled as 1) or negative review (labelled as 0).

In [7]:
#!pip install nltk
#!nltk.download('stopwords')
#!pip install keras
#!pip install tensorflow
%matplotlib inline
import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

Take a look at the first 10 reviews.

In [8]:
with open("sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()
content[0:10]

['So there is no way for me to plug it in here in the US unless I go by a converter.\t0\n',
 'Good case, Excellent value.\t1\n',
 'Great for the jawbone.\t1\n',
 'Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!\t0\n',
 'The mic is great.\t1\n',
 'I have to jiggle the plug to get it to line up right to get decent volume.\t0\n',
 'If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one.\t0\n',
 'If you are Razr owner...you must have this!\t1\n',
 'Needless to say, I wasted my money.\t0\n',
 'What a waste of money and time!.\t0\n']

Let's first remove leading and trailing white space, then separate the sentences from the labels using list comprehensions.

In [9]:
content = [x.strip() for x in content]
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]

Print out the first ten reviews along with their sentiment.

In [10]:
for i in range(10):
    print(sentences[i],'\t',labels[i])

So there is no way for me to plug it in here in the US unless I go by a converter. 	 0
Good case, Excellent value. 	 1
Great for the jawbone. 	 1
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!! 	 0
The mic is great. 	 1
I have to jiggle the plug to get it to line up right to get decent volume. 	 0
If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one. 	 0
If you are Razr owner...you must have this! 	 1
Needless to say, I wasted my money. 	 0
What a waste of money and time!. 	 0


This is an optional step. The dataset uses 0 as negative sentiment and 1 for positive sentiment. The blog author likes to use -1 and 1, respectively, instead. Can't say that I disagree.

In [11]:
## Transform the labels from '0 vs. 1' to '-1 vs. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

## Pre-processing the text data
To input data into the any model, the data input must be in vector form. We will do the following transformations:
- Remove punctuation and numbers
- Transform all words to lower-case
- Remove stop words
- Tokenize the texts
- Convert the sentences into vectors, using a bag-of-words representation

The reason for these transformations is that there is a lot in the review that isn't helpful in determining the sentiment. For example, the word "the" in a sentence isn't helpful, nor is a comma.

## Task: Definitions

Define the following terms when speaking of natural language processing:

- Stop word removal
- Tokenization
- Lemmatization
- Corpus
- Bag of words
- Stemming
- TF-IDF (Term frequency-inverse document frequency)

Pay particular attention to how TF-IDF works.

We could make our own list of "stop" words to remove.

In [12]:
## Demonstrate ##
def removeStopWords(stopWords, txt):
    newtxt = ' '.join([word for word in txt.split() if word not in stopWords])
    return newtxt

stoppers = ['a','is','of','the','this','uhm','uh']
removeStopWords(stoppers, "this is a test of the stop word removal code")

'test stop word removal code'

It is much more common in NLP to use a built-in set of stopwords. NLTK has a good set.

In [13]:
from nltk.corpus import stopwords
stops = stopwords.words("English")
removeStopWords(stops, "this is a test of the stop word removal code.")

'test stop word removal code.'

Let's remove any numerical digits, punctuation, convert all words to lowercase, remove extra whitespace, and remove stop words.

In [14]:
def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x

## Remove digits
digits = [str(x) for x in range(10)]
remove_digits = [full_remove(x, digits) for x in sentences]

## Remove punctuation
remove_punc = [full_remove(x, list(string.punctuation)) for x in remove_digits]

## Make everything lower-case and remove any white space
sents_lower = [x.lower() for x in remove_punc]
sents_lower = [x.strip() for x in sents_lower]

## Remove stop words
from nltk.corpus import stopwords
stops = stopwords.words("English")
def removeStopWords(stopWords, txt):
    newtxt = ' '.join([word for word in txt.split() if word not in stopWords])
    return newtxt
sents_processed = [removeStopWords(stops,x) for x in sents_lower]

Let's look at how the transformed sentences look.

In [15]:
sents_processed[0:20]

['way plug us unless go converter',
 'good case excellent value',
 'great jawbone',
 'tied charger conversations lasting minutes major problems',
 'mic great',
 'jiggle plug get line right get decent volume',
 'several dozen several hundred contacts imagine fun sending one one',
 'razr owner must',
 'needless say wasted money',
 'waste money time',
 'sound quality great',
 'impressed going original battery extended battery',
 'two seperated mere ft started notice excessive static garbled sound headset',
 'good quality though',
 'design odd ear clip comfortable',
 'highly recommend one blue tooth phone',
 'advise everyone fooled',
 'far good',
 'works great',
 'clicks place way makes wonder long mechanism would last']

It seems like these reviews have lost a lot of their meaning. Maybe we removed a little too much. This will not always be the case. It really depends on the level of the writing. Reviews are generally not written at an advanced level. 

Let's define our own set of stopwords.

In [16]:
stop_set = ['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from']
sents_processed = [removeStopWords(stop_set,x) for x in sents_lower]

Looks like we haven't lost as much of the meaning.

In [17]:
sents_processed[0:20]

['so there is no way for me plug in here in us unless go by converter',
 'good case excellent value',
 'great for jawbone',
 'tied charger for conversations lasting more than minutes major problems',
 'mic is great',
 'have jiggle plug get line up right get decent volume',
 'if you have several dozen or several hundred contacts then imagine fun sending each them one by one',
 'if you are razr owner you must have this',
 'needless say wasted my money',
 'what waste money and time',
 'and sound quality is great',
 'was very impressed when going original battery extended battery',
 'if two were seperated by mere ft started notice excessive static and garbled sound headset',
 'very good quality though',
 'design is very odd as ear clip is not very comfortable at all',
 'highly recommend for any one who has blue tooth phone',
 'advise everyone do not be fooled',
 'so far so good',
 'works great',
 'clicks into place in way that makes you wonder how long that mechanism would last']

Time to stem the text, which involves stripping off prefixes and suffixes in the word and converting the word into its base form.

The Porter and Lancaster stemmers are both built into NLTK and behave similarly.

In [18]:
import nltk
def stem_with_porter(words):
    porter = nltk.PorterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words
    
def stem_with_lancaster(words):
    porter = nltk.LancasterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words    
## Demonstrate ##    
str = "Please don't unbuckle your seat-belt while I am driving, he said"
print("porter:", stem_with_porter(str.split()))

print("lancaster:", stem_with_lancaster(str.split()))

porter: ['pleas', "don't", 'unbuckl', 'your', 'seat-belt', 'while', 'I', 'am', 'driving,', 'he', 'said']
lancaster: ['pleas', "don't", 'unbuckl', 'yo', 'seat-belt', 'whil', 'i', 'am', 'driving,', 'he', 'said']


Stemming can remove some of the meaning of a word, but remember that the goal is to reduce the vector size containing the unique words found in the corpus. Keeping every form of the same root word can increase computational complexity substantially.

In [19]:
porter = [stem_with_porter(x.split()) for x in sents_processed]
porter = [" ".join(i) for i in porter]
porter[0:10]

['so there is no way for me plug in here in us unless go by convert',
 'good case excel valu',
 'great for jawbon',
 'tie charger for convers last more than minut major problem',
 'mic is great',
 'have jiggl plug get line up right get decent volum',
 'if you have sever dozen or sever hundr contact then imagin fun send each them one by one',
 'if you are razr owner you must have thi',
 'needless say wast my money',
 'what wast money and time']

## Vectorizing the Data

CountVectorizer is just a way of counting how many time each word occurs in the corpus.

TD is a normalized frequency of the word, whereas IDF is a weighting of the uniqueness of the word across all of the documents.

Here we create a bag of words and then find the normalized frequencies and weightings of each word.

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = CountVectorizer(analyzer = "word", 
                             preprocessor = None, 
                             stop_words =  'english', 
                             max_features = 6000, ngram_range=(1,5))
data_features = vectorizer.fit_transform(sents_processed)
tfidf_transformer = TfidfTransformer()
data_features_tfidf = tfidf_transformer.fit_transform(data_features)
data_mat = data_features_tfidf.toarray()

The code above can be confusing, so let me try to explain a bit more. Imagine we took all unique words found in this review dataset and put one in each column of a spreadsheet. Each review would then be given a row in the spreadsheet. For each review, we put numbers indicating how many times each word appears in the review, like in the graphic below. That is what data_features above holds.

![image.png](attachment:image.png)

This gives word counts, but certain words are very common and don't convey much information, so we have to look at the relevant importance of the words. That's what the tfidf_transformer does. It takes the counts above and scales them based on their importance. In the graphic above, all numbers would be converted to a value between -1 and 1.

We are ready to put the input into the model. 

Let’s create training and test sets. Here, we split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).

In [21]:
np.random.seed(0)
test_index = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_index = list(set(range(len(labels))) - set(test_index))
train_data = data_mat[train_index,]
train_labels = y[train_index]
test_data = data_mat[test_index,]
test_labels = y[test_index]

TextBlob finds all the words and phrases that it can assign polarity and subjectivity to, and average all of them together.

Sentiment Labels: Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we’re going to ignore them for now). A corpus’ sentiment is the average of these.

Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive.

Subjectivity: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

In [22]:
!pip install textblob
from textblob import TextBlob
#Create polarity function and subjectivity function
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
pol_list = [pol(x) for x in sents_processed]
sub_list = [sub(x) for x in sents_processed]



Let's see the polarity and subjectivity of the first 10 reviews in the corpus.

In [23]:
for i in range(10):
    print(sents_processed[i], '\t', pol_list[i], sub_list[i])

so there is no way for me plug in here in us unless go by converter 	 0.0 0.0
good case excellent value 	 0.85 0.8
great for jawbone 	 0.8 0.75
tied charger for conversations lasting more than minutes major problems 	 0.1875 0.3333333333333333
mic is great 	 0.8 0.75
have jiggle plug get line up right get decent volume 	 0.22619047619047616 0.6011904761904762
if you have several dozen or several hundred contacts then imagine fun sending each them one by one 	 0.09999999999999999 0.06666666666666667
if you are razr owner you must have this 	 0.0 0.0
needless say wasted my money 	 -0.35 0.5
what waste money and time 	 -0.2 0.0


## Modeling

### Logistic Regression

In [24]:
from sklearn.linear_model import SGDClassifier
## Fit logistic classifier on training data
clf = SGDClassifier(loss="log", penalty="none")
clf.fit(train_data, train_labels)
## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_
## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)
## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (test_labels > 0.0))
print("Training error: ", float(errs_train)/len(train_labels))
print("Test error: ", float(errs_test)/len(test_labels))

Training error:  0.0116
Test error:  0.184


Which words are most important in deciding whether a sentence is positive? As a first approximation to this, we simply take the words whose coefficients in w have the largest positive values.

Likewise, we look at the words whose coefficients in w have the most negative values, and we think of these as influential in negative predictions.

In [25]:
## Convert vocabulary into a list:
vocab = np.array([z[0] for z in sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1])])
## Get indices of sorting w
inds = np.argsort(w)
## Words with large negative values
neg_inds = inds[0:50]
print("Highly negative words: ")
# MB: fixed bug here
print([x for x in list(vocab[neg_inds])])
## Words with large positive values
pos_inds = inds[-49:-1]
print("Highly positive words: ")
print([x for x in list(vocab[pos_inds])])

Highly negative words: 
['sucks', 'worst', 'poor', 'bad', 'disappointing', 'bland', 'disappointment', 'horrible', 'failed', 'avoid', 'cheap', 'stupid', 'unfortunately', 'doesn', 'sucked', 'rude', 'average', 'fly', 'probably', 'slow', 'piece', 'tasteless', 'awful', 'mistake', 'return', 'directing', 'dirty', 'dropped', 'blah', 'junk', 'mediocre', 'waste', 'waste time', 'appealing', 'selection food', 'improvement', 'flat', 'ok', 'torture', 'hour', 'wasted', 'eating', 'att', 'engaging', 'poorly', 'happened', 'crap', 'joke', 'remorse', 'didn']
Highly positive words: 
['exactly', 'forget', 'vegas buffet', 'crisp', 'inside', 'haven', 'entertaining', 'score', 'fun', 'art', 'fast', 'friendly', 'works great', 'highly recommend', 'audio', 'favorite', 'fall', 'hand', 'shows', 'definately', 'played', 'pleased', 'plus', 'prices', 'comfortable', 'fabulous', 'wonderful', 'bacon', 'fantastic', 'soundtrack', 'incredible', 'cool', 'delicious', 'won disappointed', 'best', 'awesome', 'assure', 'beautiful',

Let's see what the prediction is for a few fake reviews (1 is positive, -1 is negative).

In [26]:
print(clf.predict(vectorizer.transform(["It's a sad movie but very good"])))
print(clf.predict(vectorizer.transform(["Waste of my time"])))
print(clf.predict(vectorizer.transform(["It is not what like"])))
print(clf.predict(vectorizer.transform(["It is not what I m looking for"])))

[1]
[-1]
[1]
[-1]


### Naive Bayes

In [27]:
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(train_data, train_labels)
nb_preds_test = nb_clf.predict(test_data)
nb_errs_test = np.sum((nb_preds_test > 0.0) != (test_labels > 0.0))
print("Test error: ", float(nb_errs_test)/len(test_labels))

Test error:  0.174


Let's see what the prediction is for a few fake reviews (1 is positive, -1 is negative).

In [28]:
print(nb_clf.predict(vectorizer.transform(["It's a sad movie but very good"])))
print(nb_clf.predict(vectorizer.transform(["Waste of my time"])))
print(nb_clf.predict(vectorizer.transform(["It is not what like"])))
print(nb_clf.predict(vectorizer.transform(["It is not what I m looking for"])))

[1]
[-1]
[-1]
[1]


### Support Vector Machines

In [29]:
from sklearn.linear_model import SGDClassifier
svm_clf = SGDClassifier(loss="hinge", penalty='l2')
svm_clf.fit(train_data, train_labels)
svm_preds_test = svm_clf.predict(test_data)
svm_errs_test = np.sum((svm_preds_test > 0.0) != (test_labels > 0.0))
print("Test error: ", float(svm_errs_test)/len(test_labels))

Test error:  0.2


Let's again see what the prediction is for a few fake reviews (1 is positive, -1 is negative).

In [30]:
print(svm_clf.predict(vectorizer.transform(["It's a sad movie but very good"])))
print(svm_clf.predict(vectorizer.transform(["Waste of my time"])))
print(svm_clf.predict(vectorizer.transform(["This is not what I like"])))
print(svm_clf.predict(vectorizer.transform(["It is not what I am looking for"])))

[1]
[-1]
[-1]
[-1]


### Recurrent Neural Networks (specifically, Long Short Term Memory Networks)

In [31]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
max_review_length = 200
tokenizer = Tokenizer(num_words=10000,  #max no. of unique words to keep
                      filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', 
                      lower=True #convert to lower case
                     )
tokenizer.fit_on_texts(sents_processed)

Using TensorFlow backend.


Truncate and pad the input sequences so that they are all in the same length

In [32]:
X = tokenizer.texts_to_sequences(sents_processed)
X = sequence.pad_sequences(X, maxlen= max_review_length)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (3000, 200)


Recall that y is vector of 1 and -1. Now we change it to a matrix with 2 columns that represent -1 and 1. You've seen the get_dummies method before!

In [33]:
import pandas as pd
Y=pd.get_dummies(y).values

As we did earlier, let’s create training and test sets. Here, we split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).

In [34]:
np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds = list(set(range(len(labels))) - set(test_inds))
train_data = X[train_inds,]
train_labels = Y[train_inds]
test_data = X[test_inds,]
test_labels = Y[test_inds]

Create recurrent neural network! This is another rabbit hole that we could spend weeks on. Here's a link if you're interested in learning more about what we're doing here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/.

In [35]:
EMBEDDING_DIM = 200
model = Sequential()
model.add(Embedding(10000, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(250, dropout=0.2,return_sequences=True))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 200)          2000000   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 200, 200)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 200, 250)          451000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               140400    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 202       
Total params: 2,591,602
Trainable params: 2,591,602
Non-trainable params: 0
_________________________________________________________________
None


Run the model.

In [36]:
epochs = 2
batch_size = 40
model.fit(train_data, train_labels, 
          epochs=epochs, 
          batch_size=batch_size,
          validation_split=0.1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 2250 samples, validate on 250 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.callbacks.History at 0x7f9257e51f28>

We see below that LSTM performs the best out of all the models trained so far, i.e. Logistic, Naive Bayes and SVM.

In [37]:
loss, acc = model.evaluate(test_data, test_labels, verbose=2,
                            batch_size=batch_size)
print(f"Loss: {loss}")
print(f"Validation accuracy: {acc}")

Loss: 0.4158953988552094
Validation accuracy: 0.8360000252723694


Now let’s see how it predict a test case. In the probability distribution, the first number is the probability that the text is negative, and the second is the probability that. the text is positive. Note that the probabilities add to zero.

In [38]:
outcome_labels = ['Negative', 'Positive']
new = ["I would not recommend this movie"]
 
def predict_sentiment(text):
    seq = tokenizer.texts_to_sequences(text)
    padded = sequence.pad_sequences(seq, maxlen=max_review_length)
    pred = model.predict(padded)
    print("Probability distribution: ", pred)
    print("Is this a Positive or Negative review? ")
    print(outcome_labels[np.argmax(pred)])

predict_sentiment(new)

Probability distribution:  [[0.8749243  0.12507565]]
Is this a Positive or Negative review? 
Negative


In [39]:
new = ["It is not what i am looking for"]
predict_sentiment(new)

Probability distribution:  [[0.88306546 0.11693449]]
Is this a Positive or Negative review? 
Negative


The model has to take a guess in all instances. Here we see a case where it classifies the review as positive, but isn't too confident in its guess (close to 50/50).

In [40]:
new = ["This isn't what i am looking for"]
predict_sentiment(new)

Probability distribution:  [[0.48358354 0.5164164 ]]
Is this a Positive or Negative review? 
Positive


## Task: IMDB Dataset Analysis

We next seek to use sentiment analysis on an IMDB dataaset consisting of 50k movie reviews. The dataset is very simple. There are two columns: the text of the movie review, and the sentiment (positive or negative).

The file can be downloaded at https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews.

We are going to handle the pre-processing here a bit differently. Above, we stemmed the data using the Porter and Lancaster stemmers. In this new analysis, we're going to use lemmatization instead of stemming. Research and implement this...it's not difficult. Implement the other pre-processing steps (stop word removal, lowercasing, etc.) the same as we did in the tutorial above.

Then select at least two of the models above (Logistic Regression, etc.) and evaluate the accuracy of each, then write a few sentence summary of your findings.