Reference: https://github.com/adeshpande3/LSTM-Sentiment-Analysis

In [1]:
import numpy as np
import pandas as pd 
import tensorflow as tf
from tensorflow.contrib import rnn

# Sentiment Analysis with LSTM

## Read pre-trained word vectors

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.; Source: https://nlp.stanford.edu/projects/glove/

In [9]:
vocab_size = 400000
embed_size = 50

wordVectors = np.random.normal(0, size=[vocab_size, embed_size])
wordVectors = wordVectors.astype(np.float32) ## to be consistent
wordsList = []

with open('glove.6B.50d.txt', encoding="utf-8", mode="r") as textFile:
    word_id = 0
    for line in textFile:
        line = line.split()
        if word_id == 100:
            print(line)
        word = line[0]
        wordsList.append(word)
        wordVectors[word_id] = np.array(line[1:], dtype=np.float32)
        word_id += 1;

['so', '0.60308', '-0.32024', '0.088857', '-0.55176', '0.53182', '0.047069', '-0.36246', '0.0057018', '-0.37665', '0.22534', '-0.13534', '0.35988', '-0.42518', '0.071324', '0.77065', '0.56712', '0.41226', '0.12451', '0.1423', '-0.96535', '-0.39053', '0.34199', '0.56969', '0.031635', '0.69465', '-1.9216', '-0.67118', '0.57971', '0.86088', '-0.59105', '3.7787', '0.30431', '-0.043103', '-0.42398', '-0.063915', '-0.066822', '0.061983', '0.56332', '-0.22335', '-0.47386', '-0.47021', '0.091714', '0.14778', '0.63805', '-0.14356', '-0.0022928', '-0.315', '-0.25187', '-0.26879', '0.36657']


In [32]:
print('Shape of Word Vector: ', wordVectors.shape)
print('Embedding vector of first word: ',wordVectors[0][:5], '...')
print('The index of word `good` is: ', wordsList.index('good'))

Shape of Word Vector:  (400000, 50)
Embedding vector of first word:  [ 0.41800001  0.24968    -0.41242     0.1217      0.34527001] ...
The index of word `good` is:  219


- `WordsList` is a list of words (40,000)
- `wordVectors` is the embedding vectors for each word

### An example of sentence coding

![](https://github.com/adeshpande3/LSTM-Sentiment-Analysis/raw/4bb7b1e8c0e8e9f7f649d1f68cb34db0b2b6675e/Images/SentimentAnalysis5.png)

In [36]:
maxSeqLength = 10 #Maximum length of sentence
numDimensions = embed_size #Dimensions for each word vector
firstSentence = np.zeros((maxSeqLength), dtype='int32')
firstSentence[0] = wordsList.index("i")
firstSentence[1] = wordsList.index("thought")
firstSentence[2] = wordsList.index("the")
firstSentence[3] = wordsList.index("movie")
firstSentence[4] = wordsList.index("was")
firstSentence[5] = wordsList.index("incredible")
firstSentence[6] = wordsList.index("and")
firstSentence[7] = wordsList.index("inspiring")
print(firstSentence) #Shows the row index for each word

[   41   804     0  1005    15  7446     5 13767     0     0]


https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup

In [34]:
with tf.Session() as sess:
    print(tf.nn.embedding_lookup(wordVectors,firstSentence).eval().shape)

(250, 50)


## Read in comments

### Investigate length of comments to determine sequence length

In [35]:
from os import listdir
from os.path import isfile, join
positiveFiles = ['data/positiveReviews/' + f for f in listdir('positiveReviews/') if isfile(join('positiveReviews/', f))]
negativeFiles = ['data/negativeReviews/' + f for f in listdir('negativeReviews/') if isfile(join('negativeReviews/', f))]
numWords = []
for pf in positiveFiles:
    with open(pf, "r", encoding='utf-8') as f:
        line=f.readline()
        counter = len(line.split())
        numWords.append(counter)       
print('Positive files finished')

for nf in negativeFiles:
    with open(nf, "r", encoding='utf-8') as f:
        line=f.readline()
        counter = len(line.split())
        numWords.append(counter)  
print('Negative files finished')

numFiles = len(numWords)
print('The total number of files is', numFiles)
print('The total number of words in the files is', sum(numWords))
print('The average number of words in the files is', sum(numWords)/len(numWords))

Positive files finished
Negative files finished
The total number of files is 25000
The total number of words in the files is 5844680
The average number of words in the files is 233.7872


*Assign Sequence Length*

In [37]:
maxSeqLength = 250

### Example of translating

*Before cleaning, raw text*

In [38]:
fname = positiveFiles[0] 
with open(fname) as f:
    for lines in f:
        print(lines)
        exit

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


*After cleaning*

In [39]:
# Removes punctuation, parentheses, question marks, etc., and leaves only alphanumeric characters
import re
strip_special_chars = re.compile("[^A-Za-z0-9 ]+")

def cleanSentences(string):
    string = string.lower().replace("<br />", " ")
    return re.sub(strip_special_chars, "", string.lower())

with open(fname) as f:
    for lines in f:
        print(cleanSentences(lines))
        exit

bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt


*After encoding*

In [40]:
firstFile = np.zeros((maxSeqLength), dtype='int32')
with open(fname) as f:
    indexCounter = 0
    line=f.readline()
    cleanedLine = cleanSentences(line)
    split = cleanedLine.split()
    for word in split:
        if indexCounter < maxSeqLength:
            try:
                firstFile[indexCounter] = wordsList.index(word)
            except ValueError:
                firstFile[indexCounter] = 399999 #Vector for unknown words
        indexCounter = indexCounter + 1
firstFile

array([174943,    152,     14,      7,   7362,   2841,     20,   1421,
           22,      0,    215,     79,     19,     77,     68,   1009,
           59,    164,    214,    125,     19,   2562,    192,   1678,
           82,      6,      0,   3174,   8104,    410,    285,      4,
          733,     12, 174943,   7984,  15303,     14,    181,   2386,
            4,   2532,     73,     14,   2562,      0,  14170,      4,
         3981,   7980,      0,  34401,    543,     38,     86,    253,
          248,    131,     44,  22495,   2562,  31166,      0,  91887,
            3,      0,   1115,    794,     64,   9794,    285,      3,
            0,    888,     41,   1522,      5,     44,    543,     61,
           41,    822,      0,   1942,      6,     42,      7,   1283,
         2648,    977,      4,   6292,    135,      0,    164,     41,
         1040,   3151,     22,    152,      7,   2392,    331,   5537,
        14663,    187,      4,  11739,     48,      3,    392,   2562,
      

### Load previous results

In [41]:
ids = np.load('./script/idsMatrix.npy')
ids.shape #25000 sentence with 250 words each

(25000, 250)

### Define functions to get batch of train samples with half positive and half negative

In [42]:
from random import randint

def getTrainBatch():
    labels = []
    arr = np.zeros([batchSize, maxSeqLength])
    for i in range(batchSize):
        if (i % 2 == 0): 
            num = randint(1,11499)
            labels.append([1,0])
        else:
            num = randint(13499,24999)
            labels.append([0,1])
        arr[i] = ids[num-1:num]
    return arr, labels

def getTestBatch():
    labels = []
    arr = np.zeros([batchSize, maxSeqLength])
    for i in range(batchSize):
        num = randint(11499,13499)
        if (num <= 12499):
            labels.append([1,0])
        else:
            labels.append([0,1])
        arr[i] = ids[num-1:num]
    return arr, labels

### Example of a batch

In [43]:
batchSize = 10
arr_labels = getTrainBatch()
print('Length of x1, ', len(arr_labels[0][0])) # Shape of arr is [batch_size, max_sequence]
print('Length of x2, ', len(arr_labels[0][1])) # Shape of arr is [batch_size, max_sequence]
print('...Batch size')
print('Y, ', arr_labels[1])

Length of x1,  250
Length of x2,  250
...Batch size
Y,  [[1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1]]


## Define LSTM model

![](https://github.com/adeshpande3/LSTM-Sentiment-Analysis/raw/4bb7b1e8c0e8e9f7f649d1f68cb34db0b2b6675e/Images/SentimentAnalysis16.png)

In [51]:
batchSize = 24
lstmUnits = 64
n_classes = 2
iterations = 10000  # 100000
learning_rate = 0.001
numDimensions = 50

### Transform input

In [22]:
x = tf.placeholder(tf.int32,[batchSize, maxSeqLength]) # Note it is consistent with `arr` from next_batch function
y = tf.placeholder(tf.int32,[batchSize, n_classes]) # Note it is consistent with `label` from next_batch function

In [45]:
data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]),dtype=tf.float32) # the vector after embedding
data = tf.nn.embedding_lookup(wordVectors, x) # pay attention to the shape of `x` and `data`
data = tf.unstack(data, maxSeqLength, 1) # https://www.tensorflow.org/api_docs/python/tf/unstack 

In [46]:
print('Length:', len(data), ', Element:', data[0])
print() 

Length: 250 , Element: Tensor("unstack_1:0", shape=(24, 50), dtype=float32)



![](https://github.com/adeshpande3/LSTM-Sentiment-Analysis/raw/c25c41adaa68a0968bdc3540a71b0791b76860cd/Images/SentimentAnalysis13.png)

### Main Model

Just a typical single layer LSTM

In [None]:
lstmCell = rnn.BasicLSTMCell(lstmUnits)
lstmCell = rnn.DropoutWrapper(cell = lstmCell, output_keep_prob = 0.75)
outputs, _ = tf.nn.static_rnn(lstmCell, data, dtype= tf.float32)

In [27]:
out_weights = tf.Variable(tf.random_normal([lstmUnits, n_classes]))
out_bias = tf.Variable(tf.random_normal([n_classes]))
prediction = tf.matmul(outputs[-1], out_weights)+ out_bias
loss =tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits = prediction, labels = y))
opt = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(loss)
correctPred = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

## Start Training

In [None]:
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()

for i in range(iterations):
    #Next Batch of reviews
    nextBatch, nextBatchLabels = getTrainBatch();
    sess.run(opt, {x: nextBatch, y: nextBatchLabels})
   
    #Calculate training error 
    if (i % 10 == 0):
        acc=sess.run(accuracy,feed_dict={x:nextBatch, y:nextBatchLabels})
        los=sess.run(loss,feed_dict={x:nextBatch, y:nextBatchLabels})
        print('For iter ',i,', Accuracy: ', acc, ' ,Loss: ',los)

## Apply to test dataset

In [53]:
iterations = 10
for i in range(iterations):
    nextBatch, nextBatchLabels = getTestBatch();
    print("Accuracy for this batch:", (sess.run(accuracy, {x: nextBatch, y: nextBatchLabels})) * 100)

Accuracy for this batch: 91.6666686535
Accuracy for this batch: 75.0
Accuracy for this batch: 87.5
Accuracy for this batch: 87.5
Accuracy for this batch: 75.0
Accuracy for this batch: 91.6666686535
Accuracy for this batch: 83.3333313465
Accuracy for this batch: 95.8333313465
Accuracy for this batch: 75.0
Accuracy for this batch: 87.5


# Train another CNN model for Sentiment Analysis

![example](http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-8.03.47-AM-1024x413.png)

## Transform input

In [56]:
data_cnn = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]),dtype=tf.float32) # Same as LTSM
data_cnn = tf.nn.embedding_lookup(wordVectors, x)  # Same as LTSM
data_cnn = tf.reshape(data_cnn, [batchSize,maxSeqLength,numDimensions,1]) # Reshape to 3D, first 1 + 3d
data_cnn

<tf.Tensor 'Reshape:0' shape=(24, 250, 50, 1) dtype=float32>

## Define model parameters

In [79]:
filter_size = 2 # Number of words per stride
num_filters = 4 # Number of filters, matching the figures
filter_shape = [filter_size, embed_size, 1, num_filters] # `1` is number of channels
iterations = 1000

In [70]:
W_conv1 = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1))
b_conv1 = tf.Variable(tf.constant(0.1, shape=[num_filters]))
h_conv1 = tf.nn.relu(tf.nn.conv2d(data_cnn, W_conv1, 
                                  strides=[1, 1, 1, 1], padding='VALID') + b_conv1) # Note, cannot use SAME
h_pool1 = tf.nn.max_pool(h_conv1, ksize=[1, maxSeqLength - filter_size + 1, 1, 1], 
                         strides=[1, 1, 1, 1], padding='VALID')
h_pool1_flat = tf.reshape(h_pool1, [-1, num_filters])

In [71]:
h_conv1

<tf.Tensor 'Relu_3:0' shape=(24, 249, 1, 4) dtype=float32>

In [72]:
h_pool1

<tf.Tensor 'MaxPool_2:0' shape=(24, 1, 1, 4) dtype=float32>

In [73]:
h_pool1_flat

<tf.Tensor 'Reshape_3:0' shape=(24, 4) dtype=float32>

In [74]:
keep_prob_ = tf.placeholder(tf.float32)
h_pool1_flat_drop= tf.nn.dropout(h_pool1_flat, keep_prob_)

W_fc1 = tf.Variable(tf.truncated_normal([num_filters, n_classes], stddev= 0.1))
b_fc1 = tf.Variable(tf.constant(0.1, shape = [n_classes]))
prediction_ = tf.nn.relu(tf.matmul(h_pool1_flat_drop, W_fc1) + b_fc1) # use relu as activition function

In [75]:
h_pool1_flat_drop

<tf.Tensor 'dropout_1/mul:0' shape=(24, 4) dtype=float32>

In [76]:
prediction_

<tf.Tensor 'Relu_4:0' shape=(24, 2) dtype=float32>

In [77]:
loss_ = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits = prediction_, labels = y))
opt_ = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(loss_)
correctPred_ = tf.equal(tf.argmax(prediction_, 1), tf.argmax(y,1))
accuracy_ = tf.reduce_mean(tf.cast(correctPred_, tf.float32))

## Start Training

In [None]:
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()

for i in range(iterations):
    nextBatch, nextBatchLabels = getTrainBatch();
    sess.run(opt_, {x: nextBatch, y: nextBatchLabels, keep_prob_: 0.50})

    if (i % 10 == 0):
        acc_=sess.run(accuracy_,feed_dict={x:nextBatch, y:nextBatchLabels, keep_prob_: 1.00})
        los_=sess.run(loss_,feed_dict={x:nextBatch, y:nextBatchLabels, keep_prob_: 1.00})
        print('For iter ',i,', Accuracy: ', acc_, ' ,Loss: ',los_)

# To do: Combine CNN and LSTM

In [81]:
h_conv1

<tf.Tensor 'Relu_3:0' shape=(24, 249, 1, 4) dtype=float32>

In [82]:
h_pool2 = tf.nn.max_pool(h_conv1, ksize=[1, 2, 1, 1], 
                         strides=[1, 2, 1, 1], padding='VALID')
h_pool2

<tf.Tensor 'MaxPool_3:0' shape=(24, 124, 1, 4) dtype=float32>

In [83]:
#goal: Length: 250 , Element: Tensor("unstack:0", shape=(24, 50), dtype=float32)
h_unstack = tf.unstack(h_pool2, axis = 3)
h_unstack

[<tf.Tensor 'unstack_2:0' shape=(24, 124, 1) dtype=float32>,
 <tf.Tensor 'unstack_2:1' shape=(24, 124, 1) dtype=float32>,
 <tf.Tensor 'unstack_2:2' shape=(24, 124, 1) dtype=float32>,
 <tf.Tensor 'unstack_2:3' shape=(24, 124, 1) dtype=float32>]

In [84]:
h_concat = tf.concat(h_unstack, axis = 2)
h_concat

<tf.Tensor 'concat:0' shape=(24, 124, 4) dtype=float32>

In [85]:
data_ = tf.unstack(h_concat, 124 , 1) # https://www.tensorflow.org/api_docs/python/tf/unstack 
print('Length:', len(data_), ', Element:', data_[0])
print() 

Length: 124 , Element: Tensor("unstack_3:0", shape=(24, 4), dtype=float32)



Then follows the typical LSTM module