We would require following libraries to prepare our data.

* The listdir and join function are required to combine multiple text files into one corpus.
* The sent_tokenize and word_tokenize would tokenize the reviews into words and sentences.
* The RegexpTokenizer can be used to extract special characters and alphanumeric characters from the text.
* Stop words are words such as "and, the" etc. which are most frequently used in sentences but do not contribute much   to the overall meaning of the sentence.

In [None]:
from os import listdir
from os.path import join
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import pandas as pd
import numpy as np

We define three variables to start with:

* stop_words: we would import stop words frequently used in English language from NLTK and assign it to the variable.
* exclude_list: We would tag the part-of-speech of every sentence and would like to exclude Nouns(Singular),    Nouns(Plural) and Proper Nouns from our sentences. 
* SpeCharword_tokenizer: This ('[A-Za-z0-9]\w*') indicates that the RegexpTokenizer will only consider words and numbers and ignore special characters. 

In [None]:
stop_words = set(stopwords.words('english'))
exclude_list = ['NN','NNS','NNP']
SpeCharword_tokenizer = RegexpTokenizer('[A-Za-z0-9]\w*')

The read_txt_data function below will read every text document and store it in a list.

In [None]:
def read_txt_data(filename):
    data = []
    file = open(filename, "r", encoding='utf-8') 
    for line in file.readlines():
        data.append(SpeCharword_tokenizer.tokenize(line.lower().replace('\n','').replace('<br />','')))
    file.close()
    return data

Our dataset consists of 12,500 labeled reviews with positive and negative sentiments each for training. Every review is stored in a seperate text file.

In [None]:
file_folder_pos = "/Users/rajdesai/Downloads/aclImdb/train/pos"
file_folder_neg = "/Users/rajdesai/Downloads/aclImdb/train/neg"

For now we would take 7000 positive and negative reviews and store them in two different lists.

In [None]:
positiveFiles = []
index = 0
for textfile in listdir(file_folder_pos):
        if(index <7000):
                positiveFiles.extend(read_txt_data(join(file_folder_pos,textfile)))
                print(index)
        else:
                break
        index = index + 1

In [None]:
negativeFiles = []
index = 0
for textfile in listdir(file_folder_neg):
        if(index <7000):
                negativeFiles.extend(read_txt_data(join(file_folder_neg,textfile)))
                print(index)
        else:
                break
        index = index + 1

Similarly, we would create two lists of test data with 200 reviews each.

In [None]:
file_folder_pos_test = "/Users/rajdesai/Downloads/aclImdb/test/pos"
file_folder_neg_test = "/Users/rajdesai/Downloads/aclImdb/test/pos"

In [None]:
positiveTestFiles = []

index = 0
for textfile in listdir(file_folder_pos_test ):
        if(index <200):
                positiveTestFiles.extend(read_txt_data(join(file_folder_pos_test ,textfile)))
        else:
                break
        index = index + 1



In [None]:
negativeTestFiles = []

index = 0
for textfile in listdir(file_folder_neg_test):
        if(index <200):
                negativeTestFiles.extend(read_txt_data(join(file_folder_neg_test,textfile)))
        else:
                break
        index = index + 1

Now we would create a bag of words which would give us a distinct word count of every word in the document.

In [None]:
dict_bag_of_words= {}
for line in positiveFiles:
        for word in line:
                dict_bag_of_words[word] = dict_bag_of_words.get(word, 0)+1

for line in negativeFiles:
        for word in line:
                dict_bag_of_words[word] = dict_bag_of_words.get(word, 0)+1


for k,v in sorted(dict_bag_of_words.items(), key=lambda p:p[1]):
        print(k,v)

I have downloaded two pre-trained models which would help us with feature extraction from : https://github.com/adeshpande3/LSTM-Sentiment-Analysis

* The wordsList.npy and wordVectors.npy contain a list of words and its vector representation using the Word2Vec model.

In [None]:
wrdlistmodel = "/Users/rajdesai/Downloads/models/wordsList.npy"
wrdvecmodel = "/Users/rajdesai/Downloads/models/wordVectors.npy"

We load both the models. Also, the wordsList model is converted to a list. It initially gets loaded as numpy array

In [None]:
wordsList = np.load(wrdlistmodel)
wordsList = wordsList.tolist() 
wordsList = [word.decode('UTF-8') for word in wordsList]
wordVectors = np.load(wrdvecmodel)

The following function would pass each sentence from our reviews compare it with the words present in the models we loaded and assign a vector value.
At the end of it each sentence would have a aggregate vector value.

In [None]:
MAX_SENTENCE_LENGTH = 300

In [None]:
def get_vector_array(sentence):
        """ SENTENCE TO VECTOR GENERATOR """
        sentence = SpeCharword_tokenizer.tokenize(sentence.lower().replace('\n','').replace('<br />',''))
        print(sentence)
        vect = np.zeros((24, MAX_SENTENCE_LENGTH), dtype='int32')
        indx = 0
        for word in sentence:
                try:
                        vect[0][indx] = wordsList.index(word)
                except ValueError:
                        vect[0][indx] = 399999 #Vector for unkown words
                indx  = indx  + 1
                ip_labels.append([1,0])
                if indx >= MAX_SENTENCE_LENGTH:
                        break
        return vect

The below function iterates through our review files and creates a matrix which would be a numpy array.

The matrix would consists of an index number which represents the review id, the vector value of the document and the label.

In [None]:
file_count = len(positiveFiles) + len(negativeFiles)
ids = np.zeros((file_count, MAX_SENTENCE_LENGTH), dtype='int32')
file_indx = 0
ip_labels = []
for line in positiveFiles:
        indx = 0
        for word in line:
                print(file_indx)
                print(indx)
                try:
                        ids[file_indx][indx] = wordsList.index(word)
                except ValueError:
                        ids[file_indx][indx] = 399999 #Vector for unkown words
                indx  = indx  + 1
                ip_labels.append([1,0])
                if indx >= MAX_SENTENCE_LENGTH:
                        break
        file_indx = file_indx  + 1
        
for line in negativeFiles:
        indx = 0
        for word in line:
                print(file_indx)
                print(indx)
                try:
                        ids[file_indx][indx] = wordsList.index(word)
                except ValueError:
                        ids[file_indx][indx] = 399999  #Vector for unkown words
                indx  = indx  + 1
                ip_labels.append([0,1])
                if indx >= MAX_SENTENCE_LENGTH:
                        break
        file_indx = file_indx  + 1

We will save the matrix, i.e the array we receive at the end of last iteration as idsMatrix1.

In [None]:
np.save('/Users/rajdesai/Downloads/models/idsMatrix1', ids)

In [None]:
load_id = np.load('/Users/rajdesai/Downloads/models/idsMatrix1')

In [None]:
for i in load_id[0]:
        print(i)

Similarly we would get a test matrix which would be an array of file index numbers and test tables

In [None]:
test_file_count = len(positiveTestFiles) + len(negativeTestFiles)
test_ids = np.zeros((test_file_count, MAX_SENTENCE_LENGTH), dtype='int32')
file_indx = 0
test_labels = []
for line in positiveTestFiles:
        indx = 0
        for word in line:
                print(file_indx)
                print(indx)
                try:
                        test_ids[file_indx][indx] = wordsList.index(word)
                except ValueError:
                        test_ids[file_indx][indx] = 399999 #Vector for unkown words
                indx  = indx  + 1
                test_labels.append([1,0])
                if indx >= MAX_SENTENCE_LENGTH:
                        break
        file_indx = file_indx  + 1
        
for line in negativeTestFiles:
        indx = 0
        for word in line:
                print(file_indx)
                print(indx)
                try:
                        test_ids[file_indx][indx] = wordsList.index(word)
                except ValueError:
                        test_ids[file_indx][indx] = 399999  #Vector for unkown words
                indx  = indx  + 1
                test_labels.append([0,1])
                if indx >= MAX_SENTENCE_LENGTH:
                        break
        file_indx = file_indx  + 1

Now that we have the vector matrix which would be our input to our LSTM model ready, we start the training.

First we would define our training parameters:
* batchSize- Number of samples per iteration.
* lstmUnits- Number of hidden states.

After which we initialize a tensorflow session.

* we define a weight and bias matrix.

* Softmax function is used in the output cell.

* Adam optimizer is used to minimize the loss.

    

In [None]:
batchSize = 24
lstmUnits = 64
numClasses = 2
iterations = 100000
numDimensions = 300

import datetime
import tensorflow as tf

sess = tf.Session()
tf.reset_default_graph()

labels = tf.placeholder(tf.float32, [batchSize, numClasses])
input_data = tf.placeholder(tf.int32, [batchSize, MAX_SENTENCE_LENGTH])

data = tf.Variable(tf.zeros([batchSize, MAX_SENTENCE_LENGTH, numDimensions]),dtype=tf.float32)
data = tf.nn.embedding_lookup(wordVectors,input_data)

lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75)
value, _ = tf.nn.dynamic_rnn(lstmCell, data, dtype=tf.float32)

weight = tf.Variable(tf.truncated_normal([lstmUnits, numClasses]))
bias = tf.Variable(tf.constant(0.1, shape=[numClasses]))
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
prediction = (tf.matmul(last, weight) + bias)

correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32)) 

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer().minimize(loss)



tf.summary.scalar('Loss', loss)
tf.summary.scalar('Accuracy', accuracy)
merged = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, sess.graph)

sess = tf.InteractiveSession()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())

for i in range(iterations):
   #Next Batch of reviews
   nextBatch, nextBatchLabels = getTrain(batchSize);
   sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels})
   
   #Write summary to Tensorboard
   if (i % 50 == 0):
       summary = sess.run(merged, {input_data: nextBatch, labels: nextBatchLabels})
       writer.add_summary(summary, i)

   #Save the network every 10,000 training iterations
   if (i % 100 == 0 and i != 0):
       save_path = saver.save(sess, "/Users/rajdesai/Downloads/models/pretrained_lstm.ckpt", global_step=i)
       print("saved to %s" % save_path)
writer.close()


iterations = 20
for i in range(iterations):
    nextBatch, nextBatchLabels = getTest(24);
    print("Accuracy :", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)



We define an input text, which is a sample review to test our model.

We generate the vector array of the review and pass it through our model for prediction.

In [None]:
ip_text = "This is really a new low in entertainment. Even though there are a lot worse movies out.<br /><br />In the Gangster / Drug scene genre it is hard to have a convincing storyline (this movies does not, i mean Sebastians motives for example couldn't be more far fetched and worn out cliché.) Then you would also need a setting of character relationships that is believable (this movie does not.) <br /><br />Sure Tristan is drawn away from his family but why was that again? what's the deal with his father again that he has to ask permission to go out at his age? interesting picture though to ask about the lack and need of rebellious behavior of kids in upper class family. But this movie does not go in this direction. Even though there would be the potential judging by the random Backflashes. Wasn't he already down and out, why does he do it again? <br /><br />So there are some interesting questions brought up here for a solid socially critic drama (but then again, this movie is just not, because of focusing on cool production techniques and special effects an not giving the characters a moment to reflect and most of all forcing the story along the path where they want it to be and not paying attention to let the story breath and naturally evolve.) <br /><br />It wants to be a drama to not glorify abuse of substances and violence (would be political incorrect these days, wouldn't it?) but on the other hand it is nothing more then a cheap action movie (like there are so so many out there) with an average set of actors and a Vinnie Jones who is managing to not totally ruin what's left of his reputation by doing what he always does.<br /><br />So all in all i .. just ... can't recommend it.<br /><br />1 for Vinnie and 2 for the editing."
ip_vect = get_vector_array(ip_text)

In [None]:
predict = sess.run(prediction, {input_data: ip_vect})[0]

if (predict[0] > predict[1]):
        print("Positive")
else:
        print("Negative")