# Project: "Sentiment Analysis from Review"

# Getting Our Machine Ready

In [1]:
import nltk
nltk.download('stopwords')
import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re, time
from nltk.corpus import stopwords
from string import punctuation
from collections import defaultdict
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from collections import namedtuple

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Using TensorFlow backend.


#Data Set Preparation

In [0]:
# Load the data
train_data = pd.read_csv("https://raw.githubusercontent.com/tanvirehsan/sentiment-analysis-tensorflow/master/Data/word2vec-nlp-tutorial/labeledTrainData.tsv", delimiter="\t")
test_data = pd.read_csv("https://raw.githubusercontent.com/tanvirehsan/sentiment-analysis-tensorflow/master/Data/word2vec-nlp-tutorial/testData.tsv", delimiter="\t")

# Data Inspection

In [3]:
#Print Training Data
train_data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [4]:
#Print Test Data
test_data.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [6]:
#Printing shape of Data
print(train_data.shape)
print(test_data.shape)

(25000, 3)
(25000, 2)


In [7]:
# Inspect first 3 reviews
for i in range(3):
    print(train_data.review[i])
    print()

With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally star

In [9]:
# Check for any null values
print(train_data.isnull().sum())
print(test_data.isnull().sum())

id           0
sentiment    0
review       0
dtype: int64
id        0
review    0
dtype: int64


# Method for Cleaning and Format Data Set

In [0]:
def cleanText(text, remove_stopwords=True):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        #print(stops)    
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"<br />", " ", text)
    text = re.sub(r"[^a-z]", " ", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    text = re.sub(r"  ", " ", text)
    
    # Remove punctuation from text
    text = ''.join([c for c in text if c not in punctuation])
    
    # Return a list of words
    return(text)

#Clean the training DataSet and testing DataSet

In [0]:
#Cleaning Training DataSet using the cleanText() method
trainData_clean = []
for review in train_data.review:
    trainData_clean.append(cleanText(review))

In [14]:
# Inspect the top 3 cleaned Train Dataset
for i in range(3):
    print(trainData_clean[i])
    print()

stuff going moment mj i ve started listening music watching odd documentary there watched wiz watched moonwalker again maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj s feeling towards press also obvious message drugs bad m kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice him the actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond me mj overheard plans nah joe pesci s character ranted wanted people know supplying drugs etc dunno maybe hates mj s music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad seq

In [0]:
#Cleaning Test DataSet using the cleanText() method
testData_clean = []
for review in test_data.review:
    testData_clean.append(cleanText(review))

In [17]:
# Inspect the top 3 cleaned Test Dataset
for i in range(3):
    print(testData_clean[i])
    print()
    
print(len(trainData_clean))
print(len(testData_clean))

naturally film who s main themes mortality nostalgia loss innocence perhaps surprising rated highly older viewers younger ones however craftsmanship completeness film anyone enjoy pace steady constant characters full engaging relationships interactions natural showing need floods tears show emotion screams show fear shouting show dispute violence show anger naturally joyce s short story lends film ready made structure perfect polished diamond small changes huston makes inclusion poem fit neatly truly masterpiece tact subtlety overwhelming beauty 

movie disaster within disaster film full great action scenes meaningful throw away sense reality let s see word wise lava burns you steam burns you can t stand next lava diverting minor lava flow difficult let alone significant one scares think might actually believe saw movie even worse significant amount talent went making film mean acting actually good effects average hard believe somebody read scripts allowed talent wasted guess suggestio

In [18]:
# Combine All review Data (Cleaned)
combined_all_cleaned_reviews = trainData_clean + testData_clean
# Tokenize the reviews
tokenizer = Tokenizer()
tokenizer.fit_on_texts(combined_all_cleaned_reviews)
print("Fitting is complete.")

trainData_seq = tokenizer.texts_to_sequences(trainData_clean)
print("Text to Sequence conversion is complete for training data.")

testData_seq = tokenizer.texts_to_sequences(testData_clean)
print("Text to Sequence conversion is complete for test data")

Fitting is complete.
Text to Sequence conversion is complete for training data.
Text to Sequence conversion is complete for test data


In [19]:
# Find the number of unique tokens
word_index = tokenizer.word_index
print("Words in index: %d" % len(word_index))

Words in index: 99425


In [20]:
# Inspect Top 3 Trainig reviews(Cleaned) after they have been tokenized
for i in range(3):
    print(trainData_seq[i])
    print()

[437, 81, 481, 10863, 6, 71, 573, 2590, 115, 65, 948, 551, 51, 207, 24383, 207, 17034, 213, 188, 92, 19, 684, 2550, 118, 104, 14, 511, 3933, 188, 25, 240, 644, 2336, 1251, 17034, 85, 4772, 85, 701, 3, 298, 81, 15, 351, 1827, 533, 1209, 3566, 10863, 1, 477, 861, 3526, 22, 517, 662, 1403, 18, 60, 5290, 2073, 1109, 180, 406, 1512, 807, 2559, 5, 10863, 469, 81, 655, 80, 265, 109, 569, 10863, 33435, 29469, 141, 2, 10863, 374, 12, 57, 24, 374, 205, 14, 246, 173, 9, 740, 701, 3, 135, 334, 456, 138, 16333, 4102, 1702, 626, 865, 10467, 1009, 11946, 880, 1058, 1640, 408, 10863, 258, 18, 584, 134, 10863, 17899, 2288, 15681, 865, 10467, 1, 32, 36026, 381, 20, 46, 17447, 1403, 426, 9779, 188, 4220, 10863, 1, 115, 658, 511, 91, 5, 10863, 1541, 436, 2257, 131, 2124, 2367, 626, 22, 67, 112, 4731, 5337, 300, 1313, 29470, 18, 626, 547, 878, 655, 685, 4, 444, 190, 537, 131, 677, 3371, 1224, 779, 54, 1229, 260, 2, 20, 5, 10863, 4, 570, 73, 468, 30, 20, 239, 695, 151, 269, 108, 7617, 662, 3514, 10863, 1, 3

In [21]:
# Inspect Top 3 Test reviews(Cleaned) after they have been tokenized
for i in range(3):
    print(testData_seq[i])
    print()

[1840, 3, 243, 1, 196, 1331, 16836, 4319, 1853, 2898, 302, 1675, 1116, 462, 885, 723, 1041, 587, 102, 13587, 34637, 3, 155, 271, 975, 5520, 1809, 29, 279, 1756, 1428, 5285, 1234, 681, 280, 19332, 1650, 43, 1345, 3849, 43, 1040, 5133, 43, 13113, 489, 43, 2659, 1840, 6978, 1, 250, 13, 6932, 3, 1520, 24, 2400, 320, 5213, 3934, 316, 1267, 4449, 78, 6740, 5427, 1039, 6406, 287, 832, 17683, 3944, 4376, 827]

[2, 1494, 666, 1494, 3, 279, 21, 114, 58, 3161, 1283, 151, 198, 531, 190, 1, 15, 576, 1529, 9158, 3051, 168, 6335, 3051, 168, 87, 49, 698, 284, 9158, 18445, 1302, 9158, 2719, 798, 190, 537, 2733, 4, 2668, 30, 140, 75, 175, 120, 2, 11, 349, 2733, 1043, 549, 333, 141, 3, 297, 42, 75, 8, 202, 763, 160, 175, 1693, 254, 3020, 1507, 549, 916, 387, 5937, 12, 2, 292, 145, 79, 151, 5, 988, 3561, 290, 46, 479, 31, 79, 151, 1079, 10, 3161, 1399]

[76, 2, 264, 120, 4212, 423, 354, 7, 4, 128, 444, 1, 2343, 21, 1165, 1111, 102, 21, 242, 116, 116, 41843, 1, 1131, 1209, 2380, 10273, 1643, 435, 11326, 39

In [0]:
# Find the length of reviews ( Train and Test Dataset)
lengths = []
for review in trainData_seq:
    lengths.append(len(review))

for review in testData_seq:
    lengths.append(len(review))

# Create a dataframe so that the values can be inspected
lengths = pd.DataFrame(lengths, columns=['counts'])

In [0]:
lengths.counts.describe()

count    50000.000000
mean       132.337460
std         99.452039
min          3.000000
25%         71.000000
50%         98.000000
75%        161.000000
max       1504.000000
Name: counts, dtype: float64

In [24]:
# Percentile functions 
#To determine the maximum length of review text to be considered 
print(np.percentile(lengths.counts, 80))
print(np.percentile(lengths.counts, 85))
print(np.percentile(lengths.counts, 90))
print(np.percentile(lengths.counts, 95))

178.0
208.0
253.0
332.0


**To Train our model faster we are limiting the length of review upto maximum 200.Reviews with more than 200 words will have those extra words removed. Reviews with less than 200 words will have padding tokens added until it reaches the length of 200.**

In [25]:
# Pad and truncate the review so that they all have the same length.
max_review_length = 200

trainData_pad = pad_sequences(trainData_seq, maxlen = max_review_length)
print("Padding of Training Data is complete.")

testData_pad = pad_sequences(testData_seq, maxlen = max_review_length)
print("Padding of Test Data is complete.")

Padding of Training Data is complete.
Padding of Test Data is complete.


In [26]:
# Inspect the Top 3 Train reviews after padding has been completed. 
for i in range(3):
    print(trainData_pad[i,:100])
    print()

[   85  4772    85   701     3   298    81    15   351  1827   533  1209
  3566 10863     1   477   861  3526    22   517   662  1403    18    60
  5290  2073  1109   180   406  1512   807  2559     5 10863   469    81
   655    80   265   109   569 10863 33435 29469   141     2 10863   374
    12    57    24   374   205    14   246   173     9   740   701     3
   135   334   456   138 16333  4102  1702   626   865 10467  1009 11946
   880  1058  1640   408 10863   258    18   584   134 10863 17899  2288
 15681   865 10467     1    32 36026   381    20    46 17447  1403   426
  9779   188  4220 10863]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[ 6288  4709   308  9361  8885  3623  1933  7431  2248 11138  1524  2479
  2087  2931  3016   949  1709  1242  1513 33436  1851  1594  2479  1395
  9576 36028  4020  1686 50269   476  

In [27]:
# Inspect the Top 3 Test reviews after padding has been completed. 
for i in range(3):
    print(testData_pad[i,:100])
    print()

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]



In [28]:
# Creating the training and validation sets
x_train, x_valid, y_train, y_valid = train_test_split(trainData_pad, train_data.sentiment, test_size = 0.15, random_state = 2)
x_test = testData_pad
print(testData_pad.shape)

(25000, 200)


In [29]:
# Inspect the shape of the data
print(x_train.shape)
print(x_valid.shape)
print(x_test.shape)

(21250, 200)
(3750, 200)
(25000, 200)


# Build and Train the Model

In [0]:
def get_batches(x, y, batch_size):
    '''Create the batches for the training and validation data'''
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

In [0]:
def get_test_batches(x, batch_size):
    '''Create the batches for the testing data'''
    n_batches = len(x)//batch_size
    x = x[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size]

In [0]:
def lstm_cell(lstm_size, keep_prob):
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    return drop

In [0]:
def build_rnn(n_words, embed_size, batch_size, lstm_size, num_layers, 
              dropout, learning_rate, multiple_fc, fc_units):
    '''Build the Recurrent Neural Network'''

    tf.reset_default_graph()

    # Declare placeholders we'll feed into the graph
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [None, None], name='inputs')

    with tf.name_scope('labels'):
        labels = tf.placeholder(tf.int32, [None, None], name='labels')

    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    # Create the embeddings
    with tf.name_scope("embeddings"):
        embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
        embed = tf.nn.embedding_lookup(embedding, inputs)

    # Build the RNN layers
    with tf.name_scope("RNN_layers"):
        cell = tf.contrib.rnn.MultiRNNCell([lstm_cell(lstm_size, keep_prob) for _ in range(num_layers)])

    # Set the initial state
    with tf.name_scope("RNN_init_state"):
        initial_state = cell.zero_state(batch_size, tf.float32)

    # Run the data through the RNN layers
    with tf.name_scope("RNN_forward"):
        outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                                 initial_state=initial_state)    
    
    # Create the fully connected layers
    with tf.name_scope("fully_connected"):
        
        # Initialize the weights and biases
        weights = tf.truncated_normal_initializer(stddev=0.1)
        biases = tf.zeros_initializer()
        
        dense = tf.contrib.layers.fully_connected(outputs[:, -1],
                                                  num_outputs = fc_units,
                                                  activation_fn = tf.sigmoid,
                                                  weights_initializer = weights,
                                                  biases_initializer = biases)
        dense = tf.contrib.layers.dropout(dense, keep_prob)
        
        # Depending on the iteration, use a second fully connected layer
        if multiple_fc == True:
            dense = tf.contrib.layers.fully_connected(dense,
                                                      num_outputs = fc_units,
                                                      activation_fn = tf.sigmoid,
                                                      weights_initializer = weights,
                                                      biases_initializer = biases)
            dense = tf.contrib.layers.dropout(dense, keep_prob)
    
    # Make the predictions
    with tf.name_scope('predictions'):
        predictions = tf.contrib.layers.fully_connected(dense, 
                                                        num_outputs = 1, 
                                                        activation_fn=tf.sigmoid,
                                                        weights_initializer = weights,
                                                        biases_initializer = biases)
        tf.summary.histogram('predictions', predictions)
    
    # Calculate the cost
    with tf.name_scope('cost'):
        cost = tf.losses.mean_squared_error(labels, predictions)
        tf.summary.scalar('cost', cost)
    
    # Train the model
    with tf.name_scope('train'):    
        optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

    # Determine the accuracy
    with tf.name_scope("accuracy"):
        correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels)
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        tf.summary.scalar('accuracy', accuracy)
    
    # Merge all of the summaries
    merged = tf.summary.merge_all()    

    # Export the nodes 
    export_nodes = ['inputs', 'labels', 'keep_prob', 'initial_state', 'final_state','accuracy',
                    'predictions', 'cost', 'optimizer', 'merged']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])
    
    return graph

In [0]:
def train(model, epochs, log_string):
    '''Train the RNN'''

    saver = tf.train.Saver()
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        # Used to determine when to stop the training early
        valid_loss_summary = []
        
        # Keep track of which batch iteration is being trained
        iteration = 0

        print()
        print("Training Model: {}".format(log_string))

        train_writer = tf.summary.FileWriter('./logs/3/train/{}'.format(log_string), sess.graph)
        valid_writer = tf.summary.FileWriter('./logs/3/valid/{}'.format(log_string))
        
        vmodel_dir= "./Review/Model"
        tf.gfile.MkDir("Review")
        tf.gfile.MkDir(vmodel_dir)

        for e in range(epochs):
            state = sess.run(model.initial_state)
            
            # Record progress with each epoch
            train_loss = []
            train_acc = []
            val_acc = []
            val_loss = []

            with tqdm(total=len(x_train)) as pbar:
                for _, (x, y) in enumerate(get_batches(x_train, y_train, batch_size), 1):
                    feed = {model.inputs: x,
                            model.labels: y[:, None],
                            model.keep_prob: dropout,
                            model.initial_state: state}
                    summary, loss, acc, state, _ = sess.run([model.merged, 
                                                             model.cost, 
                                                             model.accuracy, 
                                                             model.final_state, 
                                                             model.optimizer], 
                                                            feed_dict=feed)                
                    
                    # Record the loss and accuracy of each training batch
                    train_loss.append(loss)
                    train_acc.append(acc)
                    
                    # Record the progress of training
                    train_writer.add_summary(summary, iteration)
                    
                    iteration += 1
                    pbar.update(batch_size)
            
            # Average the training loss and accuracy of each epoch
            avg_train_loss = np.mean(train_loss)
            avg_train_acc = np.mean(train_acc) 

            val_state = sess.run(model.initial_state)
            with tqdm(total=len(x_valid)) as pbar:
                for x, y in get_batches(x_valid, y_valid, batch_size):
                    feed = {model.inputs: x,
                            model.labels: y[:, None],
                            model.keep_prob: 1,
                            model.initial_state: val_state}
                    summary, batch_loss, batch_acc, val_state = sess.run([model.merged, 
                                                                          model.cost, 
                                                                          model.accuracy, 
                                                                          model.final_state], 
                                                                         feed_dict=feed)
                    
                    # Record the validation loss and accuracy of each epoch
                    val_loss.append(batch_loss)
                    val_acc.append(batch_acc)
                    pbar.update(batch_size)
            
            # Average the validation loss and accuracy of each epoch
            avg_valid_loss = np.mean(val_loss)    
            avg_valid_acc = np.mean(val_acc)
            valid_loss_summary.append(avg_valid_loss)
            
            # Record the validation data's progress
            valid_writer.add_summary(summary, iteration)

            # Print the progress of each epoch
            print("Epoch: {}/{}".format(e, epochs),
                  "Train Loss: {:.3f}".format(avg_train_loss),
                  "Train Acc: {:.3f}".format(avg_train_acc),
                  "Valid Loss: {:.3f}".format(avg_valid_loss),
                  "Valid Acc: {:.3f}".format(avg_valid_acc))

            # Stop training if the validation loss does not decrease after 3 epochs
            if avg_valid_loss > min(valid_loss_summary):
                print("No Improvement.")
                stop_early += 1
                if stop_early == 3:
                    break   
            
            # Reset stop_early if the validation loss finds a new low
            # Save a checkpoint of the model
            else:
                print("New Record!")
                stop_early = 0
                checkpoint = "./Review/Model/sentiment_{}.ckpt".format(log_string)
                saver.save(sess, checkpoint)

In [36]:
# The default parameters of the model
n_words = len(word_index)
print(n_words)
embed_size = 300
batch_size = 250
lstm_size = 128
num_layers = 2
dropout = 0.5
learning_rate = 0.001
epochs = 13 #100
multiple_fc = False
fc_units = 256


99425


In [0]:
# Train the model with the desired tuning parameters
for lstm_size in [64,128]:
    for multiple_fc in [True, False]:
        for fc_units in [128, 256]:
            log_string = 'ru={},fcl={},fcu={}'.format(lstm_size,
                                                      multiple_fc,
                                                      fc_units)
            model = build_rnn(n_words = n_words, 
                              embed_size = embed_size,
                              batch_size = batch_size,
                              lstm_size = lstm_size,
                              num_layers = num_layers,
                              dropout = dropout,
                              learning_rate = learning_rate,
                              multiple_fc = multiple_fc,
                              fc_units = fc_units)            
            train(model, epochs, log_string)


Training Model: ru=64,fcl=True,fcu=128


100%|██████████| 21250/21250 [02:00<00:00, 178.95it/s]
100%|██████████| 3750/3750 [00:06<00:00, 557.72it/s]


Epoch: 0/1 Train Loss: 0.245 Train Acc: 0.569 Valid Loss: 0.190 Valid Acc: 0.732
New Record!


  0%|          | 0/21250 [00:00<?, ?it/s]


Training Model: ru=64,fcl=True,fcu=256


100%|██████████| 21250/21250 [01:59<00:00, 177.79it/s]
100%|██████████| 3750/3750 [00:06<00:00, 552.41it/s]


Epoch: 0/1 Train Loss: 0.229 Train Acc: 0.626 Valid Loss: 0.162 Valid Acc: 0.779
New Record!


  0%|          | 0/21250 [00:00<?, ?it/s]


Training Model: ru=64,fcl=False,fcu=128


100%|██████████| 21250/21250 [01:59<00:00, 177.46it/s]
100%|██████████| 3750/3750 [00:06<00:00, 556.06it/s]


Epoch: 0/1 Train Loss: 0.225 Train Acc: 0.636 Valid Loss: 0.190 Valid Acc: 0.726
New Record!


  0%|          | 0/21250 [00:00<?, ?it/s]


Training Model: ru=64,fcl=False,fcu=256


100%|██████████| 21250/21250 [01:58<00:00, 179.10it/s]
100%|██████████| 3750/3750 [00:06<00:00, 554.79it/s]


Epoch: 0/1 Train Loss: 0.230 Train Acc: 0.638 Valid Loss: 0.171 Valid Acc: 0.762
New Record!


  0%|          | 0/21250 [00:00<?, ?it/s]


Training Model: ru=128,fcl=True,fcu=128


100%|██████████| 21250/21250 [03:31<00:00, 101.18it/s]
100%|██████████| 3750/3750 [00:11<00:00, 324.73it/s]


Epoch: 0/1 Train Loss: 0.224 Train Acc: 0.637 Valid Loss: 0.224 Valid Acc: 0.647
New Record!


  0%|          | 0/21250 [00:00<?, ?it/s]


Training Model: ru=128,fcl=True,fcu=256


100%|██████████| 21250/21250 [03:30<00:00, 100.42it/s]
100%|██████████| 3750/3750 [00:11<00:00, 326.00it/s]


Epoch: 0/1 Train Loss: 0.218 Train Acc: 0.656 Valid Loss: 0.145 Valid Acc: 0.803
New Record!


  0%|          | 0/21250 [00:00<?, ?it/s]


Training Model: ru=128,fcl=False,fcu=128


100%|██████████| 21250/21250 [03:28<00:00, 101.55it/s]
100%|██████████| 3750/3750 [00:11<00:00, 323.82it/s]


Epoch: 0/1 Train Loss: 0.200 Train Acc: 0.688 Valid Loss: 0.138 Valid Acc: 0.807
New Record!


  0%|          | 0/21250 [00:00<?, ?it/s]


Training Model: ru=128,fcl=False,fcu=256


100%|██████████| 21250/21250 [03:29<00:00, 101.47it/s]
100%|██████████| 3750/3750 [00:11<00:00, 324.06it/s]


Epoch: 0/1 Train Loss: 0.205 Train Acc: 0.689 Valid Loss: 0.155 Valid Acc: 0.788
New Record!


# Make the Predictions

In [0]:
def make_predictions(lstm_size, multiple_fc, fc_units, checkpoint):
    '''Predict the sentiment of the testing data'''
    
    # Record all of the predictions
    all_preds = []

    model = build_rnn(n_words = n_words, 
                      embed_size = embed_size,
                      batch_size = batch_size,
                      lstm_size = lstm_size,
                      num_layers = num_layers,
                      dropout = dropout,
                      learning_rate = learning_rate,
                      multiple_fc = multiple_fc,
                      fc_units = fc_units) 
    
    with tf.Session() as sess:
        saver = tf.train.Saver()
        # Load the model
        saver.restore(sess, checkpoint)
        test_state = sess.run(model.initial_state)
        for _, x in enumerate(get_test_batches(x_test, batch_size), 1):
            feed = {model.inputs: x,
                    model.keep_prob: 1,
                    model.initial_state: test_state}
            predictions = sess.run(model.predictions, feed_dict=feed)
            
            #for pred in predictions:
                #all_preds.append(float(pred))
                #print(float(pred))
                
    return all_preds

I am going to compare the results of the best three models, based on the validation data. Then average the predictions of these three models, which should produce an even better set of predictions. 

In [0]:
checkpoint1 = "./Review/Model/sentiment_ru=128,fcl=False,fcu=256.ckpt"
checkpoint2 = "./Review/Model/sentiment_ru=128,fcl=False,fcu=128.ckpt"
checkpoint3 = "./Review/Model/sentiment_ru=64,fcl=True,fcu=256.ckpt"

In [0]:
# Make predictions using the best 3 models
predictions1 = make_predictions(128, False, 256, checkpoint1)
predictions2 = make_predictions(128, False, 128, checkpoint2)
predictions3 = make_predictions(64, True, 256, checkpoint3)

In [0]:
# Average the best three predictions
predictions_combined = (pd.DataFrame(predictions1) + pd.DataFrame(predictions2) + pd.DataFrame(predictions3))/3

The results of the predictions are as follows:
- Predictions1: 0.919
- Predictions2: 0.914
- Predictions3: 0.916
- Combined Predictions: 0.935