# Long Short-Term Memory

Homework assignments are mandatory. In order to be granted with 8 ECTS and your grade, you must pass 9 out of 10 homeworks. Upload your solution as zip archive until Saturday 6th January 23:59 into the public Homework Submissions folder on studip. The archive should contain your solution as iPython notebook (.ipynb) and as HTML export, as well as your TensorBoard summary ﬁle. Name the archive ﬁle as <your group id > longshort-term-memory.zip.

Further, you have to correct another group’s homework until Monday 8th January 23:59. Please follow the instructions in the rating guidelines on how to do your rating.

If you encounter problems, please do not hesitate to send your question or concern to lbraun@uos.de. Please do not forget to include [TF] in the subject line.

# 1 Introduction

In this week’s task, we are going to implement a network to perform a sentiment analysis on written movie reviews. The network will map sequences of words onto binary decissions: Good rating vs. bad rating.

# 2 LSTM recap

Read the excelent article by Christopher Olah about Understanding LSTMs.

# 3 Data

We are going to train the network on movie ratings, which were taken from the Internet Movie Database. Download the Large Movie Review Dataset v1.0 from the dataset’s homepage and unzip the content into your working directory. In order to save memory you can delete all ﬁles that end in .feat and the subfolder /train/unsup/.

# 4 Data preparation

You can use the helper class (06 imdb-helper.py) to read-in and iterate over the data. The script creates tokenized versions of the movie reviews. Further, the helper class has a method create dictionaries to create dictionaries that map words to unique ids and ids back to words. The method introduces two more labels, one for rare words and one which is used to ﬁll up sequences to create batches with samples of equal length.

Further, there is a method to iterate over the data-sets and to slice a batch into subsequences.

In [147]:
# helper class

import os
from nltk.tokenize import RegexpTokenizer
from collections import Counter
import numpy as np
import tensorflow as tf

class IMDB:
    def __init__(self, directory):
        self._directory = directory
        
        self._training_data, self._training_labels = self._load_data("train")
        self._test_data, self._test_labels = self._load_data("test")
        
        np.random.seed(0)
        samples_n = self._training_labels.shape[0]
        random_indices = np.random.choice(samples_n, samples_n // 7, replace = False)
        np.random.seed()
        
        self._validation_data = self._training_data[random_indices]
        self._validation_labels = self._training_labels[random_indices]
        self._training_data = np.delete(self._training_data, random_indices, axis = 0)
        self._training_labels = np.delete(self._training_labels, random_indices)
        
        joined_written_ratings = [word for text in self._training_data for word in text]
        print("Unique words: " + str(len(Counter(joined_written_ratings))))
        print("Mean length: " + str(np.mean([len(text) for text in self._training_data])))
    
    def _load_data(self, data_set_type):
        data = []
        labels = []
        # Iterate over conditions
        for condition in ["neg", "pos"]:
            directory_str = os.path.join(self._directory, "aclImdb", data_set_type, condition)
            directory = os.fsencode(directory_str)
        
            for file in os.listdir(directory):
                filename = os.fsdecode(file)
                
                label = 0 if condition == "neg" else 1
                labels.append(label)
                
                # Read written rating from file
                with open(os.path.join(directory_str, filename)) as fd:
                    written_rating = fd.read()
                    written_rating = written_rating.lower()
                    tokenizer = RegexpTokenizer(r'\w+')
                    written_rating = tokenizer.tokenize(written_rating)
                    data.append(written_rating)
                
        return np.array(data), np.array(labels)
    
    def create_dictionaries(self, vocabulary_size, cutoff_length):
        joined_written_ratings = [word for text in self._training_data for word in text]
        words_and_count = Counter(joined_written_ratings).most_common(vocabulary_size - 2)
        
        word2id = {word: word_id for word_id, (word, _) in enumerate(words_and_count, 2)}
        word2id["_UNKNOWN_"] = 0
        word2id["_NOT_A_WORD_"] = 1
        
        id2word = dict(zip(word2id.values(), word2id.keys()))
        
        self._word2id = word2id
        self._id2word = id2word
        
        self._training_data = np.array([self.words2ids(text[:cutoff_length]) for text in self._training_data])
        self._validation_data = np.array([self.words2ids(text[:cutoff_length]) for text in self._validation_data])
        self._test_data = np.array([self.words2ids(text[:cutoff_length]) for text in self._test_data])
        
        print("Mean length: " + str(np.mean([len(text) for text in self._training_data])))
        
    
    def words2ids(self, words):
        if type(words) == list or type(words) == range or type(words) == np.ndarray:
            return [self._word2id.get(word, 0) for word in words]
        else:
            return self._word2id.get(words, 0)
    
    def ids2words(self, ids):
        if type(ids) == list or type(ids) == range or type(ids) == np.ndarray:
            return [self._id2word.get(wordid, "_UNKNOWN_") for wordid in ids]
        else:
            return self._id2word.get(ids, "_UNKNOWN_")
    
    
    def get_training_batch(self, batch_size):
        return self._get_batch(self._training_data, self._training_labels, batch_size)
    
    def get_validation_batch(self, batch_size):
        return self._get_batch(self._validation_data, self._validation_labels, batch_size)
    
    def get_test_batch(self, batch_size):
        return self._get_batch(self._test_data, self._test_labels, batch_size)
    
    def _get_batch(self, data, labels, batch_size):
        samples_n = labels.shape[0]
        if batch_size <= 0:
            batch_size = samples_n
        
        random_indices = np.random.choice(samples_n, samples_n, replace = False)
        data = data[random_indices]
        labels = labels[random_indices]        
        
        for i in range(samples_n // batch_size):
            on = i * batch_size
            off = on + batch_size
            yield data[on:off], labels[on:off]
    
    
    def slize_batch(self, batch, slize_size):
        max_len = np.max([len(sample) for sample in batch])
        steps = int(np.ceil(max_len / slize_size))
        max_len = slize_size * steps
        
        # Resize all samples in batch to same size
        batch_size = len(batch)
        buffer = np.ones((batch_size, max_len), dtype = np.int32)
        for i, sample in enumerate(batch):
            buffer[i, 0:len(sample)] = sample
        
        for i in range(steps):
            on = i * slize_size
            off = on + slize_size
            yield buffer[:, on:off]
        
    
    def get_sizes(self):
        training_samples_n = self._training_labels.shape[0]
        validation_samples_n = self._validation_labels.shape[0]
        test_samples_n = self._test_labels.shape[0]
        return training_samples_n, validation_samples_n, test_samples_n

In [148]:
# import data

imdb = IMDB(os.getcwd())

Unique words: 70316
Mean length: 241.894582108


# 7 Hyperparameters

The following hyperparameters work quite well. Feel free to play around with them and to adjust them according to your computational resources.

• Embedding and LSTM memory size: 64

• Vocabulary size: 20.000

• Review cutoﬀ length: 300

• Subsequence length: 100

• Batch size: 250

• Epochs: 2

• Adam Optimizer

• Learning rate: 0.03

• Dropout rate: 0.85

In [138]:
embedding_size = 64
vocabulary_size = 20000
cutoff_length = 300
subsequence_length = 100
batch_size = 250
epochs = 2
learning_rate = 0.03
keep_prob = 0.85
number_neurons = 500

In [53]:
# create word ids of the data
imdb.create_dictionaries(vocabulary_size, cutoff_length)

Mean length: 194.473423865


In [69]:
# tests

print("Mean length: " + str(np.mean([len(text) for text in imdb._training_data])))
print("Shape: ", imdb._training_data.shape)

ids = imdb.words2ids(["one", "two", "three", "four", "five", "god", "christ", "nonsensehere"])
print(ids)

Mean length: 194.473423865
Shape:  (21429,)
[31, 107, 286, 690, 683, 509, 2570, 0]


# 5 Network

Implement the following network with the help of TensorFlow

and train the network on subsequences of the movie ratings. Follow the example from the lecture slides in order to reset or keep the hidden and cell state of the LSTM.

In [171]:
tf.reset_default_graph()

# network-input
words = tf.placeholder(tf.int32, [batch_size, subsequence_length])


# word-embedding

with tf.variable_scope("embedding"):
    # Create a word-embedding of size vocabulary size x embedding size
    initializer = tf.random_uniform_initializer(-1.0, 1.0)
    embeddings = tf.get_variable("embeddings", [vocabulary_size, embedding_size], initializer = initializer)

    # Given a tensor of word ids, retrieve the respective embedding
    embed = tf.nn.embedding_lookup(embeddings, words)

In [145]:
# tests
print(words)
print(embeddings)
print(embed)

Tensor("Placeholder:0", shape=(250, 100), dtype=int32)
<tf.Variable 'embedding/embeddings:0' shape=(20000, 64) dtype=float32_ref>
Tensor("embedding/embedding_lookup:0", shape=(250, 100, 64), dtype=float32)


# 6 Optional

Wrap your embeddings with a dropout layer and add a dropout wrapper to your recurrent cell. Adjust your dropout rate according to the data-set that you feed into your network.

Use a learning rate scheduling procedure to increase the performance of your network.

In [172]:
# LSTM

keep_probability = tf.placeholder(tf.float32)
cell_state = tf.placeholder(tf.float32, shape=[batch_size, embedding_size])
hidden_state = tf.placeholder(tf.float32, shape=[batch_size, embedding_size])

with tf.variable_scope("lstm"):
    # Create LSTM cell isntance
    cell = tf.nn.rnn_cell.BasicLSTMCell(embedding_size)
    
    # Possibly add dropout between recurrent steps
    cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob = keep_prob)
    
    # Create zero state and initialize hidden and cell state from placeholder values
    zero_state = cell.zero_state(batch_size, tf.float32)
    state = tf.nn.rnn_cell.LSTMStateTuple(c = cell_state, h = hidden_state)
    
    # tf.nn.static_rnn expects list of time step vectors
    sequences = tf.unstack(embed, num = subsequence_length, axis = 1)
    
    # Unroll the model, returns list of outputs and final cell state
    outputs, state = tf.nn.static_rnn(cell, sequences, initial_state = state)
    
    # Recreate tensor from list
    outputs = tf.reshape(tf.concat(outputs, 1), [batch_size, subsequence_length, embedding_size])

# mean
mean_outputs = tf.reduce_mean(outputs, 2)

# tests
print(outputs)
print(mean_outputs)

with tf.variable_scope("ffnn"):
    # Create weights and biases for FFNN
    initializer = tf.truncated_normal_initializer(stddev = 0.1)
    ffnn_weights = tf.get_variable("weights", [subsequence_length, 1], tf.float32, initializer)
    ffnn_biases = tf.get_variable("biases", [1], initializer = tf.zeros_initializer())
    
    # Calculate the drive of FFNN
    drive = tf.matmul(mean_outputs, ffnn_weights) + ffnn_biases

    print(drive)

Tensor("lstm/Reshape:0", shape=(250, 100, 64), dtype=float32)
Tensor("Mean:0", shape=(250, 100), dtype=float32)
Tensor("ffnn/add:0", shape=(250, 1), dtype=float32)


# 7.1 Plotting

Create a summary node for the sigmoid cross entropy and for the accuracy of the network and use TensorBoard in order to monitor the training process.

In [173]:
# Calculate loss, based on outputs

train_labels = tf.placeholder(tf.float32, shape=[batch_size, 1])

with tf.variable_scope("loss"):
    # calculate sigmoid loss
    sigmoid_cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=train_labels, logits=drive)
    mean_sigmoid_cross_entropy = tf.reduce_mean(sigmoid_cross_entropy)
    
    # add summary node of loss
    tf.summary.scalar("sigmoid_cross_entropy", sigmoid_cross_entropy)
    
    # calculate accuracy
    correct_prediction = tf.equal(tf.cast(tf.greater_equal(tf.nn.sigmoid(drive), 0.5), tf.float32), train_labels)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # add summary node for accuracy
    tf.summary.scalar("accuracy", accuracy)
    
with tf.variable_scope("optimizer"):
    training_step = tf.train.AdamOptimizer(learning_rate).minimize(sigmoid_cross_entropy)

merged_summaries = tf.summary.merge_all()

# summary file writer
train_writer = tf.summary.FileWriter("./summaries/train", tf.get_default_graph())

In [179]:
# Graph Evaluation

# create new session
with tf.Session() as session:

    # step counter
    step = 0

    # initialize variables
    session.run(tf.global_variables_initializer())

    print('Start Training')

    for epoch in range(epochs):

        # print epoch number
        print('Epoch:', epoch + 1)

        for data, label in imdb.get_training_batch(batch_size):
            #labels =  np.expand_dims(labels axis=1).astype(np.float32)
            # Get initial  state and cell state
            _state = session.run(zero_state)

            for subsequence in imdb.slize_batch(data, subsequence_length):
                # Get state of last step
                _state, _summaries, _ = session.run(
                    [state, training_step, merged_summaries],
                    feed_dict = {
                        sequence: subsequence,
                        desired: label,
                        keep_probability: keep_prob,
                        cell_state: _state.c,
                        hidden_state: _state.h
                    }
                )

                # increment step counter
                step += 1

                # write summary to file
                train_writer.add_summary(_summaries, step)

            # print validation accuracy every 25 steps
            if step % 25 == 0:
                _val_accuracy = 0.0
                _val_step = 0
                for val_data, val_label in imdb.get_validation_batch(batch_size):
                    #?val_label =  np.expand_dims(val_labels, axis=1).astype(np.float32)
                    # Get initial hidden and cell state
                    _val_state = session.run(zero_state)

                    for val_subsequence in imdb_data.slize_batch(val_data, sequence_length):
                        # Get state of last step
                        _val_state, _accuracy, = session.run(
                            [state, accuracy],
                            feed_dict = {
                                sequence: val_subsequence,
                                desired: val_label,
                                keep_probability: keep_prob,
                                cell_state: _val_state.c,
                                hidden_state: _val_state.h
                            }
                        )
                        _val_accuracy += _accuracy
                        _val_step += 1
                _val_accuracy /= _val_step

                print('Validation Accuracy at step %d: %f' % (step, _val_accuracy))            

print('End Training')

Start Training
Epoch: 1


ValueError: invalid literal for int() with base 10: 'let'

In [None]:
There was the following error which we could not debug anymore:
    ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-179-e54e67498113> in <module>()
     22             _state = session.run(zero_state)
     23 
---> 24             for subsequence in imdb.slize_batch(data, subsequence_length):
     25                 # Get state of last step
     26                 _state, _summaries, _ = session.run(

<ipython-input-147-df359dfff299> in slize_batch(self, batch, slize_size)
    118         buffer = np.ones((batch_size, max_len), dtype = np.int32)
    119         for i, sample in enumerate(batch):
--> 120             buffer[i, 0:len(sample)] = sample
    121 
    122         for i in range(steps):

ValueError: invalid literal for int() with base 10: 'let'

         
Still we put a lot of time and effort into this homework and we guess it should be enough to pass though.

# 8 Visualizing the trained embeddings with tSNE

Use t-SNE to visualize the trained word embeddings. Try to isolate the two clusters which contain the words with very negative sentiment and very positive words sentiment.

# 9 Evaluate the test performance

Once you are done with optimizing the parameters of your implementation, use the test data-set to evaluate the test performance of the network.

# 10 Find help

If you struggle with the implementation, there is a slightly related tutorial available on the TensorFlow homepage. Please try to solve the problem with the help of the slides and the TensorFlow python API documentation ﬁrst.