This is a tutorial on processing batches of paragraphs of sentences in tensorflow. 
A key assumption is that every sentence can be read independently--both within and across paragraphs.

In [1]:
import tensorflow as tf
import numpy as np
from functools import reduce

In [2]:
# function to generate a sequence of one_hot vectors (i.e. a sequence of "words")
def gen_sequence(words, embedding_size):
    seq = np.zeros((len(words), embedding_size))
    for i in range(len(words)):
        seq[i,words[i]] = 1
    return words

# generate n_sentences sentences with variable number of words
# Assume a vocabulary of size vocab_size
# maximum words per sentence will be max_words_per_sent
def generate_paragraph(vocab_size, n_sentences, max_words_per_sent):
    sentences = []
    for i in range(n_sentences):
        nwords = np.random.randint(1, max_words_per_sent+1)
        words = np.random.randint(vocab_size, size=nwords)
        sentence = gen_sequence(words, vocab_size)
        sentences.append(sentence)
    return sentences
        
def print_sent_sizes(paragraph):
    print ([len(s) for s in paragraph])

In [3]:
# generate 5 paragraphs
vocab_size=30
p1 = generate_paragraph(vocab_size, n_sentences=1+1, max_words_per_sent=int(1/1*vocab_size))
p2 = generate_paragraph(vocab_size, n_sentences=2+1, max_words_per_sent=int(1/2*vocab_size))
p3 = generate_paragraph(vocab_size, n_sentences=3+1, max_words_per_sent=int(1/3*vocab_size))
p4 = generate_paragraph(vocab_size, n_sentences=4+1, max_words_per_sent=int(1/4*vocab_size))
p5 = generate_paragraph(vocab_size, n_sentences=5+1, max_words_per_sent=int(1/5*vocab_size))

paragraphs = [p1, p2, p3, p4, p5]
for p in paragraphs: print_sent_sizes(p)

[15, 12]
[15, 15, 6]
[7, 8, 8, 4]
[3, 1, 6, 7, 3]
[2, 2, 3, 6, 3, 6]


From this point onwards, we will treat these paragraphs as a single "batch".
The goal is to to read the sentences from each paragraph one-at-a-time in parallel.
This is somewhat of a difficult task because there are both a varible number of sentences and a variable number of words, which makes actions like "bucketting" and "padding" difficult and non-intuitive.

I am practicing this for the context that every paragraph corresponds to a different example and can thus be treated independently. Further, every sentence can be processed independently from another sentence. Thus our goal is to have a tensor with 3 indices, where the first index corresponds to the batch size, the 2nd to all sentences concatonated together, and the 3rd to the word embedding size. I.e. a tensor of dimension $\sum_i^N |P_i| \times S_{max} \times E$, where $|P_i|$ is the number of sentences in the ith paragraph, $N$ is the number of paragraphs, $S_{max}$ is the maximum sentence length across all sentences, and $E$ is the embedding size.

In [4]:
# First let's find the total number of sentences. sanity check. should be 20
n_sentences = reduce((lambda x, y: x + y), [len(p) for p in paragraphs])
print ("number of sentences =", n_sentences)

number of sentences = 20


In [11]:
# Now let's find the maximum sentence length.

# this function goes through every value in an array and finds the maximum length
# will be applied to "paragraphs",which contains lists, for the maximum paragraph length 
#      and each paragraph, which contains np.arrays, for the maximum sentence length
def get_max_length(array, size_op):
    max_len = 0
    for value in array:
        max_len = max(max_len, size_op(value))
    return max_len

# will be used for 2D np.arrays 
def np_length(arr): return arr.shape[0]

# this is slightly recursive. for each paragraph, I check the local maximum sentence length and 
# I then compare that against a "global" maximum sentence inside the main get_max_length function
def max_sent_in_par(paragraph): 
    return get_max_length(paragraph, np_length)

# sanity check
maximum_sentence_length = get_max_length(paragraphs, max_sent_in_par)
print ("maximum_sentence_length =", maximum_sentence_length)

maximum_sentence_length = 15


In [6]:
# Now that we know the number of sentences and the paragraph length, we can create a 3D tensor containing all of the paragraphs

paragraphs_tensor = np.zeros((n_sentences, maximum_sentence_length, vocab_size))
# fill out the tensor
s = 0
for i, paragraph in enumerate(paragraphs):
    p_length = len(paragraph)
    for j, sentence in enumerate(paragraph):
        for k, word in enumerate(sentence):
            paragraphs_tensor[s,k] = word
        s += 1

Now let's define a computational graph that will read in every tensor batch and process each sentence with an rnn. I will use a basic RNN following [this tutorial on RNNs](https://danijar.com/introduction-to-recurrent-networks-in-tensorflow/) and [this tutorial on variable length sequences](https://danijar.com/variable-sequence-lengths-in-tensorflow/) by [Danijar Hafner](https://danijar.com/)

In [7]:
x = tf.placeholder(tf.float32, (None, None, vocab_size))
try:
    cell = tf.contrib.rnn.LSTMCell(num_units=256, state_is_tuple=True)
except:
    cell = tf.contrib.rnn.LSTMCell(num_units=256, state_is_tuple=True, reuse=True)
output, state = tf.nn.dynamic_rnn(cell, x, dtype=tf.float32)

In [10]:
output = tf.transpose(output, [1, 0, 2])
# print(output.get_shape()[0])
# last = tf.gather(output, int(output.get_shape()[0]) - 1)

?
