# Train the model using the vocabulary embedding pickle file
- Specify FN0 to point to the correct pickle file in data folder. This pickle file should have been generated by vocabulary-embedding
- Implementation of the "simple" model from http://arxiv.org/pdf/1512.01712v1.pdf

In [1]:
FN = 'train' # name of the file the model weights will be saved to (train.hdf5)
FN0 = 'all-the-news-1000-vocab-embedding' # name of the word embedding weights
FN1 = 'train' # name of the file the model weights to load from (if starting with a pre-existing model)

You should use GPU (device=cuda) but if it is busy then you always can fall back to your CPU (device=cpu)

In [2]:
import os
os.environ['THEANO_FLAGS'] = 'device=cuda,floatX=float32'

In [3]:
import keras
keras.__version__

Using TensorFlow backend.


'2.1.5'

Use indexing of tokens from [vocabulary-embedding](./vocabulary-embedding.ipynb) this does not clip the indexes of the words to `vocab_size`.

Use the index of outside words to replace them with several `oov` words (`oov` , `oov0`, `oov1`, ...) that appear in the same description and headline. This will allow headline generator to replace the oov with the same word in the description

You can start training from a pre-existing model. This allows you to run this notebooks many times, each time using different parameters and passing the end result of one run to be the input of the next.

I've started with `maxlend=0` (see below) in which the description was ignored. I then moved to start with a high `LR` and then manually lowering it. I also started with `nflips=0` in which the original headlines is used as-is and slowely moved to `12` in which half the input headline was fliped with the predictions made by the model (the paper used fixed 10%)

### Padding and clipping of input
Input data (`X`) holds arrays of representations of descriptions, replacing each word by its corresponding index. (X is essentially loaded from %FN0.data.pickle, the product from vocabulary-embedding file.) 

Each entry in `X` is then left padded with `empty` until the length reaches `maxlend`. Finally the entry is appended  with an `eos`. If entry length exceeds `maxlend`, the preceding part is clipped off and the rest is appended with an `eos`.

E.g. X[0] = [12, 34, 567], word2idx['<eos>'] = 1, word2idx['<empty>'] = 0

- maxlend = 6 ==> X[0] = [0, 0, 0, 12, 34, 567, 1]
- maxlend = 3 ==> X[0] = [12, 34, 567, 1]
- maxlend = 2 ==> X[0] = [34, 567, 1]

Labels (`Y`) are the headline words followed by `eos` and clipped or padded to `maxlenh`

=====================

made from `maxlend` description words followed by `eos`
followed by headline words followed by `eos`
if description is shorter than `maxlend` it will be left padded with `empty`
if entire data is longer than `maxlen` it will be clipped and if it is shorter it will be right padded with empty.

Labels (`Y`) are the headline words followed by `eos` and clipped or padded to `maxlenh`

In other words the input is made from a `maxlend` half in which the description is padded from the left
and a `maxlenh` half in which `eos` is followed by a headline followed by another `eos` if there is enough space.

The labels match only the second half and 
the first label matches the `eos` at the start of the second half (following the description in the first half)

In [4]:
maxlend = 25 # 0 - if we dont want to use description at all
maxlenh = 25
maxlen = maxlend + maxlenh
rnn_size = 512 # must be same as 160330-word-gen
rnn_layers = 3  # match FN1
batch_norm = False

the out of the first `activation_rnn_size` nodes from the top LSTM layer will be used for activation and the rest will be used to select predicted word

In [5]:
activation_rnn_size = 40 if maxlend else 0

In [6]:
# Training parameters
seed = 42
p_W, p_U, p_dense, weight_decay = 0, 0, 0, 0
optimizer = 'adam'
LR = 1e-4
batch_size = 64
nflips = 10

# Read word embedding

In [7]:
import pickle
with open('data/%s.pickle'%FN0, 'rb') as fp:
    embedding, idx2word, word2idx, glove_idx2idx = pickle.load(fp)
vocab_size, embedding_size = embedding.shape

In [8]:
with open('data/%s.data.pickle'%FN0, 'rb') as fp:
    X, Y = pickle.load(fp)
print('Number of descriptions:', len(X))
print('Number of headlines:', len(Y))

Number of descriptions: 998
Number of headlines: 998


In [9]:
nb_unknown_words = 10
nb_train_samples = len(X)  # Number of training samples
nb_val_samples = len(X) * 0.1 # Number of validation set
# nb_train_samples = 30000
# nb_val_samples = 3000

In [10]:
print('number of examples',len(X),len(Y))
print('dimension of embedding space for words',embedding_size)
print('vocabulary size', vocab_size, 'the last %d words can be used as place holders for unknown/oov words'%nb_unknown_words)
print('total number of different words',len(idx2word), len(word2idx))
print('number of words outside vocabulary which we can substitue using glove similarity', len(glove_idx2idx))
print('number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov)',len(idx2word)-vocab_size-len(glove_idx2idx))

number of examples 998 998
dimension of embedding space for words 100
vocabulary size 16182 the last 10 words can be used as place holders for unknown/oov words
total number of different words 16182 16182
number of words outside vocabulary which we can substitue using glove similarity 0
number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov) 0


In [11]:
for i in range(nb_unknown_words):
    idx2word[vocab_size-1-i] = '<%d>'%i

When printing, mark words that are outside of the vocabulary with `^` at the end

In [12]:
oov0 = vocab_size-nb_unknown_words

In [13]:
for i in range(oov0, len(idx2word)):
    idx2word[i] = idx2word[i]+'^'

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=seed)
len(X_train), len(Y_train), len(X_test), len(Y_test)

(898, 898, 100, 100)

In [15]:
del X
del Y

In [16]:
empty = 0
eos = 1
# For printing purposes
idx2word[empty] = '_' # Change '<empty>' to '_' 
idx2word[eos] = '~'   # Change '<eos>' to '~'

In [17]:
import numpy as np
from keras.preprocessing import sequence
from keras.utils import np_utils
import random, sys

In [18]:
def prt(label, x):
    print(label+':', end='')
    for w in x:
        print(idx2word[w]+' ', end='')
    print()

In [19]:
i = random.randint(0, len(X_test) - 1) # Randomly look at an entry
print('i: ', i)
print('Training Set:')
prt('H',Y_train[i]) # Headline
prt('D',X_train[i]) # Description

i:  34
Training Set:
H:Is Your Workout Not Working 
D:is your workout getting you nowhere research and lived experience indicate that many people who begin a new exercise program see little if any improvement in their health and fitness even after week of studiously sticking with their new routine among fitness scientist these people are known a nonresponders their body simply don t respond to the exercise they are doing and once discouraged they often return to being nonexercisers 


In [20]:
print('Testing Set:')
prt('H',Y_test[i]) # Headline
prt('D',X_test[i]) # Description

Testing Set:
H:Skepticism and Support in South Korea a Ban Ki moon Weighs Presidential Bid The New York Times 
D:each day hundred of visitor many with young child make a pilgrimage to haengchi village where ban wa born 72 year ago they wander through a replica of mr ban s old house they learn about his personal journey to the united nations where he wa secretary general for 10 year despite criticism of his tenure there mr ban is seen a a role model by vast number of south koreans school textbook for example celebrate him a a man who made south korea proud 


# Model Configuration

In [27]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, RepeatVector
from keras.layers.wrappers import TimeDistributed, Bidirectional
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.regularizers import l2

In [23]:
# seed weight initialization
random.seed(seed)
np.random.seed(seed)

In [24]:
regularizer = l2(weight_decay) if weight_decay else None

### Start with a standaed stacked LSTM, add embedding layer

In [40]:
model = Sequential() #  Sequential model is a linear stack of layers
model.add(Embedding(vocab_size, embedding_size,
                    input_length=maxlen,
                    embeddings_regularizer=regularizer, weights=[embedding], mask_zero=True,
                    name='embedding_1'))

for i in range(rnn_layers):
    lstm = LSTM(rnn_size, return_sequences=True, # batch_norm=batch_norm,
                kernel_regularizer=regularizer, recurrent_regularizer=regularizer,
                bias_regularizer=regularizer, dropout=p_W, recurrent_dropout=p_U,
                name='lstm_%d'%(i+1)
                  )
    model.add(lstm)
    model.add(Dropout(p_dense,name='dropout_%d'%(i+1)))

## Context Layer

The attention mechanism is used when outputting each word in the decoder. For each output word the attention mechanism computes a weight over each of the input words that determines how much attention should be paid to that input word. The weights sum up to 1, and are used to compute a weighted average of the last hidden layers generated after processing each of the input words. This weighted average is referred to as the context.

Context is then input into the softmax layer along with the last hidden layer from the current step of decoding.

Context layer reduces the input just to its headline part (second half).
For each word in this part it concatenate the output of the previous layer (RNN)
with a weighted average of the outputs of the description part.
In this only the last `rnn_size - activation_rnn_size` are used from each output.
The first `activation_rnn_size` output is used to computer the weights for the averaging.

In [41]:
from keras.layers.core import Lambda
import keras.backend as K

def simple_context(X, mask, n=activation_rnn_size, maxlend=maxlend, maxlenh=maxlenh):
    desc, head = X[:,:maxlend,:], X[:,maxlend:,:]
    head_activations, head_words = head[:,:,:n], head[:,:,n:]
    desc_activations, desc_words = desc[:,:,:n], desc[:,:,n:]
    
    # RTFM http://deeplearning.net/software/theano/library/tensor/basic.html#theano.tensor.batched_tensordot
    # activation for every head word and every desc word
    activation_energies = K.batch_dot(head_activations, desc_activations, axes=(2,2))
    # make sure we dont use description words that are masked out
    activation_energies = activation_energies + -1e20*K.expand_dims(1.-K.cast(mask[:, :maxlend],'float32'),1)
    
    # for every head word compute weights for every desc word
    activation_energies = K.reshape(activation_energies,(-1,maxlend))
    activation_weights = K.softmax(activation_energies)
    activation_weights = K.reshape(activation_weights,(-1,maxlenh,maxlend))

    # for every head word compute weighted average of desc words
    desc_avg_word = K.batch_dot(activation_weights, desc_words, axes=(2,1))
    return K.concatenate((desc_avg_word, head_words))

In [42]:
if activation_rnn_size:
    model.add(Lambda(simple_context,
                     mask = lambda inputs, mask: mask[:,maxlend:],
                     output_shape = lambda input_shape: (input_shape[0], maxlenh, 2*(rnn_size - activation_rnn_size)),
                     name='simplecontext_1'))
model.add(TimeDistributed(Dense(vocab_size,
                                kernel_regularizer=regularizer, bias_regularizer=regularizer,
                                name = 'timedistributed_1')))
model.add(Activation('softmax', name='activation_1'))

## Configure the model for training

In [43]:
from keras.optimizers import Adam, RMSprop # usually I prefer Adam but article used rmsprop
# opt = Adam(lr=LR)  # keep calm and reduce learning rate
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

In [44]:
K.set_value(model.optimizer.lr,np.float32(LR))

In [45]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 100)           1618200   
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 512)           1255424   
_________________________________________________________________
dropout_1 (Dropout)          (None, 50, 512)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 50, 512)           2099200   
_________________________________________________________________
dropout_2 (Dropout)          (None, 50, 512)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 50, 512)           2099200   
_________________________________________________________________
dropout_3 (Dropout)          (None, 50, 512)           0         
__________

# Load model weights if data/train.hdf5 is present
This file will be produced after first time of running train

In [34]:
if os.path.exists('data/%s.hdf5'%FN1):
    model.load_weights('data/%s.hdf5'%FN1)

ValueError: Layer #1 (named "bidirectional_1" in the current model) was found to correspond to layer lstm_1 in the save file. However the new layer bidirectional_1 expects 6 weights, but the saved weights have 3 elements.

## Test if everything looks right so far

In [35]:
def lpadd(x, maxlend=maxlend, eos=eos):
    """left (pre) pad a description to maxlend and then add eos.
    The eos is the input to predicting the first word in the headline
    """
    assert maxlend >= 0
    if maxlend == 0:
        return [eos]
    n = len(x)
    if n > maxlend:
        x = x[-maxlend:]
        n = maxlend
    return [empty]*(maxlend-n) + x + [eos]

In [36]:
samples = [lpadd([3]*26)]
# pad from right (post) so the first maxlend will be description followed by headline
data = sequence.pad_sequences(samples, maxlen=maxlen, value=empty, padding='post', truncating='post')

In [37]:
np.all(data[:,maxlend] == eos)

True

In [38]:
data.shape,list(map(len, samples))

((1, 50), [26])

### Generates output predictions for the input samples

In [39]:
probs = model.predict(data, verbose=0, batch_size=1)
probs.shape

InvalidArgumentError: Input to reshape is a tensor with 49200 values, but the requested shape requires a multiple of 944
	 [[Node: time_distributed_1/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](simplecontext_1/concat, time_distributed_1/Reshape/shape)]]

Caused by op 'time_distributed_1/Reshape', defined at:
  File "/anaconda3/envs/cs259/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/envs/cs259/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 127, in start
    self.asyncio_loop.run_forever()
  File "/anaconda3/envs/cs259/lib/python3.5/asyncio/base_events.py", line 421, in run_forever
    self._run_once()
  File "/anaconda3/envs/cs259/lib/python3.5/asyncio/base_events.py", line 1425, in _run_once
    handle._run()
  File "/anaconda3/envs/cs259/lib/python3.5/asyncio/events.py", line 127, in _run
    self._callback(*self._args)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 117, in _handle_events
    handler_func(fileobj, events)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2903, in run_ast_nodes
    if self.run_code(code, result):
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-30-6016869b550d>", line 8, in <module>
    name = 'timedistributed_1')))
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/keras/models.py", line 492, in add
    output_tensor = layer(self.outputs[0])
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/keras/engine/topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/keras/layers/wrappers.py", line 208, in call
    inputs = K.reshape(inputs, (-1,) + input_shape[2:])
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1894, in reshape
    return tf.reshape(x, shape)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2510, in reshape
    name=name)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/anaconda3/envs/cs259/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 49200 values, but the requested shape requires a multiple of 944
	 [[Node: time_distributed_1/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](simplecontext_1/concat, time_distributed_1/Reshape/shape)]]


# Sample generation
### This section describes the process of feeding the output of a decoder as the input in the next step

Use Beam-search decoder which generates input words one at a time, at each step extending the B highest probability sequences.

In [288]:
# variation to https://github.com/ryankiros/skip-thoughts/blob/master/decoding/search.py
def beamsearch(predict, start=[empty]*maxlend + [eos],
               k=1, maxsample=maxlen, use_unk=True, empty=empty, eos=eos, temperature=1.0):
    """return k samples (beams) and their NLL scores, each sample is a sequence of labels,
    all samples starts with an `empty` label and end with `eos` or truncated to length of `maxsample`.
    You need to supply `predict` which returns the label probability of each sample.
    `use_unk` allow usage of `oov` (out-of-vocabulary) label in samples
    """
    def sample(energy, n, temperature=temperature):
        """sample at most n elements according to their energy"""
        n = min(n,len(energy))
        prb = np.exp(-np.array(energy) / temperature )
        res = []
        for i in range(n):
            z = np.sum(prb)
            r = np.argmax(np.random.multinomial(1, prb/z, 1))
            res.append(r)
            prb[r] = 0. # make sure we select each element only once
        return res

    dead_k = 0 # samples that reached eos
    dead_samples = []
    dead_scores = []
    live_k = 1 # samples that did not yet reached eos
    live_samples = [list(start)]
    live_scores = [0]

    while live_k:
        # for every possible live sample calc prob for every possible label 
        probs = predict(live_samples, empty=empty)
        # total score for every sample is sum of -log of word prb
        cand_scores = np.array(live_scores)[:,None] - np.log(probs)
        cand_scores[:,empty] = 1e20
        if not use_unk:
            for i in range(nb_unknown_words):
                cand_scores[:,vocab_size - 1 - i] = 1e20
        live_scores = list(cand_scores.flatten())
        

        # find the best (lowest) scores we have from all possible dead samples and
        # all live samples and all possible new words added
        scores = dead_scores + live_scores
        ranks = sample(scores, k)
        n = len(dead_scores)
        ranks_dead = [r for r in ranks if r < n]
        ranks_live = [r - n for r in ranks if r >= n]
        
        dead_scores = [dead_scores[r] for r in ranks_dead]
        dead_samples = [dead_samples[r] for r in ranks_dead]
        
        live_scores = [live_scores[r] for r in ranks_live]

        # append the new words to their appropriate live sample
        voc_size = probs.shape[1]
        live_samples = [live_samples[r//voc_size]+[r%voc_size] for r in ranks_live]

        # live samples that should be dead are...
        # even if len(live_samples) == maxsample we dont want it dead because we want one
        # last prediction out of it to reach a headline of maxlenh
        zombie = [s[-1] == eos or len(s) > maxsample for s in live_samples]
        
        # add zombies to the dead
        dead_samples += [s for s,z in zip(live_samples,zombie) if z]
        dead_scores += [s for s,z in zip(live_scores,zombie) if z]
        dead_k = len(dead_samples)
        # remove zombies from the living 
        live_samples = [s for s,z in zip(live_samples,zombie) if not z]
        live_scores = [s for s,z in zip(live_scores,zombie) if not z]
        live_k = len(live_samples)

    return dead_samples + live_samples, dead_scores + live_scores

In [290]:
def keras_rnn_predict(samples, empty=empty, model=model, maxlen=maxlen):
    """for every sample, calculate probability for every possible label
    you need to supply your RNN model and maxlen - the length of sequences it can handle
    """
    sample_lengths = list(map(len, samples))
    assert all(l > maxlend for l in sample_lengths)
    assert all(l[maxlend] == eos for l in samples)
    # pad from right (post) so the first maxlend will be description followed by headline
    data = sequence.pad_sequences(samples, maxlen=maxlen, value=empty, padding='post', truncating='post')
    probs = model.predict(data, verbose=0, batch_size=batch_size)
    return np.array([prob[sample_length-maxlend-1] for prob, sample_length in zip(probs, sample_lengths)])

In [291]:
def vocab_fold(xs):
    """convert list of word indexes that may contain words outside vocab_size to words inside.
    If a word is outside, try first to use glove_idx2idx to find a similar word inside.
    If none exist then replace all accurancies of the same unknown word with <0>, <1>, ...
    """
    xs = [x if x < oov0 else glove_idx2idx.get(x,x) for x in xs]
    # the more popular word is <0> and so on
    outside = sorted([x for x in xs if x >= oov0])
    # if there are more than nb_unknown_words oov words then put them all in nb_unknown_words-1
    outside = dict((x,vocab_size-1-min(i, nb_unknown_words-1)) for i, x in enumerate(outside))
    xs = [outside.get(x,x) for x in xs]
    return xs

In [292]:
def vocab_unfold(desc,xs):
    # assume desc is the unfolded version of the start of xs
    unfold = {}
    for i, unfold_idx in enumerate(desc):
        fold_idx = xs[i]
        if fold_idx >= oov0:
            unfold[fold_idx] = unfold_idx
    return [unfold.get(x,x) for x in xs]

In [293]:
import sys
import Levenshtein

def gensamples(skips=2, k=10, batch_size=batch_size, short=True, temperature=1., use_unk=True):
    i = random.randint(0,len(X_test)-1)
    print('HEAD:',' '.join(idx2word[w] for w in Y_test[i][:maxlenh]))
    print('DESC:',' '.join(idx2word[w] for w in X_test[i][:maxlend]))
    sys.stdout.flush()

    print('HEADS:')
    x = X_test[i]
    samples = []
    if maxlend == 0:
        skips = [0]
    else:
        skips = range(min(maxlend,len(x)), max(maxlend,len(x)), abs(maxlend - len(x)) // skips + 1)
    for s in skips:
        start = lpadd(x[:s])
        fold_start = vocab_fold(start)
        sample, score = beamsearch(predict=keras_rnn_predict, start=fold_start, k=k, temperature=temperature, use_unk=use_unk)
        assert all(s[maxlend] == eos for s in sample)
        samples += [(s,start,scr) for s,scr in zip(sample,score)]

    samples.sort(key=lambda x: x[-1])
    codes = []
    for sample, start, score in samples:
        code = ''
        words = []
        sample = vocab_unfold(start, sample)[len(start):]
        for w in sample:
            if w == eos:
                break
            words.append(idx2word[w])
            code += chr(w//(256*256)) + chr((w//256)%256) + chr(w%256)
        if short:
            distance = min([100] + [-Levenshtein.jaro(code,c) for c in codes])
            if distance > -0.6:
                print(score, ' '.join(words))
        #         print '%s (%.2f) %f'%(' '.join(words), score, distance)
        else:
                print(score, ' '.join(words))
        codes.append(code)

In [294]:
gensamples(skips=2, batch_size=batch_size, k=10, temperature=1.)

HEAD: The Right Way to Fall The New York Times
DESC: rare is the individual who hasn t tripped over a pet or uneven pavement tumbled off a bike slipped on ice or maybe wiped out
HEADS:
242.2842559814453 racial advise Graffiti approached Threats invested cuba Troubles Defense Open callused slogan cristiano pharmacy illness alyssa forgiven supply intellectual literature same commence opportunist liberalism devoses
242.2878646850586 Smiling Nightclub residence poured kcna went bromwich louvre material impressive Couple sicily register encapsulated humans convene itching encouraged retroactively kabul damn Alarmed hijab sinking Bowl


# Data generator

Data generator generates batches of inputs and outputs/labels for training. The inputs are each made from two parts. The first maxlend words are the original description, followed by `eos` followed by the headline which we want to predict, except for the last word in the headline which is always `eos` and then `empty` padding until `maxlen` words.

For each, input, the output is the headline words (without the start `eos` but with the ending `eos`) padded with `empty` words up to `maxlenh` words. The output is also expanded to be y-hot encoding of each word.

To be more realistic, the second part of the input should be the result of generation and not the original headline.
Instead we will flip just `nflips` words to be from the generator, but even this is too hard and instead
implement flipping in a naive way (which consumes less time.) Using the full input (description + eos + headline) generate predictions for outputs. For nflips random words from the output, replace the original word with the word with highest probability from the prediction.

In [295]:
def flip_headline(x, nflips=None, model=None, debug=False):
    """given a vectorized input (after `pad_sequences`) flip some of the words in the second half (headline)
    with words predicted by the model
    """
    if nflips is None or model is None or nflips <= 0:
        return x
    
    batch_size = len(x)
    assert np.all(x[:,maxlend] == eos)
    probs = model.predict(x, verbose=0, batch_size=batch_size)
    x_out = x.copy()
    for b in range(batch_size):
        # pick locations we want to flip
        # 0...maxlend-1 are descriptions and should be fixed
        # maxlend is eos and should be fixed
        flips = sorted(random.sample(range(maxlend+1,maxlen), nflips))
        if debug and b < debug:
            print(b+' ', end='')
        for input_idx in flips:
            if x[b,input_idx] == empty or x[b,input_idx] == eos:
                continue
            # convert from input location to label location
            # the output at maxlend (when input is eos) is feed as input at maxlend+1
            label_idx = input_idx - (maxlend+1)
            prob = probs[b, label_idx]
            w = prob.argmax()
            if w == empty:  # replace accidental empty with oov
                w = oov0
            if debug and b < debug:
                print('%s => %s '%(idx2word[x_out[b,input_idx]],idx2word[w]),)
            x_out[b,input_idx] = w
        if debug and b < debug:
            print()
    return x_out

In [296]:
def conv_seq_labels(xds, xhs, nflips=None, model=None, debug=False):
    """description and hedlines are converted to padded input vectors. headlines are one-hot to label"""
    batch_size = len(xhs)
    assert len(xds) == batch_size
    x = [vocab_fold(lpadd(xd)+xh) for xd,xh in zip(xds,xhs)]  # the input does not have 2nd eos
    x = sequence.pad_sequences(x, maxlen=maxlen, value=empty, padding='post', truncating='post')
    x = flip_headline(x, nflips=nflips, model=model, debug=debug)
    
    y = np.zeros((batch_size, maxlenh, vocab_size))
    for i, xh in enumerate(xhs):
        xh = vocab_fold(xh) + [eos] + [empty]*maxlenh  # output does have a eos at end
        xh = xh[:maxlenh]
        y[i,:,:] = np_utils.to_categorical(xh, vocab_size)
        
    return x, y

In [297]:
def gen(Xd, Xh, batch_size=batch_size, nb_batches=None, nflips=None, model=None, debug=False, seed=seed):
    """yield batches. for training use nb_batches=None
    for validation generate deterministic results repeating every nb_batches
    
    while training it is good idea to flip once in a while the values of the headlines from the
    value taken from Xh to value generated by the model.
    """
    c = nb_batches if nb_batches else 0
    while True:
        xds = []
        xhs = []
        if nb_batches and c >= nb_batches:
            c = 0
        new_seed = random.randint(0, sys.maxsize)
        random.seed(c+123456789+seed)
        for b in range(batch_size):
            t = random.randint(0,len(Xd)-1)

            xd = Xd[t]
            s = random.randint(min(maxlend,len(xd)), max(maxlend,len(xd)))
            xds.append(xd[:s])
            
            xh = Xh[t]
            s = random.randint(min(maxlenh,len(xh)), max(maxlenh,len(xh)))
            xhs.append(xh[:s])

        # undo the seeding before we yield inorder not to affect the caller
        c+= 1
        random.seed(new_seed)

        yield conv_seq_labels(xds, xhs, nflips=nflips, model=model, debug=debug)

In [298]:
r = next(gen(X_train, Y_train, batch_size=batch_size))
r[0].shape, r[1].shape, len(r)

((64, 50), (64, 25, 16182), 2)

In [299]:
def test_gen(gen, n=5):
    Xtr,Ytr = next(gen)
    for i in range(n):
        assert Xtr[i,maxlend] == eos
        x = Xtr[i,:maxlend]
        y = Xtr[i,maxlend:]
        yy = Ytr[i,:]
        yy = np.where(yy)[1]
        prt('L',yy)
        prt('H',y)
        if maxlend:
            prt('D',x)

In [300]:
test_gen(gen(X_train, Y_train, batch_size=batch_size))

L:Riot by Drug Gangs in Brazil Prison Leaves at Least 56 Dead The New York Times ~ _ _ _ _ _ _ _ _ 
H:~ Riot by Drug Gangs in Brazil Prison Leaves at Least 56 Dead The New York Times _ _ _ _ _ _ _ _ 
D:scene of slaughter in the compaj prison which held more than 1 200 inmate about triple it official capacity i never saw anything like this 
L:Donald Trump and Janet Yellen Look to Be on a Collision Course The New York Times ~ _ _ _ _ _ _ _ _ 
H:~ Donald Trump and Janet Yellen Look to Be on a Collision Course The New York Times _ _ _ _ _ _ _ _ 
D:explicitly made clear that the nation isn t far from attaining those goal now it s fair to say the economy is near maximum employment 
L:Trump Appears to Side With Assange Over Intelligence Agencies Conclusions The New York Times ~ _ _ _ _ _ _ _ _ _ _ 
H:~ Trump Appears to Side With Assange Over Intelligence Agencies Conclusions The New York Times _ _ _ _ _ _ _ _ _ _ 
D:mr trump is leaning on republican veteran in the oval office s top slot the f

test fliping

In [301]:
test_gen(gen(X_train, Y_train, nflips=6, model=model, debug=False, batch_size=batch_size))

L:Riot by Drug Gangs in Brazil Prison Leaves at Least 56 Dead The New York Times ~ _ _ _ _ _ _ _ _ 
H:~ Riot by Drug Gangs in Brazil Prison legislation legislation precaution 56 Dead The debate York Times _ _ _ _ _ _ _ _ 
D:scene of slaughter in the compaj prison which held more than 1 200 inmate about triple it official capacity i never saw anything like this 
L:Donald Trump and Janet Yellen Look to Be on a Collision Course The New York Times ~ _ _ _ _ _ _ _ _ 
H:~ Donald Trump and legislation legislation Look to Be on legislation Collision Course legislation New York legislation _ _ _ _ _ _ _ _ 
D:explicitly made clear that the nation isn t far from attaining those goal now it s fair to say the economy is near maximum employment 
L:Trump Appears to Side With Assange Over Intelligence Agencies Conclusions The New York Times ~ _ _ _ _ _ _ _ _ _ _ 
H:~ Trump Appears to legislation With Assange Over Intelligence Agencies Conclusions The legislation legislation legislation _ _ _ _ _ _ _ _

In [302]:
valgen = gen(X_test, Y_test,nb_batches=3, batch_size=batch_size)

check that valgen repeats itself after nb_batches

In [303]:
for i in range(4):
    test_gen(valgen, n=1)

L:How to Navigate New Airline Carry On Rules The New York Times ~ _ _ _ _ _ _ _ _ _ _ _ _ 
H:~ How to Navigate New Airline Carry On Rules The New York Times _ _ _ _ _ _ _ _ _ _ _ _ 
D:a medium and technology company even so 45 percent of flier said they would buy a basic economy fare and those traveler aren t necessarily 
L:Cyberwar for Sale The New York Times ~ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
H:~ Cyberwar for Sale The New York Times _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
D:the morning of may 18 2014 violeta lagunes wa perplexed by a series of strange message that appeared in her gmail inbox it wa election 
L:Justice Department Urges Appeals Court to Reinstate Trump s Travel Ban The New York Times ~ _ _ _ _ _ _ _ _ _ 
H:~ Justice Department Urges Appeals Court to Reinstate Trump s Travel Ban The New York Times _ _ _ _ _ _ _ _ _ 
D:day of reprieve to foreign visitor from seven predominantly muslim country a well a other immigrant who initially were blocked from entering the united states

# Training Starts Here

In [304]:
history = {}

In [305]:
traingen = gen(X_train, Y_train, batch_size=batch_size, nflips=nflips, model=model)
valgen = gen(X_test, Y_test, nb_batches=nb_val_samples//batch_size, batch_size=batch_size)

In [306]:
r = next(traingen)
r[0].shape, r[1].shape, len(r)

((64, 50), (64, 25, 16182), 2)

## Trains the model for a given number of epochs (iterations on a dataset)

In [307]:
for iteration in range(500):
    print('Iteration', iteration)
    h = model.fit_generator(traingen, steps_per_epoch=nb_train_samples//batch_size,
                        epochs=1, validation_data=valgen, validation_steps=nb_val_samples
                           )
    for k,v in h.history.items():
        history[k] = history.get(k,[]) + v
    with open('data/%s.history.pkl'%FN,'wb') as fp:
        pickle.dump(history,fp,-1)
    model.save_weights('data/%s.hdf5'%FN, overwrite=True)
    gensamples(batch_size=batch_size)

Iteration 0
Epoch 1/1
HEAD: Russia Moves to Soften Domestic Violence Law The New York Times
DESC: russian lawmaker on wednesday moved to decriminalize some form of domestic battery for offender who do not do serious physical harm to their victim members
HEADS:
242.09794521331787 containing schreiber country individual sheet immersive 1982 innuendo surreal essay Italian Before revealed Fractious navigate clintons Guardians troubled pooling vacation buzzing Companies disarray facade brewer
242.119647026062 enrich instantly Close so Letter acid McConaughey tearful Down 104 fully wis 30 sympathy Helps Owners adapted venetian presidency experiencing provides smooth Assailant compaj Medallion
Iteration 1
Epoch 1/1
HEAD: Love Interrupted A Travel Ban Separates Couples The New York Times
DESC: it took 11 day of calling lawyer beseeching immigration official and trying to book plane ticket but on tuesday osman nasreldin got the love of
HEADS:
210.6527214050293 Crash First Rockettes California Y

KeyboardInterrupt: 