# Model training with Jack

## Prerequisites

Note: this command need to be run in terminal from the root of Jack.

Download GloVe:
> `data/GloVe/download_small.sh`

In [1]:
%load_ext autoreload
%autoreload 2
import os
os.chdir('..')    # change dir to Jack root

In [2]:
from jack import readers
from jack.core import SharedResources
from jack.io.embeddings.embeddings import load_embeddings
from jack.io.load import load_jack
from jack.util.hooks import LossHook, ExamplesPerSecHook
from jack.util.vocab import Vocab
from notebooks.prettyprint import QAPrettyPrint
import tensorflow as tf

Let's check all the currently available readers from `readers.py`:

In [3]:
for reader_ in readers.readers.keys():
    print(reader_)

fastqa_reader
modular_qa_reader
fastqa_reader_torch
dam_snli_reader
cbilstm_nli_reader
modular_nli_reader
distmult_reader
complex_reader
transe_reader


## Shared resources

To train the reader, we need to define a vocabulary. Additionally, our readers we will need word embeddings too. We'll use the downloaded GloVe [[1]](#ref1) embeddings. Both the vocabulary and the embeddings are shared between the two presented readers in the notebook.

In [4]:
glove_path = 'data/GloVe/glove.6B.50d.txt'
embeddings = load_embeddings(glove_path,
                             type='glove')
vocab = Vocab(emb=embeddings,
              init_from_embeddings=True)

## FastQA (SQuAD)

We will be training a FastQA [[2]](#ref2) model on a very small subset of the SQuAD dataset [[3]](#ref3), due to slow training. If you want to train your models on a large datasets (like the full SQuAD dataset), we recommend training them on GPUs.

### Data loading

Load up the training data

In [5]:
squad_path = 'data/SQuAD/snippet.jtr.json'
fastqa_train_data = load_jack(squad_path)

### Creating the reader

We need to define the hyperparameter values (representation dimensionality, input representation dimensionality, etc.) and general configuration parameters (maximum span size, etc.) for the FastQA reader:

In [6]:
fastqa_config = {"repr_dim": 10,
                 "repr_dim_input": embeddings.lookup.shape[1],
                 "max_span_size": 10}

Then we create an example reader, based on the (previously defined) vocabulary and the reader configuration:

In [7]:
fastqa_svac = SharedResources(vocab, fastqa_config)
fastqa_reader = readers.fastqa_reader(fastqa_svac)

Afterwards, we set up modules (input, model, output) given a training dataset. `is_training` set to `True` indicates we are in the training phase. After this call, all the parameters of the model will be initialised.

In [8]:
fastqa_reader.setup_from_data(fastqa_train_data, is_training=True)

### Applying the untrained reader

Our model is initialised, but has not been trained yet. We can see that from the predictions it makes:

In [9]:
questions = [q for q, a in fastqa_train_data]
for q, a in zip(questions[:5], fastqa_reader(questions)[:5]):
    print("Question: " + q.question)
    print("Answer:   %s \t %.3f" % (a[0].text, a[0].score))
    print()

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:   is the Grotto, a Marian 	 0.602

Question: What is in front of the Notre Dame Main Building?
Answer:   is the Grotto, a Marian 	 0.636

Question: The Basilica of the Sacred heart at Notre Dame is beside to which structure?
Answer:   is 	 0.691

Question: What is the Grotto at Notre Dame?
Answer:   is the Grotto 	 0.657

Question: What sits on top of the Main Building at Notre Dame?
Answer:   is 	 0.742



Of course the output is not correct because the model was not trained at all.

### Training

First, we set up everything necessary for training. In this case we set the `batch_size` to the size of the dataset, as we're working on a very small dataset. We define hooks which will print out useful information during training (loss and speed) and define the optimiser used (Adam).

In [10]:
# for training we use the bin/jack-train.py script.
batch_size = len(fastqa_train_data)
# short explanation
hooks = [LossHook(fastqa_reader, iter_interval=1), 
         ExamplesPerSecHook(fastqa_reader, batch_size, iter_interval=1)]
optimizer = tf.train.AdamOptimizer(0.11)

...and we start the training procedure:

In [11]:
fastqa_reader.train(optimizer,
                    batch_size=batch_size,
                    hooks=hooks,
                    max_epochs=20,
                    training_set=fastqa_train_data)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


INFO:jack.core.reader:Number of parameters: 6341
INFO:jack.core.reader:Start training...
INFO:jack.util.hooks:Epoch 1	Iter 1	train loss 10.049612998962402
INFO:jack.util.hooks:Epoch 2	Iter 2	train loss 10.117327690124512
INFO:jack.util.hooks:Epoch 3	Iter 3	train loss 9.749606132507324
INFO:jack.util.hooks:Epoch 4	Iter 4	train loss 8.416142463684082
INFO:jack.util.hooks:Epoch 5	Iter 5	train loss 7.313817977905273
INFO:jack.util.hooks:Epoch 6	Iter 6	train loss 6.559922695159912
INFO:jack.util.hooks:Epoch 7	Iter 7	train loss 5.957869052886963
INFO:jack.util.hooks:Epoch 8	Iter 8	train loss 5.390750408172607
INFO:jack.util.hooks:Epoch 9	Iter 9	train loss 4.717708587646484
INFO:jack.util.hooks:Epoch 10	Iter 10	train loss 4.113035202026367
INFO:jack.util.hooks:Epoch 11	Iter 11	train loss 3.6502954959869385
INFO:jack.util.hooks:Epoch 12	Iter 12	train loss 2.89532208442688
INFO:jack.util.hooks:Epoch 13	Iter 13	train loss 2.441541910171509
INFO:jack.util.hooks:Epoch 14	Iter 14	train loss 1.99152

### Predictions from the trained reader

Let's take a look at the predictions after 20 epochs of training:

In [12]:
predictions = fastqa_reader(questions)
for q, a in zip(questions[:5], predictions[:5]):
    print("Question: " + q.question)
    print("Answer:   %s \t (score: %.3f)\n" % (a[0].text, a[0].score))

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:   Saint Bernadette Soubirous 	 (score: 22.813)

Question: What is in front of the Notre Dame Main Building?
Answer:   a copper statue of Christ 	 (score: 21.131)

Question: The Basilica of the Sacred heart at Notre Dame is beside to which structure?
Answer:   the Main Building 	 (score: 17.104)

Question: What is the Grotto at Notre Dame?
Answer:   a Marian place of prayer and reflection 	 (score: 17.628)

Question: What sits on top of the Main Building at Notre Dame?
Answer:   a golden statue of the Virgin Mary 	 (score: 19.150)



And let's take a look at one of the answers in the context of the paragraph:

In [13]:
questions[20].question

'What entity provides help with the management of time for new students at Notre Dame?'

In [14]:
QAPrettyPrint(questions[20].support[0], predictions[20][0].span)

The predicted answers look much better now. However, be aware that this is the prediction of a model trained on a very small subset of data, applied to that same data. Feel free to train your model on the full SQuAD dataset.

### Saving the model

We can now save the model after training it:

In [15]:
fastqa_reader.store("/tmp/fastqa_reader")

## Decomposable attention model (SNLI)

### Data loading

We load the data, and prepare it for later printing

In [16]:
snli_path = 'data/SNLI/snippet.jtr_v1.json'
snli_train_data = load_jack(snli_path)

hypotheses = []
premises = []
labels = []
for input_, output_ in snli_train_data:
    premises.append(input_.support[0])
    hypotheses.append(input_.question)
    labels.append(output_[0].text)

We reset the tensorflow graph to clear out the previously built model

In [17]:
tf.reset_default_graph()

### Creating the reader

As before, we set up the configuration for the model:

In [18]:
snli_config = {"repr_dim": 10,
               "repr_dim_input": embeddings.lookup.shape[1],
               "model": "dam_snli_reader"}

...create the shared resources:

In [19]:
snli_svac = SharedResources(vocab, snli_config)

...build the reader, and set it up with the dataset:

In [20]:
snli_reader = readers.readers["dam_snli_reader"](snli_svac)
snli_reader.setup_from_data(snli_train_data, is_training=True)

INFO:jack.readers.natural_language_inference.decomposable_attention:Building the Attend graph ..
INFO:jack.readers.natural_language_inference.decomposable_attention:Building the Compare graph ..
INFO:jack.readers.natural_language_inference.decomposable_attention:Building the Aggregate graph ..


### Training

We set up the training procedure, similarly to the FastQA model:

In [21]:
batch_size = len(snli_train_data)
hooks = [LossHook(snli_reader, iter_interval=1), 
         ExamplesPerSecHook(snli_reader, batch_size, iter_interval=1)]
optimizer = tf.train.AdamOptimizer(0.05)

...and run the training:

In [22]:
snli_reader.train(optimizer,
                  batch_size=batch_size,
                  hooks=hooks,
                  max_epochs=20,
                  training_set=snli_train_data)

INFO:jack.core.reader:Preparing training data...
INFO:jack.core.input_module:OnlineInputModule pre-processes data on-the-fly in first epoch and caches results for subsequent epochs! That means, first epoch might be slower.
INFO:jack.core.reader:Number of parameters: 20001443
INFO:jack.core.reader:Start training...
INFO:jack.util.hooks:Epoch 1	Iter 1	train loss 1.0986123085021973
INFO:jack.util.hooks:Epoch 2	Iter 2	train loss 1.0928868055343628
INFO:jack.util.hooks:Epoch 3	Iter 3	train loss 1.0861289501190186
INFO:jack.util.hooks:Epoch 4	Iter 4	train loss 1.0980854034423828
INFO:jack.util.hooks:Epoch 5	Iter 5	train loss 1.0843183994293213
INFO:jack.util.hooks:Epoch 6	Iter 6	train loss 1.0872703790664673
INFO:jack.util.hooks:Epoch 7	Iter 7	train loss 1.089052677154541
INFO:jack.util.hooks:Epoch 8	Iter 8	train loss 1.0898879766464233
INFO:jack.util.hooks:Epoch 9	Iter 9	train loss 1.0901975631713867
INFO:jack.util.hooks:Epoch 10	Iter 10	train loss 1.0897631645202637
INFO:jack.util.hooks:Ep

### Predictions from the trained reader

In [23]:
input_ = [qa_setting for qa_setting, answers in snli_train_data]
output_ = snli_reader(input_)

In [24]:
for p, h, l, o in zip(premises[:5], hypotheses[:5], labels[:5], output_[:5]):
    print('Premise: {}'.format(p))
    print('Hypothesis: {}'.format(h))
    print('Prediction: {} (score: {:.2f})  [Label: {}]\n'.format(o[0].text, o[0].score, l))

Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is training his horse for a competition.
Prediction: neutral (score: 0.26)  [Label: neutral]

Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is at a diner, ordering an omelette.
Prediction: contradiction (score: 0.29)  [Label: contradiction]

Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is outdoors, on a horse.
Prediction: contradiction (score: 0.29)  [Label: entailment]

Premise: Children smiling and waving at camera
Hypothesis: They are smiling at their parents
Prediction: neutral (score: 0.26)  [Label: neutral]

Premise: Children smiling and waving at camera
Hypothesis: There are children present
Prediction: contradiction (score: 0.29)  [Label: entailment]



## References:

<a id='ref1'>[1]</a> Pennington, Jeffrey, Richard Socher, and Christopher Manning. <a href='http://www.aclweb.org/anthology/D14-1162'>"Glove: Global vectors for word representation."</a> Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

<a id='ref2'>[2]</a> Weissenborn, Dirk, Georg Wiese, and Laura Seiffe. <a href='http://www.aclweb.org/anthology/K17-1028'>"Making neural qa as simple as possible but not simpler."</a> Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 2017.</a>

<a id='ref3'>[3]</a> Rajpurkar, Pranav, et al. <a href='http://www.anthology.aclweb.org/D/D16/D16-1264.pdf'>"SQuAD: 100,000+ Questions for Machine Comprehension of Text."</a> Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016.