# Model training, application and evaluation with Jack

This notebook explains how Jack can be used programatically to setup, train and evaluate readers. It serves as an example to familiarize yourself with the framework. This is why we're training both models on a very small subset of the dataset (due to slow training). However, in practice we use scripts (`bin/jack-train.py`, `bin/jack-eval.py`) and configs (`conf/`) to train and evaluate models. See Jack documentation for further details.


## Prerequisites

**Note:** this command needs to be run in terminal from the root of Jack.

Download GloVe [[1]](#ref1):
> `sh data/GloVe/download_small.sh`

In [1]:
%load_ext autoreload
%autoreload 2
import os
os.chdir('..')    # change dir to Jack root

In [2]:
from jack import readers
from jack.core import SharedResources
from jack.eval import evaluate_reader, pretty_print_results
from jack.io.embeddings.embeddings import load_embeddings
from jack.io.load import load_jack
from jack.util.hooks import LossHook
from jack.util.vocab import Vocab
from notebooks.prettyprint import QAPrettyPrint, print_nli
import tensorflow as tf

Let's check all the currently available readers from `readers.py`:

These are the **extractive question answering readers**:

In [3]:
[r for r in readers.extractive_qa_readers.keys()]

['fastqa_reader', 'modular_qa_reader', 'fastqa_reader_torch']

...the **natural language inference readers**:

In [4]:
[r for r in readers.nli_readers.keys()]

['dam_snli_reader', 'cbilstm_nli_reader', 'modular_nli_reader']

and the **link prediction readers**:

In [5]:
[r for r in readers.link_prediction_readers.keys()]

['distmult_reader', 'complex_reader', 'transe_reader']

## Shared resources

To train the reader, we need to define a vocabulary. Additionally, our readers will need word embeddings too. We will use the downloaded GloVe [[1]](#ref1) embeddings. Both the vocabulary and the embeddings are shared between the two presented readers in the notebook.

In [6]:
glove_path = 'data/GloVe/glove.6B.50d.txt'
embeddings = load_embeddings(glove_path, type='glove')
vocab = Vocab(emb=embeddings, init_from_embeddings=True)

## FastQA model (SQuAD dataset)

We will be training a FastQA [[2]](#ref2) model on a very small subset of the SQuAD dataset [[3]](#ref3).

Additionally, please note a simpler way to train FastQA from the command line:

> `python3 bin/jack-train.py with config='./conf/qa/squad/fastqa.yaml'`


### Data loading

Load up the training data

In [7]:
squad_path = 'data/SQuAD/snippet.jtr.json'
fastqa_train_data = load_jack(squad_path)

### Creating the reader

We need to define the hyperparameter values (representation dimensionality, input representation dimensionality, etc.) and general configuration parameters (maximum span size, etc.) for the FastQA reader:

In [8]:
fastqa_config = {
    "reader": "fastqa_reader",
    "repr_dim": 10,
    "repr_dim_input": embeddings.lookup.shape[1],
    "max_span_size": 10
}

Then we create an example reader, based on the (previously defined) vocabulary and the reader configuration:

In [9]:
shared_resources = SharedResources(vocab, fastqa_config)
fastqa_reader = readers.fastqa_reader(shared_resources)
# equivalent: readers.readers[fastqa_config["reader"]](shared_resources)
# equivalent: readers.readers["fastqa_reader"](shared_resources)

Afterwards, we set up modules (input, model, output) given a training dataset. `is_training` set to `True` indicates we are in the training phase. After this call, all the parameters of the model will be initialised.

In [10]:
fastqa_reader.setup_from_data(fastqa_train_data, is_training=True)

### Applying the untrained reader

Our model is initialised, but has not been trained yet. We can see that from the predictions it makes:

In [11]:
num_questions = 5
questions = [q for q, a in fastqa_train_data]
for q, a in zip(questions[:num_questions], fastqa_reader(questions)):
    print("Question: " + q.question)
    print("Answer:   %s \t (score: %.3f)\n" % (a[0].text, a[0].score))

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:   At the end of the main drive ( 	 (score: 0.001)

Question: What is in front of the Notre Dame Main Building?
Answer:   At the end of the main drive ( 	 (score: 0.001)

Question: The Basilica of the Sacred heart at Notre Dame is beside to which structure?
Answer:   At the end of the main drive ( 	 (score: 0.001)

Question: What is the Grotto at Notre Dame?
Answer:   gold dome is a golden statue of the Virgin Mary 	 (score: 0.001)

Question: What sits on top of the Main Building at Notre Dame?
Answer:   gold dome is a golden statue of the Virgin Mary 	 (score: 0.001)



Of course the output is not correct because the model was not trained at all.

### Training

First, we set up everything necessary for training. In this case we set the `batch_size` to the size of the dataset, as we're working on a very small dataset. We define hooks which will print out useful information during training (loss and speed) and define the optimiser used (Adam).

In [12]:
batch_size = len(fastqa_train_data)
hooks = [LossHook(fastqa_reader, iter_interval=1)]
optimizer = tf.train.AdamOptimizer(0.11)

...and we start the training procedure:

In [13]:
fastqa_reader.train(optimizer,
                    batch_size=batch_size,
                    hooks=hooks,
                    max_epochs=20,
                    training_set=fastqa_train_data)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


INFO:jack.core.reader:Number of parameters: 6341
INFO:jack.core.reader:Start training...
INFO:jack.util.hooks:Epoch 1	Iter 1	train loss 10.044792175292969
INFO:jack.util.hooks:Epoch 2	Iter 2	train loss 9.878975868225098
INFO:jack.util.hooks:Epoch 3	Iter 3	train loss 9.1016206741333
INFO:jack.util.hooks:Epoch 4	Iter 4	train loss 7.367959499359131
INFO:jack.util.hooks:Epoch 5	Iter 5	train loss 6.420207977294922
INFO:jack.util.hooks:Epoch 6	Iter 6	train loss 6.436832427978516
INFO:jack.util.hooks:Epoch 7	Iter 7	train loss 4.776064872741699
INFO:jack.util.hooks:Epoch 8	Iter 8	train loss 4.5866827964782715
INFO:jack.util.hooks:Epoch 9	Iter 9	train loss 3.6831815242767334
INFO:jack.util.hooks:Epoch 10	Iter 10	train loss 3.3338069915771484
INFO:jack.util.hooks:Epoch 11	Iter 11	train loss 2.9347217082977295
INFO:jack.util.hooks:Epoch 12	Iter 12	train loss 2.5186946392059326
INFO:jack.util.hooks:Epoch 13	Iter 13	train loss 2.525355339050293
INFO:jack.util.hooks:Epoch 14	Iter 14	train loss 2.360

### Predictions from the trained reader

Let's take a look at the predictions after 20 epochs of training:

In [14]:
num_questions = 5
predictions = fastqa_reader(questions)
for q, a in zip(questions[:num_questions], predictions):
    print("Question: " + q.question)
    print("Answer:   %s \t (score: %.3f)\n" % (a[0].text, a[0].score))

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:   Saint Bernadette Soubirous 	 (score: 0.968)

Question: What is in front of the Notre Dame Main Building?
Answer:   a copper statue of Christ 	 (score: 0.906)

Question: The Basilica of the Sacred heart at Notre Dame is beside to which structure?
Answer:   the Main Building 	 (score: 0.675)

Question: What is the Grotto at Notre Dame?
Answer:   a Marian place of prayer and reflection 	 (score: 0.710)

Question: What sits on top of the Main Building at Notre Dame?
Answer:   a golden statue of the Virgin Mary 	 (score: 0.865)



And let's take a look at one of the answers in the context of the paragraph:

In [15]:
questions[20].question

'What entity provides help with the management of time for new students at Notre Dame?'

In [16]:
QAPrettyPrint(questions[20].support[0], predictions[20][0].span)

The predicted answers look much better now. However, be aware that this is the prediction of a model trained on a very small subset of data, applied to that same data. Feel free to train your model on the full SQuAD dataset.

### Evaluation

Let's evaluate our trained model:

In [17]:
batch_size = len(fastqa_train_data)
metrics = evaluate_reader(fastqa_reader, fastqa_train_data, batch_size)
pretty_print_results(metrics)

INFO:jack.core.input_module:OnlineInputModule pre-processes data on-the-fly in first epoch and caches results for subsequent epochs! That means, first epoch might be slower.
INFO:jack.core.reader:Start answering...


 [Elapsed Time: 0:00:00] |###################################| (Time: 0:00:00) 


Exact: 1.0
F1: 1.0


### Saving the model

We can now save the model after training it:

In [18]:
fastqa_reader.store("/tmp/fastqa_reader")

## Decomposable attention model (SNLI dataset)

We will train a decomposable attention model [[4]](#ref4) on a very small subset of the Stanford NLI dataset [[5]](#ref5).

Additionally, please note a simpler way to train the DAM from the command line:

> `python3 bin/jack-train.py with config='./conf/nli/snli/dam.yaml'`

### Data loading

We load the data, and prepare it for later printing

In [19]:
snli_path = 'data/SNLI/snippet.jtr_v1.json'
snli_train_data = load_jack(snli_path)

hypotheses = []
premises = []
labels = []
for input_, output_ in snli_train_data:
    premises.append(input_.support[0])
    hypotheses.append(input_.question)
    labels.append(output_[0].text)

We reset the tensorflow graph to clear out the previously built model

In [20]:
tf.reset_default_graph()

### Creating the reader

As before, we set up the configuration for the model:

In [21]:
snli_config = {"repr_dim": 10,
               "repr_dim_input": embeddings.lookup.shape[1],
               "reader": "dam_snli_reader"}

...create the shared resources:

In [22]:
shared_resources = SharedResources(vocab, snli_config)

...build the reader, and set it up with the dataset:

In [23]:
snli_reader = readers.readers["dam_snli_reader"](shared_resources)
snli_reader.setup_from_data(snli_train_data, is_training=True)

INFO:jack.readers.natural_language_inference.decomposable_attention:Building the Attend graph ..
INFO:jack.readers.natural_language_inference.decomposable_attention:Building the Compare graph ..
INFO:jack.readers.natural_language_inference.decomposable_attention:Building the Aggregate graph ..
DEBUG:jack.core.reader:Variable: <tf.Variable 'jtreader/embeddings:0' shape=(400001, 50) dtype=float32_ref> (Trainable: True)
DEBUG:jack.core.reader:Variable: <tf.Variable 'jtreader/bos_token_embedding:0' shape=(1, 1, 50) dtype=float32_ref> (Trainable: False)
DEBUG:jack.core.reader:Variable: <tf.Variable 'jtreader/transform_embeddings/fully_connected/weights:0' shape=(50, 10) dtype=float32_ref> (Trainable: True)
DEBUG:jack.core.reader:Variable: <tf.Variable 'jtreader/attend/transform_attend/fully_connected/weights:0' shape=(10, 10) dtype=float32_ref> (Trainable: True)
DEBUG:jack.core.reader:Variable: <tf.Variable 'jtreader/attend/transform_attend/fully_connected/biases:0' shape=(10,) dtype=float3

### Training

We set up the training procedure, similarly to the FastQA model:

In [24]:
batch_size = len(snli_train_data)
hooks = [LossHook(snli_reader, iter_interval=1)]
optimizer = tf.train.AdamOptimizer(0.05)

...and run the training:

In [25]:
snli_reader.train(optimizer,
                  batch_size=batch_size,
                  hooks=hooks,
                  max_epochs=20,
                  training_set=snli_train_data)

INFO:jack.core.reader:Preparing training data...
INFO:jack.core.input_module:OnlineInputModule pre-processes data on-the-fly in first epoch and caches results for subsequent epochs! That means, first epoch might be slower.
INFO:jack.core.reader:Number of parameters: 20001443
INFO:jack.core.reader:Start training...
INFO:jack.util.hooks:Epoch 1	Iter 1	train loss 1.0986123085021973
INFO:jack.util.hooks:Epoch 2	Iter 2	train loss 1.0932137966156006
INFO:jack.util.hooks:Epoch 3	Iter 3	train loss 1.0894496440887451
INFO:jack.util.hooks:Epoch 4	Iter 4	train loss 1.0861561298370361
INFO:jack.util.hooks:Epoch 5	Iter 5	train loss 1.095503568649292
INFO:jack.util.hooks:Epoch 6	Iter 6	train loss 1.0809195041656494
INFO:jack.util.hooks:Epoch 7	Iter 7	train loss 1.0815532207489014
INFO:jack.util.hooks:Epoch 8	Iter 8	train loss 1.0724213123321533
INFO:jack.util.hooks:Epoch 9	Iter 9	train loss 1.0121815204620361
INFO:jack.util.hooks:Epoch 10	Iter 10	train loss 0.8881077766418457
INFO:jack.util.hooks:Ep

### Predictions from the trained reader

In [26]:
input_ = [qa_setting for qa_setting, answers in snli_train_data]
output_ = snli_reader(input_)

In [27]:
num_examples = 5
for p, h, l, o in zip(premises[:num_examples], hypotheses, labels, output_):
    print('Premise: {}'.format(p))
    print('Hypothesis: {}'.format(h))
    print('Prediction: {} (score: {:.2f})  [Label: {}]\n'.format(o[0].text, o[0].score, l))

Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is training his horse for a competition.
Prediction: neutral (score: 1.00)  [Label: neutral]

Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is at a diner, ordering an omelette.
Prediction: contradiction (score: 0.46)  [Label: contradiction]

Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is outdoors, on a horse.
Prediction: contradiction (score: 0.46)  [Label: entailment]

Premise: Children smiling and waving at camera
Hypothesis: They are smiling at their parents
Prediction: neutral (score: 1.00)  [Label: neutral]

Premise: Children smiling and waving at camera
Hypothesis: There are children present
Prediction: contradiction (score: 0.46)  [Label: entailment]



### Evaluation

Let's evaluate our trained model!

In [28]:
batch_size = len(snli_train_data)
metrics = evaluate_reader(snli_reader, snli_train_data, batch_size)
pretty_print_results(metrics)

INFO:jack.core.input_module:OnlineInputModule pre-processes data on-the-fly in first epoch and caches results for subsequent epochs! That means, first epoch might be slower.
INFO:jack.core.reader:Start answering...


 [Elapsed Time: 0:00:00] |###################################| (Time: 0:00:00) 


Accuracy: 0.7
Confusion Matrix:

	             	contradiction	entailment   	neutral      
	contradiction	3            	0            	0            
	entailment   	3            	0            	0            
	neutral      	0            	0            	4            
	
F1:
	contradiction: 0.6666666666666666
	entailment: 0.0
	neutral: 1.0
Precision:
	contradiction: 0.5
	entailment: 0.0
	neutral: 1.0
Recall:
	contradiction: 1.0
	entailment: 0.0
	neutral: 1.0


## References:

<a id='ref1'>[1]</a> Jeffrey Pennington, Richard Socher, and Christopher Manning. <a href='http://www.aclweb.org/anthology/D14-1162'>"Glove: Global vectors for word representation."</a> Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

<a id='ref2'>[2]</a> Dirk Weissenborn, Georg Wiese, and Laura Seiffe. <a href='http://www.aclweb.org/anthology/K17-1028'>"Making neural qa as simple as possible but not simpler."</a> Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL). 2017.</a>

<a id='ref3'>[3]</a> Pranav Rajpurkar, et al. <a href='http://www.anthology.aclweb.org/D/D16/D16-1264.pdf'>"SQuAD: 100,000+ Questions for Machine Comprehension of Text."</a> Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2016.

<a id='ref4'>[4]</a> Ankur Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit . <a href='http://www.aclweb.org/anthology/D14-1162'>"A Decomposable Attention Model for Natural Language Inference."</a> Proceedings of the 2016 conference on empirical methods in natural language processing (EMNLP). 2016. 

<a id='ref5'>[5]</a> Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. <a href='http://www.anthology.aclweb.org/D/D16/D16-1264.pdf'>"A large annotated corpus for learning natural language inference."</a> In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2015.