<a href="https://colab.research.google.com/github/shivammehta007/QuestionGenerator/blob/master/QGenerator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Generation

Additional Dependencies

In [1]:
!pip install fairseq
!pip install sacremoses subword_nmt
!pip install -U tqdm

Requirement already up-to-date: tqdm in /usr/local/lib/python3.6/dist-packages (4.43.0)


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import os
import json
import logging
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import random

In [0]:
# For results duplication
SEED=1234
random.seed(SEED)

In [0]:
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG)

## DataSet

#### Using Author's Dataset

In [0]:
!cp -r /content/drive/My\ Drive/Data/data/processed processed

In [0]:
!mv processed/src-train.txt train.paragraphs
!mv processed/src-test.txt test.paragraphs
!mv processed/src-dev.txt valid.paragraphs
!mv processed/tgt-train.txt train.questions
!mv processed/tgt-test.txt test.questions
!mv processed/tgt-dev.txt valid.questions

Now I can skip preprocessing and go directly to the generating binary for training

#### Using Myself

In [0]:
SQUAD_DIR = '/content/drive/My Drive/Colab Notebooks/SQuAD'
SQUAD_TRAIN = os.path.join(SQUAD_DIR, 'train_v2.json')
# SQUAD_DEV = os.path.join(SQUAD_DIR, 'dev.json')
SQUAD_TEST = os.path.join(SQUAD_DIR, 'test_v2.json')
print(SQUAD_TRAIN, SQUAD_TEST) # , SQUAD_DEV

In [0]:
with open(SQUAD_TRAIN) as train_file:
    train_data = json.load(train_file)
    train_data = train_data['data']

with open(SQUAD_TEST) as test_file:
    test_data = json.load(test_file)
    test_data = test_data['data']

### PreProcessing Function

In [0]:
def convert_to_file_without_answers(dataset, dataset_type='train', get_impossible=False):
    """
    Takes an input json and generates dataset_type.paragraphs and dataset_type.questions
    Input:
    dataset : string -> Name of json input
    dataset_type: string -> Type of dataset like (Train, test, valid)
    get_impossible: boolean -> Flag to get unanswerable questions
    """
    para_output = open(dataset_type + '.paragraphs', 'w')
    question_output = open(dataset_type + '.questions', 'w')
    d = []
    for paragraphs in tqdm(dataset):
        paragraphs = paragraphs['paragraphs']
        for i, paragraph in enumerate(paragraphs):
            para = paragraph['context']
            for questionanswers in paragraph['qas']:
                if questionanswers['is_impossible']:
                    continue
                question = questionanswers['question']
                para = para.replace('\n', ' ')
                para_output.write(para.strip().lower() + '\n')
                question_output.write(question.strip().lower() + '\n')
                d.append(i)
    print(len(d))
    para_output.close()
    question_output.close()

In [0]:
convert_to_file_without_answers(train_data, 'train')
convert_to_file_without_answers(test_data, 'test')


In [0]:
def split_train_valid(filename_paragraph='train.paragraphs', filename_questions='train.questions', split_ratio=0.8):
    """Splits the train set to a validation set"""

    with open(filename_paragraph) as paragraphs_file, open(filename_questions) as questions_file:
        data_paragraphs = paragraphs_file.readlines()
        data_questions = questions_file.readlines()
    
    # Output files
    train_paragraphs_file = open('train.paragraphs', 'w')
    valid_paragraphs_file = open('valid.paragraphs', 'w')
    train_questions_file = open('train.questions', 'w')
    valid_questions_file = open('valid.questions', 'w')

    train_count, valid_count = 0, 0

    for i in tqdm(range(len(data_paragraphs))):
        if random.random() < split_ratio:
            train_paragraphs_file.write(data_paragraphs[i].strip() + '\n')
            train_questions_file.write(data_questions[i].strip() + '\n')
            train_count += 1
        else:
            valid_paragraphs_file.write(data_paragraphs[i].strip() + '\n')
            valid_questions_file.write(data_questions[i].strip() + '\n')
            valid_count += 1

    logger.info('Total Trainset: {} | Total ValidSet: {}'.format(train_count, valid_count))



In [0]:
split_train_valid()

### Generate Binary of Dataset for FairSeq to process

In [14]:
!fairseq-preprocess --source-lang paragraphs --target-lang questions \
     --trainpref train --testpref test --validpref valid\
     --destdir preprocessed_data --seed 1234 --nwordssrc 45000 --nwordstgt 28000

Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='preprocessed_data', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=45000, nwordstgt=28000, only_source=False, optimizer='nag', padding_factor=8, seed=1234, source_lang='paragraphs', srcdict=None, target_lang='questions', task='translation', tensorboard_logdir='', testpref='test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='train', user_dir=None, validpref='valid', workers=1)
| [paragraphs] Dictionary: 44999 types
| [paragraphs] train.paragraphs: 70484 sents, 2386532 tokens, 1.32% replaced by <unk>
| [paragraphs] Dictionary: 44999 types
| [paragraphs] valid.paragraphs: 10570 sents, 368586 to

### Training ConvSeq2Seq Model

In [23]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train preprocessed_data/ \
     --lr 0.001 --clip-norm 0.1 --dropout 0.3 --max-epoch 15 --optimizer adam\
     --arch fconv_iwslt_de_en --save-dir checkpoints/fconv --batch-size 128 --no-epoch-checkpoints \
     --encoder-embed-path glove.840B.300d.txt \
     --encoder-embed-dim 300 --decoder-embed-dim 300 --decoder-embed-path glove.840B.300d.txt --decoder-out-embed-dim 300 --num-workers 3

Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-08, arch='fconv_iwslt_de_en', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.1, cpu=False, criterion='cross_entropy', curriculum=0, data='preprocessed_data/', dataset_impl=None, ddp_backend='c10d', decoder_attention='True', decoder_embed_dim=300, decoder_embed_path='glove.840B.300d.txt', decoder_layers='[(256, 3)] * 3', decoder_out_embed_dim=300, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, empty_cache_freq=0, encoder_embed_dim=300, encoder_embed_path='glove.840B.300d.txt', encoder_layers='[(256, 3)] * 4', fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1,

In [28]:
!fairseq-generate preprocessed_data \
    --path checkpoints/fconv/checkpoint_last.pt \
    --batch-size 128 > last_gen.out



In [30]:
!tail last_gen.out

S-10485	cardinals have in canon law a `` privilege of forum '' -lrb- i.e. , exemption from being judged by ecclesiastical <unk> of ordinary rank -rrb- : only the pope is competent to judge them in matters subject to ecclesiastical jurisdiction -lrb- cases that refer to matters that are spiritual or linked with the spiritual , or with regard to infringement of ecclesiastical laws and whatever contains an element of sin , where culpability must be determined and the appropriate ecclesiastical penalty imposed -rrb- .
T-10485	who is the only person who can judge a cardinal in regards to laws of the church ?
H-10485	-0.8387119770050049	what is the title given to matters that are considered to judge ?
P-10485	-0.7391 -1.5841 -1.1289 -0.9351 -0.5712 -0.2948 -1.8041 -0.0796 -0.2706 -1.3786 -0.5791 -0.0106 -2.3660 -0.0000
S-5194	tajikistan -lrb- <unk> <unk> / , / <unk> / , or / <unk> / ; persian : <unk> <unk> -lsb- <unk> -rsb- -rrb- , officially the republic of tajikistan -lrb- persian : <unk> 

In [0]:
!cp checkpoints/fconv/checkpoint_last.pt /content/drive/My\ Drive/Data

In [32]:
!grep ^H gen.out | cut -f3- > gen.out.sys
!grep ^T gen.out | cut -f2- > gen.out.ref
!fairseq-score --sys gen.out.sys --ref gen.out.ref

Namespace(ignore_case=False, order=4, ref='gen.out.ref', sacrebleu=False, sentence_bleu=False, sys='gen.out.sys')
BLEU4 = 7.49, 39.8/11.5/5.6/2.7 (BP=0.819, ratio=0.834, syslen=114971, reflen=137927)


In [31]:
!grep ^H last_gen.out | cut -f3- > last_gen.out.sys
!grep ^T last_gen.out | cut -f2- > last_gen.out.ref
!fairseq-score --sys last_gen.out.sys --ref last_gen.out.ref

Namespace(ignore_case=False, order=4, ref='last_gen.out.ref', sacrebleu=False, sentence_bleu=False, sys='last_gen.out.sys')
BLEU4 = 7.78, 37.4/10.9/5.3/2.7 (BP=0.895, ratio=0.900, syslen=124120, reflen=137927)


### Trying Baseline LSTM Model

In [15]:
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip

--2020-03-12 11:46:12--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2020-03-12 11:46:12--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2020-03-12 11:46:12--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip

In [16]:
!unzip glove.840B.300d.zip

Archive:  glove.840B.300d.zip
  inflating: glove.840B.300d.txt     


In [0]:
# --lr 0.001. --lr-shrink
! rm -rf checkpoints
# --lr 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5,0.25,0.125,0.0625,0.03125,0.015625,0.0078125

In [35]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train preprocessed_data/ \
     --clip-norm 5 --batch-size 64 \
     --save-dir checkpoints/lstm \
     --arch lstm --max-epoch 15 --encoder-hidden-size 600 --encoder-layers 2 \
     --decoder-hidden-size 600 --decoder-layers 2 --optimizer adam --lr 0.001  --dropout 0.3 --encoder-embed-path glove.840B.300d.txt \
     --encoder-bidirectional --encoder-embed-dim 300 --decoder-embed-dim 300 --no-epoch-checkpoints --decoder-embed-path glove.840B.300d.txt --decoder-out-embed-dim 300 --num-workers 3


Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_softmax_cutoff='10000,50000,200000', arch='lstm', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=5.0, cpu=False, criterion='cross_entropy', curriculum=0, data='preprocessed_data/', dataset_impl=None, ddp_backend='c10d', decoder_attention='1', decoder_dropout_in=0.3, decoder_dropout_out=0.3, decoder_embed_dim=300, decoder_embed_path='glove.840B.300d.txt', decoder_freeze_embed=False, decoder_hidden_size=600, decoder_layers=2, decoder_out_embed_dim=300, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, empty_cache_freq=0, encoder_bidirectional=True, encoder_dropout_in=0.3, encoder_dropout_out=0.3, encoder_embed_dim=300, encoder_embed_path='glove.840B.300d.txt', encoder_freeze_embed=False, encoder_hidden_size=600, encoder_layers=2, fast_stat_sync=False, f

In [38]:
!fairseq-generate preprocessed_data \
    --path checkpoints/lstm/checkpoint_last.pt \
    --batch-size 64 --beam 3 > lstm_last.out



In [39]:
!grep ^H lstm_last.out | cut -f3- > lstm_last.out.sys
!grep ^T lstm_last.out | cut -f2- > lstm_last.out.ref
!fairseq-score --sys lstm_last.out.sys --ref lstm_last.out.ref

Namespace(ignore_case=False, order=4, ref='lstm_last.out.ref', sacrebleu=False, sentence_bleu=False, sys='lstm_last.out.sys')
BLEU4 = 7.04, 39.0/11.2/5.2/2.6 (BP=0.807, ratio=0.823, syslen=113536, reflen=137927)


In [0]:
!cp checkpoints/lstm/checkpoint_best.pt /content/drive/My\ Drive/Data/lstm_best.pt

In [41]:
!head lstm.out.sys 
print('---------')
!head lstm.out.ref

what was madonna raised in ?
what do vaccine rely on ?
when will the line of the line will be launched in 2016 ?
what was the name of the earthquake in 2008 ?
who returned as <unk> 's composer ?
how many frets does <unk> 's vocal range cover ?
who was the winner of the winner ?
what is the official language of portugal ?
what is a group of offspring ?
who was influential among the british <unk> ?
---------
<<unk>> was raised in what religion ?
what do vaccines need to work ?
when will the full line appear ?
what earthquake happened in southern sichuan ?
who wrote the music for <<unk>> ?
how many octaves does <<unk>> have ?
who won this season of idol ?
what is the official name of portugal ?
what are <<unk>> offspring referred as ?
who was <<unk>> influential among ?


### BaseLine LSTM with Sentence Filtered

#### Filtering the Squad Dataset

In [0]:
from spacy.lang.en import English

nlp_sentence = English()
nlp_sentence.add_pipe(nlp_sentence.create_pipe("sentencizer"))

In [0]:
def extract_filtered_sentences(questionanswers, para):
    """
    Method returns filtered sentences from the answers and para for SQUAD
    """
    tokenized_paragraph = nlp_sentence(para)
    sentences = [sent.string for sent in tokenized_paragraph.sents]

    filtered_sentences = set()

    # This iterates over every answer in question
    for answer in questionanswers["answers"]:
        answer_index = answer["answer_start"]
        length = 0

        # find sentence that has answer and filter them
        for sentence in sentences:
            if answer_index <= length + len(sentence):
                filtered_sentences.add(sentence.replace("\n", " ").strip())
                break
            length += len(sentence)

        if not filtered_sentences:
            print("Length : {}".format(length))
            raise Exception("One of the Answers had no sentence please check the data")

    return " ".join(filtered_sentences)

In [0]:
def filter_sentences_on_answer(dataset, dataset_type="train", get_impossible=False):
    """
    Filter the paragraph with only sentences relevant to answer and generates files
    with sentences and questions instead of paragraphs and questions
    Input:
    dataset: string
    dataset_type: string
    get_impossible: boolean
    """

    para_output = open(dataset_type + '.paragraphs', 'w')
    question_output = open(dataset_type + '.questions', 'w')

    dataset_size = 0

    logger.debug("Starting to filter sentences on answer")

    # This loops iterates over every paragraph
    for paragraphs in tqdm(dataset):
        paragraphs = paragraphs["paragraphs"]
        for i, paragraph in enumerate(paragraphs):
            para = paragraph["context"]
            # This loop iterates over every question in para
            for questionanswers in paragraph["qas"]:
                if questionanswers["is_impossible"]:
                    continue
                question = questionanswers["question"]

                filtered_sentences = extract_filtered_sentences(questionanswers, para)

                para_output.write(filtered_sentences.strip().lower() + "\n")
                question_output.write(question.strip().lower() + "\n")

                dataset_size += 1

    logger.info("Size of the {} dataset: {}".format(dataset_type, dataset_size))
    para_output.close()
    question_output.close()

    logger.debug("Sentences Filtered on Answers")

In [0]:
filter_sentences_on_answer(train_data, 'train')
filter_sentences_on_answer(test_data, 'test')

In [0]:
split_train_valid()

#### Training

In [8]:
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip
!unzip glove.840B.300d.zip
!rm -rf checkpoints

--2020-03-13 12:16:44--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2020-03-13 12:16:45--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2020-03-13 12:16:45--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip

In [0]:
!fairseq-preprocess --source-lang paragraphs --target-lang questions \
     --trainpref train --testpref test --validpref valid\
     --destdir preprocessed_data --nwordssrc 45000 --nwordstgt 28000

#--nwordssrc 45000 --nwordstgt 28000

In [0]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train preprocessed_data/ \
     --clip-norm 5 --batch-size 64 \
     --arch lstm --max-epoch 15 --encoder-hidden-size 600 --encoder-layers 2 \
     --decoder-hidden-size 600 --decoder-layers 2 --optimizer sgd  --dropout 0.3 --encoder-embed-path glove.840B.300d.txt \
     --encoder-bidirectional --encoder-embed-dim 300 --decoder-embed-dim 300 --no-epoch-checkpoints --decoder-embed-path glove.840B.300d.txt --decoder-out-embed-dim 300 --num-workers 3 \
     --lr 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5,0.25,0.125,0.0625,0.03125,0.015625,0.0078125

In [0]:
!fairseq-generate preprocessed_data \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 64 --beam 3 | tee gen.out

In [0]:
!grep ^H gen.out | cut -f3- > gen.out.sys
!grep ^T gen.out | cut -f2- > gen.out.ref
!fairseq-score --sys gen.out.sys --ref gen.out.ref

In [0]:
!head -n 100 gen.out.sys

### Transformer Model

In [9]:
!fairseq-preprocess --source-lang paragraphs --target-lang questions \
     --trainpref train --testpref test --validpref valid\
     --destdir preprocessed_data --nwordssrc 45000 --nwordstgt 28000

Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='preprocessed_data', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=45000, nwordstgt=28000, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='paragraphs', srcdict=None, target_lang='questions', task='translation', tensorboard_logdir='', testpref='test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='train', user_dir=None, validpref='valid', workers=1)
| [paragraphs] Dictionary: 44999 types
| [paragraphs] train.paragraphs: 70484 sents, 2386532 tokens, 1.32% replaced by <unk>
| [paragraphs] Dictionary: 44999 types
| [paragraphs] valid.paragraphs: 10570 sents, 368586 token

In [13]:
# fairseq-train \
#     data-bin/wmt16_en_de_bpe32k \
#     --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
#     --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
#     --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
#     --dropout 0.3 --weight-decay 0.0 \
#     --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
#     --max-tokens 3584 \
#     --fp16

# --decoder-layers 2 --encoder-layers 2 --encoder-embed-path glove.840B.300d.txt --encoder-embed-dim 300 --decoder-embed-dim 300   --decoder-embed-path glove.840B.300d.txt 

!CUDA_VISIBLE_DEVICES=0 fairseq-train preprocessed_data/ \
     --clip-norm 0.0 --batch-size 64 \
     --arch transformer --max-epoch 15  \
     --save-dir checkpoints/transformer \
     --optimizer adam  --dropout 0.3 \
     --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
     --no-epoch-checkpoints --num-workers 3 \

Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer', attention_dropout=0.0, best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='cross_entropy', cross_self_attention=False, curriculum=0, data='preprocessed_data/', dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encode

In [14]:
!fairseq-generate preprocessed_data \
    --path checkpoints/transformer/checkpoint_best.pt \
    --batch-size 64 --beam 3 | tee transformer.out

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
P-3770	-1.0566 -0.8042 -0.1405 -0.1467 -0.0053 -0.1199 -2.7720 -1.1656 -2.0467 -1.9258 -1.4245 -0.9294 -1.2462 -0.7750 -0.0000
S-5667	continental portugal is <unk> into 18 districts , while the archipelagos of the azores and madeira are governed as autonomous regions ; the largest units , established since 1976 , are either mainland portugal -lrb- portuguese : portugal continental -rrb- and the autonomous regions of portugal -lrb- azores and madeira -rrb- .
T-5667	how many districts is the continental portugal divided into ?
H-5667	-0.6435481309890747	what is the name of the largest group in the world ?
P-5667	-0.6972 -0.5679 -0.2510 -0.2680 -0.0035 -0.0555 -0.3923 -2.6520 -0.4540 -0.8963 -2.0283 -0.1002 0.0000
S-4542	major biomedical research institutions include memorial sloan -- <unk> cancer center , rockefeller university , suny <unk> medical center , albert einstein college of medicine , mount sinai school of medicin

In [15]:
!grep ^H transformer.out | cut -f3- > transformer.out.sys
!grep ^T transformer.out | cut -f2- > transformer.out.ref
!fairseq-score --sys transformer.out.sys --ref transformer.out.ref

Namespace(ignore_case=False, order=4, ref='transformer.out.ref', sacrebleu=False, sentence_bleu=False, sys='transformer.out.sys')
BLEU4 = 3.13, 27.1/4.2/1.4/0.6 (BP=1.000, ratio=1.055, syslen=145454, reflen=137927)
