- Changed evaluate.py to include softmax(logits) i.e confidence (for labels 0 and 1) in the output json for validation and test.
- Added sample json outputs
- Added files for best model performance (Accuracy- 58%)
- Analysed confidence probabilities: Model is very underconfident and most options are labelled as TRUE(1). confidence-analysis
- Contacted the developers for gaining information on the performance, seems like they don't know how it degraded when they updated the toolkit to incorporate new Allen AI and Huggingface module versions(Issue thread- nyu-mll/jiant#1052).
- After manually checking results, it is observed that a particular option with any resemblance to a portion of the paragraph is marked TRUE without taking the question into context.
- Researched multi-hop approaches such as Multi-hop Question Answering via Reasoning ChainsChen et. al. 2019 that produces reasoning which seems a logical solution to trim down the paragraph into the most relevant for the particular question. This gave us the following idea.
- Analysed BERT-QA(fine-tuned on SQuAd) and other fine-tuned BERT models(on STS-B, QNLI) on MultiRC dataset, details in experiments/ folder. While it was able to give partially correct answers, it's single span approach failed in answering multihop questions(as expected). One important observation- frozen BERT without any pre-training gave approximately the same results. This highlights the challenging characteristics of the dataset and provides reason for the low-confident model, as it could not learn or find patterns necessary to answer the questions. Added python script in "MultiRC_BERT_QA/".
- Implemented approach in Repurposing Entailment for Multi-Hop Question Answering TasksTrivedi et. al. 2019(pre-trained on QNLI). This entailment approach gave the best-yet results of F1 score:63. The approach of formulating a hypothesis as the answer based on context to the question and transforming the dataset gave us ideas for one of our other main approaches, NER-based QA, also recommended by our project mentor. This entailment approach was further analysed on confidence metrics.
- Added task into the baseline model for the above approach and dataset transformation script under branch "MultiRC_NLI/"
- FINAL APPROACH = Implemented Named-entity-recognition based approach. Idea- Using the concept of BIO tagging to train the model on correct tags for the correct answer and vice-versa for the wrong answers. Pre-requisite- Tranformed the MultiRC dataset into an NER dataset with different tags, one each for- paragraph, question, correct and incorrect answer.
- Added colab notebooks with the required data for the above approach in the repository under MultiRC_NER/
- Analyse the implementation of Entailment-based approach in terms of confidence and micro-analysis on samples of data.
Dataset page: https://cogcomp.seas.upenn.edu/multirc/
Analysis: https://docs.google.com/spreadsheets/d/1zLZw-e5Anm17ah5RsUGOAzEJpmqOGGp-nA_72XfQN-E/edit?usp=sharing
REPORT : https://www.overleaf.com/read/zfbzkqjzxwrb
PROGRESS Slides : https://docs.google.com/presentation/d/1Z8hRQzUXM6ZboHXiayK_s2NtFMi9Ek0osfTT1MWxj9s/edit?usp=sharing
MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
The repo consists of following files/folders:
- baseline: The jiant software toolkit
- output: consists of folders of experiments we conducted and their respective results.
- results.tsv consists of cumulative evaluation results over the runs
- log.log files have the complete log for respective runs
- params.conf have a copy of the configurations used for that run
- data: consists various task datasets with each folder containing train, validation and test data
- jiant/config: contains the default and custom configurations files.
- jiant: Implementation scripts and related modules (hugging face, allennlp).
- user_config.sh: contains the default environment variables(user configurations) like project base directory path, dataset path
- output: consists of folders of experiments we conducted and their respective results.
- dataset: The MultiRC train and dev datasets
- Docs: Related research papers
- Topic notes
- experiments: Details of expermimental analysis of a different approach.
- MultiRC_NER: An NER-based QA approach for MultiRC dataset. Required notebooks and data.
- data: tokenized data
- evaluations: output prediction files
- models: Trained model, config file and vocab
- MultiRC_NER notebook: code for training the NER model on training data
- MultiRC_NER_eval: code for evaluating the trained NER model on evaluation data
- parser.py: converts the given MultiRC data from original format to the NER format
- MutliRC_BERT_QA: BERT QA model
- exploratory_analysis: has code and analysis related to BERT QA model
- MultiRC_NLI: NLI model
- preprocess_multirc.py: convert the given MultiRC data from original format to the NLI format
- lib
(subset of configurations from default.conf which we have overriden on custom config files)
Argument | Description | Value(eg:) |
---|---|---|
max_seq_len | Maximum sequence length, in tokens. Mainly needed for MultiRC, to avoid over-truncating | 256 |
batch_size | Training batch size | 32 |
lr | Initial learning rate | 0.0003 |
dropout | Dropout rate | 0.1 |
val_interval | Interval (in steps) at which you want to evaluate your model on the validation set during pretraining. A step is a batch update | 100 |
input_module | The word embedding or contextual word representation layer | "bert-base-uncased" |
transformers_output_mode | How to handle the embedding layer of the BERT | "top" |
classifier | The type of the final layer(s) in classification and regression tasks | log_reg |
pair_attn | If true, use attn in sentence-pair classification/regression tasks | 0 |
optimizer | Use 'bert_adam' for reproducing BERT experiments | "bert_adam" |
Stopping criteria | ||
min_lr | Minimum learning rate. Training will stop when our explicit LR decay lowers | 0.000001 |
max_vals | Maximum number of validation checks. Will stop once these many validation steps are done | 1000 |
max_epochs | Maximum number of epochs (full pass over a task's training data) | -1 |
Task specification | ||
pretrain_tasks | List of tasks to pretrain on | "sst" |
target_tasks | (MultiRC in our case) list of target tasks to (train and) test on | "multirc" |
do_pretrain | Run pre-train on tasks mentioned in pretrain_tasks | 1 |
do_target_task_training | After do_pretrain, train on the target tasks in target_tasks | 1 |
do_full_eval | Test on the tasks in target_tasks | 1 |
load_model | If true, restore from checkpoint when starting do_pretrain. No impact on do_target_task_training | 1 |
load_target_train_checkpoint | load the specified model_state checkpoint for target_training | none |
load_eval_checkpoint | load the specified model_state checkpoint for evaluation | none |
write_preds | list of splits for which predictions need to be written to disk | 1 |
Misc | ||
exp_name | Name of the current experiment | |
run_name | Name of the run given an experiment |
Complete overview of JIANT: https://arxiv.org/pdf/2003.02249.pdf
Tuning baseline model jiant to execute task 'MultiRC'
##Setup steps ##
-
Setup environment:
conda env create -f environment.yml
-
Activate environment:
conda activate jiant
-
Install remaining packages:
-
Download dataset:
python scripts/download_superglue_data.py --data_dir data --tasks MultiRC
-
Setup environment variables >
cp user_config_template.sh user_config.sh:
eg:
export JIANT_DATA_DIR=/home/varun/PycharmProjects/jiant/data
export JIANT_PROJECT_PREFIX=/home/varun/PycharmProjects/jiant
source user_config.sh
-
Experiment configurations > cp jiant/config/demo.conf jiant/config/multirc.conf: eg changes:
pretrain_tasks="sst" target_tasks="multirc" input_module = bert-base-cased
-
Run task:
source user_config.sh; python main.py --config_file jiant/config/multirc.conf
pip install tensorboard
tensorboard --logdir=${LOG_DIR}
- Convert the MultiRC dataset into NER format using the parser.py
- Run training notebook and evaluation notebook (replace the folder path for the trained model and outputs in these notebooks)
- Pick the SuperGLUE baseline BERT model and understand/explore the codebase.
- Improve the model over baseline scores on the Multi-RC dataset.
- Additional model references- a. Google T5 b. Facebook RoBERTa c. Google BERT
- Exploring confidence probabilities
- Increasing the low EM(exact-match) score
- Increasing F1-score over baseline results