# COVID QA Analysis
## Advanced Statistical NLP (CSE 291-3) 

### Yash Khandelwal, Kaushik Ravindran

github: https://github.com/yashskhandelwal/Covid_QA_Analysis



#### Expected outputs per model:

##### Evaluation:
*   Exact Match
*   F1 Score

##### Training time:
*   Time taken to fine tune the model
*   Average prediction time

##### Environmental impact:
*   GPU Details
*   CO2 emission impact of trainnig the model

#### List of models

*   BERT: Base, Large
*   RoBERTa: Base, Large
*   DistilBERT: Base
*   ALBERT: Base, XXL
*   ELECTRA: Base
*   LongFormer: Base, Large
*   BigBird: base

#### Main libraries:

*   pyTorch
*   trasnformers (HuggingFace)
*   tokenizers (HuggingFace)
*   datasets (HuggingFace)
*   codecarbon




In [5]:
%%capture
# env setup
# install relavant libraries
!pip install datasets transformers
!pip install accelerate
!pip install humanize
!pip install millify
!pip install tqdm
!pip install codecarbon

In [6]:
# imports
import math, statistics, time
from collections import defaultdict
import numpy as np
from tqdm.auto import tqdm

import torch
from codecarbon import EmissionsTracker
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments

import warnings
warnings.filterwarnings("ignore")

In [8]:
# login to hugging face
from huggingface_hub import notebook_login
notebook_login()

Login successful
Your token has been saved to /Users/yashkhandelwal/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


### Section 1: Prepping the dataset

##### Section 1.1: load covid qa dataset and get a bearing

In [9]:
raw_datasets = load_dataset("covid_qa_deepset")

Reusing dataset covid_qa_deepset (/Users/yashkhandelwal/.cache/huggingface/datasets/covid_qa_deepset/covid_qa_deepset/1.0.0/fb886523842e312176f92ec8e01e77a08fa15a694f5741af6fc42796ee9c8c46)


  0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document_id', 'context', 'question', 'is_impossible', 'id', 'answers'],
        num_rows: 2019
    })
})

In [11]:
raw_datasets['train'].features

{'document_id': Value(dtype='int32', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'is_impossible': Value(dtype='bool', id=None),
 'id': Value(dtype='int32', id=None),
 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}

##### Section 1.2: Print some basic stats for the dataset

In [12]:
#About context lengths
context_lengths = list(map(len, raw_datasets['train']['context']))
print('Average context length is:', statistics.mean(context_lengths))
print('Max context length is:', max(context_lengths))
print('Min context length is:', min(context_lengths))
print('Median context length is:', statistics.median(context_lengths))

Average context length is: 32051.449232293213
Max context length is: 70846
Min context length is: 2876
Median context length is: 29857


In [13]:
#About questions lengths
question_lengths = list(map(len, raw_datasets['train']['question']))
print('Average question length is:', statistics.mean(question_lengths))
print('Max question length is:', max(question_lengths))
print('Min question length is:', min(question_lengths))
print('Median question length is:', statistics.median(question_lengths))

Average question length is: 58.48588410104012
Max question length is: 194
Min question length is: 11
Median question length is: 55


In [14]:
#About num of answers per question
answer_count = list(map(lambda x: len(x['answers']['text']), raw_datasets['train']))
print('Average answer count is:', statistics.mean(answer_count))
print('Max answer count is:', max(answer_count))
print('Min answer count is:', min(answer_count))
print('Median answer count is:', statistics.median(answer_count))

Average answer count is: 1
Max answer count is: 1
Min answer count is: 1
Median answer count is: 1


In [15]:
#About length of answers
answer_lengths = list(map(lambda x: len(x['answers']['text'][0]), raw_datasets['train']))
print('Average answer length is:', statistics.mean(answer_lengths))
print('Max answer length is:', max(answer_lengths))
print('Min answer length is:', min(answer_lengths))
print('Median answer length is:', statistics.median(answer_lengths))

Average answer length is: 93.31698860822189
Max answer length is: 933
Min answer length is: 1
Median answer length is: 65


##### Section 1.3: Split dataset into train and validation

In [16]:
raw_datasets_split = raw_datasets["train"].train_test_split(train_size=0.9, seed=42)
raw_datasets_split['validation'] = raw_datasets_split.pop('test')
raw_datasets = raw_datasets_split

Loading cached split indices for dataset at /Users/yashkhandelwal/.cache/huggingface/datasets/covid_qa_deepset/covid_qa_deepset/1.0.0/fb886523842e312176f92ec8e01e77a08fa15a694f5741af6fc42796ee9c8c46/cache-7699b1bdb0a55cfe.arrow and /Users/yashkhandelwal/.cache/huggingface/datasets/covid_qa_deepset/covid_qa_deepset/1.0.0/fb886523842e312176f92ec8e01e77a08fa15a694f5741af6fc42796ee9c8c46/cache-969d64e58f6c8316.arrow


In [17]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document_id', 'context', 'question', 'is_impossible', 'id', 'answers'],
        num_rows: 1817
    })
    validation: Dataset({
        features: ['document_id', 'context', 'question', 'is_impossible', 'id', 'answers'],
        num_rows: 202
    })
})

#### Section 2: Tokenize the dataset

In [18]:
pre_trained_model_checkpoint = "bert-base-cased"

In [19]:
tokenizer = AutoTokenizer.from_pretrained(pre_trained_model_checkpoint)

###### Section 2.1 Preprocessing raw_dataset

In [20]:
# split long context into multiple features 
# find answer start and end token id in each of the features
def preprocess_training_examples(examples):
    #overlapping between context split in multiple features
    stride = 50

    questions = [q.strip() for q in examples["question"]]
    context =  examples["context"]
    answers = examples["answers"] 
    
    #use model tokenizer to tokenize examples
    inputs = tokenizer(
        questions,
        examples["context"],
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    #return_overflowing_tokens -- for each feature, it represents the original example it belonged to
    #return_offsets_mapping -- for each token, it returns the start and end position of the word represented by that token in the original context
    
        
    #pop offset_mapping and overflow_to_sample mapping
    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    
    #map the start an dend token of answer in each feature
    start_positions = []
    end_positions = []
    
    #for each feature
    for i, offset in enumerate(offset_mapping): 
        sample_idx = sample_map[i] #get original example index
        answer = answers[sample_idx] #get the answer for that example
        start_char = answer["answer_start"][0] #start char of answer in original context
        end_char = answer["answer_start"][0] + len(answer["text"][0]) #end char of answer in original context
        
        #labels in tokenized input indicating whether token belongs to question (0), context (1), or special token (None)
        sequence_ids = inputs.sequence_ids(i) 

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [21]:
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Loading cached processed dataset at /Users/yashkhandelwal/.cache/huggingface/datasets/covid_qa_deepset/covid_qa_deepset/1.0.0/fb886523842e312176f92ec8e01e77a08fa15a694f5741af6fc42796ee9c8c46/cache-a6cb04fbadc16da1.arrow


In [22]:
train_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 9
})

#### Section 3: Setting up evaluation for validation

In [23]:
metric = load_metric("squad")

In [24]:
n_best = 20
max_answer_length = 30

def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

#### Section 4: Finetuning the model

In [28]:
# define a training loop
def current_milli_time():
    return round(time.time() * 1000)

def finetune_model(model, args, train_dataset, val_dataset, tokenizer):
    from transformers import Trainer
    from codecarbon import EmissionsTracker
    import torch, time

    tracker = EmissionsTracker()
    tracker.start()
    start_time = current_milli_time()

    trainer = Trainer(
      model=model,
      args=args,
      train_dataset=train_dataset,
      eval_dataset=None,
      tokenizer=tokenizer,
    )
    trainer.train()

    emissions = tracker.stop()
    print('Emissions:', emissions, 'CO_2 eq (in KG)')
    if torch.cuda.is_available():
        print('GPU device name:', torch.cuda.get_device_properties(0).name)
        print('GPU device memory:', torch.cuda.get_device_properties(0).total_memory/(10**9), "GiB")
    print('Training time:', (current_milli_time()-start_time)/(1000*60))
    return trainer

In [29]:
# set model and training arguments
model = AutoModelForQuestionAnswering.from_pretrained(pre_trained_model_checkpoint)
args = TrainingArguments(
    "covid_qa_analysis_bert_base",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    fp16=False,
    hub_model_id="armageddon/covid_qa_analysis_bert_base",
    push_to_hub=True,
)

# finetune model
trainer = finetune_model(model, args, train_dataset, None, tokenizer)

loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/yashkhandelwal/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file https://huggingface.co/bert-bas

Step,Training Loss


Saving model checkpoint to covid_qa_analysis_bert_base/checkpoint-2
Configuration saved in covid_qa_analysis_bert_base/checkpoint-2/config.json
Model weights saved in covid_qa_analysis_bert_base/checkpoint-2/pytorch_model.bin
tokenizer config file saved in covid_qa_analysis_bert_base/checkpoint-2/tokenizer_config.json
Special tokens file saved in covid_qa_analysis_bert_base/checkpoint-2/special_tokens_map.json
tokenizer config file saved in covid_qa_analysis_bert_base/tokenizer_config.json
Special tokens file saved in covid_qa_analysis_bert_base/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




Emissions: 0.00039581337392363214 CO_2 eq (in KG)
Training time: 2.7594333333333334


In [30]:
# push to github if needed
trainer.push_to_hub(commit_message="Working Code")

Saving model checkpoint to covid_qa_analysis_bert_base
Configuration saved in covid_qa_analysis_bert_base/config.json
Model weights saved in covid_qa_analysis_bert_base/pytorch_model.bin
tokenizer config file saved in covid_qa_analysis_bert_base/tokenizer_config.json
Special tokens file saved in covid_qa_analysis_bert_base/special_tokens_map.json


Upload file runs/Feb23_11-15-41_Yashs-MacBook-Pro.local/events.out.tfevents.1645643748.Yashs-MacBook-Pro.local…

To https://huggingface.co/armageddon/covid_qa_analysis_bert_base
   9ad215c..701e101  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Question Answering', 'type': 'question-answering'}, 'dataset': {'name': 'covid_qa_deepset', 'type': 'covid_qa_deepset', 'args': 'covid_qa_deepset'}}


'https://huggingface.co/armageddon/covid_qa_analysis_bert_base/commit/701e101baf9d4d474f245806695951135538570f'