# HuggingFace Tutorial

This is a tutorial for me to learn how to use transformer with huggingface.

# Reference: 
- https://huggingface.co/
- https://huggingface.co/transformers/
- https://github.com/huggingface/datasets
- https://colab.research.google.com/drive/1IPkZo1Wd-DghIOK6gJpcb0Dv4_Gv2kXB?usp=sharing#scrollTo=ImupuGXDGq7b
- https://github.com/monologg/KoELECTRA
- https://github.com/monologg/KoELECTRA/blob/master/finetune/run_squad.py
- Korean Sentence Splitter: https://github.com/hyunwoongko/kss

# Installation

```bash
$ pip install transformers
```

In [8]:
import numpy as np

import torch
import torchtext

print(f"PyTorch Version: {torch.__version__}")
print(f"TorchText Version: {torchtext.__version__}")  

PyTorch Version: 1.6.0
TorchText Version: 0.8.0a0+c851c3e


# Datasets

need to install sentencepiece

```bash
$ pip install sentencepiece
$ pip install datasets
```

# How to use?

## Pipeline

- ConversationalPipeline
- FeatureExtractionPipeline
- FillMaskPipeline
- QuestionAnsweringPipeline
- SummarizationPipeline
- TextClassificationPipeline
- TextGenerationPipeline
- TokenClassificationPipeline
- TranslationPipeline
- ZeroShotClassificationPipeline
- Text2TextGenerationPipeline
- TableQuestionAnsweringPipeline

function: `pipeline`

- "feature-extraction": will return a FeatureExtractionPipeline.
- "sentiment-analysis": will return a TextClassificationPipeline.
- "ner": will return a TokenClassificationPipeline.
- "question-answering": will return a QuestionAnsweringPipeline.
- "fill-mask": will return a FillMaskPipeline.
- "summarization": will return a SummarizationPipeline.
- "translation_xx_to_yy": will return a TranslationPipeline.
- "text2text-generation": will return a Text2TextGenerationPipeline.
- "text-generation": will return a TextGenerationPipeline.
- "zero-shot-classification:: will return a ZeroShotClassificationPipeline.
- "conversation": will return a ConversationalPipeline.

model will be automatically downloaded in `~/.cache/huggingface/`

In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis", framework="pt")

In [None]:
sentences = [
    "We are very happy to show you the 🤗 Transformers library.",
    "I'll go to Apple Store.",
    "This model covers a lot area. But, I won't use it. Since it is too hard to use."
]

In [None]:
results = classifier(sentences)
for res in results:
    print(f"label: {res['label']}, with score: {round(res['score'], 4)}")

# Fine-Tuning with Custom Dataset

https://huggingface.co/transformers/custom_datasets.html#qa-squad


## Dataset


- 제목(title)
- 본문의 카테고리(source)
- 본문(context)
- 질문 번호(id)
- 육하원칙(classtype)
- 질문(question)
- 정답의 시작위치(answer_start)
- 정답(text)

In [185]:
import json
from tqdm import tqdm
from pathlib import Path
repo_path = Path().absolute().parent
data_path = repo_path.parent / "data" / "AIhub" / "QA"
for p in data_path.glob("*all.json"):
    print(p)

/home/simonjisu/code/data/AIhub/QA/ko_nia_normal_squad_all.json
/home/simonjisu/code/data/AIhub/QA/ko_nia_clue0529_squad_all.json
/home/simonjisu/code/data/AIhub/QA/ko_nia_noanswer_squad_all.json


In [35]:
n_split = 10
processd_length = []
for path in data_path.glob("*all.json"):
    if path.name == "ko_nia_normal_squad_all.json":
        state = "train"
    elif path.name == "ko_nia_clue0529_squad_all.json":
        state = "val"
    else:
        continue
    # read
    with open(path, 'rb') as f:
        squad_dict = json.load(f)
    total_examples = len(squad_dict["data"])
    k = len(squad_dict["data"]) // n_split
    processed = 0
    
    for i in range(n_split):
        p = data_path / f"{state}_{i}.json"
        temp_data = dict(
            creator=squad_dict["creator"], 
            version=squad_dict["version"], 
            data=squad_dict["data"][i:i+k]
        )
        processed += k
        with open(p, "w") as f:
            json.dump(temp_data, f)
            
    if processed < total_examples:
        p = data_path / f"{state}_{i+1}.json"
        temp_data = dict(
            creator=squad_dict["creator"], 
            version=squad_dict["version"], 
            data=squad_dict["data"][processed:]
        )
        with open(p, "w") as f:
            json.dump(temp_data, f)

In [7]:
len(squad_dict["data"])

47314

In [33]:
path = data_path / f"ko_nia_normal_squad_{10}.json"
with open(path, 'rb') as f:
    squad_dict = json.load(f)

## Explore Datset

In [3]:
def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    for group in tqdm(squad_dict["data"], total=len(squad_dict["data"]), desc="Reading Dataset"):
        for paragraph in group['paragraphs']:
            context = paragraph['context']
            for qa in paragraph['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

In [4]:
train_file = "ko_nia_normal_squad_all.json"
train_path = data_path / train_file
val_file = "ko_nia_clue0529_squad_all.json"
val_path = data_path / val_file

train_contexts, train_questions, train_answers = read_squad(train_path)
val_contexts, val_questions, val_answers = read_squad(val_path)

Reading Dataset: 100%|██████████| 47314/47314 [00:00<00:00, 444901.71it/s]
Reading Dataset: 100%|██████████| 34500/34500 [00:00<00:00, 548757.41it/s]


In [5]:
print(len(train_contexts), len(val_contexts))

243425 96663


Let's see some samples

In [6]:
import termcolor

for idx in np.random.randint(0, len(train_contexts), size=(2,)):
    txt = train_answers[idx]["text"]
    context = train_contexts[idx].split(txt)
    context.insert(1, termcolor.colored(txt, "red", attrs=["bold"]))
    answer_end = train_answers[idx]['answer_start'] + len(train_answers[idx]['text'])  # not included like python range
    print(termcolor.colored("Context: ", attrs=["bold"]))
    print("".join(context))
    print(termcolor.colored("Question: ", attrs=["bold"]))
    print(train_questions[idx])
    print(termcolor.colored("Answer: ", attrs=["bold"]))
    print(f"  Start: {train_answers[idx]['answer_start']}, End: {answer_end}")
    print()

[1mContext: [0m
댓글 조작 의혹 사건으로 구속된 김모(48·닉네임 드루킹)씨는 그동안 페이스북을 통해 추미애 더불어민주당 대표, 포털사이트 네이버, 문재인 대통령 핵심 지지자들인 ‘문꿀오소리’ 등을 싸잡아 비판했다. 이밖에도 여론 전문가, 남북 관계 전문가처럼 행세하면서 정치권에 훈수를 두는 일도 마다하지 않았다. 김씨는 자신의 페이스북 계정 ‘Sj Kim(드루킹)’을 통해 [1m[31m지난 1월 26일[0m “그동안 그렇게 하라고 해도 안하더니 네이버에서 드디어 계정 접속관리하고 기사 웹페이지를 손봤다”며 “기존 매크로 같은 것은 이틀 전부터 막혀서 안 될 것”이라고 언급했다. 매크로를 이용한 댓글 추천 조작 방법을 김씨가 명백하게 알고 있었다는 점을 뒷받침하는 글이라는 분석이 많다. 김씨는 이 글에서 추 대표도 비판했다. 그는 “청와대가 압력을 넣어 네이버 웹페이지를 개편하게 하면 뭘 하느냐”며 “‘문재앙’ 단어를 프레임화한 것은 그걸 기사화시킨 추 대표의 작품”이라고 지적했다. 이어 “지지자들은 열심히 댓글 방어하고 있는데 추 대표는 휴가 가셨다죠? 민주당의 앞날이 암울하다”고 썼다. ‘문꿀오소리’에 대한 혹평도 이어졌다. 김씨는 지난해 12월 페이스북 글에서 “자유한국당 댓글부대는 문 대통령 관련 기사에 악플을 단 뒤 순식간에 7000∼8000개의 추천을 찍는 화력”이라며 “문꿀오소리나 달빛기사단(문 대통령 핵심 지지층)은 기껏해야 그 반의 반에도 미치지 못한다”고 했다. 그러면서 “지금까지 문재인 지지자들은 온라인을 완전히 장악하고 있다고 오만에 빠져 있었다”고 썼다. 여론과 남북 관계에 대한 ‘점잖은’ 훈수도 빼놓지 않았다. 김씨는 1월 초 페이스북에서 “온라인 여론 점유율이 대통령 지지율이라고 여러 차례 이야기를 해도 정치인은 알아듣지 못한다”며 “아직도 오프라인 세상이 여론을 좌우한다고 생각하고 있다”고 했다. 이어 “통일은 반드시 이뤄야만 할 숙원”이라며 “그럴 때는 북한에 대한 발언도 예의있게 해야 한다. 요즘 20, 3

In [7]:
def add_end_idx(answers, contexts):
    for idx, (answer, context) in enumerate(zip(answers, contexts)):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
            print(f"type1: {idx}")
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters
            print(f"type2: {idx}")

In [8]:
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [9]:
from transformers import ElectraModel, ElectraTokenizer, ElectraTokenizerFast, ElectraForQuestionAnswering

tokenizer = ElectraTokenizerFast.from_pretrained("monologg/koelectra-base-v3-discriminator")  
# Fast 를 써야 ._encodings 속성이 생긴다. 
# 안에는 Encoding class로 된 데이터가 list롤 있음
# train_encodings._encodings[0]
# Encoding(num_tokens=365, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [10]:
sample_len = 5
train_encodings = tokenizer(
    train_contexts[:sample_len], train_questions[:sample_len], padding = "max_length",
    max_length=512, truncation=True
)
val_encodings = tokenizer(
    val_contexts[:sample_len], val_questions[:sample_len], padding = "max_length",
    max_length=512, truncation=True
)

In [11]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))  # char_to_token: 문자가 몇 번째 토큰에 있는지 확인
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers[:sample_len])
add_token_positions(val_encodings, val_answers[:sample_len])

In [12]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

In [13]:
class ARGS:
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)
        
args_dict = {
  "task": "korquad",
  "data_dir": "data",
  "ckpt_dir": "ckpt",
  "train_file": "KorQuAD_v1.0_train.json",
  "predict_file": "KorQuAD_v1.0_dev.json",
  "threads": 4,
  "version_2_with_negative": False,
  "null_score_diff_threshold": 0.0,
  "max_seq_length": 512,
  "doc_stride": 128,
  "max_query_length": 64,
  "max_answer_length": 30,
  "n_best_size": 20,
  "verbose_logging": True,
  "overwrite_output_dir": True,
  "evaluate_during_training": True,
  "eval_all_checkpoints": True,
  "save_optimizer": False,
  "do_lower_case": False,
  "do_train": True,
  "do_eval": True,
  "num_train_epochs": 7,
  "weight_decay": 0.0,
  "gradient_accumulation_steps": 1,
  "adam_epsilon": 1e-8,
  "warmup_proportion": 0,
  "max_steps": -1,
  "max_grad_norm": 1.0,
  "no_cuda": False,
  "model_type": "koelectra-base-v3",
  "model_name_or_path": "monologg/koelectra-base-v3-discriminator",
  "output_dir": "koelectra-base-v3-korquad-ckpt",
  "seed": 42,
  "train_batch_size": 8,
  "eval_batch_size": 32,
  "logging_steps": 1000,
  "save_steps": 1000,
  "learning_rate": 5e-5
}
     
args = ARGS(**args_dict)

In [14]:
from transformers import ElectraForQuestionAnswering, ElectraConfig
config = ElectraConfig.from_pretrained(args.model_name_or_path)
model = ElectraForQuestionAnswering.from_pretrained(args.model_name_or_path, config=config)

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForQuestionAnswering were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: ['qa_outputs.weight'

In [15]:
from torch.utils.data import DataLoader
from transformers import AdamW

device = "cpu"

model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=args.train_batch_size, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for batch in train_loader:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    loss.backward()
    optim.step()
    break

In [16]:
print("Inputs", input_ids.size())
print("Start Tokens", start_positions.size())
print("End Tokens", end_positions.size())
print("Loss", outputs.loss)
print("Start Logits", outputs.start_logits.size())
print("End Logits", outputs.end_logits.size())

Inputs torch.Size([5, 512])
Start Tokens torch.Size([5])
End Tokens torch.Size([5])
Loss tensor(6.0529, grad_fn=<DivBackward0>)
Start Logits torch.Size([5, 512])
End Logits torch.Size([5, 512])


## Is there more Efficient way???...

In [183]:
from transformers import squad_convert_examples_to_features
from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor

In [186]:
processor = SquadV2Processor()
examples = processor.get_train_examples(data_dir=data_path, filename="test.json")  # examples은 먼저 whitespace 기반으로 토크나이징함

100%|██████████| 5/5 [00:00<00:00, 314.43it/s]


In [187]:
from transformers import ElectraTokenizer, ElectraForQuestionAnswering
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

In [6]:
features, train_dataset = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=512,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
    return_dataset="pt",
    threads=4,
)

convert squad examples to features: 100%|██████████| 25/25 [00:00<00:00, 84.96it/s]
add example index and unique id: 100%|██████████| 25/25 [00:00<00:00, 67216.41it/s]


In [7]:
a = features[1]
print(a.start_position, a.end_position)

26 30


In [8]:
for i, t in enumerate(a.tokens[:35]):
    print(i, t)

0 [CLS]
1 '
2 국제
3 청소년
4 ##포
5 ##럼
6 '
7 이
8 열리
9 ##는
10 때
11 ##는
12 ?
13 [SEP]
14 한국
15 ##청
16 ##소년단
17 ##체
18 ##협
19 ##의
20 ##회
21 ##와
22 여성
23 ##가족
24 ##부
25 ##는
26 22
27 ##일
28 ##부터
29 28
30 ##일
31 ##까
32 ##지
33 서울
34 ##과


In [178]:
class ARGS:
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)
        
args_dict = {
  "task": "korquad",
  "data_dir": "data",
  "ckpt_dir": "ckpt",
  "train_file": "KorQuAD_v1.0_train.json",
  "predict_file": "KorQuAD_v1.0_dev.json",
  "threads": 4,
  "version_2_with_negative": False,
  "null_score_diff_threshold": 0.0,
  "max_seq_length": 512,
  "doc_stride": 128,
  "max_query_length": 64,
  "max_answer_length": 30,
  "n_best_size": 20,
  "verbose_logging": True,
  "overwrite_output_dir": True,
  "evaluate_during_training": True,
  "eval_all_checkpoints": True,
  "save_optimizer": False,
  "do_lower_case": False,
  "do_train": True,
  "do_eval": True,
  "num_train_epochs": 7,
  "weight_decay": 0.0,
  "gradient_accumulation_steps": 1,
  "adam_epsilon": 1e-8,
  "warmup_proportion": 0,
  "max_steps": -1,
  "max_grad_norm": 1.0,
  "no_cuda": False,
  "model_type": "koelectra-base-v3",
  "model_name_or_path": "monologg/koelectra-base-v3-discriminator",
  "output_dir": "koelectra-base-v3-korquad-ckpt",
  "seed": 42,
  "train_batch_size": 8,
  "eval_batch_size": 32,
  "logging_steps": 1000,
  "save_steps": 1000,
  "learning_rate": 5e-5
}
     
args = ARGS(**args_dict)

def tolist(tensor):
    return tensor.detach().cpu().tolist()

In [179]:
from transformers import ElectraForQuestionAnswering, ElectraConfig
config = ElectraConfig.from_pretrained(args.model_name_or_path)
model = ElectraForQuestionAnswering.from_pretrained(args.model_name_or_path, config=config)

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForQuestionAnswering were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: ['qa_outputs.weight'

In [180]:
device = "cpu"

In [12]:
from torch.utils.data import DataLoader
from transformers import AdamW, get_linear_schedule_with_warmup

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=args.train_batch_size)
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": args.weight_decay,
    },
    {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=int(t_total * args.warmup_proportion), num_training_steps=t_total
)

global_step = 1
epochs_trained = 0
steps_trained_in_current_epoch = 0
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()

In [13]:
model.to(device)
model.train()
for step, batch in enumerate(train_dataloader):
    batch = tuple(t.to(device) for t in batch)

    inputs = {
        "input_ids": batch[0],
        "attention_mask": batch[1],
        "token_type_ids": batch[2],
        "start_positions": batch[3],
        "end_positions": batch[4],
    }
    break

In [14]:
input_ids=inputs["input_ids"]
attention_mask=inputs["attention_mask"]
token_type_ids=inputs["token_type_ids"]

In [15]:
o = model.electra.forward(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
o.last_hidden_state.size()

torch.Size([8, 512, 768])

In [16]:
fin_o = model.qa_outputs(o.last_hidden_state)
fin_o.size()

torch.Size([8, 512, 2])

In [17]:
outputs = model(**inputs)
print(f"start predict {outputs.start_logits.argmax(1)}")
print(f"start answer {inputs['start_positions']}")
print(f"start predict {outputs.end_logits.argmax(1)}")
print(f"start answer {inputs['end_positions']}")

start predict tensor([257,  92,  26, 461,  28,  28, 468,  86])
start answer tensor([  0, 108,  21,  87,  18,  16, 114,  37])
start predict tensor([243,  64,  56, 178, 109, 109,  21, 333])
start answer tensor([  0, 119,  31,  91,  18,  16, 116,  43])


**eval phase**

In [188]:
eval_examples = processor.get_dev_examples(data_dir=data_path, filename="test.json")  # examples은 먼저 whitespace 기반으로 토크나이징함
eval_features, eval_dataset = squad_convert_examples_to_features(
    examples=eval_examples,
    tokenizer=tokenizer,
    max_seq_length=512,
    doc_stride=128,
    max_query_length=64,
    is_training=False,
    return_dataset="pt",
    threads=4,
)

100%|██████████| 5/5 [00:00<00:00, 608.93it/s]
convert squad examples to features: 100%|██████████| 25/25 [00:00<00:00, 84.13it/s]
add example index and unique id: 100%|██████████| 25/25 [00:00<00:00, 109113.01it/s]


In [189]:
# for fea in eval_features:
#     fea.unique_id -= 1000000000
eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=args.train_batch_size)

In [37]:
from copy import deepcopy

In [43]:
features = deepcopy(eval_features)
example_index = 0
unique_id = 1000000000
previous_example_index = -1 
for fea in features:
    print(f"before: {fea.unique_id} / {fea.example_index}")
    fea.unique_id = unique_id
    unique_id += 1
    
    current_example_index = fea.example_index
    print(f" c: {current_example_index} / p: {previous_example_index} / e: {example_index}")
    if previous_example_index == current_example_index:
        fea.example_index = previous_example_index
    else:
        previous_example_index = fea.example_index
        fea.example_index = example_index
        example_index += 1
    print(f"after: {fea.unique_id} / {fea.example_index}")
    print()

before: 1000000000 / 0
 c: 0 / p: -1 / e: 0
after: 1000000000 / 0

before: 1000000001 / 1
 c: 1 / p: 0 / e: 1
after: 1000000001 / 1

before: 1000000002 / 2
 c: 2 / p: 1 / e: 2
after: 1000000002 / 2

before: 1000000003 / 3
 c: 3 / p: 2 / e: 3
after: 1000000003 / 3

before: 1000000004 / 4
 c: 4 / p: 3 / e: 4
after: 1000000004 / 4

before: 1000000005 / 5
 c: 5 / p: 4 / e: 5
after: 1000000005 / 5

before: 1000000006 / 5
 c: 5 / p: 5 / e: 6
after: 1000000006 / 5

before: 1000000007 / 6
 c: 6 / p: 5 / e: 6
after: 1000000007 / 6

before: 1000000008 / 6
 c: 6 / p: 6 / e: 7
after: 1000000008 / 6

before: 1000000009 / 7
 c: 7 / p: 6 / e: 7
after: 1000000009 / 7

before: 1000000010 / 7
 c: 7 / p: 7 / e: 8
after: 1000000010 / 7

before: 1000000011 / 8
 c: 8 / p: 7 / e: 8
after: 1000000011 / 8

before: 1000000012 / 8
 c: 8 / p: 8 / e: 9
after: 1000000012 / 8

before: 1000000013 / 9
 c: 9 / p: 8 / e: 9
after: 1000000013 / 9

before: 1000000014 / 9
 c: 9 / p: 9 / e: 10
after: 1000000014 / 9

before: 

In [44]:
features2 = deepcopy(eval_features)

previous_example_index = -1
for fea in features2:
    print(f"before: {fea.unique_id} / {fea.example_index}")
    fea.unique_id = unique_id
    unique_id += 1
    
    current_example_index = fea.example_index
    print(f" c: {current_example_index} / p: {previous_example_index} / e: {example_index}")
    if previous_example_index == current_example_index:
        fea.example_index = previous_example_index
    else:
        previous_example_index = fea.example_index
        fea.example_index = example_index
        example_index += 1
    print(f"after: {fea.unique_id} / {fea.example_index}")
    print()

before: 1000000000 / 0
 c: 0 / p: -1 / e: 25
after: 1000000032 / 25

before: 1000000001 / 1
 c: 1 / p: 0 / e: 26
after: 1000000033 / 26

before: 1000000002 / 2
 c: 2 / p: 1 / e: 27
after: 1000000034 / 27

before: 1000000003 / 3
 c: 3 / p: 2 / e: 28
after: 1000000035 / 28

before: 1000000004 / 4
 c: 4 / p: 3 / e: 29
after: 1000000036 / 29

before: 1000000005 / 5
 c: 5 / p: 4 / e: 30
after: 1000000037 / 30

before: 1000000006 / 5
 c: 5 / p: 5 / e: 31
after: 1000000038 / 5

before: 1000000007 / 6
 c: 6 / p: 5 / e: 31
after: 1000000039 / 31

before: 1000000008 / 6
 c: 6 / p: 6 / e: 32
after: 1000000040 / 6

before: 1000000009 / 7
 c: 7 / p: 6 / e: 32
after: 1000000041 / 32

before: 1000000010 / 7
 c: 7 / p: 7 / e: 33
after: 1000000042 / 7

before: 1000000011 / 8
 c: 8 / p: 7 / e: 33
after: 1000000043 / 33

before: 1000000012 / 8
 c: 8 / p: 8 / e: 34
after: 1000000044 / 8

before: 1000000013 / 9
 c: 9 / p: 8 / e: 34
after: 1000000045 / 34

before: 1000000014 / 9
 c: 9 / p: 9 / e: 35
after: 

In [35]:
for fea in features:
    print(f"previous: {fea.unique_id} / {fea.example_index}")

previous: 1000000000 / 0
previous: 1000000001 / 1
previous: 1000000002 / 1
previous: 1000000003 / 1
previous: 1000000004 / 1
previous: 1000000005 / 1
previous: 1000000006 / 2
previous: 1000000007 / 3
previous: 1000000008 / 4
previous: 1000000009 / 5
previous: 1000000010 / 5
previous: 1000000011 / 6
previous: 1000000012 / 6
previous: 1000000013 / 7
previous: 1000000014 / 7
previous: 1000000015 / 8
previous: 1000000016 / 8
previous: 1000000017 / 9
previous: 1000000018 / 9
previous: 1000000019 / 10
previous: 1000000020 / 10
previous: 1000000021 / 11
previous: 1000000022 / 11
previous: 1000000023 / 12
previous: 1000000024 / 13
previous: 1000000025 / 14
previous: 1000000026 / 15
previous: 1000000027 / 16
previous: 1000000028 / 17
previous: 1000000029 / 18
previous: 1000000030 / 19
previous: 1000000031 / 20


In [190]:
all_results = []
for batch in eval_dataloader:
    model.eval()
    batch = tuple(t.to(device) for t in batch)

    with torch.no_grad():
        inputs = {
            "input_ids": batch[0],
            "attention_mask": batch[1],
            "token_type_ids": batch[2],
        }
        example_indices = batch[3]
        outputs = model(**inputs)
        
    for i, example_index in enumerate(example_indices):
        eval_feature = eval_features[example_index.item()]
        unique_id = int(eval_feature.unique_id)
        output = [tolist(o[i]) for o in outputs.values()]
        start_logits, end_logits = output
        result = SquadResult(unique_id, start_logits, end_logits)
        all_results.append(result)

In [33]:
from transformers.data.metrics.squad_metrics import (
    compute_predictions_logits,
    squad_evaluate
)

In [34]:
output_prediction_file = "./predictions.json"
output_nbest_file = "./nbest_predictions.json"
output_null_log_odds_file = "./null_odds.json"

In [35]:
predictions = compute_predictions_logits(
    eval_examples,
    eval_features,
    all_results,
    args.n_best_size,
    args.max_answer_length,
    args.do_lower_case,
    output_prediction_file,
    output_nbest_file,
    output_null_log_odds_file,
    args.verbose_logging,
    args.version_2_with_negative,
    args.null_score_diff_threshold,
    tokenizer,
)

In [36]:
for i, j in zip(eval_examples, predictions.items()):
    if i.qas_id == j[0]:
        print(i.qas_id)
        print(f"Answer: {i.answers[0]['text']}")
        print(f"Predict: {j[1]}")
        print()
    else:
        print(i.qas_id, j[0])

c1_57059-1
Answer: 한국청소년단체협의회와 여성가족부
Predict: 선언문을 전달할 예정이다

c1_57060-1
Answer: 22일부터 28일
Predict: 선언문을 전달할 예정이다

c1_57061-1
Answer: '청소년과 뉴미디어'
Predict: 선언문을 전달할 예정이다

c1_57062-1
Answer: 기조강연을 시작으로 국가별 주제관련 사례발표, 그룹 토론 및 전체총회, '청소년선언문' 작성 및 채택 등 다양한 프로그램을 운영한다.
Predict: 선언문을 전달할 예정이다

m5_306705-1
Answer: 샐리
Predict: 육상 볼링 양궁 리듬체조 에어로빅 선수권대회(이하 아육대)'에서는

c1_151305-1
Answer: 보조 교통 경찰로 일하는 천중핑
Predict: 아이는 죽었을 것”이라며 딸을 구해준 천중핑에게 감사의 뜻을 전했습다

c1_151306-1
Answer: 지난달 28일
Predict: 아이는 죽었을 것”이라며 딸을 구해준 천중핑에게 감사의 뜻을 전했습다

c1_151307-1
Answer: 구이저우성 카일리시
Predict: 아이는 죽었을 것”이라며 딸을 구해준 천중핑에게 감사의 뜻을 전했습다

c1_151308-1
Answer: ‘중국의 좋은 이웃상’과 함께 상금 1만 위안(약 170만원)을 수여
Predict: 아이는 죽었을 것”이라며 딸을 구해준 천중핑에게 감사의 뜻을 전했습다

c1_151309-1
Answer: 이틀 간의 코마 상태 이후 의식을 회복해 지난 2일부터 중환자실에서 치료를 받고 있습니다
Predict: 아이는 죽었을 것”이라며 딸을 구해준 천중핑에게 감사의 뜻을 전했습다

c1_151310-1
Answer: 열쇠공이 문을 따는 소리에 겁을 먹고 창문 밖으로 도망을 치려다
Predict: 회복해 지난

c1_151311-1
Answer: 아이가 잠든 사이 돌보던 아이의 할머니가 쓰레기를 버리러 나갔다가 문이 잠기는 바람에 열쇠공을 불렀던 것
Predict: 아이는 죽었을 것”이

In [37]:
results = squad_evaluate(eval_examples, predictions)

In [38]:
results

OrderedDict([('exact', 0.0),
             ('f1', 5.556363636363637),
             ('total', 25),
             ('HasAns_exact', 0.0),
             ('HasAns_f1', 5.556363636363637),
             ('HasAns_total', 25),
             ('best_exact', 0.0),
             ('best_exact_thresh', 0.0),
             ('best_f1', 5.556363636363637),
             ('best_f1_thresh', 0.0)])

Why there is 0 to predict?

In [69]:
vocab_rev = {v: k for k, v in tokenizer.vocab.items()}
tostring = lambda x: " ".join(x).replace(" ##", "").replace("[PAD]", "").strip()
def show_original(inputs):
    tokens = [vocab_rev[i.item()] for i in inputs["input_ids"]]
    s, e = inputs["start_positions"].item(), inputs["end_positions"].item()
    print(s, e)
    print("Answer: ", tostring(tokens[s:(e+1)]))
    print(tostring(tokens))

In [78]:
idxes = [5, 6, 7, 8, 9, 10]
for i, idx in enumerate(idxes):
    batch = train_dataset[idx]
    inputs = {
        "input_ids": batch[0],
        "attention_mask": batch[1],
        "token_type_ids": batch[2],
        "start_positions": batch[3],
        "end_positions": batch[4],
    }
    if i % 2 == 0:
        print("----------------------------"*3)
    show_original(inputs)
    print()

------------------------------------------------------------------------------------
103 112
Answer:  보조 교통 경찰로 일하는 천중핑
[CLS] 중국에서 아파트에서 추락하던 3세 아이를 살리고 자신은 혼수상태에 빠진 사람은 누구야 ? [SEP] 중국의 한 여성 경찰이 아파트에서 추락하던 3세 아이를 살리고 자신은 혼수상태에 빠졌습니다 . 의인 ( 義 人 ) 의 소식이 알려지자 각박한 중국 사회에 큰 반향을 일으키고 있습니다 . 5일 귀주도시망 등 중국 현지 언론에 따르면 구이저우성 카일리시에 보조 교통 경찰로 일하는 천중핑 ( 49 ) 은 지난달 28일 한 아파트에서 비상 상황이 발생했다는 연락을 받고 현장으로 향했습니다 . 도착했을 때 아파트 4층 창문에서 여자 아이가 매달려 있었습니다 . 곧이어 아이는 손에 힘이 빠지면서 밑으로 추락했습니다 . 천중핑과 다른 세명의 이웃들이 달려갔습니다 . 그리고 아이는 바닥이 아니라 천중핑의 팔에 떨어졌습니다 . 중간 비막이 천막 때문에 속도가 줄기는 했지만 추락의 충격은 천중핑이 고스란히 감당해야 했습니다 . 아이는 즉시 병원으로 옮겨져 치료를 받았습니다 . 다리 골절로 그리 심각한 상황은 아니라고 합니다 . 하지만 생명의 은인이자 영웅은 커다란 댓가를 치러야 했다 . 뇌출혈로 인한 의식불명 상태에 빠진 것이다 . 다행히 이틀 간의 코마 상태 이후 의식을 회복해 지난 2일부터 중환자실에서 치료를 받고 있습니다 . 아이는 열쇠공이 문을 따는 소리에 겁을 먹고 창문 밖으로 도망을 치려다 사고를 당한 것으로 전해졌습니다 . 아이가 잠든 사이 돌보던 아이의 할머니가 쓰레기를 버리러 나갔다가 문이 잠기는 바람에 열쇠공을 불렀던 것입니다 . 아이의 엄마는 “ 천중핑의 도움이 없었다면 아이는 죽었을 것 ” 이라며 딸을 구해준 천중핑에게 감사의 뜻을 전했습다 . 카일리시 정부 대표와 공안부 관계자들도 천중핑이 입원한 병원을 찾아 위로하고 회복될때까지 도움

If over 512 tokens, it will generate sentences from the back like: `tokens[-512:]`

In [1]:
from pathlib import Path
import torch
from torch.utils.data import TensorDataset, DataLoader
import pytorch_lightning as pl
import torchmetrics
from tqdm import tqdm
from transformers import (
    ElectraForQuestionAnswering, 
    ElectraConfig, 
    ElectraTokenizer,
    AdamW,
    squad_convert_examples_to_features,
    get_linear_schedule_with_warmup
)

from transformers.data.processors.squad import SquadResult, SquadV2Processor
from transformers.data.metrics.squad_metrics import (
    compute_predictions_logits,
    squad_evaluate
)
# typing
from transformers.data.processors import SquadFeatures
from typing import List

In [3]:
train_file = "test_train*.json"
val_file = "test_val*.json"
repo_path = Path().absolute().parent
data_path = repo_path.parent / "data" / "AIhub" / "QA"
ckpt_path = repo_path.parent / "ckpt"

args_dict = {
    "task": "AIHub_QA",
    "data_path": data_path,
    "ckpt_path": ckpt_path,
    "train_file": train_file,
    "val_file": val_file,
    "cache_file": "test_{}_cache_{}",
    "random_seed": 77,
    "threads": 4,
    "version_2_with_negative": False,
    "null_score_diff_threshold": 0.0,
    "max_seq_length": 512,
    "doc_stride": 128,
    "max_query_length": 64,
    "max_answer_length": 30,
    "n_best_size": 20,
    "verbose_logging": True,
    "do_lower_case": False,
    "num_train_epochs": 10,
    "weight_decay": 0.0,
    "adam_epsilon": 1e-8,
    "warmup_proportion": 0,
    "model_type": "koelectra-base-v3",
    "model_name_or_path": "monologg/koelectra-base-v3-discriminator",
    "output_dir": "koelectra-base-v3-korquad-ckpt",
    "seed": 42,
    "train_batch_size": 2,
    "eval_batch_size": 3,
    "learning_rate": 5e-5,
    "output_prediction_file": "predictions/predictions_{}.json",
    "output_nbest_file": "nbest_predictions/nbest_predictions_{}.json",
    "output_null_log_odds_file": "null_odds/null_odds_{}.json",
}

for arg in ["output_prediction_file", "output_nbest_file", "output_null_log_odds_file"]:
    p = args_dict["ckpt_path"] / args_dict[arg]
    if not p.parent.exists():
        p.mkdir(parents=True)

In [12]:

def flatten(li):
    for ele in li:
        if isinstance(ele, list):
            yield from flatten(ele)
        else:
            yield ele

class Model(pl.LightningModule):
    def __init__(self, **kwargs):
        super().__init__()
        self.save_hyperparameters() 
        self.config = ElectraConfig.from_pretrained(self.hparams.model_name_or_path)
        self.model = ElectraForQuestionAnswering.from_pretrained(
            self.hparams.model_name_or_path, 
            config=self.config
        )
        self.tokenizer = ElectraTokenizer.from_pretrained(self.hparams.model_name_or_path)
        # create dataset and cache it
        self.train_files = []
        self.val_files = []
        self.create_dataset_all(state="train")
        self.create_dataset_all(state="val")

        # 
        self.all_examples, self.all_features = [], []

        # function
        self.tolist = lambda x: x.detach().cpu().tolist()

    def create_dataset_all(self, state:str):
        self.example_index = 0
        self.unique_id = 1000000000
        if state == "train":
            file_str = self.hparams.train_file
        elif state == "val":
            file_str = self.hparams.val_file
        else:
            raise ValueError("state should be train or val")
        
        file_iter = sorted(self.hparams.data_path.glob(file_str), key=lambda x: int(x.name.strip(".json").split("_")[-1]))
        for path in file_iter:
            filename = path.name
            idx = int(filename.strip(".json").split("_")[-1])
            self.create_dataset(path.name, idx, state)

    def create_dataset(self, filename:str, idx:int, state:str):
        cache_file = self.hparams.cache_file.format(state, idx)
        print(f"[INFO] Processing: {filename} | Cache file name: {cache_file}")
        processed_file = self.hparams.ckpt_path / cache_file
        if processed_file.exists():
            print(f"[INFO] cache file already exists! passing the procedure")
            print(f"[INFO] Path: {processed_file}")
            if state == "train":
                self.train_files.append(cache_file)
            elif state == "val":
                self.val_files.append(cache_file)
            else:
                raise ValueError("state should be train or val")
            return None
        else:
            processor = SquadV2Processor()
            if state == "train":
                process_fn = processor.get_train_examples
                is_training = True
                self.train_files.append(cache_file)
            elif state == "val":
                process_fn = processor.get_dev_examples
                is_training = False
                self.val_files.append(cache_file)
            else:
                raise ValueError("state should be train or val")

            examples = process_fn(
                data_dir=self.hparams.data_path, 
                filename=filename
            )

            features = squad_convert_examples_to_features(
                examples=examples,
                tokenizer=self.tokenizer,
                max_seq_length=self.hparams.max_seq_length,
                doc_stride=self.hparams.doc_stride,
                max_query_length=self.hparams.max_query_length,
                is_training=is_training,
                return_dataset=False,
                threads=self.hparams.threads,
            )
            # need to fix all `example_index` and `unique_id` since splitted the dataset only on validation dataset
            self.fix_unique_id(features, state)
            dataset = self.convert_to_tensor(state, features)
            cache = dict(dataset=dataset, examples=examples, features=features)
            torch.save(cache, processed_file)
            print(f"[INFO] cache file saved! {processed_file}")

    def convert_to_tensor(self, state:str, features:List[SquadFeatures]):
        """
        Reference: https://github.com/huggingface/transformers/blob/master/src/transformers/data/processors/squad.py
        Arguments:
            state {str} -- [description]
        """        
        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
        all_attention_masks = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
        all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
        
        if state == "train":
            all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
            all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
            dataset = TensorDataset(
                all_input_ids, all_attention_masks, all_token_type_ids, all_start_positions, all_end_positions
            )
        elif state == "val":
            all_unique_ids = torch.tensor([f.unique_id for f in features], dtype=torch.long)
            dataset = TensorDataset(
                all_input_ids, all_attention_masks, all_token_type_ids, all_unique_ids
            )
        else:
            raise ValueError("state should be train or val")
        return dataset

    def fix_unique_id(self, features:list, state:str="val"):
        if state == "val":
            previous_example_index = -1 
            for fea in tqdm(features, total=len(features), desc="fixing index and ids"):
                fea.unique_id = self.unique_id
                self.unique_id += 1
                
                current_example_index = fea.example_index
                if previous_example_index == current_example_index:
                    fea.example_index = previous_example_index
                else:
                    previous_example_index = fea.example_index
                    fea.example_index = self.example_index
                    self.example_index += 1
        else:
            return None

    def load_cache(self, filename:str, return_dataset:bool=True):
        processed_file = self.hparams.ckpt_path / filename
        cache = torch.load(processed_file)
        dataset, examples, features = cache["dataset"], cache["examples"], cache["features"]

        if return_dataset:
            return dataset
        else:
            return examples, features

    def create_dataloader(self, state:str="train"):
        if state == "train":
            shuffle = True
            batch_size = self.hparams.train_batch_size
            file_list = self.train_files
        elif state == "val":
            shuffle = False
            batch_size = self.hparams.eval_batch_size
            file_list = self.val_files
        else:
            raise ValueError("state should be train or val")

        file_loader = (self.load_cache(filename=file, return_dataset=True) for file in file_list)
        loaders = []
        self.val_dataset_length = []
        for dataset in file_loader: 
            dataloader = DataLoader(
                dataset=dataset,
                batch_size=batch_size,
                shuffle=shuffle,
                num_workers=self.hparams.threads
            )
            loaders.append(dataloader)
            self.val_dataset_length.append(len(dataset))
        return loaders

    def train_dataloader(self):
        return self.create_dataloader(state="train")

    def val_dataloader(self):
        return self.create_dataloader(state="val")

    def forward(self, **kwargs):
        return self.model(**kwargs)

    def training_step(self, batch, batch_idx):
        batch = list(map(torch.cat, zip(*batch)))
        inputs_ids, attention_mask, token_type_ids, start_positions, end_positions = batch

        outputs = self(
            input_ids=inputs_ids, 
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            start_positions=start_positions,
            end_positions=end_positions
        )

        loss = outputs.loss
        return  {'loss': loss}

    def validation_step(self, batch, batch_idx, dataloader_idx):
        # batch = single dataloader batch not multiple dataloader
        inputs_ids, attention_mask, token_type_ids, data_unique_ids = batch
        outputs = self(
            input_ids=inputs_ids, 
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            start_positions=None,
            end_positions=None
        )

        # outputs.values: [(B, H), (B, H)] > batch_results: (B, 2, H)
        # B = len(datasets) * batch_size
        batch_results = []
        for i, unique_id in enumerate(data_unique_ids.detach().cpu().tolist()):
            output = [self.tolist(o[i]) for o in outputs.values()]
            start_logits, end_logits = output
            result = SquadResult(unique_id, start_logits, end_logits)
            batch_results.append(result)

        # for i, example_index in enumerate(example_indices):
        #     eval_feature = self.eval_features[example_index.item()]
        #     unique_id = int(eval_feature.unique_id)
        #     output = [self.tolist(o[i]) for o in outputs.values()]
        #     start_logits, end_logits = output
        #     result = SquadResult(unique_id, start_logits, end_logits)
        #     batch_results.append(result)
            
        return batch_results
    
    def train_epoch_end(self, outputs):
        loss = torch.tensor(0, dtype=torch.float)
        for out in outputs:
            loss += out["loss"].detach().cpu()
        loss = loss / len(outputs)

        return {'loss': loss}

    def validation_epoch_end(self, outputs):
        if (self.all_examples == []) or (self.all_features == []):
            for file in self.val_files:
                examples, features = self.load_cache(filename=file, return_dataset=False)
                self.all_examples.extend(examples)
                self.all_features.extend(features)
                del examples
                del features

        all_results = list(flatten(outputs))
        # TODO: See if needed?
        # outputs: [(B, 2, H)] : start_logits, end_logits list
        # B = len(datasets) * batch_size
        # all_results = []
        # for res in outputs:  # res: (B, 2, H)
        #     start_logits, end_logits = res
        #     idx = torch.arange(self.hparams.train_batch_size).repeat(len(self.val_files), 1)  # len(dataset), batch_size
        #     example_idx_to_add = torch.LongTensor([0] + self.val_dataset_length[:-1]).unsqueeze(1)
        #     idx = (idx + example_idx_to_add).view(-1)  # B
        #     for k in idx:
        #         unique_id = self.all_features[k].unique_id
        #         result = SquadResult(unique_id, start_logits, end_logits)
        #     all_results.append(result)
        
        # https://huggingface.co/transformers/_modules/transformers/data/processors/squad.html
        # TODO: Cannot find the key unique_id
        # BUG: must set argument of `trainer: num_sanity_val_steps=0` to avoid error.

        predictions = compute_predictions_logits(
            self.all_examples,
            self.all_features,
            all_results,
            self.hparams.n_best_size,
            self.hparams.max_answer_length,
            self.hparams.do_lower_case,
            self.hparams.ckpt_path / self.hparams.output_prediction_file.format(self.global_step),
            self.hparams.ckpt_path / self.hparams.output_nbest_file.format(self.global_step),
            self.hparams.ckpt_path / self.hparams.output_null_log_odds_file.format(self.global_step),
            self.hparams.verbose_logging,
            self.hparams.version_2_with_negative,
            self.hparams.null_score_diff_threshold,
            self.tokenizer,
        )
        results = squad_evaluate(self.all_examples, predictions)
        accuracy = results["exact"]
        f1 = results["f1"]
        self.log("accuracy", accuracy, on_epoch=True, prog_bar=True)
        self.log("f1", f1, on_epoch=True, prog_bar=True)

    def configure_optimizers(self):
        t_total = self.total_steps()
        
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)], 
                "weight_decay": 0.0
            },
        ]
        optimizer = AdamW(
            params=optimizer_grouped_parameters, 
            lr=self.hparams.learning_rate, 
            eps=self.hparams.adam_epsilon
        )
        scheduler = get_linear_schedule_with_warmup(
            optimizer=optimizer, 
            num_warmup_steps=int(t_total * self.hparams.warmup_proportion), 
            num_training_steps=t_total
        )
        
        return {
            'optimizer': optimizer,
            'scheduler': scheduler,
        }

    def total_steps(self):
        r"""
        source: https://github.com/PyTorchLightning/pytorch-lightning/issues/1038
        """
        return len(self.train_dataloader()) * self.hparams.num_train_epochs

In [13]:
def main(args_dict):
    print("[INFO] Using PyTorch Ver", torch.__version__)
    print("[INFO] Seed:", args_dict["random_seed"])
    checkpoint_callback = pl.callbacks.ModelCheckpoint(
        filename="epoch{epoch}-f1{f1:.4f}",
        monitor="f1",
        save_top_k=3,
        mode="max",
    )
    pl.seed_everything(args_dict["random_seed"])
    model = Model(**args_dict)
    
    print("[INFO] Start FineTuning")
    trainer = pl.Trainer(
        callbacks=[checkpoint_callback],
        max_epochs=args_dict["num_train_epochs"],
        deterministic=torch.cuda.is_available(),
        gpus=-1 if torch.cuda.is_available() else None,
        num_sanity_val_steps=0
    )
    trainer.fit(model)
    return model

In [14]:
main(args_dict)

Global seed set to 77


[INFO] Using PyTorch Ver 1.6.0
[INFO] Seed: 77


Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForQuestionAnswering were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: ['qa_outputs.weight'

[INFO] Processing: test_train_0.json | Cache file name: test_train_cache_0
[INFO] cache file already exists! passing the procedure
[INFO] Path: /home/simonjisu/code/ckpt/test_train_cache_0
[INFO] Processing: test_train_1.json | Cache file name: test_train_cache_1
[INFO] cache file already exists! passing the procedure
[INFO] Path: /home/simonjisu/code/ckpt/test_train_cache_1
[INFO] Processing: test_val_0.json | Cache file name: test_val_cache_0
[INFO] cache file already exists! passing the procedure
[INFO] Path: /home/simonjisu/code/ckpt/test_val_cache_0
[INFO] Processing: test_val_1.json | Cache file name: test_val_cache_1
[INFO] cache file already exists! passing the procedure
[INFO] Path: /home/simonjisu/code/ckpt/test_val_cache_1
[INFO] Start FineTuning


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…




Model(
  (model): ElectraForQuestionAnswering(
    (electra): ElectraModel(
      (embeddings): ElectraEmbeddings(
        (word_embeddings): Embedding(35000, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): ElectraEncoder(
        (layer): ModuleList(
          (0): ElectraLayer(
            (attention): ElectraAttention(
              (self): ElectraSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): ElectraSelfOutput(
                (dense): Linear(in_features=768, out_features

---