# HuggingFace Tutorial

This is a tutorial for me to learn how to use transformer with huggingface.

# Reference: 
- https://huggingface.co/
- https://huggingface.co/transformers/
- https://github.com/huggingface/datasets

# Installation

```bash
$ pip install transformers
```

In [149]:
import numpy as np

import torch
import torchtext

print(f"PyTorch Version: {torch.__version__}")
print(f"TorchText Version: {torchtext.__version__}")  

PyTorch Version: 1.6.0
TorchText Version: 0.8.0a0+c851c3e


# Datasets

need to install sentencepiece

```bash
$ pip install sentencepiece
$ pip install datasets
```

# How to use?

## Pipeline

- ConversationalPipeline
- FeatureExtractionPipeline
- FillMaskPipeline
- QuestionAnsweringPipeline
- SummarizationPipeline
- TextClassificationPipeline
- TextGenerationPipeline
- TokenClassificationPipeline
- TranslationPipeline
- ZeroShotClassificationPipeline
- Text2TextGenerationPipeline
- TableQuestionAnsweringPipeline

function: `pipeline`

- "feature-extraction": will return a FeatureExtractionPipeline.
- "sentiment-analysis": will return a TextClassificationPipeline.
- "ner": will return a TokenClassificationPipeline.
- "question-answering": will return a QuestionAnsweringPipeline.
- "fill-mask": will return a FillMaskPipeline.
- "summarization": will return a SummarizationPipeline.
- "translation_xx_to_yy": will return a TranslationPipeline.
- "text2text-generation": will return a Text2TextGenerationPipeline.
- "text-generation": will return a TextGenerationPipeline.
- "zero-shot-classification:: will return a ZeroShotClassificationPipeline.
- "conversation": will return a ConversationalPipeline.

model will be automatically downloaded in `~/.cache/huggingface/`

In [2]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis", framework="pt")

In [3]:
sentences = [
    "We are very happy to show you the 🤗 Transformers library.",
    "I'll go to Apple Store.",
    "This model covers a lot area. But, I won't use it. Since it is too hard to use."
]

In [4]:
results = classifier(sentences)
for res in results:
    print(f"label: {res['label']}, with score: {round(res['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.9812
label: NEGATIVE, with score: 0.9983


# Fine-Tuning with Custom Dataset

https://huggingface.co/transformers/custom_datasets.html#qa-squad


## Dataset


- 제목(title)
- 본문의 카테고리(source)
- 본문(context)
- 질문 번호(id)
- 육하원칙(classtype)
- 질문(question)
- 정답의 시작위치(answer_start)
- 정답(text)

In [2]:
import json
from tqdm import tqdm
from pathlib import Path
repo_path = Path().absolute().parent
data_path = repo_path.parent / "data" / "AIhub" / "QA"
for p in data_path.glob("ko*.json"):
    print(p)

/home/simonjisu/code/data/AIhub/QA/ko_nia_normal_squad_all.json
/home/simonjisu/code/data/AIhub/QA/ko_nia_clue0529_squad_all.json
/home/simonjisu/code/data/AIhub/QA/ko_nia_noanswer_squad_all.json


In [3]:
def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    for group in tqdm(squad_dict["data"], total=len(squad_dict["data"]), desc="Reading Dataset"):
        for paragraph in group['paragraphs']:
            context = paragraph['context']
            for qa in paragraph['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

In [5]:
train_file = "ko_nia_normal_squad_all.json"
train_path = data_path / train_file
val_file = "ko_nia_clue0529_squad_all.json"
val_path = data_path / val_file

train_contexts, train_questions, train_answers = read_squad(train_path)
val_contexts, val_questions, val_answers = read_squad(val_path)

Reading Dataset: 100%|██████████| 47314/47314 [00:00<00:00, 443209.57it/s]
Reading Dataset: 100%|██████████| 34500/34500 [00:00<00:00, 554466.23it/s]


In [6]:
print(len(train_contexts), len(val_contexts))

243425 96663


Let's see some samples

In [7]:
import termcolor

for idx in np.random.randint(0, len(train_contexts), size=(2,)):
    txt = train_answers[idx]["text"]
    context = train_contexts[idx].split(txt)
    context.insert(1, termcolor.colored(txt, "red", attrs=["bold"]))
    answer_end = train_answers[idx]['answer_start'] + len(train_answers[idx]['text'])  # not included like python range
    print(termcolor.colored("Context: ", attrs=["bold"]))
    print("".join(context))
    print(termcolor.colored("Question: ", attrs=["bold"]))
    print(train_questions[idx])
    print(termcolor.colored("Answer: ", attrs=["bold"]))
    print(f"  Start: {train_answers[idx]['answer_start']}, End: {answer_end}")
    print()

[1mContext: [0m
지난해 자산관리공사(캠코)가 매입해준 대형 저축은행 상당수에서 PF 부실이 추가로 발생하는 등 건전성이 크게 개선되지 못한 것으로 드러났다. 금융당국으로서는 '땜질식 처방으로 저축은행 부실을 키웠다'는 비판을 면키 어렵게 됐다. 금융당국은 하반기 저축은행에 대한 대대적인 경영진단을 통해 경영건전화를 추진하겠다고 밝혔지만 이번에도 미온적인 조치에 그치는 게 아니냐는 우려가 나온다. ▶관련기사 12면 5일 공적자금관리위원회가 공개한 '저축은행별 인수대상 PF대출 채권 현황'(법인채권)에 따르면 지난해 6월 캠코 인수 대상 PF대출이 1000억원을 넘는 저축은행이 16곳에 달했다. 이중 14곳은 자산 1조원 이상의 대형저축은행이었다. 솔로몬이 4879억원으로 가장 많았고, 같은 계열인 부산솔로몬이 2030억원으로 뒤를 이었다. 또 삼화(우리금융) 1913억원, 토마토 1870억원, 대전 1828억원, 현대스위스 1802억원에 달했다. 한국 1304억원, 경기 1093억원, 진흥 1004억원 등 한국계열 저축은행도 1000억원 이상 인수대상에 포함됐다. 당시 금융당국은 91개 저축은행 714개 PF사업장에 대해 사업성 평가를 실시하고 각 저축은행의 의사를 타진한 뒤 2조5277억원의 구조조정기금을 투입해 총 4조833억원의 PF부실 채권을 매입할 계획을 세웠다. 실제 매입과정에서 소송이나 경매에 들어간 채권을 제외하면서 매입규모는 3조7493억원으로 줄었지만 당초 계획에서는 크게 벗어나지 않았다. 이처럼 캠코가 지난해 대규모로 PF 부실채권을 인수해주었지만 저축은행의 사정은 나아지지 않았다. [1m[31m솔로몬의 경우 지난해 6월말 9258억원이었던 PF 잔액이 지난 3월말 6131억원으로 줄었지만 연체율은 12.9%에서 25.17%로 급등했다.[0m 부산솔로몬 역시 PF잔액은 절반 가량 줄었지만 연체율은 두배 가량 상승했다. 토마토, 현대스위스, HK, 프라임, 한국 등 대부분 저축은행이 PF잔액이 줄었는데도 고정이하여신과 연체

In [13]:
def add_end_idx(answers, contexts):
    for idx, (answer, context) in enumerate(zip(answers, contexts)):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
            print(f"type1: {idx}")
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters
            print(f"type2: {idx}")

In [14]:
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [15]:
from transformers import ElectraModel, ElectraTokenizer, ElectraTokenizerFast, ElectraForQuestionAnswering

tokenizer = ElectraTokenizerFast.from_pretrained("monologg/koelectra-base-v3-discriminator")  
# Fast 를 써야 ._encodings 속성이 생긴다. 
# 안에는 Encoding class로 된 데이터가 list롤 있음
# train_encodings._encodings[0]
# Encoding(num_tokens=365, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [30]:
sample_len = 5
train_encodings = tokenizer(
    train_contexts[:sample_len], train_questions[:sample_len], 
    max_length=512, truncation=True, padding=True
)
val_encodings = tokenizer(
    val_contexts[:sample_len], val_questions[:sample_len], 
    max_length=512, truncation=True, padding=True)

In [44]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [31]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))  # char_to_token: 문자가 몇 번째 토큰에 있는지 확인
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers[:sample_len])
add_token_positions(val_encodings, val_answers[:sample_len])

In [32]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

In [33]:
from transformers import ElectraConfig

config = {
  "task": "QA",
  "data_dir": "data",
  "ckpt_dir": "ckpt",
  "train_file": train_file,
  "predict_file": val_file,
  "threads": 4,
  "version_2_with_negative": False,
  "null_score_diff_threshold": 0.0,
  "max_seq_length": 512,
  "doc_stride": 128,
  "max_query_length": 64,
  "max_answer_length": 30,
  "n_best_size": 20,
  "verbose_logging": True,
  "overwrite_output_dir": True,
  "evaluate_during_training": True,
  "eval_all_checkpoints": True,
  "save_optimizer": False,
  "do_lower_case": False,
  "do_train": True,
  "do_eval": True,
  "num_train_epochs": 7,
  "weight_decay": 0.0,
  "gradient_accumulation_steps": 1,
  "adam_epsilon": 1e-8,
  "warmup_proportion": 0,
  "max_steps": -1,
  "max_grad_norm": 1.0,
  "no_cuda": False,
  "model_type": "koelectra-base-v3",
  "model_name_or_path": "monologg/koelectra-base-v3-discriminator",
  "output_dir": "koelectra-base-v3-korquad-ckpt",
  "seed": 42,
  "train_batch_size": 2,
  "eval_batch_size": 2,
  "logging_steps": 1000,
  "save_steps": 1000,
  "learning_rate": 5e-5,
}
cfg = ElectraConfig(**config)

In [34]:
model = ElectraForQuestionAnswering(cfg).cpu()

In [23]:
from torch.utils.data import DataLoader
from transformers import AdamW

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=config["train_batch_size"], shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()

RuntimeError: CUDA error: device-side assert triggered


https://colab.research.google.com/drive/1IPkZo1Wd-DghIOK6gJpcb0Dv4_Gv2kXB?usp=sharing#scrollTo=ImupuGXDGq7b

https://github.com/monologg/KoELECTRA

https://github.com/monologg/KoELECTRA/blob/master/finetune/run_squad.py

Korean Sentence Splitter: https://github.com/hyunwoongko/kss

## Is there more Efficient way???...

In [3]:
from transformers import squad_convert_examples_to_features
from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor

In [4]:
processor = SquadV2Processor()
examples = processor.get_train_examples(data_dir=data_path, filename="test.json")  # examples은 먼저 whitespace 기반으로 토크나이징함

100%|██████████| 5/5 [00:00<00:00, 377.92it/s]


In [5]:
from transformers import ElectraTokenizer, ElectraForQuestionAnswering
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

In [63]:
features, train_dataset = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=512,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
    return_dataset="pt",
    threads=4,
)

convert squad examples to features: 100%|██████████| 25/25 [00:00<00:00, 84.88it/s]
add example index and unique id: 100%|██████████| 25/25 [00:00<00:00, 59041.44it/s]


In [60]:
a = features[1]

In [46]:
a.start_position, a.end_position

(21, 31)

In [47]:
for i, t in enumerate(a.tokens[:35]):
    print(i, t)

0 [CLS]
1 서울
2 ##과
3 충북
4 괴산
5 ##에
6 ##서
7 '
8 국제
9 ##청
10 ##소년
11 ##포
12 ##럼
13 '
14 을
15 여
16 ##는
17 곳
18 ##은
19 ?
20 [SEP]
21 한국
22 ##청
23 ##소년단
24 ##체
25 ##협
26 ##의
27 ##회
28 ##와
29 여성
30 ##가족
31 ##부
32 ##는
33 22
34 ##일


In [7]:
class ARGS:
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)
        
args_dict = {
  "task": "korquad",
  "data_dir": "data",
  "ckpt_dir": "ckpt",
  "train_file": "KorQuAD_v1.0_train.json",
  "predict_file": "KorQuAD_v1.0_dev.json",
  "threads": 4,
  "version_2_with_negative": False,
  "null_score_diff_threshold": 0.0,
  "max_seq_length": 512,
  "doc_stride": 128,
  "max_query_length": 64,
  "max_answer_length": 30,
  "n_best_size": 20,
  "verbose_logging": True,
  "overwrite_output_dir": True,
  "evaluate_during_training": True,
  "eval_all_checkpoints": True,
  "save_optimizer": False,
  "do_lower_case": False,
  "do_train": True,
  "do_eval": True,
  "num_train_epochs": 7,
  "weight_decay": 0.0,
  "gradient_accumulation_steps": 1,
  "adam_epsilon": 1e-8,
  "warmup_proportion": 0,
  "max_steps": -1,
  "max_grad_norm": 1.0,
  "no_cuda": False,
  "model_type": "koelectra-base-v3",
  "model_name_or_path": "monologg/koelectra-base-v3-discriminator",
  "output_dir": "koelectra-base-v3-korquad-ckpt",
  "seed": 42,
  "train_batch_size": 8,
  "eval_batch_size": 32,
  "logging_steps": 1000,
  "save_steps": 1000,
  "learning_rate": 5e-5
}
     
args = ARGS(**args_dict)

In [8]:
from transformers import ElectraForQuestionAnswering, ElectraConfig
config = ElectraConfig.from_pretrained(args.model_name_or_path)
model = ElectraForQuestionAnswering.from_pretrained(args.model_name_or_path, config=config)

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForQuestionAnswering were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: ['qa_outputs.weight'

In [9]:
# device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cpu"

In [11]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup

train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": args.weight_decay,
    },
    {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=int(t_total * args.warmup_proportion), num_training_steps=t_total
)

global_step = 1
epochs_trained = 0
steps_trained_in_current_epoch = 0
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()

In [191]:
model.to(device)
model.train()
for step, batch in enumerate(train_dataloader):
    batch = tuple(t.to(device) for t in batch)

    inputs = {
        "input_ids": batch[0],
        "attention_mask": batch[1],
        "token_type_ids": batch[2],
        "start_positions": batch[3],
        "end_positions": batch[4],
    }
    break

In [13]:
input_ids=inputs["input_ids"]
attention_mask=inputs["attention_mask"]
token_type_ids=inputs["token_type_ids"]

In [14]:
o = model.electra.forward(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

In [15]:
o.last_hidden_state.size()

torch.Size([8, 512, 768])

In [16]:
fin_o = model.qa_outputs(o.last_hidden_state)

In [17]:
fin_o.size()

torch.Size([8, 512, 2])

In [19]:
outputs = model(**inputs)

In [20]:
outputs.start_logits.argmax(1)

tensor([411, 294,  38, 490, 491, 468, 468,  98])

In [21]:
inputs["start_positions"]

tensor([460,  16, 123,  37,  89, 308, 105, 304])

In [22]:
outputs.end_logits.argmax(1)

tensor([471, 196, 274, 105,  51, 223, 221, 279])

In [23]:
inputs["end_positions"]

tensor([481,  16, 160,  43,  94, 317, 106, 324])

**eval phase**

In [286]:
eval_examples = processor.get_dev_examples(data_dir=data_path, filename="test.json")  # examples은 먼저 whitespace 기반으로 토크나이징함
eval_features, eval_dataset = squad_convert_examples_to_features(
    examples=eval_examples,
    tokenizer=tokenizer,
    max_seq_length=512,
    doc_stride=128,
    max_query_length=64,
    is_training=False,
    return_dataset="pt",
    threads=4,
)


100%|██████████| 5/5 [00:00<00:00, 671.76it/s]

convert squad examples to features:   0%|          | 0/25 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 25/25 [00:00<00:00, 83.48it/s][A

add example index and unique id: 100%|██████████| 25/25 [00:00<00:00, 237234.39it/s]


In [None]:
for fea in eval_features:
    fea.unique_id -= 1000000000
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.train_batch_size)

In [140]:
all_results = []
for batch in eval_dataloader:
    model.eval()
    batch = tuple(t.to(device) for t in batch)

    with torch.no_grad():
        inputs = {
            "input_ids": batch[0],
            "attention_mask": batch[1],
            "token_type_ids": batch[2],
        }
        example_indices = batch[3]
        outputs = model(**inputs)
        
    for i, example_index in enumerate(example_indices):
        eval_feature = eval_features[example_index.item()]
        unique_id = int(eval_feature.unique_id)
        output = [to_list(o[i]) for o in outputs.values()]
        start_logits, end_logits = output
        result = SquadResult(unique_id, start_logits, end_logits)
        all_results.append(result)

In [113]:
def to_list(tensor):
    return tensor.detach().cpu().tolist()

In [121]:
from transformers.data.metrics.squad_metrics import (
    compute_predictions_logits,
    squad_evaluate
)

In [147]:
import os 
output_prediction_file = "./predictions.json"
output_nbest_file = "./nbest_predictions.json"
output_null_log_odds_file = "./null_odds.json"

In [148]:
predictions = compute_predictions_logits(
    eval_examples,
    eval_features,
    all_results,
    args.n_best_size,
    args.max_answer_length,
    args.do_lower_case,
    output_prediction_file,
    output_nbest_file,
    output_null_log_odds_file,
    args.verbose_logging,
    args.version_2_with_negative,
    args.null_score_diff_threshold,
    tokenizer,
)

In [181]:
for i in eval_examples:
    print(i.qas_id, i.answers[0]["text"])

c1_57059-1 한국청소년단체협의회와 여성가족부
c1_57060-1 22일부터 28일
c1_57061-1 '청소년과 뉴미디어'
c1_57062-1 기조강연을 시작으로 국가별 주제관련 사례발표, 그룹 토론 및 전체총회, '청소년선언문' 작성 및 채택 등 다양한 프로그램을 운영한다.
m5_306705-1 샐리
c1_151305-1 보조 교통 경찰로 일하는 천중핑
c1_151306-1 지난달 28일
c1_151307-1 구이저우성 카일리시
c1_151308-1 ‘중국의 좋은 이웃상’과 함께 상금 1만 위안(약 170만원)을 수여
c1_151309-1 이틀 간의 코마 상태 이후 의식을 회복해 지난 2일부터 중환자실에서 치료를 받고 있습니다
c1_151310-1 열쇠공이 문을 따는 소리에 겁을 먹고 창문 밖으로 도망을 치려다
c1_151311-1 아이가 잠든 사이 돌보던 아이의 할머니가 쓰레기를 버리러 나갔다가 문이 잠기는 바람에 열쇠공을 불렀던 것
c1_36509-1 조달청
c1_36510-1 2012년
c1_36511-1 소송 제재 자체에 문제가 있었다는 의미
c1_36512-1 계약심사협의회에서 내부 심의를 거쳐 부정당업체로 등록을 하고 제재를 실시
c1_36513-1 지방계약법이 개정되면서 지방자치단체의 부정당업자 제재권한이 조달청으로 이관된 것
c1_81296-1 NASA
c1_81297-1 2004년 1월
c1_81298-1 화성
c1_81299-1 운석으로 보이는 특이한 암석
c1_81300-1 화성으로 하강할 때 사용된 열 차폐 방패(heat shield)에 접근하면서
c1_81301-1 오퍼튜니티의 적외선 분광계인 Mini-TES의 분석에 따르면, 이 암석에서는 화성 암석에서 나오는 전형적인 열 적외선이 나오지 않았기 때문에
c1_81302-1 이 암석 근처에 머물면서 이것이 운석인지를 확실하게 알아낼 것
c1_81303-1 표면이 울퉁불퉁한 홈이 파여 있기 때문에


In [173]:
predictions

OrderedDict([('c1_57059-1', "'국제청소년포럼'을 연다고 21일 밝혔다. 한국 미국 캐나다 호주"),
             ('c1_57060-1', "'국제청소년포럼'을 연다고 21일 밝혔다. 한국 미국 캐나다 호주"),
             ('c1_57061-1', "'국제청소년포럼'을 연다고 21일 밝혔다. 한국 미국 캐나다 호주"),
             ('c1_57062-1', "'국제청소년포럼'을 연다고 21일 밝혔다. 한국 미국 캐나다 호주"),
             ('m5_306705-1', '훌라후프로 무대를 꾸몄다. 그러나'),
             ('c1_151305-1', '카일리시에 보조 교통 경찰로 일하는'),
             ('c1_151306-1', '카일리시에 보조 교통 경찰로 일하는'),
             ('c1_151307-1', '카일리시에 보조 교통 경찰로 일하는'),
             ('c1_151308-1', '카일리시에 보조 교통 경찰로 일하는'),
             ('c1_151309-1', '천중핑의 팔에'),
             ('c1_151310-1', '카일리시에 보조 교통 경찰로 일하는'),
             ('c1_151311-1', '카일리시에 보조 교통 경찰로 일하는'),
             ('c1_36509-1',
              '20건에 불과했던 가처분 소송은 지난해 116건으로 4년만에 5.8배 급증했다. 올해는'),
             ('c1_36510-1',
              '20건에 불과했던 가처분 소송은 지난해 116건으로 4년만에 5.8배 급증했다. 올해는'),
             ('c1_36511-1',
              '20건에 불과했던 가처분 소송은 지난해 116건으로 4년만에 5.8배 급증했다. 올해는'),
             ('c1_36512-1', 

In [171]:
results = squad_evaluate(eval_examples, predictions)

In [172]:
results

OrderedDict([('exact', 0.0),
             ('f1', 5.591246684350133),
             ('total', 25),
             ('HasAns_exact', 0.0),
             ('HasAns_f1', 5.591246684350133),
             ('HasAns_total', 25),
             ('best_exact', 0.0),
             ('best_exact_thresh', 0.0),
             ('best_f1', 5.591246684350133),
             ('best_f1_thresh', 0.0)])

Why there is 0 to predict?

In [65]:
vocab_rev = {v: k for k, v in tokenizer.vocab.items()}
tostring = lambda x: " ".join(x).replace(" ##", "").replace("[PAD]", "").strip()
def show_original(idx, inputs):
    tokens = [vocab_rev[i.item()] for i in inputs["input_ids"][idx]]
    s, e = inputs["start_positions"][idx].item(), inputs["end_positions"][idx].item()
    print(s, e)
    print("Answer: ", tostring(tokens[s:(e+1)]))
    print(tostring(tokens))

In [66]:
dataloader = DataLoader(train_dataset, batch_size=args.train_batch_size, shuffle=False)

for step, batch in enumerate(dataloader):
    batch = tuple(t.to(device) for t in batch)

    inputs = {
        "input_ids": batch[0],
        "attention_mask": batch[1],
        "token_type_ids": batch[2],
        "start_positions": batch[3],
        "end_positions": batch[4],
    }
    print("---------------------"*3)
    print(f"[Batch] {step}")
    print("---------------------"*3)
    for i in range(len(batch)):
        show_original(i, inputs)
        print()
    print("====================="*3)

---------------------------------------------------------------
[Batch] 0
---------------------------------------------------------------
21 31
Answer:  한국청소년단체협의회와 여성가족부
[CLS] 서울과 충북 괴산에서 ' 국제청소년포럼 ' 을 여는 곳은 ? [SEP] 한국청소년단체협의회와 여성가족부는 22일부터 28일까지 서울과 충북 괴산에서 ' 국제청소년포럼 ' 을 연다고 21일 밝혔다 . 한국 미국 캐나다 호주 등 전 세계 32개국 75여명의 대학생 , 청소년들이 모여 전 세계적 현안문제에 대한 대안과 해결책을 모색하는 자리다 . 이번 포럼의 주제는 ' 청소년과 뉴미디어 ' 다 . 스마트폰 SNS 태블릿PC 등 새로운 커뮤니케이션 매체인 ' 뉴미디어 ' 에 대한 성찰과 문제점에 대해 토론한다 . 기조강연을 시작으로 국가별 주제관련 사례발표 , 그룹 토론 및 전체총회 , ' 청소년선언문 ' 작성 및 채택 등 다양한 프로그램을 운영한다 . 개회식은 22일 서울 방화동에 있는 국제청소년센터 국제회의장에서 한다 . 전 세계 32개국 대학생ㆍ청소년 참가자와 전국의 청소년기관단체장과 청소년지도자 여성가족부 주한외교사절 등 100여명이 참석할 예정이다 . 23일에는 유엔미래포럼 박영숙 대표가 ' 뉴미디어의 균형 있는 발전을 위한 청소년의 역할 ' 에 대해 기조강연을 한다 . 뉴미디어의 올바른 활용방안과 청소년문화의 형성에 대해 설명할 계획이다 . 27일 폐회식에서는 ' 청소년선언문 ' 을 채택한다 . 선언문에는 전 세계적으로 뉴미디어의 바람직한 발전을 촉구하며 각국 청년들이 함께 실천할 수 있는 내용 등이 담길 예정이다 . 한국청소년단체협의회는 포럼이 끝난 뒤 UN 등 국제기구와 참가자 각국 정부 등 국제사회에 선언문을 전달할 예정이다 . [SEP]

26 30
Answer:  22일부터 28일
[CLS] ' 국제 청소년포럼 ' 이 열리는 때는 ? [

In [1]:
from pathlib import Path
import torch
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import torchmetrics

from transformers import (
    ElectraForQuestionAnswering, 
    ElectraConfig, 
    ElectraTokenizer,
    AdamW,
    squad_convert_examples_to_features,
    get_linear_schedule_with_warmup
)

from transformers.data.processors.squad import SquadResult, SquadV2Processor
from transformers.data.metrics.squad_metrics import (
    compute_predictions_logits,
    squad_evaluate
)
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [2]:
train_file = "ko_nia_normal_squad_all.json"
val_file = "ko_nia_clue0529_squad_all.json"

repo_path = Path().absolute().parent
data_path = repo_path.parent / "data" / "AIhub" / "QA"
ckpt_path = repo_path.parent / "ckpt"
if not ckpt_path.exists():
    ckpt_path.mkdir()
else:
    for x in ckpt_path.glob("*"):
        x.unlink()
    ckpt_path.rmdir()
    ckpt_path.mkdir()

In [3]:
train_file = "test.json"
val_file = "test.json"
args_dict = {
  "task": "AIHub_QA",
  "data_path": data_path,
  "ckpt_path": ckpt_path,
  "train_file": train_file,
  "val_file": val_file,
  "cache_file": "{}_cache",
  "random_seed": 77,
  "threads": 4,
  "version_2_with_negative": False,
  "null_score_diff_threshold": 0.0,
  "max_seq_length": 512,
  "doc_stride": 128,
  "max_query_length": 64,
  "max_answer_length": 30,
  "n_best_size": 20,
  "verbose_logging": True,
  "do_lower_case": False,
  "num_train_epochs": 7,
  "weight_decay": 0.0,
  "adam_epsilon": 1e-8,
  "warmup_proportion": 0,
  "model_type": "koelectra-base-v3",
  "model_name_or_path": "monologg/koelectra-base-v3-discriminator",
  "output_dir": "koelectra-base-v3-korquad-ckpt",
  "seed": 42,
  "train_batch_size": 8,
  "eval_batch_size": 8,
#   "logging_steps": 1000,
#   "save_steps": 1000,
  "learning_rate": 5e-5,
  "output_prediction_file": "predictions_{}.json",
  "output_nbest_file": "nbest_predictions_{}.json",
  "output_null_log_odds_file": "null_odds_{}.json",
}

In [33]:
class Model(pl.LightningModule):
    def __init__(self, **kwargs):
        super().__init__()
        self.save_hyperparameters() 
        self.config = ElectraConfig.from_pretrained(self.hparams.model_name_or_path)
        self.model = ElectraForQuestionAnswering.from_pretrained(
            self.hparams.model_name_or_path, 
            config=self.config
        )
        self.tokenizer = ElectraTokenizer.from_pretrained(self.hparams.model_name_or_path)
        # create dataset and cache it
        self.create_dataset(state="train")
        self.create_dataset(state="val")
        self.eval_examples, self.eval_features = self.load_cache(state="val", return_dataset=False)
        # function
        self.tolist = lambda x: x.detach().cpu().tolist()
        
    def create_dataset(self, state:str="train"):
        r"""
        Args:
            state: train or val
        """
        processor = SquadV2Processor()
        if state == "train":
            filename = self.hparams.train_file
            is_training = True
        elif state == "val":
            filename = self.hparams.val_file
            is_training = False
        else:
            raise ValueError("state should be train or val")
            
        examples = processor.get_train_examples(
            data_dir=self.hparams.data_path, 
            filename=filename
        )
        features, dataset = squad_convert_examples_to_features(
            examples=examples,
            tokenizer=self.tokenizer,
            max_seq_length=self.hparams.max_seq_length,
            doc_stride=self.hparams.doc_stride,
            max_query_length=self.hparams.max_query_length,
            is_training=is_training,
            return_dataset="pt",
            threads=self.hparams.threads,
        )
        # TODO: Need to figure out why the unique id is statr from 1000000000
        # https://huggingface.co/transformers/_modules/transformers/data/processors/squad.html
#         if state == "val":
#             for fea in features:
#                 fea.unique_id -= 1000000000

        cache = dict(dataset=dataset, examples=examples, features=features)
        torch.save(cache, self.hparams.ckpt_path / self.hparams.cache_file.format(state))        

    def load_cache(self, state:str="train", return_dataset=True):
        cache = torch.load(self.hparams.ckpt_path / self.hparams.cache_file.format(state))
        dataset, examples, features = cache["dataset"], cache["examples"], cache["features"]

        if return_dataset:
            return dataset
        else:
            return examples, features

    def forward(self, **kwargs):
        return self.model(**kwargs)

    def training_step(self, batch, batch_idx):
        inputs_ids, attention_mask, token_type_ids, start_positions, end_positions, *_ = batch

        outputs = self(
            input_ids=inputs_ids, 
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            start_positions=start_positions,
            end_positions=end_positions
        )

        loss = outputs.loss
        return  {'loss': loss}

    def validation_step(self, batch, batch_idx):
        inputs_ids, attention_mask, token_type_ids, example_indices, *_ = batch
        
        outputs = self(
            input_ids=inputs_ids, 
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            start_positions=None,
            end_positions=None
        )
        
        batch_results = []
        
        for i, example_index in enumerate(example_indices):
            eval_feature = self.eval_features[example_index.item()]
            unique_id = int(eval_feature.unique_id)
            output = [self.tolist(o[i]) for o in outputs.values()]
            start_logits, end_logits = output
            result = SquadResult(unique_id, start_logits, end_logits)
            batch_results.append(result)
            
        return batch_results
    
    def train_epoch_end(self, outputs):
        loss = torch.tensor(0, dtype=torch.float)
        for out in outputs:
            loss += out["loss"].detach().cpu()
        loss = loss / len(outputs)

        return {'loss': loss}

    def validation_epoch_end(self, outputs):
        all_results = []
        for res in outputs:
            all_results += res

        predictions = compute_predictions_logits(
            self.eval_examples,
            self.eval_features,
            all_results,
            self.hparams.n_best_size,
            self.hparams.max_answer_length,
            self.hparams.do_lower_case,
            self.hparams.ckpt_path / self.hparams.output_prediction_file.format(self.global_step),
            self.hparams.ckpt_path / self.hparams.output_nbest_file.format(self.global_step),
            self.hparams.ckpt_path / self.hparams.output_null_log_odds_file.format(self.global_step),
            self.hparams.verbose_logging,
            self.hparams.version_2_with_negative,
            self.hparams.null_score_diff_threshold,
            self.tokenizer,
        )
        results = squad_evaluate(self.eval_examples, predictions)
        print(results)
        accuracy = results["exact"]
        f1 = results["f1"]
        self.log("accuracy", accuracy, on_epoch=True, prog_bar=True)
        self.log("f1", f1, on_epoch=True, prog_bar=True)

    def configure_optimizers(self):
        t_total = self.total_steps()
        
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)], 
                "weight_decay": 0.0
            },
        ]
        optimizer = AdamW(
            params=optimizer_grouped_parameters, 
            lr=self.hparams.learning_rate, 
            eps=self.hparams.adam_epsilon
        )
        scheduler = get_linear_schedule_with_warmup(
            optimizer=optimizer, 
            num_warmup_steps=int(t_total * self.hparams.warmup_proportion), 
            num_training_steps=t_total
        )
        
        return {
            'optimizer': optimizer,
            'scheduler': scheduler,
        }

    def create_dataloader(self, state:str="train"):
        r"""
        Args:
            state: train or val
        """
        if state == "train":
            shuffle = True
            batch_size = self.hparams.train_batch_size
        elif state == "val":
            shuffle = False
            batch_size = self.hparams.eval_batch_size
        else:
            raise ValueError("state should be train or val")
        dataset = self.load_cache(state, return_dataset=True)
        dataloader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            shuffle=shuffle,
            num_workers=self.hparams.threads
        )
        return dataloader

    def train_dataloader(self):
        return self.create_dataloader(state="train")

    def val_dataloader(self):
        return self.create_dataloader(state="val")
    
    def total_steps(self):
        r"""
        source: https://github.com/PyTorchLightning/pytorch-lightning/issues/1038
        """
        return len(self.train_dataloader()) * self.hparams.num_train_epochs


In [34]:
def main(args_dict):
    print("[INFO] Using PyTorch Ver", torch.__version__)
    print("[INFO] Seed:", args_dict["random_seed"])
    checkpoint_callback = pl.callbacks.ModelCheckpoint(
        filename="epoch{epoch}-f1{f1:.4f}",
        monitor="f1",
        save_top_k=3,
        mode="max",
    )
    pl.seed_everything(args_dict["random_seed"])
    model = Model(**args_dict)
    
    print("[INFO] Start FineTuning")
    trainer = pl.Trainer(
        callbacks=[checkpoint_callback],
        max_epochs=args_dict["num_train_epochs"],
        deterministic=torch.cuda.is_available(),
        gpus=-1 if torch.cuda.is_available() else None,
    )
    trainer.fit(model)

In [35]:
print("[INFO] Using PyTorch Ver", torch.__version__)
print("[INFO] Seed:", args_dict["random_seed"])
checkpoint_callback = pl.callbacks.ModelCheckpoint(
    filename="epoch{epoch}-f1{f1:.4f}",
    monitor="f1",
    save_top_k=3,
    mode="max",
)
pl.seed_everything(args_dict["random_seed"])
model = Model(**args_dict)

print("[INFO] Start FineTuning")
trainer = pl.Trainer(
    callbacks=[checkpoint_callback],
    max_epochs=args_dict["num_train_epochs"],
    deterministic=torch.cuda.is_available(),
    gpus=-1 if torch.cuda.is_available() else None,
)
trainer.fit(model)

Global seed set to 77


[INFO] Using PyTorch Ver 1.6.0
[INFO] Seed: 77


Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForQuestionAnswering were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: ['qa_outputs.weight'

[INFO] Start FineTuning


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…

KeyError: 1000000016

main(args_dict)

In [59]:
all_results = []
for batch_idx, batch in enumerate(model.val_dataloader()):
    outputs = model.validation_step([b.to("cuda") for b in batch], batch_idx)
    all_results += outputs

In [60]:
all_results.

[<transformers.data.processors.squad.SquadResult at 0x7f4c407876d0>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c40755210>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c40755290>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c40755050>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c407555d0>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c42a66650>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c42a66f90>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c42a66b90>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c40787f50>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c42a664d0>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c42a66d50>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c42a660d0>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c42a66f50>,
 <transformers.data.processors.squad.SquadResult at 0x7f4c406e8050>,
 <transformers.data.processors.squ

In [61]:
import collections
all_features = model.eval_features
all_examples = model.eval_examples
example_index_to_features = collections.defaultdict(list)
for feature in all_features:
    example_index_to_features[feature.example_index].append(feature)


In [64]:
unique_id_to_result = {}
for result in all_results:
    unique_id_to_result[result.unique_id] = result

In [66]:
all_predictions = collections.OrderedDict()
all_nbest_json = collections.OrderedDict()
scores_diff_json = collections.OrderedDict()

for (example_index, example) in enumerate(all_examples):
    features = example_index_to_features[example_index]

In [67]:
unique_id_to_result

{1000000000: <transformers.data.processors.squad.SquadResult at 0x7f4c407876d0>,
 1000000001: <transformers.data.processors.squad.SquadResult at 0x7f4c40755210>,
 1000000002: <transformers.data.processors.squad.SquadResult at 0x7f4c40755290>,
 1000000003: <transformers.data.processors.squad.SquadResult at 0x7f4c40755050>,
 1000000004: <transformers.data.processors.squad.SquadResult at 0x7f4c407555d0>,
 1000000005: <transformers.data.processors.squad.SquadResult at 0x7f4c42a66650>,
 1000000006: <transformers.data.processors.squad.SquadResult at 0x7f4c42a66f90>,
 1000000007: <transformers.data.processors.squad.SquadResult at 0x7f4c42a66b90>,
 1000000008: <transformers.data.processors.squad.SquadResult at 0x7f4c40787f50>,
 1000000009: <transformers.data.processors.squad.SquadResult at 0x7f4c42a664d0>,
 1000000010: <transformers.data.processors.squad.SquadResult at 0x7f4c42a66d50>,
 1000000011: <transformers.data.processors.squad.SquadResult at 0x7f4c42a660d0>,
 1000000012: <transformers.d

In [74]:
for (example_index, example) in enumerate(all_examples):
    features = example_index_to_features[example_index]
    for (feature_index, feature) in enumerate(features):
        print(feature.unique_id)
        result = unique_id_to_result[feature.unique_id]
        print(result.unique_id)

1000000000
1000000000
1000000001
1000000001
1000000002
1000000002
1000000003
1000000003
1000000004
1000000004
1000000005
1000000005
1000000006
1000000006
1000000007
1000000007
1000000008
1000000008
1000000009
1000000009
1000000010
1000000010
1000000011
1000000011
1000000012
1000000012
1000000013
1000000013
1000000014
1000000014
1000000015
1000000015
1000000016
1000000016
1000000017
1000000017
1000000018
1000000018
1000000019
1000000019
1000000020
1000000020
1000000021
1000000021
1000000022
1000000022
1000000023
1000000023
1000000024
1000000024
1000000025
1000000025
1000000026
1000000026
1000000027
1000000027
1000000028
1000000028
1000000029
1000000029
1000000030
1000000030
1000000031
1000000031
