 # Chapter 19. 트랜스포머 모형을 이용한 질의 응답
## 1. 질의 응답 시스템의 이해

## 2. 파이프라인을 이용한 질의 응답

In [1]:
from transformers import pipeline

question_answerer = pipeline("question-answering")

context = r'''Text mining, also referred to as text data mining (abbr.: TDM), similar to text analytics, 
        is the process of deriving high-quality information from text. It involves 
        "the discovery by computer of new, previously unknown information, 
        by automatically extracting information from different written resources." 
        Written resources may include websites, books, emails, reviews, and articles. 
        High-quality information is typically obtained by devising patterns and trends 
        by means such as statistical pattern learning. According to Hotho et al. (2005)
        we can distinguish between three different perspectives of text mining: 
        information extraction, data mining, and a KDD (Knowledge Discovery in Databases) process.''' 
question = "What is text mining?"
answer = question_answerer(question=question, context=context)
print(answer)
question2 = "What are the perspectives of text mining?"
answer2 = question_answerer(question=question2, context=context)
print("질의:", question2)
print("응답:", answer2['answer'])
print("응답에 사용된 context:", context[answer2['start']:answer2['end']])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.4241906404495239, 'start': 103, 'end': 161, 'answer': 'the process of deriving high-quality information from text'}
질의: What are the perspectives of text mining?
응답: information extraction, data mining, and a KDD
응답에 사용된 context: information extraction, data mining, and a KDD


1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it with the weights stored in the checkpoint.
2. Define a text and a few questions.
3. Iterate over the questions and build a sequence from the text and the current question, with the correct model-specific separators, token type ids and attention masks.
4. Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and text), for both the start and end positions.
5. Compute the softmax of the result to get probabilities over the tokens.
6. Fetch the tokens from the identified start and stop values, convert those tokens to a string.
7. Print the results.

## 3. 자동 클래스를 이용한 질의 응답

In [2]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
print("tokenizer type:", type(tokenizer))
model = AutoModelForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')
print("model type:", type(model))
# GPU 가속을 사용할 수 있으면 device를 cuda로 설정하고, 아니면 cpu로 설정
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)

# 질문과 context를 함께 토큰화
inputs = tokenizer(question, context, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)

print("output type", type(outputs))

tokenizer type: <class 'transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast'>
model type: <class 'transformers.models.distilbert.modeling_distilbert.DistilBertForQuestionAnswering'>
output type <class 'transformers.modeling_outputs.QuestionAnsweringModelOutput'>


In [3]:
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

# argmax를 이용해 context에서 응답의 시작일 확률이 가장 높은 토큰의 위치를 반환
answer_start = torch.argmax(answer_start_scores)
# argmax를 이용해 context에서 응답의 끝일 확률이 가장 높은 토큰의 위치를 반환
answer_end = torch.argmax(answer_end_scores) + 1
print("start:", answer_start, ", end:", answer_end)

# 토큰화 결과로부터 input_ids만 추출
input_ids = inputs["input_ids"].tolist()[0] 
# input_ids에서 응답에 해당하는 id를 가져와 토큰으로 변환하고 다시 문자열로 변환
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)

print("질의:", question)
print("응답:", answer)

start: tensor(35, device='cuda:0') , end: tensor(46, device='cuda:0')
질의: What is text mining?
응답: the process of deriving high - quality information from text


In [4]:
from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

text = "Alice is sitting on the bench. Bob is sitting next to her."

result = question_answerer(question="Who is the CEO?", context=text)
print(result)

{'score': 0.7526955008506775, 'start': 31, 'end': 34, 'answer': 'Bob'}


## 4. 트레이너를 이용한 질의 응답 미세조정학습

In [5]:
import torch
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForQuestionAnswering

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
# distilbert-base-uncased는 질의응답을 위해 사전학습된 모델이 아니기 때문에 질의응답이 불가
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

In [6]:
context = """The city is the birthplace of many cultural movements, including the Harlem 
        Renaissance in literature and visual art; abstract expressionism 
        (also known as the New York School) in painting; and hip hop, punk, salsa, disco, 
        freestyle, Tin Pan Alley, and Jazz in music. New York City has been considered 
        the dance capital of the world. The city is also widely celebrated in popular lore, 
        frequently the setting for books, movies (see List of films set in New York City), 
        and television programs."""
question = "The dance capital of the world is what city in the US?"
inputs = tokenizer(question, context, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

# argmax를 이용해 context에서 응답의 시작일 확률이 가장 높은 토큰의 위치를 반환
answer_start = torch.argmax(answer_start_scores)
# argmax를 이용해 context에서 응답의 끝일 확률이 가장 높은 토큰의 위치를 반환
answer_end = torch.argmax(answer_end_scores) + 1
print("start:", answer_start, ", end:", answer_end)

# 토큰화 결과로부터 input_ids만 추출
input_ids = inputs["input_ids"].tolist()[0] 
# input_ids에서 응답에 해당하는 id를 가져와 토큰으로 변환하고 다시 문자열로 변환
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)

print("질의:", question)
print("응답:", answer) #응답을 만들지 못함

start: tensor(67, device='cuda:0') , end: tensor(12, device='cuda:0')
질의: The dance capital of the world is what city in the US?
응답: 


In [7]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")
squad = squad.train_test_split(test_size=0.2)
squad["train"][0]

Found cached dataset squad (C:/Users/user/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


{'id': '56ce1225aab44d1400b88432',
 'title': 'Frédéric_Chopin',
 'context': 'Fryderyk Chopin was born in Żelazowa Wola, 46 kilometres (29 miles) west of Warsaw, in what was then the Duchy of Warsaw, a Polish state established by Napoleon. The parish baptismal record gives his birthday as 22 February 1810, and cites his given names in the Latin form Fridericus Franciscus (in Polish, he was Fryderyk Franciszek). However, the composer and his family used the birthdate 1 March,[n 2] which is now generally accepted as the correct date.',
 'question': 'When was his birthday recorded as being?',
 'answers': {'text': ['22 February 1810'], 'answer_start': [212]}}

In [8]:
def preprocess(data):
    questions = [q.strip() for q in data["question"]] # 질의 추출하고 전처리
    # 질의와 context를 함께 토큰화
    inputs = tokenizer(
        questions,
        data["context"],
        max_length=384,              # 토큰화 결과의 최대 길이
        truncation="only_second",
        return_offsets_mapping=True, # offset_mapping을 반환
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = data["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0] # context에서 응답 시작 위치
        # context에서 응답 종료 위치를 계산
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # sequence_ids를 이용해 context 토큰의 시작과 끝을 알아냄
        idx = 0
        while sequence_ids[idx] != 1: # sequence_ids에서 첫 1의 위치
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1: # sequence_ids에서 마지막 1의 위치
            idx += 1
        context_end = idx - 1

        # 응답이 context 안에 있지 않으면 응답의 시작위치와 종료위치를 (0, 0)으로 set
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # 응답이 context 안에 있으면 start_char, end_char를 이용해 응답 토큰의 위치를 찾음
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs
# 전처리 함수를 데이터에 적용, 원래 squad에 있던 항목들은 제거
tokenized_squad = squad.map(preprocess, batched=True, remove_columns=squad["train"].column_names)
tokenized_squad

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 1000
    })
})

In [9]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

In [10]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./QandA",            # 모형 예측과 체크포인트 저장 폴더, 반드시 필요
    evaluation_strategy="epoch",     # 평가 단위, 여기서는 epoch를 선택
    learning_rate=2e-5,              # 학습률
    per_device_train_batch_size=16,  # 학습에 사용할 배치 크기
    per_device_eval_batch_size=16,   # 평가에 사용할 배치 크기
    num_train_epochs=3,              # 에포크 수
    weight_decay=0.01,               # 가중치 감쇠 값
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

***** Running training *****
  Num examples = 4000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 750
  Number of trainable parameters = 66364418


Epoch,Training Loss,Validation Loss
1,No log,2.362121
2,2.767500,1.901501
3,2.767500,1.783696


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
Saving model checkpoint to ./QandA\checkpoint-500
Configuration saved in ./QandA\checkpoint-500\config.json
Model weights saved in ./QandA\checkpoint-500\pytorch_model.bin
tokenizer config file saved in ./QandA\checkpoint-500\tokenizer_config.json
Special tokens file saved in ./QandA\checkpoint-500\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=750, training_loss=2.354102294921875, metrics={'train_runtime': 66.2499, 'train_samples_per_second': 181.132, 'train_steps_per_second': 11.321, 'total_flos': 1175877900288000.0, 'train_loss': 2.354102294921875, 'epoch': 3.0})

In [11]:
inputs = tokenizer(question, context, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

# argmax를 이용해 context에서 응답의 시작일 확률이 가장 높은 토큰의 위치를 반환
answer_start = torch.argmax(answer_start_scores)
# argmax를 이용해 context에서 응답의 끝일 확률이 가장 높은 토큰의 위치를 반환
answer_end = torch.argmax(answer_end_scores) + 1
print("start:", answer_start, ", end:", answer_end)

# 토큰화 결과로부터 input_ids만 추출
input_ids = inputs["input_ids"].tolist()[0] 
# input_ids에서 응답에 해당하는 id를 가져와 토큰으로 변환하고 다시 문자열로 변환
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)

print("질의:", question)
print("응답:", answer) #응답을 만들지 못함

start: tensor(71, device='cuda:0') , end: tensor(74, device='cuda:0')
질의: The dance capital of the world is what city in the US?
응답: new york city


In [12]:
trainer.save_model("./QandA")  # 모형 저장
# 저장된 모형 로드
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained("./QandA")
model = AutoModelForQuestionAnswering.from_pretrained("./QandA")

Saving model checkpoint to ./QandA
Configuration saved in ./QandA\config.json
Model weights saved in ./QandA\pytorch_model.bin
tokenizer config file saved in ./QandA\tokenizer_config.json
Special tokens file saved in ./QandA\special_tokens_map.json
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file ./QandA\config.json
Model config DistilBertConfig {
  "_name_or_path": "./QandA",
  "activation": "gelu",
  "architectures": [
    "DistilBertForQuestionAnswering"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",

## 5. 한글 질의 응답

In [13]:
question = "수원 화성은 언제 완성되었는가?"
context = """수원 화성은 조선시대 화성유수부 시가지를 둘러싼 성곽이다. 
1789년(정조 13) 수원을 팔달산 동쪽 아래로 옮기고, 
1794년(정조 18) 축성을 시작해 1796년에 완성했다."""
context = context.strip().replace("\n","")

In [14]:
from transformers import ElectraTokenizer, ElectraForQuestionAnswering, pipeline

tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-small-v2-distilled-korquad-384")
model = ElectraForQuestionAnswering.from_pretrained("monologg/koelectra-small-v2-distilled-korquad-384")
question_answerer = pipeline("question-answering", tokenizer=tokenizer, model=model)
answer = question_answerer({
    "question": question,
    "context": context,
})
print(answer)

Downloading:   0%|          | 0.00/255k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

loading file vocab.txt from cache at C:\Users\user/.cache\huggingface\hub\models--monologg--koelectra-small-v2-distilled-korquad-384\snapshots\70c28f5b9e6b2bd05bb609f6be1f9f8ff918cd6f\vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at C:\Users\user/.cache\huggingface\hub\models--monologg--koelectra-small-v2-distilled-korquad-384\snapshots\70c28f5b9e6b2bd05bb609f6be1f9f8ff918cd6f\special_tokens_map.json
loading file tokenizer_config.json from cache at C:\Users\user/.cache\huggingface\hub\models--monologg--koelectra-small-v2-distilled-korquad-384\snapshots\70c28f5b9e6b2bd05bb609f6be1f9f8ff918cd6f\tokenizer_config.json


Downloading:   0%|          | 0.00/472 [00:00<?, ?B/s]

loading configuration file config.json from cache at C:\Users\user/.cache\huggingface\hub\models--monologg--koelectra-small-v2-distilled-korquad-384\snapshots\70c28f5b9e6b2bd05bb609f6be1f9f8ff918cd6f\config.json
Model config ElectraConfig {
  "_name_or_path": "monologg/koelectra-small-v2-distilled-korquad-384",
  "architectures": [
    "ElectraForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": t

Downloading:   0%|          | 0.00/54.8M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at C:\Users\user/.cache\huggingface\hub\models--monologg--koelectra-small-v2-distilled-korquad-384\snapshots\70c28f5b9e6b2bd05bb609f6be1f9f8ff918cd6f\pytorch_model.bin
All model checkpoint weights were used when initializing ElectraForQuestionAnswering.

All the weights of ElectraForQuestionAnswering were initialized from the model checkpoint at monologg/koelectra-small-v2-distilled-korquad-384.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ElectraForQuestionAnswering for predictions without further training.


{'score': 0.9962994456291199, 'start': 87, 'end': 93, 'answer': '1796년에'}


In [15]:
import torch

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits
# argmax를 이용해 context에서 응답의 시작일 확률이 가장 높은 토큰의 위치를 반환
answer_start = torch.argmax(answer_start_scores)
# argmax를 이용해 context에서 응답의 끝일 확률이 가장 높은 토큰의 위치를 반환
answer_end = torch.argmax(answer_end_scores) + 1
print("start:", answer_start, ", end:", answer_end)
# 토큰화 결과로부터 input_ids만 추출
input_ids = inputs["input_ids"].tolist()[0] 
# input_ids에서 응답에 해당하는 id를 가져와 토큰으로 변환하고 다시 문자열로 변환
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print("질의:", question)
print("응답:", answer)

start: tensor(57) , end: tensor(60)
질의: 수원 화성은 언제 완성되었는가?
응답: 1796년
