# BERT + LoRA 를 활용하여 Q/A 구현하기 (Small Version)

Author : 정상근 (hugmanskj@gmail.com)

> 알림>
> 교육용 목적으로 제작된 코드입니다.

이 노트북에서는 BERT 모델을 LoRa를 이용하여 메모리를 적게 사용하는 방법을 실습합니다.
활용 데이터는 Squad이며, BERT+LoRA를 이용하여 Q/A 문제를 구현해봅니다. Squad에의 BERT 적용은 [이 Notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=K51w5LujQ97Q) 에서 많은 부분을 가져왔음을 밝힙니다.


특히 이번 실습에서는 Small 수준의 BERT를 활용해 LoRA적용 여부만 살펴봅니다.

아래의 소프트웨어와 프레임워크를 사용해서 진행하겠습니다.

(훈련 소요 시간 : 20분, colab 기준)

In [1]:
from IPython.display import Markdown, display

In [2]:
import os

# 첫 번째 GPU만 사용하도록 설정 (본 실습환경 때문)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["WANDB_MODE"] = "disabled"

## 향후 쓰일 여러가지 utility function 준비

In [3]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params:,} || all params: {all_param:,} || trainable : {100 * trainable_params / all_param}%"
    )
    return trainable_params


def compare_param(ori_p, peft_p):
    """
    Compare two parameter numbers
    """
    print(f"\n# Trainable Parameter \nBefore: {ori_p:>14,d} \nAfter:  {peft_p:>14,d} \nPercentage: {round(peft_p / ori_p * 100, 2)}%")


def show_trainable_structure(model):
    """
    Print only the trainable parameters only with size
    """
    num_totals = 0
    for _, param in model.named_parameters():
        if param.requires_grad:
            print( f"{_} \t {param.shape} \t\t {param.numel():,}" )
            num_totals += 1

    print(f"\nTotal Number of Parameter Names : {num_totals:,}")

## Squad Dataset

아래 사이트에서 Dataset을 살펴볼 수 있습니다.

https://huggingface.co/datasets/squad

### Dataset 준비

In [None]:
## It takes time! (about 10min)
from datasets import load_dataset
dataset = load_dataset("squad")

In [None]:
dataset

### Dataset 살펴보기

Squad dataset 은 Context가 주어지고, 이 Context로부터의 Question 과 Answer가 각각 주어집니다.

In [None]:
dataset["train"][0]

In [7]:
dataset["train"][0].keys()

dict_keys(['id', 'title', 'context', 'question', 'answers'])

In [8]:
print("Title : ", dataset["train"][0]['title'])
print("-"*50)
print("[Context]")
print(dataset["train"][0]['context'])
print("-"*50)
print("Question : ", dataset["train"][0]['question'])
print("Answer   : ", dataset["train"][0]['answers'])

Title :  University_of_Notre_Dame
--------------------------------------------------
[Context]
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
--------------------------------------------------
Question :  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer   :  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]

위의 예제에서, 질문에 대한 답은 'Saint Bernadette Soubirous' 이며, 이 정답은 Context 의 515 Charcter 부터 존재한다는 것을 알 수 있습니다. 이를 확인해봅니다.

In [9]:
ans_len = len(dataset["train"][0]['answers']['text'][0])
start_pos = dataset["train"][0]['answers']['answer_start'][0]

## -- extract answer from the context
dataset["train"][0]['context'][ start_pos : start_pos + ans_len]

'Saint Bernadette Soubirous'

### Q/A dataset Preprocessing

Q/A Data를 처리하기 위해서 Tokenizatoin 등을 포함한 전처리를 진행해야 합니다.

##### Model 설정하기

In [10]:
MODEL_ID = "bert-base-uncased" # small size

In [11]:
import transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# fast version 인지 아닌지 확인
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

pad_on_right = tokenizer.padding_side == "right"
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [12]:
# Data to Feature fucntion
# from this link : https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=LP4YiUxrQ97Q
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [13]:
features = prepare_train_features(dataset['train'][:5])

In [14]:
features.keys()

KeysView({'input_ids': [[101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 23052, 1010, 1037, 14042, 2173, 1997, 7083, 1998, 9185, 1012, 2009, 2003, 1037, 15059, 1997, 1996, 24665, 23052, 2012, 10223, 26371, 1010, 2605, 2073, 1996, 6261, 2984, 22353, 2135, 2596, 2000, 3002, 16595, 9648, 4674, 2061, 12083, 9711, 2271, 1999, 8517, 1012, 2012, 1996, 2203, 1997, 1996, 2364, 3298, 1006, 1998, 1999, 1037, 3622, 2240, 2008, 8539, 2083, 1017, 11342, 1998

In [15]:
# tokeized input ids
print( features['input_ids'][0] )

[101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 23052, 1010, 1037, 14042, 2173, 1997, 7083, 1998, 9185, 1012, 2009, 2003, 1037, 15059, 1997, 1996, 24665, 23052, 2012, 10223, 26371, 1010, 2605, 2073, 1996, 6261, 2984, 22353, 2135, 2596, 2000, 3002, 16595, 9648, 4674, 2061, 12083, 9711, 2271, 1999, 8517, 1012, 2012, 1996, 2203, 1997, 1996, 2364, 3298, 1006, 1998, 1999, 1037, 3622, 2240, 2008, 8539, 2083, 1017, 11342, 1998, 1996, 2751, 8514, 1007

In [16]:
tokenizer.decode(features['input_ids'][0])

'[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP] architecturally, the school has a catholic character. atop the main building \' s gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of prayer and reflection. it is a replica of the grotto at lourdes, france where the virgin mary reputedly appeared to saint bernadette soubirous in 1858. at the end of the main drive ( and in a direct line that connects through 3 statues and the gold dome ), is a simple, modern stone statue of mary. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In [17]:
# tokeized token type ids
print( features['token_type_ids'][0] )

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

여기서 0값과 1값이 바뀌는 구간을 잘 확인해보세요

In [18]:
# attention mask
print( features['attention_mask'][0] )

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

[PAD] 파트 부터는 모두 mask 가 0 입니다.

In [19]:
# tokeized token type ids
print( "Start Pos : ", features['start_positions'][0] )
print( "End   Pos : ", features['end_positions'][0] )

Start Pos :  130
End   Pos :  137


In [20]:
start_p = features['start_positions'][0]
end_p = features['end_positions'][0]
tokenizer.decode( features['input_ids'][0][start_p:end_p+1] )

'saint bernadette soubirous'

정답 부분이 uncased 형태로 되어 있는 것을 알 수 있습니다.

### Tokenization

tokenized_datasets 에 미리 dataset 전체를 tokenization 해 둡니다.

In [21]:
tokenized_datasets = dataset.map(
                        prepare_train_features,
                        batched=True,
                        remove_columns=dataset["train"].column_names
                    )

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [22]:
tokenized_datasets.keys()

dict_keys(['train', 'validation'])

In [23]:
print(dataset["train"].column_names)

['id', 'title', 'context', 'question', 'answers']


> 옵션 설명
1. batched=True 옵션은 datasets 라이브러리의 map 함수에서 매우 중요한 역할을 합니다. 이 옵션을 사용하면 map 함수가 데이터셋의 각 샘플을 개별적으로 처리하는 대신 배치 단위로 처리합니다.
2. remove_columns=datasets["train"].column_names 옵션은 datasets 라이브러리의 map 함수에서 사용되며, 특정 열(columns)을 결과 데이터셋에서 제거하는 데 사용됩니다.
3. 예를 들어, SQuAD 데이터셋의 경우 원본 열에는 '문맥(context)', '질문(question)', '답변(answer)' 등이 포함되어 있습니다. 토크나이징 과정을 거쳐 새로운 열(예: 'input_ids', 'attention_mask', 'start_positions', 'end_positions')이 추가되면, 원본의 '문맥', '질문', '답변' 열은 더 이상 필요하지 않을 수 있습니다.

In [24]:
## full data
#train_dataset = tokenized_datasets["train"]
#test_dataset  = tokenized_datasets["validation"]

## small data (about 30min)
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(5000))
test_dataset  = tokenized_datasets["validation"].shuffle(seed=42).select(range(1000))

## BERT + LoRA 준비

Huggingface 및 PEFT 라이브러리를 이용한 LoRA 셋팅은 크게 아래의 순서로 진행됩니다.

1. Original 모델 준비하기
2. LoRA configuration 준비하기
3. PEFT(LoRA) 화 시킨 Model 준비하기

### Original Model 준비하기

In [25]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(
    MODEL_ID)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


> Note!

BERT-base 는 충분히 작은 GPU에도 담길만큼 작은 모델이기 때문에 load_in_8bit같은 추가적인 Model Compression 테크닉을 여기서는 쓰지 않겠습니다. LoRA를 어떻게 적용하는지만 설명합니다.

In [26]:
ori_p = print_trainable_parameters(model)

trainable params: 108,893,186 || all params: 108,893,186 || trainable : 100.0%


#### Model Structure 확인

In [27]:
show_trainable_structure(model)

bert.embeddings.word_embeddings.weight 	 torch.Size([30522, 768]) 		 23,440,896
bert.embeddings.position_embeddings.weight 	 torch.Size([512, 768]) 		 393,216
bert.embeddings.token_type_embeddings.weight 	 torch.Size([2, 768]) 		 1,536
bert.embeddings.LayerNorm.weight 	 torch.Size([768]) 		 768
bert.embeddings.LayerNorm.bias 	 torch.Size([768]) 		 768
bert.encoder.layer.0.attention.self.query.weight 	 torch.Size([768, 768]) 		 589,824
bert.encoder.layer.0.attention.self.query.bias 	 torch.Size([768]) 		 768
bert.encoder.layer.0.attention.self.key.weight 	 torch.Size([768, 768]) 		 589,824
bert.encoder.layer.0.attention.self.key.bias 	 torch.Size([768]) 		 768
bert.encoder.layer.0.attention.self.value.weight 	 torch.Size([768, 768]) 		 589,824
bert.encoder.layer.0.attention.self.value.bias 	 torch.Size([768]) 		 768
bert.encoder.layer.0.attention.output.dense.weight 	 torch.Size([768, 768]) 		 589,824
bert.encoder.layer.0.attention.output.dense.bias 	 torch.Size([768]) 		 768
bert.encod

#### Original Model 의 성능 확인
아주 작은 Sample 에 대한 Loss만 확인합니다.

In [28]:
from transformers import TrainingArguments

testing_args = TrainingArguments(
    output_dir='./results',  # 결과 저장 디렉토리
    do_eval=True             # 평가 실행 설정
)

In [29]:
from transformers import Trainer

tester = Trainer(
    model=model,
    args=testing_args,
    eval_dataset=test_dataset,
)
res = tester.evaluate()

import pandas as pd
pd.DataFrame([res])

Unnamed: 0,eval_loss,eval_model_preparation_time,eval_runtime,eval_samples_per_second,eval_steps_per_second
0,6.024087,0.0024,14.7339,67.871,8.484


### LoRA configuration 준비하기

In [30]:
from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.QUESTION_ANS,  # TASK TYPE
    r=8, # Rank
    lora_alpha=1,
    lora_dropout=0.1,
    #target_modules=["query", "value"],  # We can specify target Modules
)

이 [코드](https://github.com/huggingface/peft/blob/main/src/peft/utils/peft_types.py|)에 다음과 같은 Task Type 을 지원함을 알 수 있습니다.

```
class TaskType(str, enum.Enum):
    SEQ_CLS = "SEQ_CLS"
    SEQ_2_SEQ_LM = "SEQ_2_SEQ_LM"
    CAUSAL_LM = "CAUSAL_LM"
    TOKEN_CLS = "TOKEN_CLS"
    QUESTION_ANS = "QUESTION_ANS"
    FEATURE_EXTRACTION = "FEATURE_EXTRACTION"
```

- 이번 실습에서는, Sequcne Classification 을 수행하기 때문에, SEQ_CLS라는 TaskType을 설정합니다.
- 또한 LoRA 의 alpha 값으로 1.0 정도를 셋팅합니다.
- BERT와 LoRA를 함께 사용하는 예제에서 `target_modules`를 명시하지 않으면 LoRA는 기본적으로 모델 내의 모든 적용 가능한 레이어나 모듈에 자동으로 적용되며, 이 경우 LoRA는 모델 내에서 가장 효과적인 부분, 일반적으로는 attention layer(Query, Key, Value)와 Feed-Forward layer적용됩니다.


### PEFT(LoRA) 화 시킨 Model 준비하기

In [31]:
from peft import get_peft_model
model = get_peft_model(model, lora_config)

In [32]:
lora_p = print_trainable_parameters(model)

trainable params: 296,450 || all params: 109,189,636 || trainable : 0.2715001266237393%


In [33]:
compare_param(ori_p, lora_p)


# Trainable Parameter 
Before:    108,893,186 
After:         296,450 
Percentage: 0.27%


## Fine-Tuning 수행하기

### Huggingface 의 Trainer 활용하여 훈련하기

In [35]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    f"{MODEL_ID}-lora-finetuned-squad",
    ## ---- Epoch & Batch ----- ##
    num_train_epochs=5,             # 훈련 에폭 수
    per_device_train_batch_size=16,  # 디바이스 당 훈련 배치 크기

    ## ---- Learning Rate ----- ##
    #warmup_steps=500,                # 워밍업을 위한 스텝 수
    weight_decay=0.01,               # 가중치 감소율
    learning_rate=2e-5,

    ## ---- GPU & NODES ----- ##
    eval_strategy="epoch",

    ## - Logging & Checkpoint - ##
    logging_steps=10,                # 로그를 기록할 스텝 간격
    #save_steps=1000,                 # 체크포인트를 저장할 스텝 간격
)


> Note!

LoRA만 적용해 봅니다. fp16, gradient checkpointing, gradient accumulation등의 테크닉은 여기서 사용하지 않겠습니다.

In [36]:
from transformers import default_data_collator
data_collator = default_data_collator

default collator 만 사용해도 됩니다.

In [37]:
trainer = Trainer(
    model=model,
    args=training_args,

    # on Which Dataset
    train_dataset=train_dataset, # < train
    eval_dataset=test_dataset,   # < test

    data_collator=data_collator,
    tokenizer=tokenizer
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForQuestionAnswering`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [38]:
import warnings

# huggingface 훈련 중 불필요한 Warning 이 많이 발생함 (epoch 별로)
warnings.filterwarnings("ignore", category=UserWarning)

In [39]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,5.6943,5.720222
2,5.4454,5.440218
3,5.1866,5.199822
4,5.013,5.064651
5,4.9973,5.024916


TrainOutput(global_step=1565, training_loss=5.376961845178574, metrics={'train_runtime': 717.3224, 'train_samples_per_second': 34.852, 'train_steps_per_second': 2.182, 'total_flos': 4916389708800000.0, 'train_loss': 5.376961845178574, 'epoch': 5.0})

## LoRA BERT 훈련 model structure

In [40]:
show_trainable_structure(model)

base_model.model.bert.encoder.layer.0.attention.self.query.lora_A.default.weight 	 torch.Size([8, 768]) 		 6,144
base_model.model.bert.encoder.layer.0.attention.self.query.lora_B.default.weight 	 torch.Size([768, 8]) 		 6,144
base_model.model.bert.encoder.layer.0.attention.self.value.lora_A.default.weight 	 torch.Size([8, 768]) 		 6,144
base_model.model.bert.encoder.layer.0.attention.self.value.lora_B.default.weight 	 torch.Size([768, 8]) 		 6,144
base_model.model.bert.encoder.layer.1.attention.self.query.lora_A.default.weight 	 torch.Size([8, 768]) 		 6,144
base_model.model.bert.encoder.layer.1.attention.self.query.lora_B.default.weight 	 torch.Size([768, 8]) 		 6,144
base_model.model.bert.encoder.layer.1.attention.self.value.lora_A.default.weight 	 torch.Size([8, 768]) 		 6,144
base_model.model.bert.encoder.layer.1.attention.self.value.lora_B.default.weight 	 torch.Size([768, 8]) 		 6,144
base_model.model.bert.encoder.layer.2.attention.self.query.lora_A.default.weight 	 torch.Size([8

## Evaluation 수행

모델을 이용해 예측하는 것은 몇 가지 후처리를 필요로 합니다. 모델 자체는 답변의 시작과 끝 위치에 대한 로짓(logits)을 예측합니다. 만약 우리가 검증 데이터로더에서 배치를 가져온다면, 모델이 제공하는 출력은 다음과 같습니다

In [41]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

예측시에는 특별한 경우를 제외하고는 굳이 loss값은 필요가 없습니다.

In [42]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([8, 384]), torch.Size([8, 384]))

logit을 이용해서 예측값을 아래와 같이 해석할 수 있습니다.

In [43]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([ 38,   7,  83,  47, 125, 101,  68,  65], device='cuda:0'),
 tensor([ 61,   7, 227,  56,  28,  63,  88,  49], device='cuda:0'))

우리의 모델은 완벽할 수 없기 때문에, 몇가지 예외처리를 해주어야 합니다.
BERT 모델을 이용해서 예측을 수행한 경우, start_pos 와 end_pos의 숫자는 서로 독립적으로 계산됩니다. 따라서 어떤 경우에는 end_pos가 start_pos보다 더 작을 수도 있습니다.

몇 가지 예외 처리 상황은 아래와 같습니다.

1. 시작 위치가 끝 위치보다 큰 경우
2. 답변이 context가 아닌 question의 텍스트 범위를 가리킬 수 있습니다.

이러한 여러가지 경우를 고려해서 예측기를 후처리릍 통해 만들어 낼 수 있습니다.



먼저 (Small) Validation data를 처리합니다.

In [44]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [45]:
validation_features = dataset["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=dataset["validation"].column_names
)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [46]:
raw_predictions = trainer.predict(validation_features)

In [47]:
validation_features.set_format(
                type=validation_features.format["type"],
                columns=list(validation_features.features.keys())
)

불필요한 Column 들을 지워줍니다.

위 Prediction 은 단순히 모델을 통해 예측한 값입니다.
몇 가지 예외사항등을 고려해서 다시 가공하는 루틴은 아래와 같습니다.

In [48]:
from tqdm.auto import tqdm
import collections
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []

        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}

        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        predictions[example["id"]] = best_answer["text"]


    return predictions

In [49]:
final_predictions = postprocess_qa_predictions(
                        dataset["validation"],
                        validation_features,
                        raw_predictions.predictions
)

Post-processing 10570 example predictions split into 10784 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

### 예측 결과 살펴보기

Validation 의 1번째 예제에 대해서 해봅니다.

In [50]:
max_answer_length = 30
n_best_size = 20

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = dataset["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 4.9952345,
  'text': 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10'},
 {'score': 4.8059864,
  'text': '2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10'},
 {'score': 4.76024,
  'text': 'Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10'},
 {'score': 4.6355886,
  'text': 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC'},
 {'score': 4.6223454,
  'text': 'anniversary" with various gold-themed initiatives, as well as temporarily'},
 {'score': 4.598525,
  'text': 'anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under'},
 {'score': 4.582387,
  

In [51]:
dataset["validation"][0]["answers"]

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'answer_start': [177, 177, 177]}

정답이 잘 안나고 있는것을 확인할 수 있습니다.
이는 당연합니다. 현재 매우 작은 모델로 매우 작은 epoch만큼 훈련했기 때문입니다.
모델을 더 큰것을 쓰고, 훈련을 더 진행하게 되면 더 정확한 결과를 얻을 수 있습니다.

### Evaluation Metric준비하기

In [52]:
from datasets import load_metric
metric = load_metric("squad")

ImportError: cannot import name 'load_metric' from 'datasets' (/usr/local/lib/python3.11/dist-packages/datasets/__init__.py)

In [None]:
validation_features

squad 의 metric을 사용하기 위해서는 정해진 형태대로 데이터를 만들어줘야 합니다.

In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in dataset["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

현재 거의 훈련을 하지 않은 상태에서의 성능입니다.