<a href="https://colab.research.google.com/github/yejijang-analyst/ESAA/blob/main/Kaggle_study/Kaggle_review_LLMScienceExam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **대회명: LLM Science Exam**

출처: https://www.kaggle.com/competitions/kaggle-llm-science-exam

참고코드 설명:
This starter notebook walks through a basic example of using BERT to rank the answers to each question. We'll finetune BERT on the 200 public questions, then use the AutoModelForMultipleChoice class to generate probabilities that each option correctly answers the prompt, and finally we'll turn those predictions into a MAP@3-formatted prediction like A B C.

In [1]:
# 사용환경 셋팅

!pip install torch datasets
!pip install accelerate
!pip install transformers -U


[0m

In [2]:
!pip uninstall torch -y

Found existing installation: torch 2.2.0
Uninstalling torch-2.2.0:
  Successfully uninstalled torch-2.2.0


In [3]:
!pip install torch

[0mCollecting torch
  Using cached torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
[0mInstalling collected packages: torch
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 2.2.0 which is incompatible.
torchdata 0.7.0 requires torch==2.1.0, but you have torch 2.2.0 which is incompatible.
torchtext 0.16.0 requires torch==2.1.0, but you have torch 2.2.0 which is incompatible.
torchvision 0.16.0+cu121 requires torch==2.1.0, but you have torch 2.2.0 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.2.0


In [5]:
import pandas as pd

train_df = pd.read_csv('/content/train.csv')
train_df.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [3]:
train_df.columns

Index(['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer'], dtype='object')

In [6]:
# 편의를 위해 데이터프레임을 데이터셋으로 변환
from datasets import Dataset
train_ds = Dataset.from_pandas(train_df)

In [4]:
from transformers import BertModel, BertTokenizer

# BERT 모델 불러오기
model_name = 'bert-base-cased'
model = BertModel.from_pretrained(model_name)

# BERT 토크나이저 불러오기
tokenizer = BertTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


다중 선택 문제를 위해 입력 데이터를 BERT 모델이 이해할 수 있는 형식으로 변환하는 역할

In [7]:
# A,B,C,D,E 선택지로 전환하기 위해 dictionary 를 생성

options = ['A', 'B', 'C', 'D', 'E']
indices = list(range(5))

option_to_index = {option:index for option, index in zip(options, indices)}
index_to_option = {index:option for option, index in zip(options, indices)}

def preprocess(example):
  #AutoModelForMultipleChoice class 는 질문과 답변을 짝으로 받기를 원해서 모든 질문을 5번 복사하여 경우의 수를 채워준다.
  first_sentence = [example['prompt']] * 5
  second_sentence = []
  for option in options:
    second_sentence.append(example[option])
  tokenized_example = tokenizer(first_sentence, second_sentence, truncation = True)
  tokenized_example['label'] = option_to_index[example['answer']] # 예제의 정답지를 인덱스로 변환
  return tokenized_example

tokenized_train_ds = train_ds.map(preprocess, batched = False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

데이터 콜레이터는 배치(batch) 단위로 데이터를 처리하는 역할

> 다중 선택 태스크의 데이터를 효율적으로 처리하기 위한 목적으로 사용될 수 있습니다. 예를 들어, 문장 간의 관계를 판단하는 문제나 객관식 문항을 처리하는 데 유리한 class 사용

In [8]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass # 허깅페이스에서 해당 class 를 복사
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

transformers 라이브러리에서 모델을 인스턴스화하고 미세 조정(finetuning) 및 예측(prediction)에 사용하는 과정

> 인스턴스: AutoModelForMultipleChoice.from_pretrained(model_dir)는 AutoModelForMultipleChoice 클래스를 기반으로하여 사전 훈련된 모델 객체를 생성하는 과정

In [9]:
# Now we'll instatiate the model that we'll finetune on our public dataset, then use to
# make prediction on the private dataset.
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
model = AutoModelForMultipleChoice.from_pretrained(model_name)

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


모델 학습을 위한 매개변수들을 설정하는 부분

> 런타임 재실행 혹은 전원 재실행해도 같은 오류가 지속되어 해당 부분에 대한 수정 필

In [10]:
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model_name = 'finetuned_bert'
training_args = TrainingArguments(
    output_dir=model_name,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to='none'
)

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

트레이닝 셋을 벨리데이션으로 쓰는 건 안좋지만 여기서 트레이닝셋이 너무 작아서 이용하기로 결정

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_train_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
)

In [None]:
# 몇 분 정도 걸림
trainer.train()

In [None]:
predictions = trainer.predict(tokenized_train_ds)

In [None]:
# 확률이 가장 높은 상위 3개 인덱스를 ABC선지로 변환
import numpy as np
def predictions_to_map_output(predictions):
    sorted_answer_indices = np.argsort(-predictions)
    top_answer_indices = sorted_answer_indices[:,:3] # Get the first three answers in each row
    top_answers = np.vectorize(index_to_option.get)(top_answer_indices)
    return np.apply_along_axis(lambda row: ' '.join(row), 1, top_answers)

In [None]:
predictions_to_map_output(predictions.predictions)

In [None]:
test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
test_df.head()

In [None]:
test_df['answer'] = 'A'

# Other than that we'll preprocess it in the same way we preprocessed test.csv
test_ds = Dataset.from_pandas(test_df)
tokenized_test_ds = test_ds.map(preprocess, batched=False, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])

In [None]:
test_predictions = trainer.predict(tokenized_test_ds)