1. Get and prepare the training and test sets of the task dataset.
2. Load a proper foundation model and tokenizer.
3. Set a `Dataset` class for the dataset.
4. Prepare for training
5. Train the model
6. Evaluate the trained (fine-tuned) model on the test dataset.

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
# import required libraries here
import json
import re
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
import torch.utils.data
import evaluate
import numpy as np

## 1. Get and prepare the training and test sets of the task dataset.

Task: Natural Language Inference (NLI, 자연어 추론)
- Given two sentences, the model classifies the relation between the two sentences.
  - Let `premise` the first sentence in the sequence, and let `hypothesis` the second sentence.
- 3 classes (`label`s): entailment (0), contradiction (1), neutral (2)


In [None]:
!wget https://huggingface.co/datasets/tasksource/ConTRoL-nli/resolve/main/train.jsonl
!wget https://huggingface.co/datasets/tasksource/ConTRoL-nli/raw/main/test.jsonl

--2025-05-26 14:09:33--  https://huggingface.co/datasets/tasksource/ConTRoL-nli/resolve/main/train.jsonl
Resolving huggingface.co (huggingface.co)... 18.164.174.17, 18.164.174.55, 18.164.174.118, ...
Connecting to huggingface.co (huggingface.co)|18.164.174.17|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/6a/02/6a02634c4dc3ea80375d44fe522bbb66514ca5c93a251a564aad3339b22a9480/768852feacdb124d55d62478612847f51615e51eae5da539ffb651deec446263?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27train.jsonl%3B+filename%3D%22train.jsonl%22%3B&Expires=1748272173&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODI3MjE3M319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy82YS8wMi82YTAyNjM0YzRkYzNlYTgwMzc1ZDQ0ZmU1MjJiYmI2NjUxNGNhNWM5M2EyNTFhNTY0YWFkMzMzOWIyMmE5NDgwLzc2ODg1MmZlYWNkYjEyNGQ1NWQ2MjQ3ODYxMjg0N2Y1MTYxNWU1MWVhZTVkYTUzOWZmYjY1MWRlZWM0NDYyNjM%7EcmVzcG9uc2UtY29udGVudC1kaXNw

In [None]:
# jsonl 파일을 읽어오는 함수 정의
def read_file(fname):
    # premise, hypothesis, label을 각각의 리스트에 저장
    premises = []
    hypotheses = []
    labels = []
    label_map = {"entailment": 0, "contradiction": 1, "neutral": 2}

    with open(fname, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line)
            premise = data['premise'].strip().lower()
            hypothesis = data['hypothesis'].strip().lower()

            # 영문자와 숫자 등만 남겨서 문자열 전처리
            premise = re.sub(r'[^a-z0-9!@#$%^&*\(\).,? ]', '', premise)
            hypothesis = re.sub(r'[^a-z0-9!@#$%^&*\(\).,? ]', '', hypothesis)
            premises.append(premise)
            hypotheses.append(hypothesis)

            label = label_map[data['label']]  # label에 해당하는 문자열을 정수 인덱스로 mapping
            labels.append(label)

    return premises, hypotheses, labels

In [None]:
# train data와 test data의 각 요소를 리스트로 저장
train_premises, train_hypotheses, train_labels = read_file('train.jsonl')
test_premises, test_hypotheses, test_labels = read_file('test.jsonl')

## Load a proper foundation model and tokenizer.

In [None]:
# 가장 기본적이고 많이 쓰이는 영어 BERT 모델이며 속도와 성능이 좋은 bert-base-uncased를 선택했습니다.
# model과 tokenizer를 load
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3) # NLI는 3-class classification이므로 num_labels를 3으로 지정
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# 학습 데이터와 테스트 데이터에 들어 있는 문장을 tokenize합니다.
train_encodings = tokenizer(train_premises, train_hypotheses, truncation=True, padding="max_length")
test_encodings = tokenizer(test_premises, test_hypotheses, truncation=True, padding="max_length")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

## 3. Set a `Dataset` class for the dataset.

In [None]:
# NLI 태스크를 위한 데이터셋 클래스 정의
class NLIDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings # tokenize 결과 (input_ids, attention_mask 등)
        self.labels = labels  # 정수 인코딩된 label 리스트

    def __getitem__(self, idx):
        # 각 요소는 딕셔너리 형태로 반환합니다.
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
# tokenizer를 거친 학습 데이터와 테스트 데이터를 데이터셋 형태로 변환합니다.
train_dataset = NLIDataset(train_encodings, train_labels)
test_dataset = NLIDataset(test_encodings, test_labels)

## 4. Prepare for training

In [None]:
# accuracy와 f1 metrics를 load 해온다.
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

# metrics 계산하는 함수 정의
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # accuracy
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    # f1 (3-class니까 macro 평균)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")

    return {
        "accuracy": accuracy["accuracy"],
        "f1": f1["f1"]
    }

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

In [None]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=5,              # total number of training epochs
    per_device_train_batch_size=4,  # batch size per device during training. RAM이 다운되는 문제로 인해 4로 설정했습니다.
    per_device_eval_batch_size=8,   # batch size for evaluation
    gradient_accumulation_steps=2,  # 실질적으로 batch가 8인 것처럼 작동하도록 했습니다.
    fp16=True,
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    report_to="none",
)

# 앞에서 정의한 모델, arguments, 데이터셋, metric 계산 함수 등을 이용하여 trainer 객체를 생성합니다.
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,           # evaluation dataset
    compute_metrics=compute_metrics,
    )

## 5. Train the model

In [None]:
# 모델을 학습시킵니다.
trainer.train()

Step,Training Loss
10,1.2498
20,1.1553
30,1.2319
40,1.2235
50,1.1975
60,1.1303
70,1.1258
80,1.0776
90,1.1098
100,1.147


Step,Training Loss
10,1.2498
20,1.1553
30,1.2319
40,1.2235
50,1.1975
60,1.1303
70,1.1258
80,1.0776
90,1.1098
100,1.147


TrainOutput(global_step=4200, training_loss=0.6173621597176506, metrics={'train_runtime': 1358.2052, 'train_samples_per_second': 24.735, 'train_steps_per_second': 3.092, 'total_flos': 8839295268572160.0, 'train_loss': 0.6173621597176506, 'epoch': 5.0})

## 6. Evaluate the trained (fine-tuned) model on the test dataset.

Show both of the **F1 score** and **Accuracy** on the test dataset.



In [None]:
# 파인튜닝한 모델을 f1과 accuracy 기반으로 평가합니다.
trainer.evaluate()

{'eval_loss': 2.9340426921844482,
 'eval_accuracy': 0.47577639751552797,
 'eval_f1': 0.47217004732975965,
 'eval_runtime': 5.9136,
 'eval_samples_per_second': 136.128,
 'eval_steps_per_second': 17.079,
 'epoch': 5.0}