# Natural Language Processing

## 심화과제 1: BERT Fine-tunning with Transformers

> 본 과제는 NLP 심화구현 해보고자 하는 사람들을 위한 과제입니다.
>
> 정답이나 Reference 코드가 존재하지 않으므로 가능한 곳까지 도전해보세요!

### Introduction

* 본 과제는 imdb 영화 리뷰 데이터에 대해 pretrain 모델을 finetuning하는 과제입니다.
* 영화 리뷰가 주어졌을 때 긍정적인 리뷰인지 부정적인 리뷰인지 판별하는 모델을 만들어 봅시다.
* 이번 시간에은 산학계에서 실제로 많이 쓰이는 [Transformer](https://huggingface.co/docs/transformers/index) 라이브러리를 사용해보겠습니다. 해당 라이브러리를 직접 참고하면서 목표 정확도를 달성하는 것이 목표입니다.
* 모델, 초매개변수 (hyperparamter) 등등을 바꾸며 finetuning을 진행해서, 테스트 정확도 93% 이상을 넘겨보세요!
* 참고 1) https://huggingface.co/transformers/
* 참고 2) https://paperswithcode.com/sota/text-classification-on-imdb

### 0. 환경 셋팅 및 데이터 업로드

In [None]:
!pip install datasets
!pip install transformers[torch]

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15
Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m2.3 MB/s[0m eta [36m

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2024-01-14 14:55:59--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2024-01-14 14:56:03 (20.6 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



### 1. 데이터 전처리

In [None]:
import torch
from transformers import DistilBertTokenizerFast
from transformers import DistilBertConfig
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
from pathlib import Path
from sklearn.model_selection import train_test_split

In [None]:
def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

  labels.append(0 if label_dir is "neg" else 1)


In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

토큰화기는 `BERT`에서 사용하는 토큰화기를 사용해보겠습니다.

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)

In [None]:
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [None]:
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [None]:
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
# val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

### 2. 모델 작성 및 학습
모델은 사전학습된 `BERT`를 증류 (Dilstilation) 과정을 통해 모델 크기를 줄인 `DistilBERT`를 사용해보겠습니다.

In [None]:
config = DistilBertConfig.from_pretrained(
    'distilbert-base-uncased',
    vocab_size=30522, max_position_embeddings=512, sinusoidal_pos_embds=False,
    n_layers=6, n_heads=12, dim=768, hidden_dim=3072,
    dropout=0.1, attention_dropout=0.1, activation='gelu'
)

In [None]:
training_args = TrainingArguments(
    output_dir='./results',          # 출력 폴더
    num_train_epochs=1,              # 학습 에폭 수
    per_device_train_batch_size=16,  # GPU당 학습 배치 크기
    per_device_eval_batch_size=64,   # GPU당 평가 배치 크기기
    warmup_steps=500,                # 학습률 스케줄링을 위한 warm up 과정 스텝 수. 이동안은 학습률이 천천히 올라간다.
    weight_decay=0.01,               # 가중치 감쇠 (weight decay)
    logging_dir='./logs',            # 로그 기록을 위한 폴더
    logging_steps=100,
)

trainer = Trainer(
    model=model,                         # 학습할 모델
    args=training_args,                  # 학습 인자
    train_dataset=train_dataset,         # 학습 데이터 셋
    eval_dataset=val_dataset             # 평가 데이터 셋
)

trainer.train()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.6611
200,0.3497


KeyboardInterrupt: 

In [None]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", config=config)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.to("cuda")

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### 3. 평가 코드

In [None]:
from datasets import load_metric
from torch.utils.data import DataLoader
from tqdm import tqdm

In [None]:
test_dataloader = DataLoader(test_dataset, batch_size=4)

In [None]:
metric= load_metric("accuracy")
test_dataloader = DataLoader(test_dataset, batch_size=128)
model.eval()
for batch in tqdm(test_dataloader):
    batch = {k: v.to("cuda") for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

In [None]:
aa = next(iter(test_dataloader))

In [None]:
bb = {k: v.to("cuda") for k, v in aa.items()}

In [None]:
len(bb['input_ids'])

4

In [None]:
outputs = model(**bb)

In [None]:
outputs['loss']

tensor(0.6456, device='cuda:0', grad_fn=<NllLossBackward0>)

In [None]:
outputs['logits']

tensor([[-0.0845, -0.0350],
        [-0.1274, -0.0150],
        [-0.1603, -0.0483],
        [-0.1124,  0.0044]], device='cuda:0', grad_fn=<AddmmBackward0>)

###**콘텐츠 라이선스**

<font color='red'><b>**WARNING**</b></font> : **본 교육 콘텐츠의 지식재산권은 재단법인 네이버커넥트에 귀속됩니다. 본 콘텐츠를 어떠한 경로로든 외부로 유출 및 수정하는 행위를 엄격히 금합니다.** 다만, 비영리적 교육 및 연구활동에 한정되어 사용할 수 있으나 재단의 허락을 받아야 합니다. 이를 위반하는 경우, 관련 법률에 따라 책임을 질 수 있습니다.
