### 이번 과제는 Bert Model을 사용하여 BBC 뉴스 기사의 category를 분류해보는 과제입니다. clone coding을 하시되, 코드 주석을 line by line으로 꼼꼼하게 달아보시며 공부해보세요!

로컬에서 돌리셔도 되지만, colab에서 GPU로 돌려보는 것을 권장합니다!

## 데이터 로드 및 탐색

In [1]:
%%capture
!pip install transformers

In [2]:
import pandas as pd
import torch
import numpy as np
from transformers import BertTokenizer, BertModel
from torch import nn
from torch.optim import Adam
from tqdm import tqdm

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv('/content/drive/MyDrive/NLP/bbc-text.csv') # bbc-text.csv 파일 경로

In [5]:
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [6]:
print(len(df))

2225


In [7]:
df.groupby('category').count()

Unnamed: 0_level_0,text
category,Unnamed: 1_level_1
business,510
entertainment,386
politics,417
sport,511
tech,401


## BertTokenizer

토크나이저로 pretrain된 BERT의 BertTokenizer를 갖고 옵니다. 여러 종류를 시도해보세요.

- bert-base-uncased : 108MB param, all lowercase
- bert-large-cased : 340MB param, both upper and lower
- bert-base-cased : 108MB param, multi language, both upper and lower


In [8]:
# BERT Tokenizer: bert-base-cased 버전의 사전 훈련된 토크나이저 로드
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
# 각 카테고리를 숫자 라벨로 매핑
labels = {'business':0,
          'entertainment':1,
          'sport':2,
          'tech':3,
          'politics':4
          }

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## Dataset

In [9]:
# 데이터셋 클래스를 정의하여 BERT에 입력 가능한 형식으로 데이터를 변환
class Dataset(torch.utils.data.Dataset):

    def __init__(self, df):
        # dataframe에서 'category' 열을 숫자 라벨로 변환
        self.labels = [labels[label] for label in df['category']]
        # 'text' 열을 BERT 입력 형식으로 토크나이징
        self.texts = [tokenizer(text,
                               padding='max_length', max_length = 512, truncation=True,
                                return_tensors="pt") for text in df['text']]

    def classes(self):
        return self.labels # 클래스 라벨 반환

    def __len__(self):
        return len(self.labels) # 데이터셋의 크기 반환

    def get_batch_labels(self, idx):
        return np.array(self.labels[idx]) # 주어진 인덱스의 라벨 반환

    def get_batch_texts(self, idx):
        return self.texts[idx] # 주어진 인덱스의 텍스트 반환

    def __getitem__(self, idx):
        # 데이터셋에서 배치 단위로 텍스트와 라벨을 반환
        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_texts, batch_y

## Train & Evaluate BertClassifier

pretrain된 BertModel을 불러옵니다. 다른 간단한 층들도 같이 쌓아줍니다.

- bert-base-cased: 12-layer, 768-hidden, 12-self attention heads, 110M parameters. Trained on cased English text.


다른 종류들의 pretrianed model은 아래 링크에서 확인할 수 있습니다.

https://huggingface.co/transformers/v2.9.1/pretrained_models.html

In [10]:
class BertClassifier(nn.Module):

    def __init__(self, dropout=0.5):

        super(BertClassifier, self).__init__()
        # 사전 훈련된 BERT 모델 로드
        self.bert = BertModel.from_pretrained('bert-base-cased')
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 5) # 768차원의 BERT 출력 -> 5개 클래스 예측
        self.relu = nn.ReLU() # 활성화 함수로 ReLU 사용

    def forward(self, input_id, mask):
        # BERT의 출력 중 풀링된 출력(pooled_output) 사용
        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output) # 드롭아웃 적용
        linear_output = self.linear(dropout_output) # 선형 변환으로 클래스 예측
        final_layer = self.relu(linear_output) # ReLU 활성화 함수 적용

        return final_layer

In [11]:
def train(model, train_data, val_data, learning_rate, epochs):

    train, val = Dataset(train_data), Dataset(val_data)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr= learning_rate)

    if use_cuda:

            model = model.cuda()
            criterion = criterion.cuda()

    for epoch_num in range(epochs):

            total_acc_train = 0
            total_loss_train = 0

            for train_input, train_label in tqdm(train_dataloader):

                train_label = train_label.to(device)
                mask = train_input['attention_mask'].to(device) # 어텐션 마스크
                input_id = train_input['input_ids'].squeeze(1).to(device) # 입력 아이디

                output = model(input_id, mask)

                batch_loss = criterion(output, train_label.long())
                total_loss_train += batch_loss.item()

                acc = (output.argmax(dim=1) == train_label).sum().item()
                total_acc_train += acc

                model.zero_grad()
                batch_loss.backward()
                optimizer.step()

            total_acc_val = 0
            total_loss_val = 0

            with torch.no_grad():

                for val_input, val_label in val_dataloader:

                    val_label = val_label.to(device)
                    mask = val_input['attention_mask'].to(device)
                    input_id = val_input['input_ids'].squeeze(1).to(device)

                    output = model(input_id, mask)

                    batch_loss = criterion(output, val_label.long())
                    total_loss_val += batch_loss.item()

                    acc = (output.argmax(dim=1) == val_label).sum().item()
                    total_acc_val += acc

            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} | Train Accuracy: {total_acc_train / len(train_data): .3f} | Val Loss: {total_loss_val / len(val_data): .3f} | Val Accuracy: {total_acc_val / len(val_data): .3f}')


In [12]:
def evaluate(model, test_data):

    test = Dataset(test_data)

    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

        model = model.cuda()

    total_acc_test = 0
    # eval 시에는 그래디언트 계산 비활성화
    with torch.no_grad():

        for test_input, test_label in test_dataloader:

              test_label = test_label.to(device)
              mask = test_input['attention_mask'].to(device)
              input_id = test_input['input_ids'].squeeze(1).to(device)

              output = model(input_id, mask)

              acc = (output.argmax(dim=1) == test_label).sum().item()
              total_acc_test += acc

    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

In [13]:
np.random.seed(112)
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=42),
                                     [int(.8*len(df)), int(.9*len(df))])

print(len(df_train),len(df_val), len(df_test))

1780 222 223


  return bound(*args, **kwds)


In [15]:
EPOCHS = 5 #EPOCH 수 늘려보기!
model = BertClassifier()
LR = 1e-6

train(model, df_train, df_val, LR, EPOCHS)

100%|██████████| 890/890 [03:08<00:00,  4.72it/s]


Epochs: 1 | Train Loss:  0.757 | Train Accuracy:  0.362 | Val Loss:  0.542 | Val Accuracy:  0.734


100%|██████████| 890/890 [03:08<00:00,  4.71it/s]


Epochs: 2 | Train Loss:  0.324 | Train Accuracy:  0.910 | Val Loss:  0.173 | Val Accuracy:  0.982


100%|██████████| 890/890 [03:08<00:00,  4.71it/s]


Epochs: 3 | Train Loss:  0.128 | Train Accuracy:  0.978 | Val Loss:  0.087 | Val Accuracy:  0.982


100%|██████████| 890/890 [03:08<00:00,  4.71it/s]


Epochs: 4 | Train Loss:  0.065 | Train Accuracy:  0.989 | Val Loss:  0.047 | Val Accuracy:  0.995


100%|██████████| 890/890 [03:08<00:00,  4.72it/s]


Epochs: 5 | Train Loss:  0.038 | Train Accuracy:  0.996 | Val Loss:  0.040 | Val Accuracy:  0.991


In [16]:
evaluate(model, df_test)

Test Accuracy:  0.987


In [18]:
LR_2 = 1e-3
train(model, df_train, df_val, LR_2, EPOCHS)

100%|██████████| 890/890 [02:56<00:00,  5.04it/s]


Epochs: 1 | Train Loss:  0.807 | Train Accuracy:  0.231 | Val Loss:  0.805 | Val Accuracy:  0.180


100%|██████████| 890/890 [02:55<00:00,  5.06it/s]


Epochs: 2 | Train Loss:  0.805 | Train Accuracy:  0.232 | Val Loss:  0.805 | Val Accuracy:  0.180


100%|██████████| 890/890 [02:56<00:00,  5.05it/s]


Epochs: 3 | Train Loss:  0.805 | Train Accuracy:  0.232 | Val Loss:  0.805 | Val Accuracy:  0.180


100%|██████████| 890/890 [02:55<00:00,  5.06it/s]


Epochs: 4 | Train Loss:  0.805 | Train Accuracy:  0.232 | Val Loss:  0.805 | Val Accuracy:  0.180


100%|██████████| 890/890 [02:55<00:00,  5.07it/s]


Epochs: 5 | Train Loss:  0.805 | Train Accuracy:  0.232 | Val Loss:  0.805 | Val Accuracy:  0.180


In [19]:
evaluate(model, df_test)

Test Accuracy:  0.256


optional) 다양한 시도를 해보셨다면 시도 별 간단한 해석도 달아주세요! 🤗

LR을 1e-3 으로 바꿔봤는데, underfitting이 되어서 제한된 epoch 내에 최적의 값에 도달하지 못한듯 하다!