# Text Classification using BERT

사람과 챗봇이 주고 받는 대화가 주어지는데, 사람이 마지막으로 질문의 주체가 속한 카테고리를 예측하는 Text classification 모델을 만들어야합니다.

주어진 데이터의 전처리 과정, 데이터 분할 (split), 모델 설계 및 학습을 진행합니다.

## 사전 준비
* 데이터 다운로드
* 라이브러리 import, install

In [None]:
from datetime import datetime

print(datetime.now().date())

2023-06-27


In [None]:
# 데이터 다운로드
# 아래 코드 실행시 iabc_challenge_20 폴더가 보여야함
!git clone https://github.com/hkbae20/iab_challenge_20.git
!pip install transformers

Cloning into 'iab_challenge_20'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 11 (delta 2), reused 11 (delta 2), pack-reused 0[K
Unpacking objects: 100% (11/11), 3.75 MiB | 3.96 MiB/s, done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_6

In [None]:
# 필요한 라이브러리 import, install
!pip install torchtext==0.6.0
import os
import json
import argparse
from argparse import Namespace
from tqdm.notebook import tqdm
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from transformers import (AutoTokenizer, AutoConfig, BertPreTrainedModel, BertModel,
                          AdamW, get_linear_schedule_with_warmup)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.6.0
  Downloading torchtext-0.6.0-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from torchtext==0.6.0)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.15.2
    Uninstalling torchtext-0.15.2:
      Successfully uninstalled torchtext-0.15.2
Successfully installed sentencepiece-0.1.99 torchtext-0.6.0


## Part 1. Loading Data

이번 챌린지에서 사용하는 데이터를 로드하고, 데이터의 형태를 확인하는 파트입니다.

In [None]:
DATA_DIR = './iab_challenge_20/'

In [None]:
# 데이터 로드
def load_data(data_dir, split):
    x = json.load(open(os.path.join(data_dir, split, "logs.json"), 'r'))
    if split != "test":
        y = json.load(open(os.path.join(data_dir, split, "labels.json"), 'r'))
    else: y = []
    return x, y

In [None]:
X_train, y_train = load_data(DATA_DIR, "train")
X_val, y_val = load_data(DATA_DIR, "val")
X_test, _ = load_data(DATA_DIR, "test")

print("Number of training data:", len(X_train))
print("Number of validation data:", len(X_val))
print("Number of test data:", len(X_test))


Number of training data: 19184
Number of validation data: 2673
Number of test data: 1717


### Raw data exploration
* X_train, X_test: User, System 두 speaker 가 주고 받는 대화. json 형태로 구성되어있음.
* y_train, y_test: 마지막 질문이 묻고 있는 대상 (hotel, restaurant, train, taxi)

In [None]:
X_train[0]

[{'speaker': 'U', 'text': 'Looking for a place to eat in the city center.'},
 {'speaker': 'S',
  'text': 'There are many options to choose from. Do you have a type of food in mind?'},
 {'speaker': 'U', 'text': "I'd like to have some Chinese food."},
 {'speaker': 'S',
  'text': 'That narrows down the restaurant choices to 10. Is there a price range you would like to stay in?'},
 {'speaker': 'U',
  'text': 'I am looking for a moderately priced place to eat, I am also looking to book a room in the bridge guest house hotel.'},
 {'speaker': 'S',
  'text': 'Which dates will you be staying at the Bridge Guest Room house?'},
 {'speaker': 'U',
  'text': 'Before I commit I have a few questions. What area is the hotel located in?'},
 {'speaker': 'S', 'text': 'The hotel is in the south area.'},
 {'speaker': 'U', 'text': 'Do they have help for disabled parking?'}]

In [None]:
print("Unique labels:", set([y for y in y_train]))

Unique labels: {'train', 'restaurant', 'taxi', 'hotel'}


## Part 2. Preprocess Data and Dataset

1. 대화 컨텍스트를 하나의 시퀀스로 변환
2. Tokenizer 를 사용하여 Dataset 생성

### 2-1. 대화 컨텍스트를 하나의 시퀀스로 변환
* X_train\[0\]을 보면 dictionary 형태로 각 utterance 가 보여지는데 하나의 문장으로 변환해야함.
* utterance 를 이어붙여 하나의 문장으로 만 때, utterance 개수를 조절하거나, 앞->뒤 혹은 뒤->앞 으로 수정 가능함.
* 또한 화자를 special token (e.g.\<U>, \<S>) 을 정의하여 추가로 표시해줄 수도 있음. 이처럼 해당 태스크만을 위해 사용되는 토큰을 special token 이라고 함


In [None]:
special_token = {"U": "<U>", "S": "<S>"}

def process_data(X, window_size=0, use_speaker_tag=False):
    # user_speaker_tag: Speaker 에 따라 앞에 <U>, <S> 를 붙여서 시퀀스에 화자를 표시하는 옵션
    # window_size: 뒤에서부터 추출하려는 utterance 개수 조절하는 인자. 입력하지 않을 경우 모든 대화가 추출됨
    X_output = []
    if use_speaker_tag:
        for log in X:
            input_seq = " ".join([(special_token[utt["speaker"]]+" "+utt["text"]) for utt in log[-window_size:]])
            X_output.append(input_seq)
    else:
        for log in X:
            input_seq = " ".join([utt["text"] for utt in log[-window_size:]])
            X_output.append(input_seq)
    return X_output

In [None]:
X_train = process_data(X_train, use_speaker_tag=True)
X_val = process_data(X_val, use_speaker_tag=True)
X_test = process_data(X_test,  use_speaker_tag=True)

In [None]:
X_train[0]

"<U> Looking for a place to eat in the city center. <S> There are many options to choose from. Do you have a type of food in mind? <U> I'd like to have some Chinese food. <S> That narrows down the restaurant choices to 10. Is there a price range you would like to stay in? <U> I am looking for a moderately priced place to eat, I am also looking to book a room in the bridge guest house hotel. <S> Which dates will you be staying at the Bridge Guest Room house? <U> Before I commit I have a few questions. What area is the hotel located in? <S> The hotel is in the south area. <U> Do they have help for disabled parking?"

### 2-2 데이터셋 생성


In [None]:
class2id = {"hotel": 0, "restaurant": 1, "taxi": 2, "train": 3}

class Dialog(Dataset):
    def __init__(self, tokenizer, split, max_len=256):
        self.tokenizer = tokenizer
        self.split = split
        self.max_len = max_len
        self.num_labels = 4
        self.data = self.load_data()

    def process_data(self, X, window_size=0, use_speaker_tag=False):
        X_output = []
        if use_speaker_tag:
            for log in X:
                input_seq = " ".join([(special_token[utt["speaker"]]+" "+utt["text"]) for utt in log[-window_size:]])
                X_output.append(input_seq)
        else:
            for log in X:
                input_seq = " ".join([utt["text"] for utt in log[-window_size:]])
                X_output.append(input_seq)
        return X_output

    def load_data(self):
        dialogs = json.load(open(os.path.join(DATA_DIR, self.split, "logs.json"), 'r'))
        dialogs = self.process_data(dialogs, use_speaker_tag=True)
        if self.split != "test":
            labels = json.load(open(os.path.join(DATA_DIR,  self.split, "labels.json"), 'r'))
        else:
            labels = ["hotel"] * len(dialogs)

        examples = []
        for dial, label in zip(dialogs, labels):
            examples.append({"input": dial,
                            "label": class2id[label]})
        return examples

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        example = self.data[idx]
        inputs = self.tokenizer.encode_plus(example["input"], pad_to_max_length=True,
                                            truncation=True, max_length=self.max_len)

        return {"inputs": inputs["input_ids"],
                "inputs_mask": inputs["attention_mask"],
                "targets": example["label"]}

    def collate_fn(self, batch): # batch 안의 데이터가 모든 같은 길이의 텐서가 될 수 있도록 작업
        input_ids = torch.tensor([example["inputs"] for example in batch], dtype=torch.long)
        input_mask = torch.tensor([example["inputs_mask"] for example in batch], dtype=torch.long)
        targets = [example["targets"] for example in batch]

        return {"input_ids": input_ids,
                "input_mask": input_mask,
                "targets": targets}


In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
train_dataset = Dialog(tokenizer, "train")
valid_dataset = Dialog(tokenizer, "val")
test_dataset = Dialog(tokenizer, "test")

In [None]:
train_dataset.data[0]

{'input': "<U> Looking for a place to eat in the city center. <S> There are many options to choose from. Do you have a type of food in mind? <U> I'd like to have some Chinese food. <S> That narrows down the restaurant choices to 10. Is there a price range you would like to stay in? <U> I am looking for a moderately priced place to eat, I am also looking to book a room in the bridge guest house hotel. <S> Which dates will you be staying at the Bridge Guest Room house? <U> Before I commit I have a few questions. What area is the hotel located in? <S> The hotel is in the south area. <U> Do they have help for disabled parking?",
 'label': 0}

In [None]:
print(train_dataset[0])

{'inputs': [101, 1026, 1057, 1028, 2559, 2005, 1037, 2173, 2000, 4521, 1999, 1996, 2103, 2415, 1012, 1026, 1055, 1028, 2045, 2024, 2116, 7047, 2000, 5454, 2013, 1012, 2079, 2017, 2031, 1037, 2828, 1997, 2833, 1999, 2568, 1029, 1026, 1057, 1028, 1045, 1005, 1040, 2066, 2000, 2031, 2070, 2822, 2833, 1012, 1026, 1055, 1028, 2008, 25142, 2091, 1996, 4825, 9804, 2000, 2184, 1012, 2003, 2045, 1037, 3976, 2846, 2017, 2052, 2066, 2000, 2994, 1999, 1029, 1026, 1057, 1028, 1045, 2572, 2559, 2005, 1037, 17844, 21125, 2173, 2000, 4521, 1010, 1045, 2572, 2036, 2559, 2000, 2338, 1037, 2282, 1999, 1996, 2958, 4113, 2160, 3309, 1012, 1026, 1055, 1028, 2029, 5246, 2097, 2017, 2022, 6595, 2012, 1996, 2958, 4113, 2282, 2160, 1029, 1026, 1057, 1028, 2077, 1045, 10797, 1045, 2031, 1037, 2261, 3980, 1012, 2054, 2181, 2003, 1996, 3309, 2284, 1999, 1029, 1026, 1055, 1028, 1996, 3309, 2003, 1999, 1996, 2148, 2181, 1012, 1026, 1057, 1028, 2079, 2027, 2031, 2393, 2005, 9776, 5581, 1029, 102, 0, 0, 0, 0, 0, 0, 0,



In [None]:
" ".join(tokenizer.convert_ids_to_tokens(train_dataset[0]["inputs"]))

"[CLS] < u > looking for a place to eat in the city center . < s > there are many options to choose from . do you have a type of food in mind ? < u > i ' d like to have some chinese food . < s > that narrows down the restaurant choices to 10 . is there a price range you would like to stay in ? < u > i am looking for a moderately priced place to eat , i am also looking to book a room in the bridge guest house hotel . < s > which dates will you be staying at the bridge guest room house ? < u > before i commit i have a few questions . what area is the hotel located in ? < s > the hotel is in the south area . < u > do they have help for disabled parking ? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

In [None]:
print(train_dataset[0]["inputs"][:40])
print(train_dataset[0]["targets"])

[101, 1026, 1057, 1028, 2559, 2005, 1037, 2173, 2000, 4521, 1999, 1996, 2103, 2415, 1012, 1026, 1055, 1028, 2045, 2024, 2116, 7047, 2000, 5454, 2013, 1012, 2079, 2017, 2031, 1037, 2828, 1997, 2833, 1999, 2568, 1029, 1026, 1057, 1028, 1045]
0


## Part 3. 모델 정의하기


- GPU 사용을 위한 Cuda 설정
- Colab 페이지 상단 메뉴>수정>노트설정에서 GPU 사용 설정이 선행되어야 합니다.

In [None]:
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

In [None]:
device

device(type='cuda')

In [None]:
class DomainClassifier(BertPreTrainedModel):
    def __init__(self, config, args):
      super(DomainClassifier, self).__init__(config, args)
      config.num_labels = 4 # hotel, restaurant, train, taxi 중 하나 선택
      self.config = config
      self.args = args
      self.bert = BertModel(config)
      ################## TODO 1 ###########################
      # class 를 분류하는 linear layer 를 선언
      # layer 이름은 classifier
      ####################################################

      self.classifier = nn.Linear(config.hidden_size, config.num_labels)
      ################## TODO 2 ###########################
      # loss funtion을 Cross Entropy Loss 로 설정
      # 변수명은 loss_fn
      ####################################################
      self.loss_fn = CrossEntropyLoss()


    def forward(self, input_ids, attention_mask, targets):
      ################## TODO 3 ###########################
      # bert 모델에 input 넣기
      ####################################################
      output = self.bert(input_ids = input_ids, attention_mask=attention_mask)
      pool_output = output[1]
      cls_output = self.classifier(pool_output)
      loss = self.loss_fn(cls_output, targets)

      return (loss, cls_output)





## Part 4. 학습
### 학습 환경 설정

In [None]:
args = Namespace()
args.train_batch_size = 4
args.eval_batch_size = 4
args.num_train_epochs = 1
args.learning_rate = 5e-5
args.clf_learning_rate = 1e-3
args.gradient_accumulation_steps = 8
args.warmup_steps = 0
args.weight_decay = 0.0
args.adam_epsilon = 1e-8
args.max_grad_norm = 1.0

### 모델 선언하기

In [None]:
device

device(type='cuda')

In [None]:
config = AutoConfig.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = DomainClassifier.from_pretrained('bert-base-uncased', config=config, args=args)
model= model.to(device)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing DomainClassifier: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing DomainClassifier from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DomainClassifier from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DomainClassifier were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']

In [None]:
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(
    train_dataset,
    sampler=train_sampler,
    batch_size=args.train_batch_size,
    collate_fn=train_dataset.collate_fn,
)

eval_sampler = SequentialSampler(valid_dataset)
eval_dataloader = DataLoader(
    valid_dataset,
    sampler=eval_sampler,
    batch_size=args.eval_batch_size,
    collate_fn=valid_dataset.collate_fn
)

test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(
    valid_dataset,
    sampler=test_sampler,
    batch_size=args.eval_batch_size,
    collate_fn=test_dataset.collate_fn
)


In [None]:
def train(args, model, train_iterator, eval_iterator):

    t_total = len(train_iterator) // args.gradient_accumulation_steps * args.num_train_epochs
    optimizer = AdamW([{'params': model.bert.parameters()},
                       {'params': model.classifier.parameters(), 'lr': args.clf_learning_rate}],
                      lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    for epoch in range(int(args.num_train_epochs)):
        tr_loss = 0
        #model.zero_grad()
        model.train()

        for step, batch in enumerate(tqdm(train_iterator)):
            #optimizer.zero_grad()

            input_ids = batch["input_ids"].to(device)
            input_mask = batch["input_mask"].to(device)
            targets = torch.tensor(batch["targets"]).to(device)


            ################## TODO 1 ###########################
            # 1. GPU 에 올린 데이터를 모델에 넣어서 결과를 받아오기.
            #     (Hint: 모델이 출력하는 것은 두개인데 학습 과정에서는 첫번째 항목이 매우 중요)
            # 2. 모델이 출력한 첫번째 항목으로 model weight 의 gradient 계산
            ####################################################
            loss, _ = model(input_ids = input_ids, attention_mask = input_mask, targets = targets)

            loss.backward()

            tr_loss += loss.item()

            # Batch size 가 작으므로 gradient 를 매 batch 마다 업데이트하지 않고 batch_size * gradient_accumulation_steps 마다 업데이트
            if (step) % args.gradient_accumulation_steps == 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm
                )
                ################## TODO 2 ###########################
                # model weight, learning rate 를 업데이트
                # 누적된 gradient 초기화
                ####################################################
                optimizer.step()
                scheduler.step()
                model.zero_grad()

        tr_loss = tr_loss / len(train_iterator)

        eval_acc, eval_loss = evaluate(model, eval_iterator)

        print(f"Epoch: {epoch}, Train_loss: {tr_loss}, Accuracy: {eval_acc}, Eval_loss: {eval_loss}")

    return tr_loss

In [None]:
def calculate_accuracy(preds, y):
    max_idx = np.argmax(preds, axis=1)
    correct = (max_idx == y)
    acc = correct.sum() / len(correct)

    return acc

In [None]:
def evaluate(model, iterator):
    model.eval()
    labels = []
    preds = []
    eval_loss = 0.0
    with torch.no_grad():
        for batch in tqdm(iterator):
            input_ids = batch["input_ids"].to(device)
            input_mask = batch["input_mask"].to(device)
            targets = torch.tensor(batch["targets"]).to(device)


            ################## TODO ###########################
            # 1. GPU 에 올린 데이터를 모델에 넣어서 결과를 받아오기
            #     (Hint: 이번에는 모델의 output도 중요)
            ####################################################
            loss, logits = model(input_ids = input_ids, attention_mask = input_mask, targets = targets)

            labels.append(targets.detach().cpu().numpy())
            preds.append(logits.detach().cpu().numpy())
            eval_loss += loss.item()

    labels = np.concatenate(labels)
    preds = np.concatenate(preds)
    acc = calculate_accuracy(preds, labels)
    eval_loss = eval_loss / len(iterator)

    return acc, eval_loss


### 학습 진행


In [None]:
!nvidia-smi

Tue Jun 27 05:50:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P0    28W /  70W |   1353MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
train_loss = train(args, model, train_dataloader, eval_dataloader)



  0%|          | 0/4796 [00:00<?, ?it/s]



  0%|          | 0/669 [00:00<?, ?it/s]

Epoch: 0, Train_loss: 1.4569701545804217, Accuracy: 0.34118967452300786, Eval_loss: 1.4407157406144078


In [None]:
id2class = {v: k for k, v in class2id.items()}

model.eval()
preds = []
with torch.no_grad():
    for batch in tqdm(test_dataloader):
        input_ids = batch["input_ids"].to(device)
        input_mask = batch["input_mask"].to(device)
        targets = torch.tensor(batch["targets"]).to(device)

        _, logits = model(input_ids=input_ids, attention_mask=input_mask, targets=targets)

        preds.append(logits.detach().cpu().numpy())

preds = np.concatenate(preds)
max_idx = np.argmax(preds, axis=1)

test_res = [id2class[ele] for ele in max_idx]
test_res[:10]

  0%|          | 0/430 [00:00<?, ?it/s]

['train',
 'train',
 'taxi',
 'taxi',
 'train',
 'train',
 'train',
 'train',
 'train',
 'train']

In [None]:
max_idx

array([3, 3, 2, ..., 3, 3, 3])

In [None]:
test_acc, test_loss = evaluate(model, test_dataloader)
print(f"Test Accuracy: {test_acc}, Eval_loss: {test_loss}")

  0%|          | 0/430 [00:00<?, ?it/s]

Test Accuracy: 0.43273150844496217, Eval_loss: 1.4187853714754415
