<a href="https://colab.research.google.com/github/respect5716/deep-learning-paper-implementation/blob/main/03_NLP/AEDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AEDA

## 0. Info

### Paper
* title: AEDA:AnEasier Data Augmentation Technique for Text Classification
* author: Akbar Karimi et al.
* url: https://arxiv.org/abs/2108.13230

### Features
* task: k-shot classification
* dataset: klue-ynat

### Reference
* https://github.com/akkarimi/aeda_nlp

## 1. Setup

In [1]:
!pip install -q transformers datasets

[K     |████████████████████████████████| 3.4 MB 4.1 MB/s 
[K     |████████████████████████████████| 306 kB 57.7 MB/s 
[K     |████████████████████████████████| 3.3 MB 67.4 MB/s 
[K     |████████████████████████████████| 596 kB 56.7 MB/s 
[K     |████████████████████████████████| 895 kB 70.9 MB/s 
[K     |████████████████████████████████| 67 kB 6.4 MB/s 
[K     |████████████████████████████████| 1.1 MB 63.7 MB/s 
[K     |████████████████████████████████| 243 kB 76.7 MB/s 
[K     |████████████████████████████████| 133 kB 35.5 MB/s 
[K     |████████████████████████████████| 160 kB 67.7 MB/s 
[K     |████████████████████████████████| 192 kB 65.9 MB/s 
[K     |████████████████████████████████| 271 kB 72.9 MB/s 
[?25h

In [1]:
import easydict
from tqdm.auto import tqdm

import numpy as np
import pandas as pd

import torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

In [2]:
config = easydict.EasyDict(
    puncs = ['.', ',', '!', '?', ';', ':'],
    punc_ratio = 0.3,
    num_augs = 5,
    k = 5,

    model_name_or_path = 'klue/roberta-small',
    batch_size = 32,
    num_epochs = 10,
    lr = 3e-5
)

## 2. Data

In [3]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data.iloc[idx]
        inputs = self.tokenizer(item['title'], max_length=64, padding='max_length', truncation=True, return_tensors='pt')
        inputs['labels'] = torch.tensor(item['label'])
        return inputs

In [4]:
tokenizer = AutoTokenizer.from_pretrained(config.model_name_or_path)

In [5]:
data = load_dataset('klue', 'ynat')
train_data = data['train'].to_pandas()
test_data = data['validation'].to_pandas()
train_data.head()

Reusing dataset klue (/root/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


  0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,guid,title,label,url,date
0,ynat-v1_train_00000,유튜브 내달 2일까지 크리에이터 지원 공간 운영,3,https://news.naver.com/main/read.nhn?mode=LS2D...,2016.06.30. 오전 10:36
1,ynat-v1_train_00001,어버이날 맑다가 흐려져…남부지방 옅은 황사,3,https://news.naver.com/main/read.nhn?mode=LS2D...,2016.05.08. 오전 5:25
2,ynat-v1_train_00002,내년부터 국가RD 평가 때 논문건수는 반영 않는다,2,https://news.naver.com/main/read.nhn?mode=LS2D...,2016.03.15. 오후 12:00
3,ynat-v1_train_00003,김명자 신임 과총 회장 원로와 젊은 과학자 지혜 모을 것,2,https://news.naver.com/main/read.nhn?mode=LS2D...,2017.02.28. 오전 9:54
4,ynat-v1_train_00004,회색인간 작가 김동식 양심고백 등 새 소설집 2권 출간,3,https://news.naver.com/main/read.nhn?mode=LS2D...,2018.04.03. 오전 7:05


In [6]:
test_dataset = Dataset(test_data, tokenizer)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=config.batch_size, shuffle=False)

In [7]:
sample = []
for l in train_data['label'].unique():
    sample.append(train_data.loc[train_data['label'] == l].sample(config.k))
train_data = pd.concat(sample, ignore_index=True)
train_data.head()

Unnamed: 0,guid,title,label,url,date
0,ynat-v1_train_36244,국립현대미술관 청주관 개관,3,https://news.naver.com/main/read.nhn?mode=LS2D...,2018.12.13. 오후 5:56
1,ynat-v1_train_20058,설 연휴 호캉스 누려볼까…호텔업계 설 패키지 잇따라,3,https://news.naver.com/main/read.nhn?mode=LS2D...,2019.01.15. 오전 6:00
2,ynat-v1_train_16870,일상의 중력에서 벗어나려는 샐러리맨의 이야기,3,https://news.naver.com/main/read.nhn?mode=LS2D...,2019.02.15. 오후 2:02
3,ynat-v1_train_01296,한 장 남은 달력 안면도 해넘이,3,https://news.naver.com/main/read.nhn?mode=LS2D...,2017.11.30. 오후 1:37
4,ynat-v1_train_36559,신간 정의를 밀어붙이는 사람,3,https://news.naver.com/main/read.nhn?mode=LS2D...,2018.11.15. 오후 1:43


## 3. Augmentation

In [8]:
def aeda(sent, puncs, punc_ratio):
    words = sent.split()
    num_puncs = np.random.randint(1, np.ceil(punc_ratio * len(words))+1)
    punc_idxs = np.random.choice(range(0, len(words)), num_puncs, replace=False).tolist()
    
    aug_sent = []
    for i, w in enumerate(words):
        if i in punc_idxs:
            aug_sent.append(np.random.choice(puncs))
        aug_sent.append(w)
    aug_sent = ' '.join(aug_sent)
    return aug_sent

In [9]:
augmented = []

for idx, item in train_data.iterrows():
    title, label = item['title'], item['label']
    augmented.append({'title': title, 'label': label})
    for _ in range(config.num_augs-1):
        aug_title = aeda(title, config.puncs, config.punc_ratio)
        augmented.append({'title': aug_title, 'label': label})
augmented = pd.DataFrame(augmented)

## 4. Train

In [10]:
def train(data, tokenizer, config):
    dataset = Dataset(data, tokenizer)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=config.batch_size, shuffle=True)

    model = AutoModelForSequenceClassification.from_pretrained(config.model_name_or_path, num_labels=data['label'].nunique())
    model.train().cuda()
    optim = torch.optim.Adam(model.parameters(), lr=config.lr)

    for ep in tqdm(range(config.num_epochs)):
        ep_loss = 0.
        for batch in dataloader:
            batch = {k:v.squeeze().to(model.device) for k,v in batch.items()}
            out = model(**batch)
            loss = out.loss

            optim.zero_grad()
            loss.backward()
            optim.step()

            ep_loss += loss.item()
        
        ep_loss /= len(dataloader)
        print(f'ep {ep:02d} | loss {ep_loss:.3f}')
    
    return model


def evaluate(model, test_loader):
    _ = model.eval()

    logits, labels = [], []
    for batch in test_loader:
        batch = {k:v.squeeze().to(model.device) for k,v in batch.items()}
        with torch.no_grad():
            out = model(**batch)
            logits.append(out.logits.cpu())
            labels.append(batch['labels'].cpu())

    logits = torch.cat(logits, dim=0)
    labels = torch.cat(labels, dim=0)
    acc = (torch.argmax(logits, dim=1) == labels).float().mean().item()
    return acc

In [11]:
model = train(train_data, tokenizer, config)
acc = evaluate(model, test_loader)

print(f'acc: {acc:.3f}')

Some weights of the model checkpoint at klue/roberta-small were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at klue/roberta-small and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'class

  0%|          | 0/10 [00:00<?, ?it/s]

ep 00 | loss 1.973
ep 01 | loss 1.923
ep 02 | loss 1.832
ep 03 | loss 1.797
ep 04 | loss 1.651
ep 05 | loss 1.686
ep 06 | loss 1.588
ep 07 | loss 1.416
ep 08 | loss 1.300
ep 09 | loss 1.229
acc: 0.450


In [12]:
model = train(augmented, tokenizer, config)
acc = evaluate(model, test_loader)

print(f'acc: {acc:.3f}')

Some weights of the model checkpoint at klue/roberta-small were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at klue/roberta-small and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'class

  0%|          | 0/10 [00:00<?, ?it/s]

ep 00 | loss 1.910
ep 01 | loss 1.539
ep 02 | loss 1.067
ep 03 | loss 0.689
ep 04 | loss 0.446
ep 05 | loss 0.286
ep 06 | loss 0.195
ep 07 | loss 0.143
ep 08 | loss 0.111
ep 09 | loss 0.092
acc: 0.551
