# Final Project: 2021년 국립국어원 인공지능 언어능력 평가

- [2021년 국립국어원 인공지능 언어능력 평가](https://corpus.korean.go.kr/task/taskList.do?taskId=1&clCd=END_TASK&subMenuId=sub01) 는 9월 1일부터 시작하여 11월 1일까지 마감된 [네 가지 과제에](https://corpus.korean.go.kr/task/taskDownload.do?taskId=1&clCd=END_TASK&subMenuId=sub02) 대한 언어능력 평가 대회
- 여기서 제시된 과제를 그대로 수행하여 그 결과를 [최종 선정된 결과들](https://corpus.korean.go.kr/task/taskLeaderBoard.do?taskId=4&clCd=END_TASK&subMenuId=sub04)과 비교할 수 있도록 수행
- 아직 테스트 셋의 정답이 공식적으로 공개되고 있지 않아, 네 가지 과제의 자료에서 evaluation dataset으로 가지고 성능을 비교할 계획
- 기말 발표전까지 정답셋이 공개될 경우 이 정답셋을 가지고 성능 검증
- Transformers 기반 방법론, 신경망 등 각자 생각한 방법대로 구현 가능
- 현재 대회기간이 종료되어 자료가 다운로드 가능하지 않으니 첨부된 자료 참조
- 개인적으로 하거나 최대 두명까지 그룹 허용. 
- 이 노트북 화일에 이름을 변경하여 작업하고 제출. 제출시 화일명을 FinalProject_[DS또는 CL]_학과_이름.ipynb
- 마감 12월 6일(월) 23:59분까지.
- 12월 7일, 9일 기말 발표 presentation 예정

## 리더보드

- 최종발표전까지 각조는 각 태스크별 실행성능을 **시도된 여러 방법의 결과들을 지속적으로**  [리더보드](https://docs.google.com/spreadsheets/d/1-uenfp5GolpY2Gf0TsFbODvj585IIiFKp9fvYxcfgkY/edit#gid=0)에 해당 팀명(구성원 이름 포함)을 입력하여 공개하여야 함. 
- 최종 마감일에 이 순위와 실제 제출한 프로그램의 수행 결과를 비교하여 성능을 확인

# Task 1. 문장 문법성 판단

In [1]:
# device (GPU)
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
print(device)

cuda


In [2]:
import os
import pandas as pd
import numpy as np

from torch.utils.data import TensorDataset, random_split
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

import time
import datetime

from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import accuracy_score

## Dataset

In [3]:
## read train data
train_df = pd.read_csv("./data/task1/NIKL_CoLA_train.tsv", delimiter='\t', header=0, names=['sentence_source', 'label', 'label_notes', 'sentence'])
print('Total no. of training sentences: {:,}\n'.format(train_df.shape[0]))
train_df.head(10)

Total no. of training sentences: 15,876



Unnamed: 0,sentence_source,label,label_notes,sentence
0,T00001,1,,높은 달이 떴다.
1,T00001,0,*,달이 뜸이 높았다.
2,T00002,1,,실없는 사람이 까불까불한다.
3,T00003,1,,나는 철수에게 공을 던졌다.
4,T00004,1,,내가 순이와 둘이서 다툰다.
5,T00004,0,*,내가 순이와 우리가 다툰다.
6,T00005,1,,나는 부지런히 뛰었다.
7,T00005,0,?,나는 부지런히 뛰어졌다.
8,T00006,1,,사랑이 죄는 아니다.
9,T00006,0,*,죄는 사랑이 아니다.


In [4]:
# check data types
train_df.dtypes

sentence_source    object
label               int64
label_notes        object
sentence           object
dtype: object

In [5]:
# get sentences and labels
train_sentences = train_df.sentence.values
train_labels = train_df.label.values

In [7]:
train_sentences[100]

'나는 종소리를 더 잘 듣더라.'

In [8]:
train_labels[100]

0

In [9]:
print(train_labels.dtype)

int64


## Tokenizer 

In [10]:
#!pip install transformers

In [6]:
from transformers import ElectraTokenizer

tokenizer = ElectraTokenizer.from_pretrained('monologg/koelectra-base-v3-discriminator')

In [7]:
# max length check
max_len = 0

for sent in train_sentences:
    # tokenize the text / add [CLS] and [SEP] tokens
    input_ids = tokenizer.encode(sent, add_special_tokens=True)
    max_len = max(max_len, len(input_ids))
    
print('Maximum sentence length: ', max_len)

Maximum sentence length:  40


In [15]:
# original sentence
print('Original sentence: ', train_sentences[100])

# sentence split into tokens
print('Tokenized sentence: ', tokenizer.tokenize(train_sentences[100]))

# sentence mapped to token IDs
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(train_sentences[100])))

Original sentence:  나는 종소리를 더 잘 듣더라.
Tokenized sentence:  ['나', '##는', '종소리', '##를', '더', '잘', '듣', '##더라', '.']
Token IDs:  [2236, 4034, 31207, 4110, 2373, 3258, 2440, 26093, 18]


In [8]:
# process sentences
def sentence_processing(sentences, labels):
    # Tokenize all of the sentences and map the tokens to thier word IDs.
    input_ids = []
    attention_masks = []
    
    for sent in sentences:
        encoded_dict = tokenizer.encode_plus(          # tokenize sentences and map tokens to their IDs
                            sent,                      
                            add_special_tokens = True, # prepend [CLS] token to the start & append [SEP] to the end
                            max_length = 64,           # pad & truncate all sentences (max_len: 40 < 64)
                            padding='max_length',      # pad & truncate to 'max_length'
                            return_attention_mask = True,   # construct attention masks to [PAD] tokens
                            return_tensors = 'pt',     # return pytorch tensors
                       )
  
        input_ids.append(encoded_dict['input_ids'])    # add the encoded sentence  

        attention_masks.append(encoded_dict['attention_mask'])  # add the attention mask
        
    # convert the lists into tensors
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)
    
    return input_ids, attention_masks, labels


In [9]:
train_input_ids, train_attention_masks, train_labels = sentence_processing(train_sentences, train_labels)

In [18]:
# original sentence and token IDs
print('Original: ', train_sentences[100])
print('Token IDs:', train_input_ids[100])

Original:  나는 종소리를 더 잘 듣더라.
Token IDs: tensor([    2,  2236,  4034, 31207,  4110,  2373,  3258,  2440, 26093,    18,
            3,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])


## Train / Validation

In [10]:
# combine the training inputs into a TensorDataset
from torch.utils.data import TensorDataset, random_split

dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)

In [20]:
len(dataset)

15876

In [11]:
# train-validataion split randomly (9:1)
train_dataset, val_dataset = random_split(dataset, [int(len(dataset) * 0.9), len(dataset) - int(len(dataset) * 0.9)])

In [22]:
print('{:>5,} training samples'.format(len(train_dataset)))
print('{:>5,} validation samples'.format(len(val_dataset)))

14,288 training samples
1,588 validation samples


In [12]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [13]:
# specify the batch size (16 or 32)
batch_size = 32

In [14]:
# dataloaders for training set
train_dataloader = DataLoader(
            train_dataset,  
            sampler = RandomSampler(train_dataset), # sample batches randomly
            batch_size = batch_size
        )

In [15]:
# dataloaders for validation set
validation_dataloader = DataLoader(
            val_dataset, 
            sampler = SequentialSampler(val_dataset), # sample batches sequentially
            batch_size = batch_size
        )

## Model

In [16]:
from transformers import ElectraForSequenceClassification, AdamW

model = ElectraForSequenceClassification.from_pretrained(
        "monologg/koelectra-base-v3-discriminator",
        num_labels = 2, # binary classification   
        output_attentions = False,
        output_hidden_states = False,                          
)

#model.cuda()
model.to(device)

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: 

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(35000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm

In [17]:
# optimizer AdamW from huggingface library
optimizer = AdamW(model.parameters(),
                  lr = 1e-5, # learning rate: 5e-5, 3e-5, 2e-5, 1e-5
                  eps = 1e-8 # adam_epsilon: default 1e-8
                )

In [18]:
# learning rate scheduler
from transformers import get_linear_schedule_with_warmup

epochs = 10

total_steps = len(train_dataloader) * epochs

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, 
                                            num_training_steps = total_steps)

In [19]:
# metrics (matthews correlation coefficient)
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import accuracy_score

# calculate the accuracy of predictions and labels (matthews corr.)
def cal_matthews_corr(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    matthews = matthews_corrcoef(pred_flat, labels_flat)                
    
    return matthews

In [20]:
import time
import datetime

# time format (time in seconds & return a string hh:mm:ss)
def format_time(elapsed):

    elapsed_rounded = int(round((elapsed)))     # round to the nearest second
    
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [21]:
from numpy import random

# set seed for reproducibility
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [22]:
# store a number training loss, validation loss, validation accuracy, and time
training_stats = []

# total training time
total_t0 = time.time()

In [34]:
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # training time for this epoch
    t0 = time.time()

    # reset the total loss for this epoch
    total_train_loss = 0

    # put the model into training mode 
    model.train()

    for step, batch in enumerate(train_dataloader):

        # progress update (every 40 batches)
        if step % 40 == 0 and not step == 0:
            # calculate elapsed time in minutes
            elapsed = format_time(time.time() - t0)
            # report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # unpack training batch from dataloader 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        #print(b_input_ids)
        #print(b_input_mask)
        #print(b_input_mask)

        # clear any previously calculated gradients before backward pass
        model.zero_grad()        

        # perform a forward pass (evaluate the model on training batch)
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask, 
                        labels=b_labels)
        loss, logits = outputs['loss'], outputs['logits']
        
        #outputs = model(b_input_ids, 
        #                token_type_ids=None, 
        #                attention_mask=b_input_mask, 
        #                labels=b_labels)
        
        #loss = outputs.loss
        #logits = ouputs.logits

        # accumulate the training loss over all batches
        total_train_loss += loss.item()

        # perform a backward pass to calculate the gradients
        loss.backward()

        # clip the norm of the gradients to 1.0
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # update parameters and take a step using the computed gradient
        optimizer.step()

        # update the learning rate
        scheduler.step()

    # calculate the average loss over all batches
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # measure training time for this epoch
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    t0 = time.time()

    # put the model in evaluation mode
    model.eval()


    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    for batch in validation_dataloader:
        
        # unpack this training batch from dataloader 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        with torch.no_grad():        

            # forward pass calculate logit predictions
            loss, logits = model(b_input_ids, 
                           token_type_ids=None, 
                           attention_mask=b_input_mask,
                           labels=b_labels,
                           return_dict=False)
            
            #outputs = model(b_input_ids, 
            #          token_type_ids=None, 
            #          attention_mask=b_input_mask,
            #          labels=b_labels,
            #          return_dict=False)
            #loss, logits = outputs['loss'], outputs['logits']
            
            #loss = outputs.loss
            #logits = outputs.logits
            
        # accumulate the validation loss
        total_eval_loss += loss.item()

        # move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # calculate the accuracy for this batch of test sentences and accumulate it over all batches
        total_eval_accuracy += cal_matthews_corr(logits, label_ids)
        

    # final accuracy for this validation run
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # calculate the average loss over all batches
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    # measure validation time
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))
    
    torch.save(model.state_dict(), f'model{epoch_i + 1}.pt')

    # record all statistics from this epoch
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...
  Batch    40  of    447.    Elapsed: 0:00:08.
  Batch    80  of    447.    Elapsed: 0:00:16.
  Batch   120  of    447.    Elapsed: 0:00:24.
  Batch   160  of    447.    Elapsed: 0:00:33.
  Batch   200  of    447.    Elapsed: 0:00:41.
  Batch   240  of    447.    Elapsed: 0:00:50.
  Batch   280  of    447.    Elapsed: 0:00:58.
  Batch   320  of    447.    Elapsed: 0:01:07.
  Batch   360  of    447.    Elapsed: 0:01:16.
  Batch   400  of    447.    Elapsed: 0:01:24.
  Batch   440  of    447.    Elapsed: 0:01:33.

  Average training loss: 0.56
  Training epoch took: 0:01:35

Running Validation...
  Accuracy: 0.56
  Validation Loss: 0.48
  Validation took: 0:00:03

Training...
  Batch    40  of    447.    Elapsed: 0:00:09.
  Batch    80  of    447.    Elapsed: 0:00:18.
  Batch   120  of    447.    Elapsed: 0:00:26.
  Batch   160  of    447.    Elapsed: 0:00:35.
  Batch   200  of    447.    Elapsed: 0:00:44.
  Batch   240  of    447.    Elapsed: 0:00:53.
  Batch   280  of    4

In [35]:
## summary of training process

pd.set_option('precision', 2)

# dataframe from training statistics
df_stats = pd.DataFrame(data=training_stats)

# use the 'epoch' as the row index
df_stats = df_stats.set_index('epoch')

df_stats

Unnamed: 0_level_0,Training Loss,Valid. Loss,Valid. Accur.,Training Time,Validation Time
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.56,0.48,0.56,0:01:35,0:00:03
2,0.44,0.49,0.56,0:01:38,0:00:03
3,0.36,0.52,0.58,0:01:37,0:00:03
4,0.31,0.57,0.56,0:01:38,0:00:03
5,0.26,0.62,0.55,0:01:36,0:00:03
6,0.23,0.67,0.55,0:01:35,0:00:03
7,0.2,0.74,0.54,0:01:38,0:00:03
8,0.18,0.77,0.54,0:01:37,0:00:03
9,0.17,0.79,0.54,0:01:36,0:00:03
10,0.16,0.8,0.54,0:01:38,0:00:03


## Test

In [23]:
# load the test dataset into a pandas dataframe
test_df = pd.read_csv("./data/task1/NIKL_CoLA_dev.tsv", delimiter='\t', header=0, names=['sentence_source', 'label', 'label_notes', 'sentence'])
print('Number of test sentences: {:,}\n'.format(test_df.shape[0]))
test_df.head(10)

Number of test sentences: 2,032



Unnamed: 0,sentence_source,label,label_notes,sentence
0,T00002,0,*,실없는 사람이 까불한다.
1,T00029,1,,순희에게는 아무리 좋은 옷도 어울리지 않는다.
2,T00033,0,*,사람은 언제나 젊는 수는 없다.
3,T00036,0,*,나는 등산이 힘들다는 진실을 모르고 산에 따라갔다가 고생만 했다.
4,T00038,1,,그가 팔을 곧게 뻗는다.
5,T00046,1,,철수가 자.
6,T00050,0,*,철수는 여간해서는 웃는다.
7,T00050,1,,나는 철수가 여간해서는 웃는다고 생각하지 않는다.
8,T00053,1,,너 네가 데모를 해야 하는 이유를 모르고 데모를 했구나.
9,T00056,0,*,마음이 든든을 한데 개운치는 못하다.


In [37]:
print('Total no. of test sentences: {:,}\n'.format(test_df.shape[0]))

Total no. of test sentences: 2,032



In [24]:
# sentence and label lists
test_sentences = test_df.sentence.values
test_labels = test_df.label.values

In [39]:
len(test_labels)

2032

In [25]:
# max length check for test dataset
max_len = 0

for sent in test_sentences:
    # tokenize the text & add [CLS] and [SEP] tokens
    input_ids = tokenizer.encode(sent, add_special_tokens=True)
    max_len = max(max_len, len(input_ids))
    
print('Maximum sentence length: ', max_len)

Maximum sentence length:  32


In [26]:
# tokenize all of the sentences 
# map the tokens to thier word IDs
test_input_ids = []
test_attention_masks = []

for sent in test_sentences:

    encoded_dict = tokenizer.encode_plus(          # tokenize
                        sent,                      # sentence to encode
                        add_special_tokens = True, # add [CLS] and [SEP] tokens
                        max_length = 64,           # pad & truncate 
                        padding = 'max_length',    # to max lenght
                        return_attention_mask = True,   # construct attention masks
                        return_tensors = 'pt',     # return pytorch tensors
                   )
    
    # add the encoded sentence  
    test_input_ids.append(encoded_dict['input_ids'])
    
    # add attention mask
    test_attention_masks.append(encoded_dict['attention_mask'])

In [42]:
len(test_input_ids)

2032

In [27]:
# convert lists into tensors
test_input_ids = torch.cat(test_input_ids, dim=0)
test_attention_masks = torch.cat(test_attention_masks, dim=0)
test_labels = torch.tensor(test_labels)
#test_labels = test_labels.clone().detach()

In [44]:
test_input_ids.size()

torch.Size([2032, 64])

In [45]:
test_attention_masks.size()

torch.Size([2032, 64])

In [46]:
test_labels.size()

torch.Size([2032])

In [28]:
# set the batch size (16 or 32) 
batch_size = 32 

# dataloader
prediction_data = TensorDataset(test_input_ids, test_attention_masks, test_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

### Evaluation on Test Set


In [29]:
PATH = './model/'

In [30]:
model.load_state_dict(torch.load(PATH + 'task1_best_model2.pt'))

<All keys matched successfully>

In [93]:
#model.load_state_dict(torch.load('model2.pt'))

<All keys matched successfully>

In [31]:
# prediction on test set
print('Predicting labels for {:,} test sentences...'.format(len(test_input_ids)))

# put model in evaluation mode
model.eval()

predictions , true_labels = [], []

# predicttion 
for batch in prediction_dataloader:
    # add batch to GPU
    batch = tuple(t.to(device) for t in batch)
  
    # unpack the inputs from dataloader
    b_input_ids, b_input_mask, b_labels = batch
  
    with torch.no_grad():
        # forward pass, calculate logit predictions
        outputs = model(b_input_ids, token_type_ids=None, 
                        attention_mask=b_input_mask)

    logits = outputs[0]

    # move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
  
    # store predictions & true labels
    predictions.append(logits)
    true_labels.append(label_ids)


print('    DONE.')

Predicting labels for 2,032 test sentences...
    DONE.


In [32]:
matthews_set = []

# evaluate each test batch using Matthew's correlation coefficient
#print('Calculating Matthews Corr. Coef. for each batch...')

for i in range(len(true_labels)):
  
    # pick the label with the highest value 
    # turn this into a list of 0s and 1s.
    pred_labels_i = np.argmax(predictions[i], axis=1).flatten()
  
    # calculate and store the coefficient for this batch  
    matthews = matthews_corrcoef(true_labels[i], pred_labels_i)                
    matthews_set.append(matthews)

In [36]:
### MCC calculation across all batches
# combine the results across all batches 
flat_predictions = np.concatenate(predictions, axis=0)

# pick the label (0 or 1) with the higher score for each sample 
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

# combine the correct labels for each batch into a single list
flat_true_labels = np.concatenate(true_labels, axis=0)

# calculate the MCC
mcc = matthews_corrcoef(flat_true_labels, flat_predictions)

print('Total MCC: %.4f' % mcc)


Total MCC: 0.5464
