
# Pytorch + HuggingFace 
## KoElectra Model
박장원님의 KoElectra-small 사용<br>
https://monologg.kr/2020/05/02/koelectra-part1/<br>
https://github.com/monologg/KoELECTRA

## Dataset
네이버 영화 리뷰 데이터셋<br>
https://github.com/e9t/nsmc

## References
- https://huggingface.co/transformers/training.html
- https://tutorials.pytorch.kr/beginner/data_loading_tutorial.html
- https://tutorials.pytorch.kr/beginner/blitz/cifar10_tutorial.html
- https://wikidocs.net/44249

## 주의사항
꼭 GPU로 해주세요 - 1epoch 당 약 20분 소요

In [None]:
# HuggingFace transformers 설치 및 NSMC 데이터셋 다운로드
!pip install transformers
!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt
!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 9.1MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 30.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 53.7MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=2b565b7c9430bb55a8c

In [None]:
!head ratings_train.txt
!head ratings_test.txt

id	document	label
9976970	아 더빙.. 진짜 짜증나네요 목소리	0
3819312	흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나	1
10265843	너무재밓었다그래서보는것을추천한다	0
9045019	교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정	0
6483659	사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다	1
5403919	막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.	0
7797314	원작의 긴장감을 제대로 살려내지못했다.	0
9443947	별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단 낫겟다 납치.감금만반복반복..이드라마는 가족도없다 연기못하는사람만모엿네	0
7156791	액션이 없는데도 재미 있는 몇안되는 영화	1
id	document	label
6270596	굳 ㅋ	1
9274899	GDNTOPCLASSINTHECLUB	0
8544678	뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아	0
6825595	지루하지는 않은데 완전 막장임... 돈주고 보기에는....	0
6723715	3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??	0
7898805	음악이 주가 된, 최고의 음악영화	1
6315043	진정한 쓰레기	0
6097171	마치 미국애니에서 튀어나온듯한 창의력없는 로봇디자인부터가,고개를 젖게한다	0
8932678	갈수록 개판되가는 중국영화 유치하고 내용없음 폼잡다 끝남 말도안되는 무기에 유치한cg남무 아 그립다 동사서독같은 영화가 이건 3류아류작이다	0


In [None]:
import pandas as pd
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, ElectraForSequenceClassification, AdamW
from tqdm.notebook import tqdm

In [None]:
# GPU 사용
device = torch.device("cuda")

# Dataset 만들어서 불러오기 

In [None]:
class NSMCDataset(Dataset):
  
  def __init__(self, csv_file):
    # 일부 값중에 NaN이 있음...
    self.dataset = pd.read_csv(csv_file, sep='\t').dropna(axis=0) 
    # 중복제거
    self.dataset.drop_duplicates(subset=['document'], inplace=True)
    self.tokenizer = AutoTokenizer.from_pretrained("monologg/koelectra-small-v3-discriminator")

    print(self.dataset.describe())
  
  def __len__(self):
    return len(self.dataset)
  
  def __getitem__(self, idx):
    row = self.dataset.iloc[idx, 1:3].values
    text = row[0]
    y = row[1]

    inputs = self.tokenizer(
        text, 
        return_tensors='pt',
        truncation=True,
        max_length=256,
        pad_to_max_length=True,
        add_special_tokens=True
        )
    
    input_ids = inputs['input_ids'][0]
    attention_mask = inputs['attention_mask'][0]

    return input_ids, attention_mask, y

In [None]:
train_dataset = NSMCDataset("ratings_train.txt")
test_dataset = NSMCDataset("ratings_test.txt")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=458.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263326.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=61.0, style=ProgressStyle(description_w…


                 id          label
count  1.461820e+05  146182.000000
mean   6.779186e+06       0.498283
std    2.919223e+06       0.499999
min    3.300000e+01       0.000000
25%    4.814832e+06       0.000000
50%    7.581160e+06       0.000000
75%    9.274760e+06       1.000000
max    1.027815e+07       1.000000
                 id         label
count  4.915700e+04  49157.000000
mean   6.752945e+06      0.502695
std    2.937158e+06      0.499998
min    6.010000e+02      0.000000
25%    4.777143e+06      0.000000
50%    7.565415e+06      1.000000
75%    9.260204e+06      1.000000
max    1.027809e+07      1.000000


# Create Model

In [None]:
train_dataset[1]



(tensor([    2,  3854,    18,    18,    18, 14061,  4275,  4219,  3461,  4991,
         22682,  4612,    18,    18,    18,    18, 11618,  4049,  4031,  4084,
          4482,  9745,  4200,  3083,  9513,     3,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,  

In [None]:
model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-small-v3-discriminator").to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=56577499.0, style=ProgressStyle(descrip…




Some weights of the model checkpoint at monologg/koelectra-small-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-small-v3-discriminator and are newly initialized

In [None]:
# 모델 레이어 보기
model

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(35000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (embeddings_project): Linear(in_features=128, out_features=256, bias=True)
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_

# Learn

In [None]:
epochs = 3
batch_size = 128

In [None]:
optimizer = AdamW(model.parameters(), lr=1e-5)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)

In [None]:
train_loader

<torch.utils.data.dataloader.DataLoader at 0x7f1ff3dd3128>

In [None]:
losses = []
accuracies = []

for i in range(epochs):
  total_loss = 0.0
  correct = 0
  total = 0
  batches = 0

  model.train()

  for input_ids_batch, attention_masks_batch, y_batch in tqdm(train_loader):
    optimizer.zero_grad()
    y_batch = y_batch.to(device)
    y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
    loss = F.cross_entropy(y_pred, y_batch)
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

    _, predicted = torch.max(y_pred, 1)
    correct += (predicted == y_batch).sum()
    total += len(y_batch)

    batches += 1
    if batches % 100 == 0:
      print("Batch Loss:", total_loss, "Accuracy:", correct.float() / total)
  
  losses.append(total_loss)
  accuracies.append(correct.float() / total)
  print("Train Loss:", total_loss, "Accuracy:", correct.float() / total)

HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))



Batch Loss: 36.11901494860649 Accuracy: tensor(0.8446, device='cuda:0')
Batch Loss: 71.67900919914246 Accuracy: tensor(0.8457, device='cuda:0')
Batch Loss: 107.1531828045845 Accuracy: tensor(0.8468, device='cuda:0')
Batch Loss: 142.32082165777683 Accuracy: tensor(0.8472, device='cuda:0')
Batch Loss: 176.7610146254301 Accuracy: tensor(0.8480, device='cuda:0')
Batch Loss: 210.78226475417614 Accuracy: tensor(0.8490, device='cuda:0')
Batch Loss: 244.77438080310822 Accuracy: tensor(0.8497, device='cuda:0')
Batch Loss: 278.73658038675785 Accuracy: tensor(0.8502, device='cuda:0')
Batch Loss: 312.0816454589367 Accuracy: tensor(0.8511, device='cuda:0')
Batch Loss: 345.40658144652843 Accuracy: tensor(0.8519, device='cuda:0')
Batch Loss: 378.1699725687504 Accuracy: tensor(0.8526, device='cuda:0')

Train Loss: 391.7557973712683 Accuracy: tensor(0.8530, device='cuda:0')


HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))

Batch Loss: 31.583349481225014 Accuracy: tensor(0.8616, device='cuda:0')
Batch Loss: 64.83433014154434 Accuracy: tensor(0.8611, device='cuda:0')
Batch Loss: 94.99985300004482 Accuracy: tensor(0.8652, device='cuda:0')
Batch Loss: 126.57420615851879 Accuracy: tensor(0.8655, device='cuda:0')
Batch Loss: 156.1950791925192 Accuracy: tensor(0.8674, device='cuda:0')
Batch Loss: 187.92248578369617 Accuracy: tensor(0.8672, device='cuda:0')
Batch Loss: 217.67785961925983 Accuracy: tensor(0.8684, device='cuda:0')
Batch Loss: 249.11092068254948 Accuracy: tensor(0.8680, device='cuda:0')
Batch Loss: 280.31765089929104 Accuracy: tensor(0.8681, device='cuda:0')
Batch Loss: 311.323682308197 Accuracy: tensor(0.8680, device='cuda:0')
Batch Loss: 341.3087706118822 Accuracy: tensor(0.8686, device='cuda:0')

Train Loss: 353.7222055196762 Accuracy: tensor(0.8689, device='cuda:0')


HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))

Batch Loss: 30.0193130671978 Accuracy: tensor(0.8704, device='cuda:0')
Batch Loss: 58.784967020154 Accuracy: tensor(0.8744, device='cuda:0')
Batch Loss: 87.92221242189407 Accuracy: tensor(0.8754, device='cuda:0')
Batch Loss: 116.78564243018627 Accuracy: tensor(0.8764, device='cuda:0')
Batch Loss: 145.57954773306847 Accuracy: tensor(0.8770, device='cuda:0')
Batch Loss: 174.31967590749264 Accuracy: tensor(0.8773, device='cuda:0')
Batch Loss: 203.52684943377972 Accuracy: tensor(0.8770, device='cuda:0')
Batch Loss: 232.1886784285307 Accuracy: tensor(0.8771, device='cuda:0')
Batch Loss: 260.70140667259693 Accuracy: tensor(0.8775, device='cuda:0')
Batch Loss: 289.4855782240629 Accuracy: tensor(0.8776, device='cuda:0')
Batch Loss: 318.87057243287563 Accuracy: tensor(0.8773, device='cuda:0')

Train Loss: 332.15166637301445 Accuracy: tensor(0.8774, device='cuda:0')


In [None]:
losses, accuracies

([391.7557973712683, 353.7222055196762, 332.15166637301445],
 [tensor(0.8530, device='cuda:0'),
  tensor(0.8689, device='cuda:0'),
  tensor(0.8774, device='cuda:0')])

테스트 데이터셋 정확도 확인하기

In [None]:
model.eval()

test_correct = 0
test_total = 0

for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
  y_batch = y_batch.to(device)
  y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
  _, predicted = torch.max(y_pred, 1)
  test_correct += (predicted == y_batch).sum()
  test_total += len(y_batch)

print("Accuracy:", test_correct.float() / test_total)

HBox(children=(FloatProgress(value=0.0, max=3073.0), HTML(value='')))




Accuracy: tensor(0.8767, device='cuda:0')


In [None]:
# 모델 저장하기
torch.save(model.state_dict(), "model.pt")

# 튜닝
- 전처리를 하고  
- 체신모델사용

In [None]:
from transformers import ElectraModel, ElectraTokenizer

In [None]:
#tokenizer를 사용하면 자동으로 CLS/ SEP가 붙는다. 전처리를 하고 나서 (그냥 깔끔하게만 할건지, 아니면 문장 구분까지 해서 도중에 SEP가 들어가게 할건지)
tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

In [None]:
pip install soynlp emoji kss

Collecting soynlp
[?25l  Downloading https://files.pythonhosted.org/packages/7e/50/6913dc52a86a6b189419e59f9eef1b8d599cffb6f44f7bb91854165fc603/soynlp-0.0.493-py3-none-any.whl (416kB)
[K     |▉                               | 10kB 23.1MB/s eta 0:00:01[K     |█▋                              | 20kB 30.3MB/s eta 0:00:01[K     |██▍                             | 30kB 15.8MB/s eta 0:00:01[K     |███▏                            | 40kB 11.6MB/s eta 0:00:01[K     |████                            | 51kB 8.0MB/s eta 0:00:01[K     |████▊                           | 61kB 8.5MB/s eta 0:00:01[K     |█████▌                          | 71kB 8.9MB/s eta 0:00:01[K     |██████▎                         | 81kB 9.1MB/s eta 0:00:01[K     |███████                         | 92kB 9.0MB/s eta 0:00:01[K     |███████▉                        | 102kB 9.4MB/s eta 0:00:01[K     |████████▋                       | 112kB 9.4MB/s eta 0:00:01[K     |█████████▍                      | 122kB 9.4MB/s eta

문장 분리를 하고, tokenizer를 해야할 거 같은데
- clean을 먼저 하고나서 

# 1차 실험
- 데이터 전처리 (clean)으로
- 문장 구분하여 그 사이에 sep 넣는 것은 X
- Max_len을 128로 했고 / 256도 할 수 있을거고, 더 아래도 할 수 있겠지
- post / 왜 꼭 POST여야 할까? 
- traing batch 128, test batch 16 / 더 나은 batch_size가 있을까?

##원래랑 바꾼거
1. 전처리
2. max_len
3. eps = 1e-8 추가
- 근데! 떨어졌다!
--- 
tokenizer에서 패딩 안넣은 값들, 거기에서 최대 길이가 얼마인지 확인이라도 해보자

In [None]:
import re
import emoji
import kss
from soynlp.normalizer import repeat_normalize

emojis = ''.join(emoji.UNICODE_EMOJI.keys())
pattern = re.compile(f'[^ .,?!/@$%~％·∼()\x00-\x7Fㄱ-힣{emojis}]+')
url_pattern = re.compile(
    r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

def clean(x):
    x = pattern.sub(' ', x)
    x = url_pattern.sub('', x)
    x = x.strip()
    x = repeat_normalize(x, num_repeats=2)
    return x

In [None]:
class NSMCDataset(Dataset):
  
  def __init__(self, csv_file):
    #중복값, 결측치 제거
    self.dataset = pd.read_csv(csv_file, sep='\t').dropna(axis=0) 
    self.dataset.drop_duplicates(subset=['document'], inplace=True)
    self.tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
    print(self.dataset.describe())
  
  def __len__(self):
    return len(self.dataset)
  
  def __getitem__(self, idx):
    row = self.dataset.iloc[idx, 1:3].values
    text = row[0]
    y = row[1]

    inputs = self.tokenizer(
        clean(text), 
        return_tensors='pt',
        truncation=True,
        max_length=128,
        pad_to_max_length=True,
        add_special_tokens=True
        )
    
    input_ids = inputs['input_ids'][0]
    attention_mask = inputs['attention_mask'][0]

    return input_ids, attention_mask, y

In [None]:
train_dataset = NSMCDataset("ratings_train.txt")
test_dataset = NSMCDataset("ratings_test.txt")

                 id          label
count  1.461820e+05  146182.000000
mean   6.779186e+06       0.498283
std    2.919223e+06       0.499999
min    3.300000e+01       0.000000
25%    4.814832e+06       0.000000
50%    7.581160e+06       0.000000
75%    9.274760e+06       1.000000
max    1.027815e+07       1.000000
                 id         label
count  4.915700e+04  49157.000000
mean   6.752945e+06      0.502695
std    2.937158e+06      0.499998
min    6.010000e+02      0.000000
25%    4.777143e+06      0.000000
50%    7.565415e+06      1.000000
75%    9.260204e+06      1.000000
max    1.027809e+07      1.000000


In [None]:
train_dataset = NSMCDataset("ratings_train.txt")
test_dataset = NSMCDataset("ratings_test.txt")

In [None]:
model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-small-v3-discriminator").to(device)

epochs = 3
batch_size = 128

optimizer = AdamW(model.parameters(), lr=1e-5, eps = 1e-8)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)
#왜 train의 batch_size와 test의 batch_size가 다르지? - train의 data가 test의 data보다 커서?

losses = []
accuracies = []

for i in range(epochs):
  total_loss = 0.0
  correct = 0
  total = 0
  batches = 0

  model.train()

  for input_ids_batch, attention_masks_batch, y_batch in tqdm(train_loader):
    optimizer.zero_grad()
    y_batch = y_batch.to(device)
    y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
    loss = F.cross_entropy(y_pred, y_batch) # loss를 이걸로 쓰는게 나을까
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

    _, predicted = torch.max(y_pred, 1)
    correct += (predicted == y_batch).sum()
    total += len(y_batch)

    batches += 1
    if batches % 100 == 0:
      print("Batch Loss:", total_loss, "Accuracy:", correct.float() / total)
  
  losses.append(total_loss)
  accuracies.append(correct.float() / total)
  print("Train Loss:", total_loss, "Accuracy:", correct.float() / total)


Some weights of the model checkpoint at monologg/koelectra-small-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-small-v3-discriminator and are newly initialized

HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))



Batch Loss: 68.85002934932709 Accuracy: tensor(0.5724, device='cuda:0')
Batch Loss: 127.70813807845116 Accuracy: tensor(0.6660, device='cuda:0')
Batch Loss: 177.47694730758667 Accuracy: tensor(0.7068, device='cuda:0')
Batch Loss: 222.75618633627892 Accuracy: tensor(0.7318, device='cuda:0')
Batch Loss: 265.1090400516987 Accuracy: tensor(0.7488, device='cuda:0')
Batch Loss: 306.7903261780739 Accuracy: tensor(0.7603, device='cuda:0')
Batch Loss: 346.1428607404232 Accuracy: tensor(0.7700, device='cuda:0')
Batch Loss: 384.83808848261833 Accuracy: tensor(0.7776, device='cuda:0')
Batch Loss: 423.6716893315315 Accuracy: tensor(0.7832, device='cuda:0')
Batch Loss: 460.69942715764046 Accuracy: tensor(0.7889, device='cuda:0')
Batch Loss: 497.5945842862129 Accuracy: tensor(0.7937, device='cuda:0')

Train Loss: 513.7507925033569 Accuracy: tensor(0.7956, device='cuda:0')


HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))

Batch Loss: 35.66193188726902 Accuracy: tensor(0.8445, device='cuda:0')
Batch Loss: 70.76259037852287 Accuracy: tensor(0.8468, device='cuda:0')
Batch Loss: 105.6323453783989 Accuracy: tensor(0.8483, device='cuda:0')
Batch Loss: 140.3365130573511 Accuracy: tensor(0.8493, device='cuda:0')
Batch Loss: 173.77634803950787 Accuracy: tensor(0.8509, device='cuda:0')
Batch Loss: 208.21840286254883 Accuracy: tensor(0.8507, device='cuda:0')
Batch Loss: 241.0422344058752 Accuracy: tensor(0.8525, device='cuda:0')
Batch Loss: 274.749304369092 Accuracy: tensor(0.8526, device='cuda:0')
Batch Loss: 307.6197340488434 Accuracy: tensor(0.8530, device='cuda:0')
Batch Loss: 341.01571345329285 Accuracy: tensor(0.8536, device='cuda:0')
Batch Loss: 373.77131205797195 Accuracy: tensor(0.8540, device='cuda:0')

Train Loss: 387.56809754669666 Accuracy: tensor(0.8542, device='cuda:0')


HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))

Batch Loss: 32.04944866895676 Accuracy: tensor(0.8612, device='cuda:0')
Batch Loss: 62.993618085980415 Accuracy: tensor(0.8671, device='cuda:0')
Batch Loss: 94.58239142596722 Accuracy: tensor(0.8664, device='cuda:0')
Batch Loss: 125.21282079815865 Accuracy: tensor(0.8679, device='cuda:0')
Batch Loss: 155.7379980236292 Accuracy: tensor(0.8694, device='cuda:0')
Batch Loss: 186.41338975727558 Accuracy: tensor(0.8693, device='cuda:0')
Batch Loss: 216.7253766655922 Accuracy: tensor(0.8696, device='cuda:0')
Batch Loss: 247.9130358248949 Accuracy: tensor(0.8691, device='cuda:0')
Batch Loss: 278.3704405426979 Accuracy: tensor(0.8688, device='cuda:0')
Batch Loss: 309.137654453516 Accuracy: tensor(0.8688, device='cuda:0')
Batch Loss: 340.41836562752724 Accuracy: tensor(0.8687, device='cuda:0')

Train Loss: 353.30324913561344 Accuracy: tensor(0.8688, device='cuda:0')


In [None]:
losses, accuracies

([513.7507925033569, 387.56809754669666, 353.30324913561344],
 [tensor(0.7956, device='cuda:0'),
  tensor(0.8542, device='cuda:0'),
  tensor(0.8688, device='cuda:0')])

In [None]:
model.eval()

test_correct = 0
test_total = 0

for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
  y_batch = y_batch.to(device)
  y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
  _, predicted = torch.max(y_pred, 1)
  test_correct += (predicted == y_batch).sum()
  test_total += len(y_batch)

print("Accuracy:", test_correct.float() / test_total)
# 모델 저장하기
torch.save(model.state_dict(), "model_2.pt")

HBox(children=(FloatProgress(value=0.0, max=3073.0), HTML(value='')))




Accuracy: tensor(0.8719, device='cuda:0')


- 전처리를 했는데 오히려 떨어졌다. 뭐지

In [None]:
self.dataset = pd.read_csv(csv_file, sep='\t').dropna(axis=0) 
self.dataset.drop_duplicates(subset=['document'], inplace=True)
self.tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
print(self.dataset.describe())

In [None]:
training_set = pd.read_csv("ratings_test.txt", sep='\t')

training_set.dropna(axis=0, inplace=True)
training_set.drop_duplicates(subset=['document'], inplace=True)

training_set['token'] = training_set['document'].apply(lambda x: tokenizer(x)['input_ids'])

#토크나이징을 해도 문장의 최대길이는 125다. 근데 왜 max_len을 125로 했을 때 성능이 떨어졌을까?
max_len = max(len(I) for I in training_set['token'])

In [None]:
model.

In [None]:
text

'안녕하세요'

In [None]:
class NSMCDataset(Dataset):
  def __init__(self, csv_file, max_len):
    #중복값, 결측치 제거
    self.dataset = pd.read_csv(csv_file, sep='\t').dropna(axis=0) 
    self.dataset.drop_duplicates(subset=['document'], inplace=True)
    self.tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
    self.max_len = max_len
    print(self.dataset.describe())
  
  def __len__(self):
    return len(self.dataset)
  
  def __getitem__(self, idx):
    row = self.dataset.iloc[idx, 1:3].values
    text = row[0]
    y = row[1]

    inputs = self.tokenizer(
        clean(text), 
        return_tensors='pt',
        truncation=True,
        max_length=self.max_len,
        pad_to_max_length=True,
        add_special_tokens=True
        )
    
    input_ids = inputs['input_ids'][0]
    attention_mask = inputs['attention_mask'][0]

    return input_ids, attention_mask, y

In [None]:
train_dataset = NSMCDataset("ratings_train.txt", max_len=256)
test_dataset = NSMCDataset("ratings_test.txt", max_len=256)

                 id          label
count  1.461820e+05  146182.000000
mean   6.779186e+06       0.498283
std    2.919223e+06       0.499999
min    3.300000e+01       0.000000
25%    4.814832e+06       0.000000
50%    7.581160e+06       0.000000
75%    9.274760e+06       1.000000
max    1.027815e+07       1.000000
                 id         label
count  4.915700e+04  49157.000000
mean   6.752945e+06      0.502695
std    2.937158e+06      0.499998
min    6.010000e+02      0.000000
25%    4.777143e+06      0.000000
50%    7.565415e+06      1.000000
75%    9.260204e+06      1.000000
max    1.027809e+07      1.000000


In [None]:
training()

Some weights of the model checkpoint at monologg/koelectra-small-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-small-v3-discriminator and are newly initialized

HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))



Batch Loss: 68.54078429937363 Accuracy: tensor(0.6013, device='cuda:0')
Batch Loss: 126.40437650680542 Accuracy: tensor(0.6866, device='cuda:0')
Batch Loss: 175.2098125219345 Accuracy: tensor(0.7248, device='cuda:0')
Batch Loss: 220.33490693569183 Accuracy: tensor(0.7461, device='cuda:0')
Batch Loss: 262.70421147346497 Accuracy: tensor(0.7612, device='cuda:0')
Batch Loss: 303.67324408888817 Accuracy: tensor(0.7715, device='cuda:0')
Batch Loss: 343.20827239751816 Accuracy: tensor(0.7799, device='cuda:0')
Batch Loss: 381.023539185524 Accuracy: tensor(0.7871, device='cuda:0')
Batch Loss: 418.39578261971474 Accuracy: tensor(0.7929, device='cuda:0')
Batch Loss: 455.9719938337803 Accuracy: tensor(0.7972, device='cuda:0')
Batch Loss: 491.7780366688967 Accuracy: tensor(0.8014, device='cuda:0')

Train Loss: 506.83726309239864 Accuracy: tensor(0.8032, device='cuda:0')


HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))

Batch Loss: 35.57343229651451 Accuracy: tensor(0.8492, device='cuda:0')
Batch Loss: 70.36522954702377 Accuracy: tensor(0.8504, device='cuda:0')
Batch Loss: 104.43516117334366 Accuracy: tensor(0.8511, device='cuda:0')
Batch Loss: 138.30078062415123 Accuracy: tensor(0.8515, device='cuda:0')
Batch Loss: 171.20911346375942 Accuracy: tensor(0.8533, device='cuda:0')
Batch Loss: 204.04435393214226 Accuracy: tensor(0.8550, device='cuda:0')
Batch Loss: 236.4515583217144 Accuracy: tensor(0.8555, device='cuda:0')
Batch Loss: 268.9024176597595 Accuracy: tensor(0.8563, device='cuda:0')
Batch Loss: 301.48681992292404 Accuracy: tensor(0.8566, device='cuda:0')
Batch Loss: 334.1799990981817 Accuracy: tensor(0.8570, device='cuda:0')
Batch Loss: 366.723610162735 Accuracy: tensor(0.8573, device='cuda:0')

Train Loss: 380.17943701148033 Accuracy: tensor(0.8576, device='cuda:0')


HBox(children=(FloatProgress(value=0.0, max=1143.0), HTML(value='')))

Batch Loss: 31.376612946391106 Accuracy: tensor(0.8652, device='cuda:0')
Batch Loss: 62.242679953575134 Accuracy: tensor(0.8661, device='cuda:0')
Batch Loss: 94.0781361758709 Accuracy: tensor(0.8659, device='cuda:0')
Batch Loss: 124.66866047680378 Accuracy: tensor(0.8668, device='cuda:0')
Batch Loss: 154.78246684372425 Accuracy: tensor(0.8677, device='cuda:0')
Batch Loss: 186.55598832666874 Accuracy: tensor(0.8671, device='cuda:0')
Batch Loss: 217.01462198793888 Accuracy: tensor(0.8673, device='cuda:0')
Batch Loss: 247.43035499751568 Accuracy: tensor(0.8675, device='cuda:0')
Batch Loss: 278.1072434037924 Accuracy: tensor(0.8677, device='cuda:0')
Batch Loss: 307.79958564043045 Accuracy: tensor(0.8685, device='cuda:0')
Batch Loss: 337.2118571102619 Accuracy: tensor(0.8690, device='cuda:0')

Train Loss: 349.3403007276356 Accuracy: tensor(0.8694, device='cuda:0')


In [None]:
evaluate('model_3')

HBox(children=(FloatProgress(value=0.0, max=3073.0), HTML(value='')))




Accuracy: tensor(0.8719, device='cuda:0')


In [None]:
def training(epochs=3, batch_size=128):
  model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-small-v3-discriminator").to(device)

  epochs = 3
  batch_size = 128

  optimizer = AdamW(model.parameters(), lr=1e-5, eps = 1e-8)
  train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
  test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)
  #왜 train의 batch_size와 test의 batch_size가 다르지? - train의 data가 test의 data보다 커서?

  losses = []
  accuracies = []

  for i in range(epochs):
    total_loss = 0.0
    correct = 0
    total = 0
    batches = 0

    model.train()

    for input_ids_batch, attention_masks_batch, y_batch in tqdm(train_loader):
      optimizer.zero_grad()
      y_batch = y_batch.to(device)
      y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
      loss = F.cross_entropy(y_pred, y_batch) # loss를 이걸로 쓰는게 나을까
      loss.backward()
      optimizer.step()

      total_loss += loss.item()

      _, predicted = torch.max(y_pred, 1)
      correct += (predicted == y_batch).sum()
      total += len(y_batch)

      batches += 1
      if batches % 100 == 0:
        print("Batch Loss:", total_loss, "Accuracy:", correct.float() / total)
    
    losses.append(total_loss)
    accuracies.append(correct.float() / total)
    print("Train Loss:", total_loss, "Accuracy:", correct.float() / total)

In [None]:
losses, accuracies

In [None]:
def evaluate(model_save):
  model.eval()

  test_correct = 0
  test_total = 0

  for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
    y_batch = y_batch.to(device)
    y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
    _, predicted = torch.max(y_pred, 1)
    test_correct += (predicted == y_batch).sum()
    test_total += len(y_batch)

  print("Accuracy:", test_correct.float() / test_total)
  # 모델 저장하기
  torch.save(model.state_dict(), "{}.pt".format(model_save))

In [None]:
train_dataset = NSMCDataset("ratings_train.txt", 256)
test_dataset = NSMCDataset("ratings_test.txt", 256)

TypeError: ignored