<a href="https://colab.research.google.com/github/whitechocobread/Ai-project/blob/main/5%EC%A3%BC%EC%B0%A8/nsmc_huggingface_koelectra_ipynb%EC%9D%98_%EC%82%AC%EB%B3%B8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Pytorch + HuggingFace
## KoElectra Model
박장원님의 KoElectra-small 사용<br>
https://monologg.kr/2020/05/02/koelectra-part1/<br>
https://github.com/monologg/KoELECTRA

## Dataset
네이버 영화 리뷰 데이터셋<br>
https://github.com/e9t/nsmc

## References
- https://huggingface.co/transformers/training.html
- https://tutorials.pytorch.kr/beginner/data_loading_tutorial.html
- https://tutorials.pytorch.kr/beginner/blitz/cifar10_tutorial.html
- https://wikidocs.net/44249

## 주의사항
꼭 GPU로 해주세요 - 1epoch 당 약 20분 소요

In [None]:
# HuggingFace transformers 설치 및 NSMC 데이터셋 다운로드
!pip install transformers
!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt
!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt

--2023-10-11 08:42:21--  https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4893335 (4.7M) [text/plain]
Saving to: ‘ratings_test.txt.1’


2023-10-11 08:42:21 (265 MB/s) - ‘ratings_test.txt.1’ saved [4893335/4893335]

--2023-10-11 08:42:21--  https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14628807 (14M) [text/plain]
Saving to: ‘ratings_train.txt.1’


2023-10-11 08:42:21 (448 MB/s) - ‘ratings_train.txt

In [None]:
!head ratings_train.txt
!head ratings_test.txt

id	document	label
9976970	아 더빙.. 진짜 짜증나네요 목소리	0
3819312	흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나	1
10265843	너무재밓었다그래서보는것을추천한다	0
9045019	교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정	0
6483659	사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다	1
5403919	막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.	0
7797314	원작의 긴장감을 제대로 살려내지못했다.	0
9443947	별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단 낫겟다 납치.감금만반복반복..이드라마는 가족도없다 연기못하는사람만모엿네	0
7156791	액션이 없는데도 재미 있는 몇안되는 영화	1
id	document	label
6270596	굳 ㅋ	1
9274899	GDNTOPCLASSINTHECLUB	0
8544678	뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아	0
6825595	지루하지는 않은데 완전 막장임... 돈주고 보기에는....	0
6723715	3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??	0
7898805	음악이 주가 된, 최고의 음악영화	1
6315043	진정한 쓰레기	0
6097171	마치 미국애니에서 튀어나온듯한 창의력없는 로봇디자인부터가,고개를 젖게한다	0
8932678	갈수록 개판되가는 중국영화 유치하고 내용없음 폼잡다 끝남 말도안되는 무기에 유치한cg남무 아 그립다 동사서독같은 영화가 이건 3류아류작이다	0


In [None]:
import pandas as pd
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, ElectraForSequenceClassification, AdamW
from tqdm.notebook import tqdm

In [None]:
# GPU 사용
device = torch.device("cuda")

# Dataset 만들어서 불러오기

In [None]:
class NSMCDataset(Dataset):

  def __init__(self, csv_file):
    # 일부 값중에 NaN이 있음...
    self.dataset = pd.read_csv(csv_file, sep='\t').dropna(axis=0)
    # 중복제거
    self.dataset.drop_duplicates(subset=['document'], inplace=True)
    self.tokenizer = AutoTokenizer.from_pretrained("monologg/koelectra-small-v2-discriminator")

    print(self.dataset.describe())

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    row = self.dataset.iloc[idx, 1:3].values
    text = row[0]
    y = row[1]

    inputs = self.tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        max_length=256,
        pad_to_max_length=True,
        add_special_tokens=True
        )

    input_ids = inputs['input_ids'][0]
    attention_mask = inputs['attention_mask'][0]

    return input_ids, attention_mask, y

In [None]:
train_dataset = NSMCDataset("ratings_train.txt")
test_dataset = NSMCDataset("ratings_test.txt")

Downloading (…)okenizer_config.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/486 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/255k [00:00<?, ?B/s]

                 id          label
count  1.461820e+05  146182.000000
mean   6.779186e+06       0.498283
std    2.919223e+06       0.499999
min    3.300000e+01       0.000000
25%    4.814832e+06       0.000000
50%    7.581160e+06       0.000000
75%    9.274760e+06       1.000000
max    1.027815e+07       1.000000
                 id         label
count  4.915700e+04  49157.000000
mean   6.752945e+06      0.502695
std    2.937158e+06      0.499998
min    6.010000e+02      0.000000
25%    4.777143e+06      0.000000
50%    7.565415e+06      1.000000
75%    9.260204e+06      1.000000
max    1.027809e+07      1.000000


# Create Model

In [None]:
model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-base-v3-discriminator").to(device)

# 한번 실행해보기
# text, attention_mask, y = train_dataset[0]
# model(text.unsqueeze(0).to(device), attention_mask=attention_mask.unsqueeze(0).to(device))

Downloading (…)lve/main/config.json:   0%|          | 0.00/467 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/452M [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# 모델 레이어 보기
model

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(35000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0-11): 12 x ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): L

# Learn

In [None]:
epochs = 5
batch_size = 16

In [None]:
optimizer = AdamW(model.parameters(), lr=5e-6)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)



In [None]:
losses = []
accuracies = []

for i in range(epochs):
  total_loss = 0.0
  correct = 0
  total = 0
  batches = 0

  model.train()

  for input_ids_batch, attention_masks_batch, y_batch in tqdm(train_loader):
    optimizer.zero_grad()
    y_batch = y_batch.to(device)
    y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
    loss = F.cross_entropy(y_pred, y_batch)
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

    _, predicted = torch.max(y_pred, 1)
    correct += (predicted == y_batch).sum()
    total += len(y_batch)

    batches += 1
    if batches % 100 == 0:
      print("Batch Loss:", total_loss, "Accuracy:", correct.float() / total)

  losses.append(total_loss)
  accuracies.append(correct.float() / total)
  print("Train Loss:", total_loss, "Accuracy:", correct.float() / total)

  0%|          | 0/9137 [00:00<?, ?it/s]



Batch Loss: 69.21856212615967 Accuracy: tensor(0.5087, device='cuda:0')
Batch Loss: 138.46102076768875 Accuracy: tensor(0.5103, device='cuda:0')
Batch Loss: 207.39337146282196 Accuracy: tensor(0.5229, device='cuda:0')
Batch Loss: 275.8445584177971 Accuracy: tensor(0.5277, device='cuda:0')
Batch Loss: 342.3590739965439 Accuracy: tensor(0.5473, device='cuda:0')
Batch Loss: 407.29491740465164 Accuracy: tensor(0.5609, device='cuda:0')
Batch Loss: 472.2934465408325 Accuracy: tensor(0.5704, device='cuda:0')
Batch Loss: 535.0667796134949 Accuracy: tensor(0.5803, device='cuda:0')
Batch Loss: 596.7737513780594 Accuracy: tensor(0.5905, device='cuda:0')
Batch Loss: 658.1542251110077 Accuracy: tensor(0.5982, device='cuda:0')
Batch Loss: 716.7165642380714 Accuracy: tensor(0.6073, device='cuda:0')
Batch Loss: 774.9012801349163 Accuracy: tensor(0.6154, device='cuda:0')
Batch Loss: 835.2888276875019 Accuracy: tensor(0.6200, device='cuda:0')
Batch Loss: 890.8794000446796 Accuracy: tensor(0.6271, device

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 69.3638727068901 Accuracy: tensor(0.5050, device='cuda:0')
Batch Loss: 138.70262604951859 Accuracy: tensor(0.5041, device='cuda:0')
Batch Loss: 208.0130781531334 Accuracy: tensor(0.5023, device='cuda:0')
Batch Loss: 277.35013204813004 Accuracy: tensor(0.5033, device='cuda:0')
Batch Loss: 346.55958753824234 Accuracy: tensor(0.5068, device='cuda:0')
Batch Loss: 415.87459963560104 Accuracy: tensor(0.5077, device='cuda:0')
Batch Loss: 485.352485537529 Accuracy: tensor(0.5036, device='cuda:0')
Batch Loss: 554.7557857632637 Accuracy: tensor(0.5027, device='cuda:0')
Batch Loss: 624.2626613974571 Accuracy: tensor(0.5004, device='cuda:0')
Batch Loss: 693.7334807515144 Accuracy: tensor(0.4994, device='cuda:0')
Batch Loss: 763.1587573885918 Accuracy: tensor(0.4994, device='cuda:0')
Batch Loss: 832.7238510251045 Accuracy: tensor(0.4971, device='cuda:0')
Batch Loss: 902.07448387146 Accuracy: tensor(0.4963, device='cuda:0')
Batch Loss: 971.3378946185112 Accuracy: tensor(0.4980, device='c

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 69.37089890241623 Accuracy: tensor(0.4975, device='cuda:0')
Batch Loss: 138.76128405332565 Accuracy: tensor(0.4919, device='cuda:0')
Batch Loss: 208.17696350812912 Accuracy: tensor(0.4940, device='cuda:0')
Batch Loss: 277.56019085645676 Accuracy: tensor(0.4964, device='cuda:0')
Batch Loss: 346.9973255991936 Accuracy: tensor(0.4924, device='cuda:0')
Batch Loss: 416.16893243789673 Accuracy: tensor(0.4975, device='cuda:0')
Batch Loss: 485.64747619628906 Accuracy: tensor(0.4968, device='cuda:0')
Batch Loss: 554.964783847332 Accuracy: tensor(0.4984, device='cuda:0')
Batch Loss: 624.312117099762 Accuracy: tensor(0.5002, device='cuda:0')
Batch Loss: 693.6667382121086 Accuracy: tensor(0.5009, device='cuda:0')
Batch Loss: 763.0274613499641 Accuracy: tensor(0.4999, device='cuda:0')
Batch Loss: 832.4396933317184 Accuracy: tensor(0.4989, device='cuda:0')
Batch Loss: 901.7402198314667 Accuracy: tensor(0.4994, device='cuda:0')
Batch Loss: 971.0797606110573 Accuracy: tensor(0.4986, device

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 69.30511689186096 Accuracy: tensor(0.5150, device='cuda:0')
Batch Loss: 138.6193505525589 Accuracy: tensor(0.5144, device='cuda:0')
Batch Loss: 207.98357212543488 Accuracy: tensor(0.5115, device='cuda:0')
Batch Loss: 277.3250471353531 Accuracy: tensor(0.5125, device='cuda:0')
Batch Loss: 346.81429493427277 Accuracy: tensor(0.5044, device='cuda:0')
Batch Loss: 416.1018435359001 Accuracy: tensor(0.5055, device='cuda:0')
Batch Loss: 485.5207711458206 Accuracy: tensor(0.5038, device='cuda:0')
Batch Loss: 554.8028053641319 Accuracy: tensor(0.5052, device='cuda:0')
Batch Loss: 624.1958383917809 Accuracy: tensor(0.5048, device='cuda:0')
Batch Loss: 693.5065541863441 Accuracy: tensor(0.5035, device='cuda:0')
Batch Loss: 762.8719091415405 Accuracy: tensor(0.5020, device='cuda:0')
Batch Loss: 832.1999333500862 Accuracy: tensor(0.5016, device='cuda:0')
Batch Loss: 901.6009470820427 Accuracy: tensor(0.5019, device='cuda:0')
Batch Loss: 970.9534012675285 Accuracy: tensor(0.5013, device=

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 69.35871714353561 Accuracy: tensor(0.4944, device='cuda:0')
Batch Loss: 138.7620500922203 Accuracy: tensor(0.4900, device='cuda:0')
Batch Loss: 208.16494888067245 Accuracy: tensor(0.4877, device='cuda:0')
Batch Loss: 277.54189002513885 Accuracy: tensor(0.4889, device='cuda:0')
Batch Loss: 346.82752829790115 Accuracy: tensor(0.4953, device='cuda:0')
Batch Loss: 416.1836247444153 Accuracy: tensor(0.4957, device='cuda:0')
Batch Loss: 485.6012945175171 Accuracy: tensor(0.4946, device='cuda:0')
Batch Loss: 554.9169694185257 Accuracy: tensor(0.4954, device='cuda:0')
Batch Loss: 624.2844794988632 Accuracy: tensor(0.4951, device='cuda:0')
Batch Loss: 693.6057306528091 Accuracy: tensor(0.4964, device='cuda:0')
Batch Loss: 762.9285252094269 Accuracy: tensor(0.4972, device='cuda:0')
Batch Loss: 832.1321687102318 Accuracy: tensor(0.5001, device='cuda:0')
Batch Loss: 901.395326256752 Accuracy: tensor(0.5017, device='cuda:0')
Batch Loss: 970.8178253173828 Accuracy: tensor(0.5017, device=

In [None]:
losses, accuracies

([5874.157319366932,
  6339.001118481159,
  6337.42215770483,
  6334.987120389938,
  6336.209636032581],
 [tensor(0.5827, device='cuda:0'),
  tensor(0.4995, device='cuda:0'),
  tensor(0.4991, device='cuda:0'),
  tensor(0.5039, device='cuda:0'),
  tensor(0.5001, device='cuda:0')])

테스트 데이터셋 정확도 확인하기

In [None]:
model.eval()

test_correct = 0
test_total = 0

for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
  y_batch = y_batch.to(device)
  y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
  _, predicted = torch.max(y_pred, 1)
  test_correct += (predicted == y_batch).sum()
  test_total += len(y_batch)

print("Accuracy:", test_correct.float() / test_total)

  0%|          | 0/3073 [00:00<?, ?it/s]

Accuracy: tensor(0.5027, device='cuda:0')


In [None]:
# 모델 저장하기
torch.save(model.state_dict(), "model.pt")

In [None]:
model.load_state_dict(torch.load("model.pt"))

<All keys matched successfully>