<a href="https://colab.research.google.com/github/tedsong3170/nlp/blob/main/eng.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers
!pip install torch

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 15.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 43.2MB/s 
[?25hCollecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 52.2MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=1fbe075ff9d4

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import tensorflow as tf
import torch

from transformers import ElectraTokenizer
from transformers import ElectraForSequenceClassification, AdamW
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import pandas as pd
import numpy as np
import random
import time
import datetime
import os.path
import json

# 입력 토큰의 최대 시퀀스 길이
MAX_LEN = 85
# 배치 사이즈
BATCH_SIZE = 32
TRAIN_PERCENT = 3e-5
EPSILON = 1e-8
# 에폭수
EPOCHS = 30


class EngSentimentAnalyzer:
    train = None
    test = None
    device = None
    model = None
    pretrainedModelPath = None

    def __init__(self, pretrainedModelPath=None):
        
        self.pretrainedModelPath = pretrainedModelPath
        # 디바이스 설정
        if torch.cuda.is_available():
            # GPU 디바이스 이름 구함
            device_name = tf.test.gpu_device_name()

            # GPU 디바이스 이름 검사
            if device_name == '/device:GPU:0':
                print('Found GPU at: {}'.format(device_name))
            else:
                raise SystemError('GPU device not found')

            self.device = torch.device("cuda")
            print('There are %d GPU(s) available.' % torch.cuda.device_count())
            print('We will use the GPU:', torch.cuda.get_device_name(0))
        else:
            self.device = torch.device("cpu")
            print('No GPU available, using the CPU instead.')

        # 분류를 위한 BERT 모델 생성
        if self.pretrainedModelPath is not None:
            if os.path.isdir(self.pretrainedModelPath) is True:
                self.model = ElectraForSequenceClassification.from_pretrained(self.pretrainedModelPath, num_labels=8)
                print("pretrained Model loaded")
            else:
                self.model = ElectraForSequenceClassification.from_pretrained("google/electra-small-generator", num_labels=8)
        else:
            self.model = ElectraForSequenceClassification.from_pretrained("google/electra-small-generator", num_labels=8)

        if torch.cuda.is_available():
            self.model.cuda()

    def loadJsonFile(self, path):
        with open(path, encoding='utf-8', mode='r') as f:
            data = json.load(f)
        
        df = pd.DataFrame.from_dict(data[0])

        is_first = True
        for array in data:
            if is_first:
                is_first = False
                continue
            
            temp_df = pd.DataFrame.from_dict(array)
            df = df.append(temp_df, ignore_index = True)

        return df

    def getInputsAndLabels(self, dataset):
        data = dataset.copy(deep=True)
        #data['utterance'] = data['utterance'].str.lower()

        utterances = data['utterance']
        utterances = ["[CLS] " + str(utterance) + " [SEP]" for utterance in utterances]

        encoder = LabelEncoder()
        labels = data['emotion'].values
        encoder.fit(labels)
        labels = encoder.transform(labels)

        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
        tokenized_texts = [tokenizer.tokenize(utterance) for utterance in utterances]

        input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
        input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

        attention_masks = []
        for seq in input_ids:
            seq_mask = [float(i>0) for i in seq]
            attention_masks.append(seq_mask)

        return input_ids, labels, attention_masks

    def getInputsFromTest(self, dataset):
        data = dataset.copy(deep=True)
        #data['utterance'] = data['utterance'].str.lower()

        utterances = data['utterance']
        utterances = ["[CLS] " + str(utterance) + " [SEP]" for utterance in utterances]
        
        tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
        tokenized_texts = [tokenizer.tokenize(utterance) for utterance in utterances]

        input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
        input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

        attention_masks = []
        for seq in input_ids:
            seq_mask = [float(i>0) for i in seq]
            attention_masks.append(seq_mask)

        return input_ids, attention_masks

    def getIndex(self, dataset):
        data = dataset.copy(deep = True)
        input_index = data.id.tolist()
        return torch.tensor(input_index)

    def preprocess(self, target, targetPath=None):

        self.train = self.loadJsonFile('/content/gdrive/MyDrive/Friends/friends_train.json')
        self.dev = self.loadJsonFile('/content/gdrive/MyDrive/Friends/friends_dev.json')
        self.test = self.loadJsonFile('/content/gdrive/MyDrive/Friends/friends_test.json')

        train_inputs, train_labels, train_masks = self.getInputsAndLabels(self.train)
        dev_inputs, dev_labels, dev_masks = self.getInputsAndLabels(self.dev)
        test_inputs, test_masks = self.getInputsFromTest(self.test)

        if target == "train":
            # 데이터를 파이토치의 텐서로 변환
            train_inputs = torch.tensor(train_inputs)
            train_labels = torch.tensor(train_labels)
            train_masks = torch.tensor(train_masks)

            dev_inputs = torch.tensor(dev_inputs)
            dev_labels = torch.tensor(dev_labels)
            dev_masks = torch.tensor(dev_masks)

            # 파이토치의 DataLoader로 입력, 마스크, 라벨을 묶어 데이터 설정
            # 학습시 배치 사이즈 만큼 데이터를 가져옴
            train_data = TensorDataset(train_inputs, train_masks, train_labels)
            train_sampler = RandomSampler(train_data)
            train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

            dev_data = TensorDataset(dev_inputs, dev_masks, dev_labels)
            dev_sampler = SequentialSampler(dev_data)
            dev_dataloader = DataLoader(dev_data, sampler=dev_sampler, batch_size=BATCH_SIZE)

            return train_dataloader, dev_dataloader
        elif target == "test":
            # 데이터를 파이토치의 텐서로 변환
            test_index = getIndex(test)
            test_inputs = torch.tensor(test_inputs)
            test_masks = torch.tensor(test_masks)

            # 파이토치의 DataLoader로 입력, 마스크, 라벨을 묶어 데이터 설정
            # 학습시 배치 사이즈 만큼 데이터를 가져옴
            test_data = TensorDataset(test_inputs, test_masks, test_labels)
            test_sampler = RandomSampler(test_data)
            test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

            return test_dataloader

    def makeModel(self, trainDataloader, validationDataloader):
        # 옵티마이저 설정
        optimizer = AdamW(self.model.parameters(), lr=TRAIN_PERCENT, eps=EPSILON)

        # 총 훈련 스텝 : 배치반복 횟수 * 에폭
        total_steps = len(trainDataloader) * EPOCHS

        # 처음에 학습률을 조금씩 변화시키는 스케줄러 생성
        scheduler = get_linear_schedule_with_warmup(optimizer,
                                                    num_warmup_steps=0,
                                                    num_training_steps=total_steps)

        # 재현을 위해 랜덤시드 고정
        seed_val = 42
        random.seed(seed_val)
        np.random.seed(seed_val)
        torch.manual_seed(seed_val)
        torch.cuda.manual_seed_all(seed_val)

        # 그래디언트 초기화
        self.model.zero_grad()

        # 에폭만큼 반복
        for epoch_i in range(0, EPOCHS):

            # ========================================
            #               Training
            # ========================================

            print("")
            print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, EPOCHS))
            print('Training...')

            # 시작 시간 설정
            t0 = time.time()

            # 로스 초기화
            total_loss = 0

            # 훈련모드로 변경
            self.model.train()

            # 데이터로더에서 배치만큼 반복하여 가져옴
            for step, batch in enumerate(trainDataloader):
                # 경과 정보 표시
                if step % 500 == 0 and not step == 0:
                    elapsed = self.format_time(time.time() - t0)
                    print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(trainDataloader), elapsed))

                # 배치를 GPU에 넣음
                batch = tuple(t.to(self.device) for t in batch)

                # 배치에서 데이터 추출
                b_input_ids, b_input_mask, b_labels = batch

                # Forward 수행
                outputs = self.model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

                # 로스 구함
                loss = outputs[0]

                # 총 로스 계산
                total_loss += loss.item()

                # Backward 수행으로 그래디언트 계산
                loss.backward()

                # 그래디언트 클리핑
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)

                # 그래디언트를 통해 가중치 파라미터 업데이트
                optimizer.step()

                # 스케줄러로 학습률 감소
                scheduler.step()

                # 그래디언트 초기화
                self.model.zero_grad()

            # 평균 로스 계산
            avg_train_loss = total_loss / len(trainDataloader)

            print("")
            print("  Average training loss: {0:.2f}".format(avg_train_loss))
            print("  Training epcoh took: {:}".format(self.format_time(time.time() - t0)))

            print("")
            print("Running Validation...")

            # 시작 시간 설정
            t0 = time.time()

            # 평가모드로 변경
            self.model.eval()

            # 변수 초기화
            eval_loss, eval_accuracy = 0, 0
            nb_eval_steps, nb_eval_examples = 0, 0

            # 데이터로더에서 배치만큼 반복하여 가져옴
            for batch in validationDataloader:
                # 배치를 GPU에 넣음
                batch = tuple(t.to(self.device) for t in batch)

                # 배치에서 데이터 추출
                b_input_ids, b_input_mask, b_labels = batch

                # 그래디언트 계산 안함
                with torch.no_grad():
                    # Forward 수행
                    outputs = self.model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

                # 로스 구함
                logits = outputs[0]

                # CPU로 데이터 이동
                logits = logits.detach().cpu().numpy()
                label_ids = b_labels.to('cpu').numpy()

                # 출력 로짓과 라벨을 비교하여 정확도 계산
                tmp_eval_accuracy = self.flat_accuracy(logits, label_ids)
                eval_accuracy += tmp_eval_accuracy
                nb_eval_steps += 1

            print("  Accuracy: {0:.2f}".format(eval_accuracy / nb_eval_steps))
            print("  Validation took: {:}".format(self.format_time(time.time() - t0)))

            self.saveModel(F'ver5_{epoch_i}')

        print("")
        print("Training complete!")

    def testModel(self, test_dataLoader):
        # 시작 시간 설정
        t0 = time.time()

        # 평가모드로 변경
        self.model.eval()

        # 변수 초기화
        eval_loss, eval_accuracy = 0, 0
        nb_eval_steps, nb_eval_examples = 0, 0

        # 데이터로더에서 배치만큼 반복하여 가져옴
        for step, batch in enumerate(test_dataLoader):
            # 경과 정보 표시
            if step % 100 == 0 and not step == 0:
                elapsed = self.format_time(time.time() - t0)
                print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(test_dataLoader), elapsed))

            # 배치를 GPU에 넣음
            batch = tuple(t.to(self.device) for t in batch)

            # 배치에서 데이터 추출
            b_input_ids, b_input_mask, b_labels = batch

            # 그래디언트 계산 안함
            with torch.no_grad():
                # Forward 수행
                outputs = self.model(b_input_ids,
                                     token_type_ids=None,
                                     attention_mask=b_input_mask)

            # 로스 구함
            logits = outputs[0]

            # CPU로 데이터 이동
            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()

            # 출력 로짓과 라벨을 비교하여 정확도 계산
            tmp_eval_accuracy = self.flat_accuracy(logits, label_ids)
            eval_accuracy += tmp_eval_accuracy
            nb_eval_steps += 1

        print("")
        print("Accuracy: {0:.2f}".format(eval_accuracy / nb_eval_steps))
        print("Test took: {:}".format(self.format_time(time.time() - t0)))

    def predict(self, inputPath, outputPath):
        predict = pd.read_csv(inputPath, encoding = 'utf-8')
        test_inputs, test_masks = self.getInputsFromTest(predict)
        
        tmp = predict.copy(deep = True)
        test_index = torch.tensor(tmp.id.tolist())
        test_inputs = torch.tensor(test_inputs)
        test_masks = torch.tensor(test_masks)

        test_data = TensorDataset(test_index, test_inputs, test_masks)
        test_sampler = RandomSampler(test_data)
        test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)


        # 평가모드로 변경
        self.model.eval()
        
        tmp_test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=1)
        test_result = predict.copy(deep = True)
        test_result = test_result.drop(columns = ['i_dialog', 'i_utterance', 'speaker'])
        test_result['Predicted'] = 'default'

        encoder = LabelEncoder()
        labels = self.train['emotion'].values
        encoder.fit(labels)
        labels = encoder.transform(labels)


        for step, batch in enumerate(tmp_test_dataloader):
            # 배치를 GPU에 넣음
            batch = tuple(t.to(self.device) for t in batch)
            
            # 배치에서 데이터 추출
            b_index, b_input_ids, b_input_mask = batch
            
            # 그래디언트 계산 안함
            with torch.no_grad():     
                # Forward 수행
                outputs = self.model(b_input_ids, 
                                token_type_ids=None, 
                                attention_mask=b_input_mask)
            
            # 로스 구함
            logits = outputs[0]

            # CPU로 데이터 이동
            logits = logits.detach().cpu().numpy()
            idx = b_index.item()
            test_result['Predicted'][idx] = encoder.classes_[np.argmax(logits)]

        test_result = test_result.drop(columns = ['utterance'])
        test_result.to_csv(outputPath, index=False)

    # 정확도 계산 함수
    def flat_accuracy(self, preds, labels):
        pred_flat = np.argmax(preds, axis=1).flatten()
        labels_flat = labels.flatten()

        return np.sum(pred_flat == labels_flat) / len(labels_flat)

    # 시간 표시 함수
    def format_time(self, elapsed):
        # 반올림
        elapsed_rounded = int(round((elapsed)))

        # hh:mm:ss으로 형태 변경
        return str(datetime.timedelta(seconds=elapsed_rounded))

    def saveModel(self, path):
        if os.path.isdir('/content/gdrive/MyDrive/'):
            a.model.save_pretrained(F'/content/gdrive/MyDrive/{path}')




In [4]:
a = EngSentimentAnalyzer()
train_dataloader, dev_dataloader = a.preprocess("train")
a.makeModel(train_dataloader, dev_dataloader)

Found GPU at: /device:GPU:0
There are 1 GPU(s) available.
We will use the GPU: Tesla T4


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=463.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=54236116.0, style=ProgressStyle(descrip…




Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraForSequenceClassification: ['generator_predictions.LayerNorm.weight', 'generator_predictions.LayerNorm.bias', 'generator_predictions.dense.weight', 'generator_predictions.dense.bias', 'generator_lm_head.weight', 'generator_lm_head.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-generator and are newly initializ

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…



Training...

  Average training loss: 1.46
  Training epcoh took: 0:00:34

Running Validation...
  Accuracy: 0.53
  Validation took: 0:00:01

Training...

  Average training loss: 1.22
  Training epcoh took: 0:00:35

Running Validation...
  Accuracy: 0.55
  Validation took: 0:00:01

Training...

  Average training loss: 1.12
  Training epcoh took: 0:00:36

Running Validation...
  Accuracy: 0.56
  Validation took: 0:00:01

Training...

  Average training loss: 1.06
  Training epcoh took: 0:00:35

Running Validation...
  Accuracy: 0.56
  Validation took: 0:00:01

Training...

  Average training loss: 0.99
  Training epcoh took: 0:00:35

Running Validation...
  Accuracy: 0.55
  Validation took: 0:00:01

Training...

  Average training loss: 0.94
  Training epcoh took: 0:00:35

Running Validation...
  Accuracy: 0.55
  Validation took: 0:00:01

Training...

  Average training loss: 0.87
  Training epcoh took: 0:00:35

Running Validation...
  Accuracy: 0.55
  Validation took: 0:00:01

Trai

In [None]:
a = EngSentimentAnalyzer('/content/gdrive/MyDrive/ver4_3/')
train_dataloader, dev_dataloader = a.preprocess("train")
a.predict('/content/gdrive/MyDrive/en_data.csv', '/content/gdrive/MyDrive/result_eng_4_3.csv')

Found GPU at: /device:GPU:0
There are 1 GPU(s) available.
We will use the GPU: Tesla T4
pretrained Model loaded


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
