<a href="https://colab.research.google.com/github/smartkorea/nlp/blob/main/eng/sa_eng.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. 사전 준비**

구글 드라이브와 연동한다.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

영어 감정분석 폴더를 생성한다.

In [None]:
mkdir /content/gdrive/MyDrive/eng_sa

mkdir: cannot create directory ‘/content/gdrive/MyDrive/eng_sa’: File exists


영어 감정분석 폴더를 이동한다.

In [None]:
cd /content/gdrive/MyDrive/eng_sa

/content/gdrive/MyDrive/eng_sa


훈련 데이터(friends_train.json)와 테스트 데이터(en_data.csv)를 업로드 한다.

In [None]:
from google.colab import files
file_uploaded = files.upload()

필요한 패키지를 import 한다.

In [None]:
import torch #the pytorch library, used for modeling and formatting our data to be compatible in a pytorch environment
import pandas as pd #for dataframe reading, cleaning functions
from tqdm.notebook import tqdm #used as a progress bar
import json

In [None]:
!head -n 10 friends_train.json

[
    [
        {
            "speaker": "Chandler",
            "utterance": "also I was the point person on my company\u0092s transition from the KL-5 to GR-6 system.",
            "emotion": "neutral",
            "annotation": "4100000"
        },
        {
            "speaker": "The Interviewer",


훈련 데이터를 읽어 데이터프레임으로 만든다.

In [None]:
with open('./friends_train.json', encoding='utf-8') as json_file:
    json_train = json.load(json_file)

df_train = pd.DataFrame(columns=['idx', 'utterance','emotion'])

i = 0
for rows in json_train:
    for row in rows:
        df_train = df_train.append(pd.DataFrame([[i, row['utterance'], row['emotion']]], columns=['idx', 'utterance','emotion']), ignore_index=True)
        i=i+1

df_train.set_index('idx', inplace=True)       

df_train['emotion'].value_counts()
    

neutral        4752
non-neutral    2017
joy            1283
surprise       1220
anger           513
sadness         351
disgust         240
fear            185
Name: emotion, dtype: int64

In [None]:
df_train[10:20]

Unnamed: 0_level_0,utterance,emotion
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
10,No dont I beg of you!,fear
11,"All right then, well have a definite answer f...",neutral
12,Really?!,surprise
13,Absolutely. You can relax; you did great.,neutral
14,But then who? The waitress I went out with las...,surprise
15,You know? Forget it!,non-neutral
16,"No-no-no-no, no! Who, who were you talking about?",surprise
17,"No, I-I-I-I don't, I actually don't know",non-neutral
18,Ok!,neutral
19,"All right, well...",neutral


감정의 종류와 건 수를 확인한다.

# **2. 전처리**

contraction은 축약된 단어를 축약되지 않은 단어로 변환해 준다.

In [None]:
!pip install contractions 



In [None]:
import contractions 
contractions.fix("im hungry and its cold yall")

df_train['utterance'] = df_train['utterance'].apply(lambda utterance: contractions.fix(utterance))

df_train[10:20]

Unnamed: 0_level_0,utterance,emotion
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
10,No dont I beg of you!,fear
11,"All right then, well have a definite answer f...",neutral
12,Really?!,surprise
13,Absolutely. You can relax; you did great.,neutral
14,But then who? The waitress I went out with las...,surprise
15,You know? Forget it!,non-neutral
16,"No-no-no-no, no! Who, who were you talking about?",surprise
17,"No, I-I-I-I do not, I actually do not know",non-neutral
18,Ok!,neutral
19,"All right, well...",neutral


문장에 대해 HTML 태크와 같이 불필요한 부분을 제거하는 전처리를 한다. 

In [None]:
from bs4 import BeautifulSoup # a library for parsing HTML
import string
import unicodedata
import re

# remove HTML tags
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")    
    return soup.get_text().replace("\n", "")

# we then apply the function for removing HTML Tags
df_train['utterance'] = df_train['utterance'].apply(strip_html_tags)

# normalise accented characters i.e. convert à to a
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8','ignore')
    return text

df_train['utterance'] = df_train['utterance'].apply(remove_accented_chars)

#remove @name mentions and urls in a tweet
def remove_mentions_and_urls(text):
  text = re.sub('(@[A-Za-z0-9]+)|(\w+:\/\/\S+)|(www.[A-Za-z0-9]+.[A-Za-z0-9]+)',' ', text)
  return text

df_train['utterance'] = df_train['utterance'].apply(remove_mentions_and_urls)

#remove punctuations except '?' and '!' and '.'
def remove_punctuation(text):
    text = re.sub(r'[\'\"\\\/\,#]', '', text)
    text = re.sub(r'[^\w\s\?\!\.]', ' ', text)
    return text

df_train['utterance'] = df_train['utterance'].apply(remove_punctuation)

#remove multiple '.', keep just one
def remove_excess_fullstops(text):
  text = re.sub(r'\.{2,}', '.', text)
  return text

df_train['utterance'] = df_train['utterance'].apply(remove_excess_fullstops)

#remove excess and trailing/leading whitespace
def remove_excess_whitespace(text):
  text = re.sub(r'\s{2,}', ' ', text).strip()
  return text

df_train['utterance'] = df_train['utterance'].apply(remove_excess_whitespace)


  ' Beautiful Soup.' % markup)


전처리된 문장을 확인한다.

In [None]:
df_train.utterance[10:20]

idx
10                                No dont I beg of you!
11    All right then well have a definite answer for...
12                                             Really?!
13             Absolutely. You can relax you did great.
14    But then who? The waitress I went out with las...
15                                 You know? Forget it!
16      No no no no no! Who who were you talking about?
17             No I I I I do not I actually do not know
18                                                  Ok!
19                                      All right well.
Name: utterance, dtype: object

훈련 데이터의 감정 종류와 건수를 확인한다.

In [None]:
df_train.emotion.value_counts()

neutral        4752
non-neutral    2017
joy            1283
surprise       1220
anger           513
sadness         351
disgust         240
fear            185
Name: emotion, dtype: int64

감정에 대한 표현을 숫자로 변환한다.

In [None]:
#importing modules for splitting the data-set
from sklearn.model_selection import train_test_split

#now we create a list of unique emotion labels
possible_labels = df_train['emotion'].unique()

#we convert those labels to numbers, for use in our algorithm later on
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

label_dict

{'anger': 6,
 'disgust': 7,
 'fear': 2,
 'joy': 4,
 'neutral': 0,
 'non-neutral': 3,
 'sadness': 5,
 'surprise': 1}

훈련 데이터에 위에서 숫자로 변환한 감정 값을 label로 추가한다.

In [None]:
#We add a new column to our original data-frame, of numbers corresponding to each emotion label
df_train['label'] = [label_dict[str_label] for str_label in df_train['emotion']]

df_train.sample(5) 

Unnamed: 0_level_0,utterance,emotion,label
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5436,No theyre theyre all true.,neutral,0
1158,No its not too soon I had lunch at a eleven.,neutral,0
6250,And and another one had never been to a party ...,surprise,1
5372,Well I was hoping after tonight that maybe I c...,neutral,0
8152,Oh my God! Look at this!,surprise,1


훈련 데이터를 훈련용과 검증용으로 분리한다.

In [None]:
x_train, x_val, y_train, y_val = train_test_split(
    df_train.index.values,
    df_train['label'],
    test_size = 0.20, #let's do 85-15 train-validation split
    random_state=17, #reproducible between my instance and whoever wants to reproduce
    stratify= df_train['label'].values #the command for stratification
)

df_train['data_type'] = ['not_set']*df_train.shape[0]

In [None]:
#if id of sample exists in x_train, make it 'train', otherwise existing in x_val, make it 'val'
df_train.loc[x_train, 'data_type'] = 'train'
df_train.loc[x_val, 'data_type'] = 'val'

#check stratification of training and validation data-sets
df_train.groupby(['emotion', 'data_type'])['utterance'].count()

emotion      data_type
anger        train         410
             val           103
disgust      train         192
             val            48
fear         train         148
             val            37
joy          train        1026
             val           257
neutral      train        3801
             val           951
non-neutral  train        1614
             val           403
sadness      train         281
             val            70
surprise     train         976
             val           244
Name: utterance, dtype: int64

BEERT pre-trained 모델의 토크나이저를 가져온다.

In [None]:
!pip install transformers==3.0.0

from transformers import BertTokenizer
from torch.utils.data import TensorDataset #setting up our dataset so it's usable in a pytorch environment

#set up a tokenizer object, using pre-trained BERT's own tokenizer
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased', #we ask the tokenizer to lowercase our sentences
    do_lower_case=True
)

#getting the maximum tokenised length out of tweets in our training data-set
max_len = 0

# For every sentence...
for sent in df_train['utterance']:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

Max sentence length:  76


훈련 데이터와 검증 데이터로 분리 한다.

In [None]:
#encode our training and validation data-sets with the tokenizer above
encoded_data_train = tokenizer.batch_encode_plus(
    #change below to appropriate setup
    df_train.utterance.values,
    add_special_tokens=True, #add the CLS and SEP tokens
    truncation=True,
    return_attention_mask=True, 
    pad_to_max_length=True,
    max_length=max_len,
    return_tensors='pt' #returns pytorch tensor
)

encoded_data_val = tokenizer.batch_encode_plus(
    #change below to appropriate setup
    df_train[df_train.data_type=='val'].utterance.values,
    add_special_tokens=True, #adds the CLS and SEP tokens
    truncation=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=max_len,
    return_tensors='pt'
)

#encoding process above returns dictionaries. We grab input ID tokens, attention mask, and labels from this
input_ids_train = encoded_data_train['input_ids'] #return each sentence as a #
attention_masks_train = encoded_data_train['attention_mask'] #returns a pytorch tensor
#change below to appropriate setup, resampled or not
labels_train = torch.tensor(df_train.label.values)

input_ids_val = encoded_data_val['input_ids'] #return each sentence as a #
attention_masks_val = encoded_data_val['attention_mask'] #returns a pytorch tensor
labels_val = torch.tensor(df_train[df_train.data_type=='val'].label.values)

#we construct a tensor dataset from input ID tokens, attention mask, and labels
dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train, labels_train)
dataset_validation = TensorDataset(input_ids_val,
                            attention_masks_val, labels_val)

# **3. BERT Pre-trained 모델 설정**

BERT pre-trained 모델을 가져온다.

In [None]:
from transformers import BertForSequenceClassification

#each tweet is its own sequence, which will be classified into one of 6 classes
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',  
    num_labels = len(label_dict),
    output_attentions = False, #dont need attention mask
    output_hidden_states = False #last layer before output
)
pass

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

batch size에 맞게 데이터를 로딩한다.

In [None]:
#we import our DataLoader and Samplers
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 32

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train), #Randomly train on data, so we don't bias training
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_validation,
    sampler=SequentialSampler(dataset_validation), #Sequential sampling on validation data so we can tie results to original dataframe
    batch_size=batch_size
)

최적화기에 hyperparameter를 설정하고 epochs 횟수와 스케줄러를 선언한다. 

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(
    model.parameters(),
    lr=6e-5, #1e-5, #2e-5 > 5e-5: A HYPERPARAMETER
    eps=1e-8
)

epochs=6

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len(dataloader_train) * epochs
)

정확도 측정을 위한 함수와 결정 함수 등을 선언한다.

In [None]:
import numpy as np
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    """
    Helper function for calculating F1-score between predicted and true values
    """
    preds_flat = np.argmax(preds, axis=1).flatten() #why flatten? we dont want a list of lists, we just want a single array
    return f1_score(labels, preds_flat, average='weighted')#weights classes according to its distribution. disgust with 6 classes is downweighted
    #weighted vs macro 


def accuracy_per_class(preds, labels):
  """
  Helper function for calculating the accuracy per class and displaying it
  Modified for sentiment Analysis. Not using emotion analysis code
  """
  preds_flat = np.argmax(preds, axis=1).flatten()
  
  for label in np.unique(labels):
      y_preds = preds_flat[labels==label]
      y_true = labels[labels==label]
      print(f'Class: {label_dict_inverse[label]}')
      print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true[y_true==label])} in percentage: {len(y_preds[y_preds==label])/len(y_true[y_true==label])}')

def softmax(matrix):
  """
  A function to normalise row values of a matrix to 1.0
  @param matrix - a numpy matrix which has non-normalised values per row
  @returns - the matrix with values all normalised to 1.0
  """
  return (np.exp(matrix.T) / np.sum(np.exp(matrix), axis=1)).T   

def emotion_prediction(normalised_matrix):
  """
  A function to grab the dominant class (i.e. the prediction)
  @param normalised_matrix - a numpy matrix, which has normalised values per row, achieved
  from applying an activation function
  """
  return np.argmax(normalised_matrix, axis=1).flatten()   

label_dict_inverse = {v:k for (k,v) in label_dict.items()}   

# **4. 훈련**

In [None]:
#again we set a seed value of 16 to make our training loop reproducible

import random

seed_val = 17 #so our results/process is reproducible by whoever wants to reproduce
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val) #include for when using a GPU

가능하다면 GPU를 사용한다.

In [None]:
#to check GPU vs CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


평가 함수를 선언한다.

In [None]:
#quite similar to training, except for the differences mentioned below
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }
        #ignore/disable gradients
        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        #detach from CPU means pulling values out of GPU to CPU
        #so we can use numpy, which works only on CPU
        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

위에서 설정한 값들로 훈련을 실시한다.

In [None]:
#our training loop!
for epoch in tqdm(range(1, epochs+1)):
    model.train()
    
    #set to 0 initially, then add each batch's loss iteratively
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch),
                        leave=False, #let it overwrite after each epoch
                        disable=False, 
                       )
    for batch in progress_bar:
        
        #first batch = set gradients to 0
        model.zero_grad()
        
        #dataloader has 3 variables. so it's going to be a tuple of 3 items. We make sure each item is on the correct device
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {
            'input_ids' : batch[0],
            'attention_mask' : batch[1],
            'labels' : batch[2]
        }
        
        #unpacks dictionary straight into model
        outputs = model(**inputs)
        
        #bert model returns loss and logits
        loss = outputs[0]
        loss_train_total += loss.item() #add up loss
        loss.backward() #backpropagate
        
        #all weights will be a norm of 1 (normalised weights)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        #update progress bar to display loss per batch
        progress_bar.set_postfix({'training_loss' : '{:3f}'.format(loss.item()/len(batch))})
        
    #outside the batch loop and inside the epoch loop, so per epoch
    #save model checkpoint and print progress
    torch.save(model.state_dict(), f'Epoch-{epoch}.model')
    
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    #loss per epoch:
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    #to detect overtraining - happens when training loss goes down and val loss goes up. Starts to
    #train perfectly on our data such that its no longer generalisable
    val_loss, predictions, true_vals = evaluate(dataloader_val) #predictions are the logits

    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (weighted): {val_f1}')


HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=331.0, style=ProgressStyle(description_widt…


Epoch 1
Training loss: 1.225851582076254


HBox(children=(FloatProgress(value=0.0, max=67.0), HTML(value='')))


Validation loss: 0.9825581347764428
F1 Score (weighted): 0.6439286164783297


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=331.0, style=ProgressStyle(description_widt…


Epoch 2
Training loss: 0.9453032105891006


HBox(children=(FloatProgress(value=0.0, max=67.0), HTML(value='')))


Validation loss: 0.63132028170486
F1 Score (weighted): 0.7996491786981919


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=331.0, style=ProgressStyle(description_widt…


Epoch 3
Training loss: 0.6659017080536782


HBox(children=(FloatProgress(value=0.0, max=67.0), HTML(value='')))


Validation loss: 0.3881627403533281
F1 Score (weighted): 0.8763153444073001


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=331.0, style=ProgressStyle(description_widt…


Epoch 4
Training loss: 0.4166493104416258


HBox(children=(FloatProgress(value=0.0, max=67.0), HTML(value='')))


Validation loss: 0.20764181896376965
F1 Score (weighted): 0.9404862700833402


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=331.0, style=ProgressStyle(description_widt…


Epoch 5
Training loss: 0.2718008341601124


HBox(children=(FloatProgress(value=0.0, max=67.0), HTML(value='')))


Validation loss: 0.14891787896405406
F1 Score (weighted): 0.9559554560651953


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=331.0, style=ProgressStyle(description_widt…


Epoch 6
Training loss: 0.1888124556920781


HBox(children=(FloatProgress(value=0.0, max=67.0), HTML(value='')))


Validation loss: 0.11765987934794889
F1 Score (weighted): 0.9625739193008973



# **5. 모델 평가**

In [None]:
#grabbing predictions from validation data-set
_, predictions_val, labels_val = evaluate(dataloader_val)

HBox(children=(FloatProgress(value=0.0, max=67.0), HTML(value='')))




예측 결과를 결정 함수를 이용하여 감정을 분류한다.

In [None]:
#grab predictions variable here and do a softmax, to visualise results against df
percent_emotions_val = softmax(predictions_val)

#from soft-maxed probabilities of emotions to picking the most dominant emotion
emotions_val = emotion_prediction(percent_emotions_val)

accuracy_per_class(predictions_val, labels_val)

Class: neutral
Accuracy: 933/951 in percentage: 0.9810725552050473
Class: surprise
Accuracy: 232/244 in percentage: 0.9508196721311475
Class: fear
Accuracy: 35/37 in percentage: 0.9459459459459459
Class: non-neutral
Accuracy: 376/403 in percentage: 0.9330024813895782
Class: joy
Accuracy: 254/257 in percentage: 0.9883268482490273
Class: sadness
Accuracy: 67/70 in percentage: 0.9571428571428572
Class: anger
Accuracy: 93/103 in percentage: 0.9029126213592233
Class: disgust
Accuracy: 44/48 in percentage: 0.9166666666666666


f1 스코어를 계산한다.

In [None]:
#f1 score overall
print('Weighted F1 score of val dataset:')
print(f1_score_func(predictions_val, labels_val))

Weighted F1 score of val dataset:
0.9625739193008973


# **6. 테스트**

구글 드라이브와 연동한다.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


영어 감정분석 폴더를 생성한다.

In [None]:
mkdir /content/gdrive/MyDrive/eng_sa

mkdir: cannot create directory ‘/content/gdrive/MyDrive/eng_sa’: File exists


영어 감정분석 폴더를 이동한다.

In [None]:
cd /content/gdrive/MyDrive/eng_sa

/content/gdrive/MyDrive/eng_sa


테스트 데이터(en_data.csv)를 업로드 한다.

In [None]:
from google.colab import files
file_uploaded = files.upload()

필요한 라이브러리 설치 및 임포트 한다.

In [None]:
!pip install transformers==3.0.0
!pip install contractions 

import torch #the pytorch library, used for modeling and formatting our data to be compatible in a pytorch environment
import pandas as pd #for dataframe reading, cleaning functions
from tqdm.notebook import tqdm #used as a progress bar
import json
from transformers import BertForSequenceClassification
import numpy as np
from sklearn.metrics import f1_score
from bs4 import BeautifulSoup # a library for parsing HTML
import string
import unicodedata
import re
from transformers import BertTokenizer
from torch.utils.data import TensorDataset #setting up our dataset so it's usable in a pytorch environment
#we import our DataLoader and Samplers
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler




BERT Pre-trained 모델을 초기화 한다.

In [None]:
#we convert those labels to numbers, for use in our algorithm later on
label_dict = {'anger': 6,
 'disgust': 7,
 'fear': 2,
 'joy': 4,
 'neutral': 0,
 'non-neutral': 3,
 'sadness': 5,
 'surprise': 1}

label_dict_inverse = {v:k for (k,v) in label_dict.items()}   

#fresh model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

GPU를 이용한다.

In [None]:
#to check GPU vs CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

#pass on the fresh model to the correct device, either GPU or CPU
model.to(device)
pass #so we dont have all that text printed out

cuda


저장된 모델을 불러온다.

In [None]:
# cuda indicates a GPU is available. Replace with 'cpu' when using a cpu.
model.load_state_dict(
    torch.load('Epoch-6.model',
              map_location=torch.device('cuda')))

<All keys matched successfully>

테스트 파일(en_data.csv)을 읽어서 데이터프레임에 저장한다.

In [None]:
with open('en_data.csv', 'r', encoding='utf-8', newline='') as csvfile:
  df_test = pd.read_csv(csvfile)

df_test[1:5]

Unnamed: 0,id,i_dialog,i_utterance,speaker,utterance
1,1,0,1,Monica,Oh! You're awake!
2,2,0,2,Joey,Then you gotta come clean with Ma! This is not...
3,3,0,3,Mr. Tribbiani,"Yeah, but this is"
4,4,0,4,Joey,I don't wanna hear it! Now go to my room!


테스트 데이터도 훈련 데이터와 동일하게 전처리를 한다.

In [None]:
import contractions
df_test['utterance'] = df_test['utterance'].apply(lambda utterance: contractions.fix(utterance))

# remove HTML tags
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")    
    return soup.get_text().replace("\n", "")

# we then apply the function for removing HTML Tags
df_test['utterance'] = df_test['utterance'].apply(strip_html_tags)

# normalise accented characters i.e. convert à to a
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8','ignore')
    return text

df_test['utterance'] = df_test['utterance'].apply(remove_accented_chars)

#remove @name mentions and urls in a tweet
def remove_mentions_and_urls(text):
  text = re.sub('(@[A-Za-z0-9]+)|(\w+:\/\/\S+)|(www.[A-Za-z0-9]+.[A-Za-z0-9]+)',' ', text)
  return text

df_test['utterance'] = df_test['utterance'].apply(remove_mentions_and_urls)

#remove punctuations except '?' and '!' and '.'
def remove_punctuation(text):
    text = re.sub(r'[\'\"\\\/\,#]', '', text)
    text = re.sub(r'[^\w\s\?\!\.]', ' ', text)
    return text

df_test['utterance'] = df_test['utterance'].apply(remove_punctuation)

#remove multiple '.', keep just one
def remove_excess_fullstops(text):
  text = re.sub(r'\.{2,}', '.', text)
  return text

df_test['utterance'] = df_test['utterance'].apply(remove_excess_fullstops)

#remove excess and trailing/leading whitespace
def remove_excess_whitespace(text):
  text = re.sub(r'\s{2,}', ' ', text).strip()
  return text

def softmax(matrix):
  """
  A function to normalise row values of a matrix to 1.0
  @param matrix - a numpy matrix which has non-normalised values per row
  @returns - the matrix with values all normalised to 1.0
  """
  return (np.exp(matrix.T) / np.sum(np.exp(matrix), axis=1)).T   

def emotion_prediction(normalised_matrix):
  """
  A function to grab the dominant class (i.e. the prediction)
  @param normalised_matrix - a numpy matrix, which has normalised values per row, achieved
  from applying an activation function
  """
  return np.argmax(normalised_matrix, axis=1).flatten()   

df_test['utterance'] = df_test['utterance'].apply(remove_excess_whitespace)

df_test.shape

(1623, 5)

전처리된 문장을 모델에 입력할 수 있는 형태로 인코딩 한다.

In [None]:
#set up a tokenizer object, using pre-trained BERT's own tokenizer
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased', #we ask the tokenizer to lowercase our sentences
    do_lower_case=True
)

max_len = 76

encoded_data_test = tokenizer.batch_encode_plus(
    #change below to appropriate setup
    df_test.utterance.values,
    add_special_tokens=True, #add the CLS and SEP tokens
    truncation=True,
    return_attention_mask=True, 
    pad_to_max_length=True,
    max_length=max_len,
    return_tensors='pt' #returns pytorch tensor
)

df_test['label'] = 0

input_ids_test = encoded_data_test['input_ids'] #return each sentence as a #
attention_masks_test= encoded_data_test['attention_mask'] #returns a pytorch tensor
labels_test = torch.tensor(df_test.label.values)

dataset_test = TensorDataset(input_ids_test,
                            attention_masks_test,
                            labels_test)


batach size는 훈련 데이터와 동일하게 정한다.

In [None]:
batch_size = 32

dataloader_test = DataLoader(
    dataset_test,
    sampler=SequentialSampler(dataset_test), #Sequential sampling on validation data so we can tie results to original dataframe
    batch_size=batch_size
)

In [None]:
#quite similar to training, except for the differences mentioned below
def evaluate_test(dataloader_test):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_test):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }
        #ignore/disable gradients
        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        #detach from CPU means pulling values out of GPU to CPU
        #so we can use numpy, which works only on CPU
        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_test) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

테스트 데이터를 평가한다.

In [None]:
#grabbing predictions from validation data-set
_, predictions_test, labels_test = evaluate_test(dataloader_test)

HBox(children=(FloatProgress(value=0.0, max=51.0), HTML(value='')))




예측 결과를 결정 함수를 이용하여 감정을 분류한다. 

In [None]:
#grab predictions variable here and do a softmax, to visualise results against df
percent_emotions_test = softmax(predictions_test)

#from soft-maxed probabilities of emotions to picking the most dominant emotion
emotions_test = emotion_prediction(percent_emotions_test)

label_dict_inverse[emotions_test[2]]

'anger'

테스트 결과를 파일로 저장한다.

In [None]:
sample_data = [['Id', 'Predicted']]

j = 0
for emotion in emotions_test:
    sample_data += [[j, label_dict_inverse[emotion]]]
    j+=1
dataframe = pd.DataFrame(sample_data)
dataframe.to_csv("sample.csv", header=False, index=False, mode='w+')

In [None]:
sample_data[1:10]

[[0, 'non-neutral'],
 [1, 'surprise'],
 [2, 'anger'],
 [3, 'neutral'],
 [4, 'anger'],
 [5, 'sadness'],
 [6, 'neutral'],
 [7, 'non-neutral'],
 [8, 'surprise']]