requirements
- transformers : 4.27.1
- keras : 2.11.0
- tensorflow : 2.11.0
- torch : 1.13.1+cu116
- pandas : 1.4.4
- numpy : 1.22.4

# 목적

**- 외국인 관광객의 현재 상태(기분)을 텍스트로 받아 -> Bert를 이용하여 자연어 처리하여 -> 기분에 맞는 음식을 추천해주는 것.**

# 데이터 준비

- 캐글에 있는 **Emotion Dataset for Emotion Recognition Tasks** 데이터셋 사용
  - 출처: https://www.kaggle.com/datasets/parulpandey/emotion-dataset?select=training.csv
- 데이터의 내용은, 텍스트가 6개감정(sad,joy,love,anger,fear,suprise)으로 분류되어 있음.
- 데이터는 train,test,val로 되어있고 각각 16000,2000,2000 존재

# 데이터 전처리

- 6개감정(sad,joy,love,anger,fear,suprise)을  -> 3개감정(sad,joy,stress)로 바꿈 (인덱스 번호 각각 0:sad,1:joy,2:stress)
  - suprise감정을 지우고 love를 joy와 합쳐서 joy, anger과 fear는 stress로
  - 조절한 데이터셋 최종 개수(전처리후) -> train : 17347개 (sad:5216,joy:7548 , stress:4583), test : 1934개 (sad:581,joy:854 , stress:499)
- 데이터의 수를 늘리기 위해 val를 train에 합치기

In [None]:
import csv
import pandas as pd

# 데이터 불러오기
# 이부분은 데이터를 각자 경로에 맞게 path를 지정!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
train = pd.read_csv('/content/drive/MyDrive/딥러닝프로젝트/김태혁 최종 폴더/감정분류데이터셋(캐글)/training.csv')
test = pd.read_csv('/content/drive/MyDrive/딥러닝프로젝트/김태혁 최종 폴더/감정분류데이터셋(캐글)/test.csv')
val = pd.read_csv('/content/drive/MyDrive/딥러닝프로젝트/김태혁 최종 폴더/감정분류데이터셋(캐글)/validation.csv')

In [None]:
train.shape,val.shape,test.shape

((16000, 2), (2000, 2), (2000, 2))

In [None]:
# train과 val합치기
train = pd.concat([train,val])
train.shape

(18000, 2)

In [None]:
# train과 test에서 suprise인5를 제거
train = train[train.label != 5]
test = test[test.label != 5]
train.shape, test.shape

((17347, 2), (1934, 2))

In [None]:
train.loc[(train['label'] == 2), 'label'] = 1  #love label값인 2를 joy:1 번으로
test.loc[(test['label'] == 2), 'label'] = 1 #love label값인 2를 joy:1 번으로

train.loc[(train['label'] == 3), 'label'] = 2  #anger label값인 3를 새로운 stress 인덱스 2번으로
test.loc[(test['label'] == 3), 'label'] = 2 #anger label값인 3를 새로운 stress 인덱스 2번으로

train.loc[(train['label'] == 4), 'label'] = 2  #fear label값인 4를 새로운 stress 인덱스 2번으로
test.loc[(test['label'] == 4), 'label'] = 2 #love label값인 2를 새로운 stress 인덱스 2번으로

In [None]:
# train,test에서 결측치 제거와 중복값 제거 수행.(학습에 영향)
train.dropna(inplace=True)
test.dropna(inplace=True)

train.drop_duplicates(subset=['text'], inplace=True)
test.drop_duplicates(subset=['text'], inplace=True)

train.shape, test.shape

((17316, 2), (1934, 2))

## 모듈 설치 및 import

In [4]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
import transformers
transformers.__version__

'4.27.1'

In [6]:
!pip install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
import keras
keras.__version__

'2.11.0'

In [8]:
import tensorflow as tf
import torch

from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import random
import time
import datetime

In [10]:
tf.__version__ , torch.__version__ , pd.__version__, np.__version__

('2.11.0', '1.13.1+cu116', '1.4.4', '1.22.4')

# **전처리(Bert 입력에 맞게) - 훈련셋**

In [None]:
# 리뷰 문장 추출
sentences = train['text']
sentences[:10]

0                               i didnt feel humiliated
1     i can go from feeling so hopeless to so damned...
2      im grabbing a minute to post i feel greedy wrong
3     i am ever feeling nostalgic about the fireplac...
4                                  i am feeling grouchy
5     ive been feeling a little burdened lately wasn...
7     i feel as confused about life as a teenager or...
8     i have been with petronas for years i feel tha...
9                                   i feel romantic too
10    i feel like i have to make the suffering i m s...
Name: text, dtype: object

In [None]:
# BERT의 입력 형식에 맞게 변환
sentences = ["[CLS] " + str(sentence) + " [SEP]" for sentence in sentences]
sentences[:10]

['[CLS] i didnt feel humiliated [SEP]',
 '[CLS] i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake [SEP]',
 '[CLS] im grabbing a minute to post i feel greedy wrong [SEP]',
 '[CLS] i am ever feeling nostalgic about the fireplace i will know that it is still on the property [SEP]',
 '[CLS] i am feeling grouchy [SEP]',
 '[CLS] ive been feeling a little burdened lately wasnt sure why that was [SEP]',
 '[CLS] i feel as confused about life as a teenager or as jaded as a year old man [SEP]',
 '[CLS] i have been with petronas for years i feel that petronas has performed well and made a huge profit [SEP]',
 '[CLS] i feel romantic too [SEP]',
 '[CLS] i feel like i have to make the suffering i m seeing mean something [SEP]']

In [None]:
# 라벨 추출
labels = train['label'].values
labels

array([0, 0, 2, ..., 1, 1, 1])

In [None]:
# BERT의 토크나이저로 문장을 토큰으로 분리
# 토크나이저는 여러 언어의 데이터를 기반으로 만든 'bert-base-multilingual-cased'를 사용
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

print (sentences[0])
print (tokenized_texts[0])

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

[CLS] i didnt feel humiliated [SEP]
['[CLS]', 'i', 'didn', '##t', 'feel', 'hu', '##mil', '##iated', '[SEP]']


In [None]:
# 입력 토큰의 최대 시퀀스 길이
MAX_LEN = 128

# 토큰을 숫자 인덱스로 변환
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

# 문장을 MAX_LEN 길이에 맞게 자르고, 모자란 부분을 패딩 0으로 채움
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

input_ids[0]

array([  101,   177, 34420, 10123, 38008, 26506, 55177, 89771,   102,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0]

In [None]:
# 어텐션 마스크 초기화
attention_masks = []

# 어텐션 마스크를 패딩이 아니면 1, 패딩이면 0으로 설정
# 패딩 부분은 BERT 모델에서 어텐션을 수행하지 않아 속도 향상
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)

print(attention_masks[0])

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [None]:
# 훈련셋과 검증셋으로 분리
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids,
                                                                                    labels, 
                                                                                    random_state=2018, 
                                                                                    test_size=0.1)

# 어텐션 마스크를 훈련셋과 검증셋으로 분리
train_masks, validation_masks, _, _ = train_test_split(attention_masks, 
                                                       input_ids,
                                                       random_state=2018, 
                                                       test_size=0.1)

# 데이터를 파이토치의 텐서로 변환
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)				

print(train_inputs[0])
print(train_labels[0])
print(train_masks[0])
print(validation_inputs[0])
print(validation_labels[0])
print(validation_masks[0])

tensor([  101,   177, 68507, 17761,   169, 19826, 10111,   177, 19556, 15127,
        27925, 10135, 10435, 20363, 10107, 10108, 40421, 10473, 10992, 10108,
        10105, 38576, 10211, 61362, 18745, 10944, 10347, 21484, 83865, 10135,
        23582, 10108, 63658, 10111, 10105, 19573, 11951, 18571, 59381, 12166,
        10160, 83018,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])

In [None]:
# 배치 사이즈
batch_size = 32

# 파이토치의 DataLoader로 입력, 마스크, 라벨을 묶어 데이터 설정
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

# **전처리 - 테스트셋**

In [None]:
# 리뷰 문장 추출
sentences = test['text']
sentences[:10]

0    im feeling rather rotten so im not very ambiti...
1            im updating my blog because i feel shitty
2    i never make her separate from me because i do...
3    i left with my bouquet of red and yellow tulip...
4      i was feeling a little vain when i did this one
5    i cant walk into a shop anywhere where i do no...
6     i felt anger when at the end of a telephone call
7    i explain why i clung to a relationship with a...
8    i like to have the same breathless feeling as ...
9    i jest i feel grumpy tired and pre menstrual w...
Name: text, dtype: object

In [None]:
# BERT의 입력 형식에 맞게 변환
sentences = ["[CLS] " + str(sentence) + " [SEP]" for sentence in sentences]
sentences[:10]

['[CLS] im feeling rather rotten so im not very ambitious right now [SEP]',
 '[CLS] im updating my blog because i feel shitty [SEP]',
 '[CLS] i never make her separate from me because i don t ever want her to feel like i m ashamed with her [SEP]',
 '[CLS] i left with my bouquet of red and yellow tulips under my arm feeling slightly more optimistic than when i arrived [SEP]',
 '[CLS] i was feeling a little vain when i did this one [SEP]',
 '[CLS] i cant walk into a shop anywhere where i do not feel uncomfortable [SEP]',
 '[CLS] i felt anger when at the end of a telephone call [SEP]',
 '[CLS] i explain why i clung to a relationship with a boy who was in many ways immature and uncommitted despite the excitement i should have been feeling for getting accepted into the masters program at the university of virginia [SEP]',
 '[CLS] i like to have the same breathless feeling as a reader eager to see what will happen next [SEP]',
 '[CLS] i jest i feel grumpy tired and pre menstrual which i prob

In [None]:
# 라벨 추출
labels = test['label'].values
labels

array([0, 0, 0, ..., 1, 1, 2])

In [None]:
# BERT의 토크나이저로 문장을 토큰으로 분리
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

print (sentences[0])
print (tokenized_texts[0])

[CLS] im feeling rather rotten so im not very ambitious right now [SEP]
['[CLS]', 'im', 'feeling', 'rather', 'rot', '##ten', 'so', 'im', 'not', 'very', 'amb', '##iti', '##ous', 'right', 'now', '[SEP]']


In [None]:
# 입력 토큰의 최대 시퀀스 길이
MAX_LEN = 128

# 토큰을 숫자 인덱스로 변환
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

# 문장을 MAX_LEN 길이에 맞게 자르고, 모자란 부분을 패딩 0으로 채움
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

input_ids[0]

array([  101, 10211, 61362, 16863, 64354, 10681, 10380, 10211, 10472,
       12558, 10559, 13903, 13499, 13448, 11858,   102,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0]

In [None]:
# 어텐션 마스크 초기화
attention_masks = []

# 어텐션 마스크를 패딩이 아니면 1, 패딩이면 0으로 설정
# 패딩 부분은 BERT 모델에서 어텐션을 수행하지 않아 속도 향상
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)

print(attention_masks[0])

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [None]:
# 데이터를 파이토치의 텐서로 변환
test_inputs = torch.tensor(input_ids)
test_labels = torch.tensor(labels)
test_masks = torch.tensor(attention_masks)

print(test_inputs[0])
print(test_labels[0])
print(test_masks[0])

tensor([  101, 10211, 61362, 16863, 64354, 10681, 10380, 10211, 10472, 12558,
        10559, 13903, 13499, 13448, 11858,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])

In [None]:
# 배치 사이즈
batch_size = 32

# 파이토치의 DataLoader로 입력, 마스크, 라벨을 묶어 데이터 설정
# 학습시 배치 사이즈 만큼 데이터를 가져옴
test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = RandomSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

<br>
<br>

# **모델 생성**

In [None]:
# GPU 디바이스 이름 구함
device_name = tf.test.gpu_device_name()

# GPU 디바이스 이름 검사
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [None]:
# 디바이스 설정
if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print('No GPU available, using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In [None]:
# 분류를 위한 BERT 모델 생성, 분류 labels 는 3개
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=3)
model.cuda()

Downloading pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

In [None]:
# 옵티마이저 설정
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # 학습률
                  eps = 1e-8 # 0으로 나누는 것을 방지하기 위한 epsilon 값
                )

# 에폭수
epochs = 4

# 총 훈련 스텝 : 배치반복 횟수 * 에폭
total_steps = len(train_dataloader) * epochs

# 처음에 학습률을 조금씩 변화시키는 스케줄러 생성
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)



<br>
<br>

# **모델 학습**

In [None]:
# 정확도 계산 함수
def flat_accuracy(preds, labels):
    
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [None]:
# 시간 표시 함수
def format_time(elapsed):

    # 반올림
    elapsed_rounded = int(round((elapsed)))
    
    # hh:mm:ss으로 형태 변경
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
# 재현을 위해 랜덤시드 고정
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# 그래디언트 초기화
model.zero_grad()

# 에폭만큼 반복
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # 시작 시간 설정
    t0 = time.time()

    # 로스 초기화
    total_loss = 0

    # 훈련모드로 변경
    model.train()
        
    # 데이터로더에서 배치만큼 반복하여 가져옴
    for step, batch in enumerate(train_dataloader):
        # 경과 정보 표시
        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # 배치를 GPU에 넣음
        batch = tuple(t.to(device) for t in batch)
        
        # 배치에서 데이터 추출
        b_input_ids, b_input_mask, b_labels = batch

        # Forward 수행                
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask, 
                        labels=b_labels)
        
        # 로스 구함
        loss = outputs[0]

        # 총 로스 계산
        total_loss += loss.item()

        # Backward 수행으로 그래디언트 계산
        loss.backward()

        # 그래디언트 클리핑
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # 그래디언트를 통해 가중치 파라미터 업데이트
        optimizer.step()

        # 스케줄러로 학습률 감소
        scheduler.step()

        # 그래디언트 초기화
        model.zero_grad()

    # 평균 로스 계산
    avg_train_loss = total_loss / len(train_dataloader)            

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    #시작 시간 설정
    t0 = time.time()

    # 평가모드로 변경
    model.eval()

    # 변수 초기화
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # 데이터로더에서 배치만큼 반복하여 가져옴
    for batch in validation_dataloader:
        # 배치를 GPU에 넣음
        batch = tuple(t.to(device) for t in batch)
        
        # 배치에서 데이터 추출
        b_input_ids, b_input_mask, b_labels = batch
        
        # 그래디언트 계산 안함
        with torch.no_grad():     
            # Forward 수행
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        
        # 출력 로짓 구함
        logits = outputs[0]

        # CPU로 데이터 이동
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # 출력 로짓과 라벨을 비교하여 정확도 계산
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...

  Average training loss: 0.39
  Training epcoh took: 0:05:31

Running Validation...
  Accuracy: 0.97
  Validation took: 0:00:12

Training...

  Average training loss: 0.10
  Training epcoh took: 0:05:27

Running Validation...
  Accuracy: 0.97
  Validation took: 0:00:12

Training...

  Average training loss: 0.06
  Training epcoh took: 0:05:27

Running Validation...
  Accuracy: 0.98
  Validation took: 0:00:12

Training...

  Average training loss: 0.03
  Training epcoh took: 0:05:27

Running Validation...
  Accuracy: 0.98
  Validation took: 0:00:12

Training complete!


# 테스트 들어가기 전 정리

- 디바이스 : GPU
- 모델 : bert-base-multilingual-cased
- 옵티마이저 : Adam
- epochs : 4
- batch_size : 32
- 학습시간 : 25분


# **테스트셋 평가**

In [None]:
#시작 시간 설정
t0 = time.time()

# 평가모드로 변경
model.eval()

# 변수 초기화
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0

# 데이터로더에서 배치만큼 반복하여 가져옴
for step, batch in enumerate(test_dataloader):
    # 경과 정보 표시
    if step % 100 == 0 and not step == 0:
        elapsed = format_time(time.time() - t0)
        print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(test_dataloader), elapsed))

    # 배치를 GPU에 넣음
    batch = tuple(t.to(device) for t in batch)
    
    # 배치에서 데이터 추출
    b_input_ids, b_input_mask, b_labels = batch
    
    # 그래디언트 계산 안함
    with torch.no_grad():     
        # Forward 수행
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask)
    
    # 출력 로짓 구함
    logits = outputs[0]

    # CPU로 데이터 이동
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    
    # 출력 로짓과 라벨을 비교하여 정확도 계산
    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

print("")
print("Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
print("Test took: {:}".format(format_time(time.time() - t0)))


Accuracy: 0.98
Test took: 0:00:13


# 모델 저장 및 로드

In [None]:
path = '/content/drive/MyDrive/Cloud_AI/감정모델bert저장/'
torch.save(model, path + 'bert_final_text_model.pt')  # 전체 모델 저장

In [None]:
#★★★현재경로가 model이 있는 폴더여야함★★★
import os
import torch
!pip install transformers # 이것이 깔려야 로델이 불려옴.
os.chdir('/content/drive/MyDrive/Cloud_AI/감정모델bert저장/')

device = torch.device("cpu")
model1 = torch.load('bert_final_text_model.pt',map_location=device) 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.2 tokenizers-0.13.2 transformers-4.27.1


<br>
<br>

# **새로운 문장 테스트**

In [None]:
# 입력 데이터 변환
def convert_input_data(sentences):

    # BERT의 토크나이저로 문장을 토큰으로 분리
    tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

    # 입력 토큰의 최대 시퀀스 길이
    MAX_LEN = 128

    # 토큰을 숫자 인덱스로 변환
    input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
    
    # 문장을 MAX_LEN 길이에 맞게 자르고, 모자란 부분을 패딩 0으로 채움
    input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

    # 어텐션 마스크 초기화
    attention_masks = []

    # 어텐션 마스크를 패딩이 아니면 1, 패딩이면 0으로 설정
    # 패딩 부분은 BERT 모델에서 어텐션을 수행하지 않아 속도 향상
    for seq in input_ids:
        seq_mask = [float(i>0) for i in seq]
        attention_masks.append(seq_mask)

    # 데이터를 파이토치의 텐서로 변환
    inputs = torch.tensor(input_ids)
    masks = torch.tensor(attention_masks)

    return inputs, masks

In [None]:
# 문장 테스트
def test_sentences(sentences):

    # 평가모드로 변경
    model1.eval()

    # 문장을 입력 데이터로 변환
    inputs, masks = convert_input_data(sentences)

    # 데이터를 GPU에 넣음
    b_input_ids = inputs.to(device)
    b_input_mask = masks.to(device)
            
    # 그래디언트 계산 안함
    with torch.no_grad():     
        # Forward 수행
        outputs = model1(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask)

    # 출력 로짓 구함
    logits = outputs[0]

    # CPU로 데이터 이동
    logits = logits.detach().cpu().numpy()

    return logits

In [None]:
# 로드 모델이 돌아가기위한 세팅 및 실험.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np


logits = test_sentences(['i hate it when i feel fearful for absolutely no reason'])

print(logits)
print(np.argmax(logits))

[[-2.4447374 -2.8480904  5.5362806]]
2


# 학습시킨 Bert모델을 이용한 감정분류를 통한 음식 추천.

- 기분 별로 음식을 나눈 기준은 학술논문 **'대학생들의 정서에 따른 컴포트 푸드의 차이:성차를 중심으로'**를 참조하여서 분류하였습니다.

In [None]:
import random

good= ["Grilled Ribs", "Yukhoe", "Steamed Ribs", "Grilled Ribs", "Grilled Tripe", "Grilled Tripe Hot Pot"]


sad=['Cold Raw Fish', 'Grilled Pollack', 'Grilled Eel', 'Grilled Chopper', 'Grilled Shellfish', 'Seaweed Soup', 'Fried Squid', 'Fried Shrimp'
, "Seaweed", "Sannakji", "Seasoned raw octopus", "Seaweed", "Shrimp fried rice", "Stir-fried webfoot octopus", "Seasoned crab", "Fish pancake", "Steamed pollack",
"Braised saury," "Dongtae-jjigae," "Steamed seafood," "Seasoned chicken," "Jajangmyeon," "Jjolmyeon," "Kongguksu," "Rice balls," "Japchae."
, "Yubu Sushi", "Rice Skewers", "Pumpkin Jeon", "Soy sauce marinated crab", "Grilled hairtail", "Grilled mackerel", "Steamed mackerel", "Gwamegi"]


stressful=['grilled pollack', 'spicy stir-fried chicken', 'spicy stew', 'jjolmyeon', 'yukgaejang', 'bibim naengmyeon', 'sushi salad', 'skirt salad', 'tofu kimchi'
"Stir-fried spicy pork", "Stir-fried webfoot octopus", "Tteokbokki", "Rapokki", "Seasoned crab", "Stir-fried chicken", "Steamed pollack", "Dong7tae jjigae", "Steamed seafood",
"Cold Noodles", "Kongguksu", "Boiled Potatoes", "Fried Chili", "Kimchi Pancake"]

In [None]:
end = 1  
while end == 1:
    input_sent = input("Please enter what you want to say : ")
    if input_sent == '0':
        break
    #input_list = list(input_sent)
    logits = test_sentences([input_sent])
    #0print(logits)

    if np.argmax(logits) == 0: # 만약에 분류감정이 sad면
        select_food = random.choice(sad)
        print(f'You look sad.') 
        print(f'{select_food} is the best when you are depressed.')
    elif np.argmax(logits) == 1: # 만약에 분류감정이 happy면
        select_food = random.choice(good)
        print(f'You must be in a good mood. Hoho! I recommend you to eat when you feel good!')
        print(f'My choice is {select_food}.')
    elif np.argmax(logits) == 2: # 만약에 분류감정이 stress면
        select_food = random.choice(stressful)
        print(f'You look stressed!')
        print(f'{select_food} is the best when you are stressed.')

Please enter what you want to say : i have been with petronas for years i feel that petronas has performed well and made a huge profit
You must be in a good mood. Hoho! I recommend you to eat when you feel good!
My choice is Grilled Tripe.
Please enter what you want to say : i feel like i have to make the suffering i m seeing mean something
You look sad.
Stir-fried webfoot octopus is the best when you are depressed.
Please enter what you want to say : i now feel compromised and skeptical of the value of every unit of work i put in
You look stressed!
Steamed seafood is the best when you are stressed.
Please enter what you want to say : 0


- 위의 결과를 보면 영어 텍스트로 입력했을때, 감정을 잘 분류하는 것으로 나옴.