# Sentiment Classification

### Task
* 네이버에서 영화평을 가지고 positive/negative인지 구분해보자.
* 데이터 불러오기를 제외한 딥러닝 트레이닝 과정을 직접 구현해보는 것이 목표 입니다.

### Dataset
* [Naver sentiment movie corpus v1.0](https://github.com/e9t/nsmc/)

### Base code
* Dataset: train, val, test로 split
* Input data shape: (`batch_size`, `max_sequence_length`)
* Output data shape: (`batch_size`, 1)
* Training
* Evaluation

### Try some techniques
* Training-epochs 조절
* Change model architectures (Custom model)
  * Use another cells (LSTM, GRU, etc.)
  * Use dropout layers
* Embedding size 조절
  * 또는 one-hot vector로 학습
* Number of words in the vocabulary 변화
* `pad` 옵션 변화
* Data augmentation (if possible)

## Import modules

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
!pip install sentencepiece



In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import time
import shutil
import tarfile

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output
import urllib.request

import pandas as pd

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

import sentencepiece as spm

from collections import Counter, defaultdict


## Load Data

* ratings_train.txt: 훈련용으로 사용되는 15만 개의 리뷰
* ratings_test.txt: 테스트용으로 보류된 5만 개의 리뷰
* 모든 리뷰는 140자 이내입니다
* 각 감정 클래스는 동등하게 샘플링되었습니다 (즉, 무작위 추측은 50%의 정확도를 보입니다)
* 10만 개의 부정적 리뷰 (원래 1-4점의 리뷰)
* 10만 개의 긍정적 리뷰 (원래 9-10점의 리뷰)
* 중립적 리뷰 (원래 5-8점의 리뷰)는 제외되었습니다


In [4]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt", filename="ratings_train.txt")
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt", filename="ratings_test.txt")

('ratings_test.txt', <http.client.HTTPMessage at 0x110a0cdf0>)

In [5]:
train_data = pd.read_table('ratings_train.txt')
train_data = train_data.dropna()
test_data = pd.read_table('ratings_test.txt')
test_data = test_data.dropna()

In [6]:
train_data.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [7]:
test_data.head()

Unnamed: 0,id,document,label
0,6270596,굳 ㅋ,1
1,9274899,GDNTOPCLASSINTHECLUB,0
2,8544678,뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아,0
3,6825595,지루하지는 않은데 완전 막장임... 돈주고 보기에는....,0
4,6723715,3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??,0


### Tokenizing


In [8]:
sp = spm.SentencePieceProcessor()
sp.load('naver_review.model')  # 모델 경로 설정

# 토크나이저 함수 정의
def tokenizer(text):
    return sp.encode_as_pieces(text)

In [9]:
train_data

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1
...,...,...,...
149995,6222902,인간이 문제지.. 소는 뭔죄인가..,0
149996,8549745,평점이 너무 낮아서...,1
149997,9311800,이게 뭐요? 한국인은 거들먹거리고 필리핀 혼혈은 착하다?,0
149998,2376369,청춘 영화의 최고봉.방황과 우울했던 날들의 자화상,1


In [10]:
for i, (line) in enumerate(train_data['document']):
    print(f"Original Sentence: {line}")
    print(f"Tokenized Sentence: {sp.encode_as_pieces(line)}")
    print(f"Token IDs: {sp.encode_as_ids(line)}")
    if i == 5:
        break

Original Sentence: 아 더빙.. 진짜 짜증나네요 목소리
Tokenized Sentence: ['▁아', '▁더빙', '..', '▁진짜', '▁짜증나', '네요', '▁목소리']
Token IDs: [14, 1226, 7, 88, 2990, 55, 2393]
Original Sentence: 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
Tokenized Sentence: ['▁흠', '...', '포스터', '보고', '▁초딩', '영화', '줄', '....', '오', '버', '연기', '조차', '▁가볍', '지', '▁않', '구나']
Token IDs: [1949, 16, 5829, 233, 1469, 10, 6601, 47, 6454, 6564, 355, 2103, 2338, 6387, 108, 508]
Original Sentence: 너무재밓었다그래서보는것을추천한다
Tokenized Sentence: ['▁너무', '재', '밓', '었다', '그래서', '보는', '것을', '추천', '한다']
Token IDs: [39, 6416, 1, 164, 4556, 515, 1409, 2176, 367]
Original Sentence: 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
Tokenized Sentence: ['▁교', '도', '소', '▁이야기', '구', '먼', '▁..', '솔직히', '▁재미는', '▁없다', '..', '평점', '▁조', '정']
Token IDs: [729, 6392, 6487, 372, 6478, 6879, 516, 5346, 1686, 309, 7, 1187, 188, 6424]
Original Sentence: 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다
Tokenized Sentence: ['▁사이', '몬', '페', '그', '의', '▁익', '살', '스런', '▁연기가', '▁돋보', '였

In [11]:
eos_token = '[SEP]'
eos_id = sp.piece_to_id(eos_token)

print(f"토큰 '{eos_token}'의 ID: {eos_id}")

토큰 '[SEP]'의 ID: 4


In [12]:
sp.encode_as_ids(['[EOS]'])

[[4379, 7127, 6566, 6866, 7344]]

In [13]:
train_text = []
for i, line in enumerate(train_data['document']):
    # sp.encode_as_ids(line)의 결과를 TensorFlow 텐서로 변환
    train_text.append(tf.convert_to_tensor(sp.encode_as_ids(line), dtype=tf.int32))

test_text = []
for i, line in enumerate(test_data['document']):
    test_text.append(tf.convert_to_tensor(sp.encode_as_ids(line), dtype=tf.int32))

2024-01-17 11:39:27.888720: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-01-17 11:39:27.889623: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Metal device set to: Apple M1


In [14]:
print(len(train_text), len(test_text))

149995 49997


### Padding and truncating data using pad sequences
* 전부 길이가 다른 리뷰들의 길이를 통일해주자

In [15]:
batch_size = 32
max_seq_length = 256

In [16]:
train_data_pad = pad_sequences(train_text,max_seq_length)
test_data_pad = pad_sequences(test_text,max_seq_length)

print(train_data_pad.shape, test_data_pad.shape)

(149995, 256) (49997, 256)


### Dataset 구성

In [17]:
batch_size = 32

# for train
train_dataset = tf.data.Dataset.from_tensor_slices((train_data_pad,train_data['label']))
train_dataset = train_dataset.shuffle(10000).repeat().batch(batch_size=batch_size)
print(train_dataset)

# for test
test_dataset = tf.data.Dataset.from_tensor_slices((test_data_pad,test_data['label']))
test_dataset = test_dataset.batch(batch_size=batch_size)
print(test_dataset)

<BatchDataset element_spec=(TensorSpec(shape=(None, 256), dtype=tf.int32, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>
<BatchDataset element_spec=(TensorSpec(shape=(None, 256), dtype=tf.int32, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>


## Build the model


## Setup hyper-parameters

In [48]:
kargs = {'model_name': 'BERT',
         'num_layers': 4,
         'd_model': 32,
         'num_heads': 8,
         'dff': 64,
         'input_vocab_size': sp.get_piece_size(),
         'target_vocab_size': sp.get_piece_size(),
         'maximum_position_encoding': 10000,
         'segment_encoding': 2,
         'end_token_idx': sp.piece_to_id('[EOS]'),
         'rate': 0.1
        }

d_model =32
dff = 64

In [49]:
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * i//2) / np.float32(d_model))
    return pos * angle_rates

In [50]:
def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

In [51]:
def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead)
    but it must be broadcastable for addition.

    Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
    output, attention_weights
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights

In [52]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = kargs['num_heads']
        self.d_model = kargs['d_model']

        assert self.d_model % self.num_heads == 0

        self.depth = self.d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(kargs['d_model'])
        self.wk = tf.keras.layers.Dense(kargs['d_model'])
        self.wv = tf.keras.layers.Dense(kargs['d_model'])

        self.dense = tf.keras.layers.Dense(kargs['d_model'])

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

In [53]:
# def point_wise_feed_forward_network(**kargs):
#     return tf.keras.Sequential([
#             tf.keras.layers.Conv1D(batch_size,256,2048),
#             tf.keras.layers.Conv1D(batch_size,256,512)
#         ])


In [54]:
def point_wise_feed_forward_network(**kargs):
    return tf.keras.Sequential([
        tf.keras.layers.Conv1D(filters=dff, kernel_size=1, activation='relu'),# (batch_size, seq_len, dff)
        tf.keras.layers.Conv1D(filters=d_model, kernel_size=1, activation='linear')  # (batch_size, seq_len, d_model)# You can change 'linear' to another activation function if needed
    ])


In [55]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(**kargs)

        self.ffn = point_wise_feed_forward_network(**kargs)
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(kargs['rate'])
        self.dropout2 = tf.keras.layers.Dropout(kargs['rate'])

    @tf.function
    def call(self, x, mask):
        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2, attn_output

In [56]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(Encoder, self).__init__()

        self.d_model = kargs['d_model']
        self.num_layers = kargs['num_layers']

        self.embedding = tf.keras.layers.Embedding(kargs['input_vocab_size'],
                                                   self.d_model)
        self.seg_encoding = tf.keras.layers.Embedding(kargs['segment_encoding'], # 문장구분 0,1로 나누는 임베딩?
                                                   self.d_model)
        self.pos_encoding = positional_encoding(kargs['maximum_position_encoding'],
                                                     self.d_model)

        self.enc_layers = [EncoderLayer(**kargs)
                           for _ in range(self.num_layers)]

        self.dropout = tf.keras.layers.Dropout(kargs['rate'])

    def get_seg_data(self, data, token_id=4):
        token_found = tf.cumsum(tf.cast(data == token_id, tf.int32), axis=1)
        modified_data = tf.cast(token_found >= 1, tf.int32)

        return modified_data

    def call(self, x, mask):
        attn = None
        seq_len = tf.shape(x)[1]
        seg_data = self.get_seg_data(x)

        # adding embedding and position encoding.
        x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x += self.seg_encoding(seg_data)

        x = self.dropout(x)

        for i in range(self.num_layers):
            x, attn = self.enc_layers[i](x, mask)

        return x, attn  # (batch_size, input_seq_len, d_model)


In [57]:
class BERT(tf.keras.Model):
    def __init__(self, **kargs):
        super(BERT, self).__init__(name=kargs['model_name'])
        self.end_token_idx = kargs['end_token_idx']
        self.encoder = Encoder(**kargs)
        self.outputs_layer = tf.keras.layers.Dense(kargs['d_model'],
                                                   activation='tanh')

        self.final_layer = tf.keras.layers.Dense(2)

    def create_padding_mask(self, seq):
        seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

        # add extra dimensions to add the padding
        # to the attention logits.
        return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

    def call(self, x):
        inp = x
        mask = self.create_padding_mask(inp)

        enc_output, attn = self.encoder(inp, mask)  # (batch_size, inp_seq_len, d_model)
        enc_output = self.outputs_layer(enc_output)  # (batch_size, inp_seq_len, d_model)
        enc_output = tf.keras.layers.Flatten()(enc_output)  # (batch_size, inp_seq_len * d_model)
        final_output = self.final_layer(enc_output)  # (batch_size, 1)

        return final_output

In [58]:
model = BERT(**kargs)

## Train the model

In [59]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')

def loss(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

def accuracy(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    mask = tf.expand_dims(tf.cast(mask, dtype=pred.dtype), axis=-1)
    pred *= mask
    acc = train_accuracy(real, pred)

    return tf.reduce_mean(acc)

In [60]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
              loss=loss,
              metrics=[accuracy])

In [61]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     monitor='val_loss',
                                                     restore_best_weights=True,
                                                     verbose=1)

In [None]:
Max_epochs = 5
history = model.fit(
    train_dataset,
    steps_per_epoch=train_data_pad.shape[0] // batch_size,
    epochs=Max_epochs,

    callbacks=[early_stopping_cb]
)

Epoch 1/5


2024-01-17 12:46:44.978908: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/5
Epoch 3/5
Epoch 4/5

## Test the model

In [None]:
results = model.evaluate(test_dataset)
# loss
print("loss value: {:.3f}".format(results[0]))
# accuracy
print("accuracy value: {:.3f}".format(results[1]))