## 네이버 무비 리뷰 분류 모형 

약 2만건의 네이버 무비 리뷰 데이터를 활용해 Sentiment Classification을 하는 모형을 만들어 본다. 

In [19]:
import pandas as pd
import numpy as np
from konlpy.tag import Mecab
from mxnet.gluon import nn, rnn
from mxnet import gluon, autograd
import gluonnlp as nlp
from mxnet import nd 
import mxnet as mx
import multiprocessing as mp
import time
import itertools
from tqdm import tqdm


mecab = Mecab()


### Vocab 생성 

학습셋 전체의 문장을 이용해 전처리를 한 뒤, Vocab을 생성한다. `Mecab` 형태소 분석기로 형태소만으로 Vocab을 생성 

In [2]:
rating = pd.read_csv("ratings.txt",sep='\t')

In [81]:
rating.head()

Unnamed: 0,id,document,label
0,8112052,어릴때보고 지금다시봐도 재밌어요ㅋㅋ,1
1,8132799,"디자인을 배우는 학생으로, 외국디자이너와 그들이 일군 전통을 통해 발전해가는 문화산...",1
2,4655635,폴리스스토리 시리즈는 1부터 뉴까지 버릴께 하나도 없음.. 최고.,1
3,9251303,와.. 연기가 진짜 개쩔구나.. 지루할거라고 생각했는데 몰입해서 봤다.. 그래 이런...,1
4,10067386,안개 자욱한 밤하늘에 떠 있는 초승달 같은 영화.,1


In [21]:
dataset = [(d, l) for d,l in zip(rating['document'], rating['label'])]

In [22]:
seq_len = 30

In [23]:
length_clip = nlp.data.PadSequence(seq_len, pad_val="<pad>")

def preprocess(data):
    comment, label = data
    morphs = mecab.morphs(str(comment).strip())
    return(length_clip(morphs), label)

def preprocess_dataset(dataset):
    start = time.time()
    with mp.Pool() as pool:
        dataset = gluon.data.SimpleDataset(pool.map(preprocess, dataset))
    end = time.time()
    print('Done! Tokenizing Time={:.2f}s, #Sentences={}'
          .format(end - start, len(dataset)))
    return dataset

In [24]:
preprocessed  = preprocess_dataset(dataset)

Done! Tokenizing Time=9.45s, #Sentences=200000


첫번째 문장의 첫 11개 토큰 출력  

In [25]:
preprocessed[0][0][:11]

['어릴', '때', '보', '고', '지금', '다시', '봐도', '재밌', '어요', 'ㅋㅋ', '<pad>']

학습셋 전체로 토큰 빈도를 생성 `counter`를 만들고, `vocab`을 생성. 
문장 생성이나 seq2seq가 아니기 때문에 `bos_token`, `eos_token` 표현은 생략 

In [26]:
counter = nlp.data.count_tokens(itertools.chain.from_iterable([c for c, _ in preprocessed]))

vocab = nlp.Vocab(counter,bos_token=None, eos_token=None, min_freq=15)

### 학습셋 생성 

토큰을 `index`로 변환 하여 학습을 위한 데이터로 변환 

In [32]:
preprocessed_encoded  = [(vocab[data], label)  for data, label in preprocessed ]

In [54]:
train, test = nlp.data.train_valid_split(preprocessed_encoded, valid_ratio=0.1)

In [55]:
batchify_fn = nlp.data.batchify.Tuple(nlp.data.batchify.Stack(),
                                      nlp.data.batchify.Stack('float32'))

train_dataloader  = gluon.data.DataLoader(train, batch_size=100, batchify_fn=batchify_fn, shuffle=True, last_batch='discard')
test_dataloader  = gluon.data.DataLoader(test, batch_size=100, batchify_fn=batchify_fn, shuffle=True, last_batch='discard')

### 모델 정의 

In [62]:
class SentClassificationModelAtt(gluon.HybridBlock):
    def __init__(self, vocab_size, num_embed, **kwargs):
        super(SentClassificationModelAtt, self).__init__(**kwargs)
        with self.name_scope():
            self.embed = nn.Embedding(input_dim=vocab_size, output_dim=num_embed)
            self.drop = nn.Dropout(0.3)
            self.fc = nn.Dense(100)
            self.out = nn.Dense(2)  
    def hybrid_forward(self, F ,inputs):
        em_out = self.drop(self.embed(inputs))
        fc_out = self.fc(em_out) 
        return(self.out(fc_out))

In [72]:
ctx = mx.gpu()

#모형 인스턴스 생성 및 트래이너, loss 정의 
model = SentClassificationModelAtt(vocab_size = len(vocab.idx_to_token), num_embed=50)


In [73]:
model.initialize(mx.init.Xavier(),ctx=ctx)
model.hybridize()

In [74]:
mx.viz.print_summary(
    model(mx.sym.var('data')), 
    shape={'data':(1,30)}, #set your shape here
)

________________________________________________________________________________________________________________________
Layer (type)                                        Output Shape            Param #     Previous Layer                  
data(null)                                          30                      0                                           
________________________________________________________________________________________________________________________
sentclassificationmodelatt3_embedding0_fwd(Embedding30x50                   0           data                            
________________________________________________________________________________________________________________________
sentclassificationmodelatt3_dropout0_fwd(Dropout)   30x50                   0           sentclassificationmodelatt3_embe
________________________________________________________________________________________________________________________
sentclassificationmodelatt3_dens

In [75]:
trainer = gluon.Trainer(model.collect_params(), 'adam')
loss = gluon.loss.SoftmaxCrossEntropyLoss()

In [76]:
def evaluate_accuracy(model, data_iter, ctx=ctx):
    acc = mx.metric.Accuracy()
    for i, (data, label) in enumerate(data_iter):
        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)
        output = model(data)
        predictions = nd.argmax(output, axis=1)
        acc.update(preds=predictions, labels=label)
    return(acc.get()[1])

In [77]:
def calculate_loss(model, data_iter, loss_obj, ctx=ctx):
    test_loss = []
    for i, (te_data, te_label) in enumerate(data_iter):
        te_data = te_data.as_in_context(ctx)
        te_label = te_label.as_in_context(ctx)
        te_output = model(te_data)
        loss_te = loss_obj(te_output, te_label)
        curr_loss = nd.mean(loss_te).asscalar()
        test_loss.append(curr_loss)
    return(np.mean(test_loss))

In [78]:
epochs = 4


tot_test_loss = []
tot_test_accu = []
tot_train_loss = []
for e in range(epochs):
    train_loss = []
    #batch training 
    for i, (data, label) in enumerate(tqdm(train_dataloader)):
        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)
        with autograd.record():
            output = model(data)
            loss_ = loss(output, label)
            loss_.backward()
        trainer.step(data.shape[0])

        curr_loss = nd.mean(loss_).asscalar()
        train_loss.append(curr_loss)

    #caculate test loss
    test_loss = calculate_loss(model, test_dataloader, loss_obj = loss, ctx=ctx) 
    test_accu = evaluate_accuracy(model, test_dataloader,  ctx=ctx)

    print("Epoch %s. Train Loss: %s, Test Loss : %s, Test Accuracy : %s" % (e, np.mean(train_loss), test_loss, test_accu))    
    tot_test_loss.append(test_loss)
    tot_train_loss.append(np.mean(train_loss))
    tot_test_accu.append(test_accu)
    

100%|██████████| 1800/1800 [00:10<00:00, 176.26it/s]
  1%|          | 18/1800 [00:00<00:10, 171.07it/s]

Epoch 0. Train Loss: 0.40293667, Test Loss : 0.36049065, Test Accuracy : 0.8435


100%|██████████| 1800/1800 [00:10<00:00, 179.11it/s]
  1%|          | 18/1800 [00:00<00:10, 175.83it/s]

Epoch 1. Train Loss: 0.34396738, Test Loss : 0.3614813, Test Accuracy : 0.84665


100%|██████████| 1800/1800 [00:10<00:00, 179.02it/s]
  1%|          | 17/1800 [00:00<00:10, 169.85it/s]

Epoch 2. Train Loss: 0.3045179, Test Loss : 0.37496507, Test Accuracy : 0.84175


100%|██████████| 1800/1800 [00:10<00:00, 178.98it/s]


Epoch 3. Train Loss: 0.26582646, Test Loss : 0.3952765, Test Accuracy : 0.8358


## TODO 

- 테스트 정확도를 87% 이상 올려본다.(Optimizer, RNN, Convolution, 데이터 전처리 방식 변경(명사만 사용?), ...) 
- 학습된 임베딩 레이어를 기반으로 단어간의 유사도를 구해본다. 
- 토큰이 아닌 char 기반으로 학습하면 어떨까? 성능이 좋아지나? 
