# A Neural Probabilistic Language Model
---
Paper implementation: 

* paper: [A Neural Probabilistic Language Model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) - Yoshua Bengio, 2003
* [blog](https://simonjisu.github.io/nlp/2018/08/22/neuralnetworklm.html)
* [slide share](http://bit.ly/2OkYFkY)

## Contents

1. Preprocessing
2. Model
3. Result: 
    * Perplexity
    * Similarity versus "gensim Word2Vec"
    * Training Time
    
---

In [1]:
# Load packages
import os
import sys
sys.path.append('/'.join(os.getcwd().split('/')[:-1]+['paper_code', 'NNLM']))

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time
import numpy as np

from nnlm_data_loader import DataSet
from model import NNLM
from konlpy.tag import Twitter

In [2]:
USE_CUDA = torch.cuda.is_available()
DEVICE = "cuda" if USE_CUDA else None
BATCH = 1024
N_GRAM = 5
TAGGER = lambda x: ['/'.join(y) for y in Twitter().pos(x, norm=True)]
vocab_path = '../paper_code/NNLM/model/vocabulary' # to fix vocabulary index

In [3]:
# train_data, _, _ = DataSet(base_path='../data/nsmc/', train='train.txt', valid='valid.txt', test='test.txt', 
#             n_gram=N_GRAM, tokenizer=TAGGER, save_tokens=True, direct_load=False, remove_short=True).splits()
# train_data.vocab.save(vocab_path, train_data.vocab.stoi)

In [4]:
dataset_creator = DataSet(base_path='../data/nsmc/', 
                          train='train_tokens', valid='valid_tokens', test='test_tokens',
                          n_gram=N_GRAM, tokenizer=str.split, save_tokens=False, 
                          direct_load=True, remove_short=True, device=DEVICE,
                          fixed_vocab=vocab_path)
train_data, valid_data, test_data = dataset_creator.splits()
train_loader, valid_loader, test_loader = dataset_creator.create_loader(train=train_data, valid=valid_data, test=test_data,
                                                                        batch_size=BATCH)

In [5]:
len(train_data), len(valid_data)

(2167155, 124660)

Checking removed short sentences which length of tokens is lower then N_GRAM(=5)

In [6]:
# actual used, number of train sentences: 162273
train_data._total, train_data._removed 

(180000, 17727)

In [7]:
# actual used, number of valid sentences: 9032
valid_data._total, valid_data._removed

(10000, 968)

In [8]:
# 16383 words : 30 ~~ 63202: 100
V = len(train_data.vocab)
E = 100
H = 500
LR = 0.001
WD = 0.00001
STEP = 10
print("vocab size is", V)

vocab size is 63202


In [9]:
nnlm = NNLM(embed_size=E, hidden_size=H, vocab_size=V, num_prev_tokens=(N_GRAM-1))
if USE_CUDA:
    nnlm = nnlm.cuda()
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(nnlm.parameters(), lr=LR, weight_decay=WD)
scheduler = optim.lr_scheduler.MultiStepLR(gamma=0.1, milestones=[5], optimizer=optimizer)

In [10]:
def perplexity(x):
    return -torch.log(x).sum() / x.size(0)

def validation(model, loader):
    model.eval()
    pp = 0
    acc = 0
    for batch in loader:
        inputs, targets = batch[0], batch[1]
        preds = model.predict(inputs)
        probs, idxes = preds.max(1)
        acc += torch.eq(idxes, targets).sum().item()
        pp += perplexity(probs).item()
        
    return acc, pp/len(loader)

In [11]:
start_time = time.time()
for step in range(STEP):
    nnlm.train()
    scheduler.step()
    losses=[]
    for i, batch in enumerate(train_loader):
        inputs, targets = batch[0], batch[1]

        nnlm.zero_grad()

        outputs = nnlm(inputs)

        loss = loss_function(outputs, targets.view(-1))
        losses.append(loss.item())

        loss.backward()
        torch.nn.utils.clip_grad_norm_(nnlm.parameters(), 50.0)  # gradient clipping
        optimizer.step()
        
        if i % 1000 == 0:
            msg = '[{}/{}][{}/{}] train_loss: {:.4f}'.format(step+1, STEP, i, len(train_loader), np.mean(losses))
            print(msg)
            
    acc_valid, pp_valid = validation(model=nnlm, loader=valid_loader)
    print('='*30)
    msg = '[{}/{}]\n valid_perplextiy: {:.4f} \n valid_accuracy: {:.4f}'.format(step+1, STEP, pp_valid, acc_valid/len(valid_data))
    print(msg)
    print('='*30)
    
end_time = time.time()
minute = int((end_time-start_time) // 60)
print('Training Excution time with validation: {:d} m {:.4f} s'.format(minute, (end_time-start_time)-minute*60))

[1/10][0/2117] train_loss: 11.0958
[1/10][1000/2117] train_loss: 6.5791
[1/10][2000/2117] train_loss: 6.2977
[1/10]
 valid_perplextiy: 2.1833 
 valid_accuracy: 0.1735
[2/10][0/2117] train_loss: 5.8066
[2/10][1000/2117] train_loss: 5.3531
[2/10][2000/2117] train_loss: 5.3857
[2/10]
 valid_perplextiy: 2.0390 
 valid_accuracy: 0.1914
[3/10][0/2117] train_loss: 5.3604
[3/10][1000/2117] train_loss: 5.0198
[3/10][2000/2117] train_loss: 5.0861
[3/10]
 valid_perplextiy: 1.9718 
 valid_accuracy: 0.1983
[4/10][0/2117] train_loss: 5.1429
[4/10][1000/2117] train_loss: 4.8299
[4/10][2000/2117] train_loss: 4.9050
[4/10]
 valid_perplextiy: 1.9294 
 valid_accuracy: 0.2015
[5/10][0/2117] train_loss: 5.0019
[5/10][1000/2117] train_loss: 4.6956
[5/10][2000/2117] train_loss: 4.7740
[5/10]
 valid_perplextiy: 1.8975 
 valid_accuracy: 0.2036
[6/10][0/2117] train_loss: 4.8963
[6/10][1000/2117] train_loss: 4.5474
[6/10][2000/2117] train_loss: 4.4194
[6/10]
 valid_perplextiy: 1.9439 
 valid_accuracy: 0.2112
[7/

In [12]:
torch.save(nnlm.state_dict(), '../paper_code/NNLM/model/nnlm.model')

## Test

In [13]:
nnlm.load_state_dict(torch.load('../paper_code/NNLM/model/nnlm.model'))

In [14]:
acc, pp = validation(model=nnlm, loader=test_loader)
msg = 'test_perplextiy: {:.4f}, test_accuracy: {:.4f}'.format(pp, acc/len(test_data))
print(msg)

test_perplextiy: 1.9268, test_accuracy: 0.2118


In [15]:
test_sent = '요즘 나오는 어린이 영화보다 수준 낮은 시나리오 거기다 우리가 아는 윌스미스 보다 어린 윌스미스에 발연기는 보너스'

In [16]:
def predict_test_sample(test_sent, model, dataset):
    test_sent_tokens = TAGGER(dataset_creator._preprocessing(test_sent))
    test_sent_ngrams = dataset.get_ngrams([test_sent_tokens])
    datas = np.array(dataset.numerical(test_sent_ngrams))
    
    x = torch.LongTensor(datas[:, :-1])
    
    if USE_CUDA:
        model = model.cpu()
    
    for tkns, inputs in zip(test_sent_ngrams, x):
        pred = model.predict(inputs.view(1, -1)).max(1)[1]
        print(' '.join(tkns[:-1]), '-->', dataset.vocab.itos[pred.item()], '\t| target:', tkns[-1])
        
    print('='*50)
    print('given {} words: {}'.format(dataset.n_gram-1, ' '.join(test_sent_ngrams[0][:-1])))
    print('-'*50)
    preds = test_sent_ngrams[0][:-1]
    sent_length = len(test_sent_tokens)
    i = 0
    inputs = dataset.numerical([test_sent_ngrams[0][:-1]])[0]
    while i <= sent_length:
        pred = model.predict(torch.LongTensor(inputs).view(1, -1)).max(1)[1].item()
        preds.append(dataset.vocab.itos[pred])
        inputs.pop(0)
        inputs.append(pred)
        i += 1
    
    print(' '.join([x.split('/')[0] for x in preds]))
    string = ' '.join(preds)
    return string

In [17]:
lm_string = predict_test_sample(test_sent, nnlm, test_data)

요즘/Noun 나오는/Verb 어린이/Noun 영화/Noun --> 가/Josa 	| target: 보다/Josa
나오는/Verb 어린이/Noun 영화/Noun 보다/Josa --> 더/Noun 	| target: 수준/Noun
어린이/Noun 영화/Noun 보다/Josa 수준/Noun --> 이/Josa 	| target: 낮은/Adjective
영화/Noun 보다/Josa 수준/Noun 낮은/Adjective --> 영화/Noun 	| target: 시나리오/Noun
보다/Josa 수준/Noun 낮은/Adjective 시나리오/Noun --> 가/Josa 	| target: 거기/Noun
수준/Noun 낮은/Adjective 시나리오/Noun 거기/Noun --> 에/Josa 	| target: 다/Josa
낮은/Adjective 시나리오/Noun 거기/Noun 다/Josa --> ./Punctuation 	| target: 우리/Noun
시나리오/Noun 거기/Noun 다/Josa 우리/Noun --> 는/Josa 	| target: 가/Josa
거기/Noun 다/Josa 우리/Noun 가/Josa --> 뭐/Noun 	| target: 아는/Verb
다/Josa 우리/Noun 가/Josa 아는/Verb --> 사람/Noun 	| target: 윌스미스/Noun
우리/Noun 가/Josa 아는/Verb 윌스미스/Noun --> 인데/Josa 	| target: 보다/Verb
가/Josa 아는/Verb 윌스미스/Noun 보다/Verb --> 가/Eomi 	| target: 어린/Verb
아는/Verb 윌스미스/Noun 보다/Verb 어린/Verb --> 애/Noun 	| target: 윌스미스/Noun
윌스미스/Noun 보다/Verb 어린/Verb 윌스미스/Noun --> 가/Josa 	| target: 에/Josa
보다/Verb 어린/Verb 윌스미스/Noun 에/Josa --> 대한/Noun 	| target: 발연기/Noun
어린/Verb 윌스미스/N

In [18]:
lm_string

'요즘/Noun 나오는/Verb 어린이/Noun 영화/Noun 가/Josa 다/Adverb 있/Adjective 나/Eomi ?/Punctuation ᄏ/Foreign 아/Exclamation 놔/Verb 서/Eomi 봤/Verb 는데/Eomi 이/Determiner 것/Noun 은/Josa 뭐/Noun ./Punctuation ᄏ/Foreign ;/Punctuation 이/Determiner 것/Noun 은/Josa 뭐/Noun'