# 8. Character-Aware Neural Language Models
We have seen RNN models can be used for language models. Now it's time to see how CNN can be used for language modeling

### References
- [Character-Aware Neural Language Models](https://arxiv.org/abs/1508.06615)
- [yoonkim/lstm-char-cnn](https://github.com/yoonkim/lstm-char-cnn)
- [mkroutikov/tf-lstm-char-cnn](https://github.com/mkroutikov/tf-lstm-char-cnn)

## Data Preprocessing
Preprocessing codes are borrowed from [mkroutikov/tf-lstm-char-cnn](https://github.com/mkroutikov/tf-lstm-char-cnn).

It's getting harder and harder to preprecess data in our model class. So we will preprocess before using `fit_to_corpus()` method as far as we can.

You have to select among these datasets `en/es/cs/de/fr/ru/ptb`.

In [1]:
from models import CNN
from collections import defaultdict
import os

import random
import numpy as np

In [2]:
def build_dataset(dataset, max_word_length, eos='+'):
    data_dir = os.path.join("data", "rnnlm_datasets", dataset)
    word_to_idx = {}
    char_to_idx = {}
    
    word_to_idx['|'] = 0
    char_to_idx[' '] = 0
    char_to_idx['{'] = 1
    char_to_idx['}'] = 2
    
    if eos:
        word_to_idx[eos] = len(word_to_idx)
        char_to_idx[eos] = len(char_to_idx)
    
    word_tokens = defaultdict(list)
    char_tokens = defaultdict(list)
    
    actual_max_word_length = 0
            
    
    for fname in ['train', 'valid', 'test']:
        with open(os.path.join(data_dir, "{}.txt".format(fname)), 'r') as f:
            for line in f:
                line = line.strip()
                line = line.replace('{', '').replace('}', '').replace('|', '')
                line = line.replace('<unk>', '|')
                if eos:
                    line = line.replace(eos, '')
            
                for word in line.split():
                    if len(word) > max_word_length - 2:
                        word = word[:max_word_length - 2]
                    
                    if word not in word_to_idx:
                        word_to_idx[word] = len(word_to_idx)
                    word_tokens[fname].append(word_to_idx[word])

                    for c in word:
                        if c not in char_to_idx:
                            char_to_idx[c] = len(char_to_idx)

                    char_array = [char_to_idx[c] for c in '{' + word + '}']
                    char_tokens[fname].append(char_array)

                    actual_max_word_length = max(actual_max_word_length, len(char_array))

                if eos:
                    word_tokens[fname].append(word_to_idx[eos])

                    char_array = [char_to_idx[c] for c in '{' + eos + '}']
                    char_tokens[fname].append(char_array)
                    
    assert actual_max_word_length <= max_word_length
    
    word_tensors = {}
    char_tensors = {}
    for fname in ('train', 'valid', 'test'):
        assert len(char_tokens[fname]) == len(word_tokens[fname])
        
        word_tensors[fname] = np.array(word_tokens[fname], dtype=np.int32)
        char_tensors[fname] = np.zeros([len(char_tokens[fname]), actual_max_word_length], dtype=np.int32)
        
        for i, char_array in enumerate(char_tokens[fname]):
            char_tensors[fname][i,:len(char_array)] = char_array
    
    return word_to_idx, char_to_idx, word_tensors, char_tensors, actual_max_word_length

In [3]:
def train_test_dev_split(word_tensors, char_tensors):
    train_word = word_tensors['train']
    valid_word = word_tensors['valid']
    test_word = word_tensors['test']
    train_char = char_tensors['train']
    valid_char = char_tensors['valid']
    test_char = char_tensors['test']
    return train_word, valid_word, test_word, train_char, valid_char, test_char

In [4]:
word_to_idx, char_to_idx, word_tensors, char_tensors, actual_max_word_length = \
    build_dataset("es", 30)

In [11]:
train_word, valid_word, test_word, train_char, valid_char, test_char = \
    train_test_dev_split(word_tensors, char_tensors)

train_data = [train_word, valid_word, train_char, valid_char, word_to_idx, char_to_idx]
test_data = [test_word, test_char]

## Training!

In [7]:
model = CNN.CNN(learning_rate=5e-4)

DEBUG: 04180000


In [8]:
model.fit_to_corpus(train_data)

Instructions for updating:
Use the retry module or similar alternatives.


In [9]:
model.train(20, save_dir="save/07_cnn", log_dir="log/07_cnn", print_every=500)

--------------------------------------------------------------------------------
Created and Initialized fresh model. Size: 361502
--------------------------------------------------------------------------------
Epoch training time: 1.3180923461914062

Finished Epoch 1
train_loss = 0.60351326, train_accruacy = 0.66652174
valid_loss = 0.46909564, valid_accuracy = 0.81411764

Epoch training time: 0.8415021896362305

Finished Epoch 2
train_loss = 0.42120360, train_accruacy = 0.81043478
valid_loss = 0.41845854, valid_accuracy = 0.81764705

Epoch training time: 0.8786766529083252

Finished Epoch 3
train_loss = 0.33527417, train_accruacy = 0.86043478
valid_loss = 0.42911296, valid_accuracy = 0.81529411

Epoch training time: 0.8922944068908691

Finished Epoch 4
train_loss = 0.27726179, train_accruacy = 0.89362318
valid_loss = 0.39775787, valid_accuracy = 0.83529412

Epoch training time: 0.897108793258667

Finished Epoch 5
train_loss = 0.22240816, train_accruacy = 0.92492753
valid_loss = 0.397

In [10]:
model.test(test_data, load_dir="save/07_cnn")

INFO:tensorflow:Restoring parameters from save/07_cnn/epoch020_0.4196.model
--------------------------------------------------------------------------------
Restored model from checkpoint for testing. Size: 361502
--------------------------------------------------------------------------------
test loss = 0.35618254, test accuracy = 0.85055555
test samples: 001800, time elapsed: 0.0993, time per one batch: 0.0028
