# 8. Character-Aware Neural Language Models
We have seen RNN models can be used for language models. Now it's time to see how CNN can be used for language modeling

### References
- [Character-Aware Neural Language Models - Kim, 2016](https://arxiv.org/abs/1508.06615)
- [yoonkim/lstm-char-cnn](https://github.com/yoonkim/lstm-char-cnn)
- [mkroutikov/tf-lstm-char-cnn](https://github.com/mkroutikov/tf-lstm-char-cnn)

## Data Preprocessing
Preprocessing codes are borrowed from [mkroutikov/tf-lstm-char-cnn](https://github.com/mkroutikov/tf-lstm-char-cnn). Preprocessed datasets are from [Jan Botha's Website](https://bothameister.github.io/)

It's getting harder and harder to preprecess data in our model class. So we will preprocess before using `fit_to_corpus()` method as far as we can.

You have to select among these datasets `en/es/cs/de/fr/ru/ptb`.

In [7]:
from models import LSTMCharCNN
import data.rnnlm_datasets.preprocess as preprocess

In [8]:
word_to_idx, char_to_idx, word_tensors, char_tensors, actual_max_word_length = \
    preprocess.build_dataset("ptb", 30, eos='+')


actual longest token length is: 21
size of word vocabulary: 10000
size of char vocabulary: 51
number of tokens in train: 929589
number of tokens in valid: 73760
number of tokens in test: 82430


In [10]:
train_word, valid_word, test_word, train_char, valid_char, test_char = \
    preprocess.train_test_dev_split(word_tensors, char_tensors)

train_data = [train_word, valid_word, train_char, valid_char, word_to_idx, char_to_idx, actual_max_word_length]
test_data = [test_word, test_char]

## Training!

In [11]:
model = LSTMCharCNN.LSTMCharCNN(hidden_size = 650,
                                num_unroll_steps = 35,
                                batch_size = 20,
                                grad_clip = 5.,
                                dropout_keep_prob = 0.5,
                                learning_rate = 1.0,
                                lstm_num_layer = 2,
                                highway_num_layer = 2,
                                char_embedding_size = 15,
                                filter_windows = [1,2,3,4,5,6,7],
                                num_filters = [50,100,150,200,200,200,200])

In [12]:
model.fit_to_corpus(train_data)

Instructions for updating:
Use the retry module or similar alternatives.


In [None]:
model.train(50, save_dir="save/08_lstm_char_cnn/ptb", log_dir="log/08_lstm_char_cnn/ptb",
                load_dir="save/08_lstm_char_cnn/ptb", print_every=500)

In [13]:
model.test(test_data, load_dir="save/08_lstm_char_cnn/ptb")

INFO:tensorflow:Restoring parameters from save/08_lstm_char_cnn/ptb/epoch030_4.4012.model
--------------------------------------------------------------------------------
Restored model from checkpoint for testing. Size: 19365765
--------------------------------------------------------------------------------
test loss = 4.36729684, perplexity = 78.8302528
test samples: 002340, time elapsed: 65.6426, time per one batch: 0.5610


In [15]:
model.sample(30, load_dir="save/08_lstm_char_cnn/ptb")

INFO:tensorflow:Restoring parameters from save/08_lstm_char_cnn/ptb/epoch030_4.4012.model


"the company 's new president + in the past two years mr. smith says he has n't seen the suit and the two sides have been working on a new york job"