In [1]:
%matplotlib inline

import utils_ted
from utils_ted import *

Using TensorFlow backend.


In [2]:
from keras.layers import TimeDistributed, Activation
from numpy.random import choice

## Setup

We haven't really looked into the detail of how this works yet - so this is provided for self-study for those who are interested. We'll look at it closely next week.

In [24]:
path=get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path, encoding='utf8').read().lower()

In [25]:
print('corpus length:', len(text))

corpus length: 600893


In [26]:
!tail {path} -n25

are thinkers who believe in the saints.


144

It stands to reason that this sketch of the saint, made upon the model
of the whole species, can be confronted with many opposing sketches that
would create a more agreeable impression. There are certain exceptions
among the species who distinguish themselves either by especial
gentleness or especial humanity, and perhaps by the strength of their
own personality. Others are in the highest degree fascinating because
certain of their delusions shed a particular glow over their whole
being, as is the case with the founder of christianity who took himself
for the only begotten son of God and hence felt himself sinless; so that
through his imagination--that should not be too harshly judged since the
whole of antiquity swarmed with sons of god--he attained the same goal,
the sense of complete sinlessness, complete irresponsibility, that can
now be attained by every individual through science.--In the same manner
I have viewed the saints of India

In [27]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1

In [32]:
print("total chars : %s" % vocab_size)

total chars : 58


In [34]:
chars.insert(0, '/n')

In [38]:
"".join(chars[1:-4])

'\n !"\'(),-.0123456789:;=?[]_abcdefghijklmnopqrstuvwxyz'

In [56]:
char_indices = {c:i for i, c in enumerate(chars)}
indices_char = {i:c for i, c in enumerate(chars)}

In [57]:
text_idxs = [char_indices[c] for c in text]

In [66]:
print(text_idxs[:10])

[43, 45, 32, 33, 28, 30, 32, 1, 1, 1]


In [60]:
''.join(indices_char[idx] for idx in text_idxs[:70])

'preface\n\n\nsupposing that truth is a woman--what then? is there not gro'

## Preprocess and create model

In [80]:
maxlen = 40
sentences_list = []
next_chars_list = []
for i in range(0, (len(text_idxs) - maxlen + 1)):
    sentences_list.append(text_idxs[i: i+maxlen])
    next_chars_list.append(text_idxs[i+1: i+maxlen+1])

In [88]:
print("nb sequences:", len(sentences_list))

nb sequences: 600854


In [84]:
sentences = np.array(sentences_list[:-2])
next_chars = np.array(next_chars_list[:-2])

In [89]:
print(sentences.shape, next_chars.shape)

(600852, 40) (600852, 40)


In [90]:
n_fac = 24

In [92]:
from keras.layers import LSTM

[Keras 2.0 release notes](https://github.com/fchollet/keras/wiki/Keras-2.0-release-notes)

```
Recurrent layers
    output_dim -> units
    init -> kernel_initializer
    inner_init -> recurrent_initializer
    added argument bias_initializer
    W_regularizer -> kernel_regularizer
    b_regularizer -> bias_regularizer
    added arguments kernel_constraint, recurrent_constraint, bias_constraint
    dropout_W -> dropout
    dropout_U -> recurrent_dropout
    consume_less -> implementation. String values have been replaced with integers: implementation 0 (default), 1 or 2.
    LSTM only: the argument forget_bias_init has been removed. Instead there is a boolean argument unit_forget_bias, defaulting to True.
```

In [93]:
model = Sequential([
    Embedding(vocab_size, n_fac, input_length=maxlen),
    LSTM(512, input_dim=n_fac, return_sequences=True, 
         dropout=0.2, recurrent_dropout=0.2, implementation=1),
    Dropout(0.2),
    LSTM(512, return_sequences=True, dropout=0.2, recurrent_dropout=0.2, implementation=1),
    Dropout(0.2),
    TimeDistributed(Dense(vocab_size)),
    Activation('softmax')    
])

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


In [95]:
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy')

## Train

In [None]:
def print_example():
    seed_string="ethics is a basic foundation of all that"
    for i in range(10):
        x = np.array([char_indices[c] for c in seed_string[-40:]])[np.newaxis,:]
        preds = np.squeeze(model.predict(x, verbose=0))[-1]
        preds = preds/np.sum(preds)
        next_char = choice(chars, p=preds)
        seed_string = seed_string + next_char
    print(seed_string)

In [None]:
model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, epochs=1)

Epoch 1/1
126208/600852 [=====>........................] - ETA: 17810s - loss: 1.9281

In [None]:
print_example()