In [1]:
%matplotlib inline
import utils; reload(utils)
from utils import *

Using Theano backend.


## Setup

We haven't really looked into the detail of how this works yet - so this is provided for self-study for those who are interested. We'll look at it closely next week.

In [2]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()
print('corpus length:', len(text))

('corpus length:', 600901)


In [3]:
print(text[:1000])

PREFACE


SUPPOSING that Truth is a woman--what then? Is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrible
seriousness and clumsy importunity with which they have usually paid
their addresses to Truth, have been unskilled and unseemly methods for
winning a woman? Certainly she has never allowed herself to be won; and
at present every kind of dogma stands with sad and discouraged mien--IF,
indeed, it stands at all! For there are scoffers who maintain that it
has fallen, that all dogma lies on the ground--nay more, that it is at
its last gasp. But to speak seriously, there are good grounds for hoping
that all dogmatizing in philosophy, whatever solemn, whatever conclusive
and decided airs it has assumed, may have been only a noble puerilism
and tyronism; and probably the time is at hand when it will be once
and again understood WHAT has actually sufficed for the basis of such
imposing and abso

In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

('total chars:', 86)


In [5]:
chars.insert(0, "\0")
''.join(chars[1:-6])

'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz'

In [6]:
# interchange mapping char and indices
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [7]:
# data to use in the model
idx = [char_indices[c] for c in text]

In [8]:
''.join(indices_char[i] for i in idx[:70])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

## Preprocessing

In [119]:
# data segmentation to sequences
maxlen = 50
sentences = []
next_chars = []
for i in range(0, len(idx) // maxlen):
    sentences.append(idx[i*maxlen: i*maxlen+maxlen])
    next_chars.append(idx[i*maxlen+1: i*maxlen+maxlen+1])
print('nb sequences:', len(sentences))

nb sequences: 12018


In [9]:
# data segmentation to sequences
maxlen = 50
sentences = []
next_chars = []
for i in range(0, len(idx)-maxlen, maxlen//2):
    sentences.append(idx[i: i+maxlen])
    next_chars.append(idx[i+1: i+maxlen+1])
print('nb sequences:', len(sentences))

('nb sequences:', 24035)


In [10]:
print(''.join([indices_char[i] for i in sentences[-1]]))
print(''.join([indices_char[i] for i in next_chars[-1]]))

e
christian world as the indications of sinfulness

christian world as the indications of sinfulness.


In [11]:
sentences = np.concatenate([[np.array(o)] for o in sentences])
next_chars = np.concatenate([[np.array(o)] for o in next_chars])

In [12]:
sentences.shape, next_chars.shape

((24035, 50), (24035, 50))

## Model training

In [13]:
def sample(probs, temperature):
    """samples an index from a vector of probabilities
    (this is not the most efficient way but is more robust)"""
    # probability distribution with temperature
    prob_temp = np.log(probs) / temperature
    exp_prob = np.exp(prob_temp)
    dist = exp_prob / np.sum(exp_prob)
    choices = range(len(probs))
    return np.random.choice(choices, p=dist)


def generate(temperature=0.75, generated_len=400, seed=0):
    # randomly select a chunk of text
    np.random.seed(seed)
    start_idx = np.random.randint(0, len(idx)-maxlen)
    seed_sent = idx[start_idx: start_idx+maxlen]
    
    sentence = np.array(seed_sent)
    generated = ''.join([indices_char[i] for i in sentence])
    
    while len(generated) < generated_len:
        pred = model.predict(sentence[None, :])[0][-1]
        next_id = sample(pred, temperature)
        generated += indices_char[next_id]
        sentence = np.array(list(sentence[1:]) + [next_id])
    
    return generated

In [14]:
n_fac = 16

In [15]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=maxlen),
        LSTM(256, return_sequences=True, dropout_U=0.2, dropout_W=0.2),
        Dropout(0.2),
        TimeDistributed(Dense(vocab_size)),
        Activation('softmax')
    ])

In [16]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [17]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 50, 16)        1376        embedding_input_1[0][0]          
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 50, 256)       279552      embedding_1[0][0]                
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 50, 256)       0           lstm_1[0][0]                     
____________________________________________________________________________________________________
timedistributed_1 (TimeDistribut (None, 50, 86)        22102       dropout_1[0][0]                  
___________________________________________________________________________________________

In [21]:
for iteration in range(3):
    print('-' * 50)
    print('Iteration', iteration+1)
    model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, nb_epoch=1)
    
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('\n----- diversity:', diversity)
        print(generate(diversity))
    
    print('\n')

--------------------------------------------------
('Iteration', 1)
Epoch 1/1


('----- diversity:', 0.2)
e puerilism
and tyronism; and probably the time is the the the the the the the the the the the the an the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the core the the the the the the the the the the the the the the the the the the sore an the the the the the the the the the the the the the the the the the the the the s


('----- diversity:', 0.5)
ssion for the complex state of delight of the pers whe on thoc the sen tis rarte son in weo pes the and the tiresans on an sound in the ore ferint an the merise af the the an the sust en sons the onos the herel en tha the the the sinsans in of an ins inpas the bond an the d corens and tha the corethore the mocans and cort on anle an the theamfent an the car the tole the conn the heen the sand fhis


('----- diversity:', 1.0)
without any
falsification taking pla

### Without embedding

In [22]:
# one-hot encoding of inputs
trn_inputs = [np.arange(vocab_size)==np.array(sentences[i]).reshape(-1,1) for i in range(len(sentences))]
trn_inputs = np.stack(trn_inputs).astype(np.float32)
trn_targets = [np.arange(vocab_size)==np.array(next_chars[i]).reshape(-1,1) for i in range(len(next_chars))]
trn_targets = np.stack(trn_targets).astype(np.float32)

In [23]:
trn_inputs.shape, trn_targets.shape

((24035, 50, 86), (24035, 50, 86))

In [24]:
def sample(probs, temperature):
    """samples an index from a vector of probabilities
    (this is not the most efficient way but is more robust)"""
    # probability distribution with temperature
    prob_temp = np.log(probs) / temperature
    exp_prob = np.exp(prob_temp)
    dist = exp_prob / np.sum(exp_prob)
    choices = range(len(probs))
    return np.random.choice(choices, p=dist)


def generate(temperature=0.75, generated_len=400, seed=0):
    # randomly select a chunk of text
    np.random.seed(seed)
    start_idx = np.random.randint(0, len(idx)-maxlen)
    seed_sent = idx[start_idx: start_idx+maxlen]

    sentence = np.array(seed_sent)
    generated = ''.join([indices_char[i] for i in sentence])
    
    while len(generated) < generated_len:
        x = (np.arange(vocab_size)==np.array(sentence).reshape(-1,1)).astype(np.float32)
        pred = model.predict(x[None, :])[0][-1]
        next_id = sample(pred, temperature)
        generated += indices_char[next_id]
        sentence = np.array(list(sentence[1:]) + [next_id])
    
    return generated

In [25]:
model=Sequential([
        LSTM(256, return_sequences=True, dropout_U=0.2, dropout_W=0.2, input_shape=(maxlen, vocab_size)),
        Dropout(0.5),
        TimeDistributed(Dense(vocab_size)),
        Activation('softmax')
    ])

In [27]:
model.compile(loss='categorical_crossentropy', optimizer=Adam())

In [26]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lstm_2 (LSTM)                    (None, 50, 256)       351232      lstm_input_1[0][0]               
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 50, 256)       0           lstm_2[0][0]                     
____________________________________________________________________________________________________
timedistributed_2 (TimeDistribut (None, 50, 86)        22102       dropout_2[0][0]                  
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 50, 86)        0           timedistributed_2[0][0]          
Total params: 373,334
Trainable params: 373,334
Non-trainable params: 0
___________________

In [28]:
for iteration in range(3):
    print('-' * 50)
    print('Iteration', iteration+1)
    model.fit(trn_inputs, trn_targets, batch_size=64, nb_epoch=1)
    
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('\n----- diversity:', diversity)
        print(generate(diversity))
    
    print('\n')

--------------------------------------------------
('Iteration', 1)
Epoch 1/1


('----- diversity:', 0.2)
ession of
genuine French curiosity and inventive the the the the and the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the to the the sore the the the the the the the the the the here the the the the the the ind and and and and the the the the the the the the the the the the the the and and ind and and and the the the the th


('----- diversity:', 0.5)
ession of
genuine French curiosity and inventive the porounithig "oretins of re the there on the meret and man the the whore han in the ans of he ar ther ond unne the the wini in the the the rose tho the the and tor in the there and the the fore ile the and to that eos the wimthe te the tich the cinters ons an the thes at toreresian whe of and touthe se an the the the and some the the in the serti


('----- diversity:', 1.0)
ession of
genuine French curiosity a