# Text Generation with Neural Networks

Import necessary packages for preprocessing, model building, etc. We follow the steps described in the theoretical part of this summer school as follows:

0. Define Reseach Goal (already done)
2. Retrieve Data
3. Prepare Data
4. Explore Data
5. Model Data
6. Present and automate Model

In [42]:
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from keras import backend as K
import numpy as np
import random
import sys
import io

# 1. Retrieve Data

Load your data! You can pick up data from everywhere, such as plain text, HTML, source code, etc.
You can either automatically download with Keras get_file function or download it manually and import it in this notebook.

## Example Data Set
[trump.txt](https://raw.githubusercontent.com/harshilkamdar/trump-tweets/master/trump.txt)

In [43]:
path = get_file('trump.txt', origin='https://raw.githubusercontent.com/harshilkamdar/trump-tweets/master/trump.txt')
text = io.open('resources/shakespeare.txt', encoding='utf-8').read().lower()

print('corpus length:', len(text))

corpus length: 106191


# 2. Prepare Data

As described in the theoretical part of this workshop we need to convert our text into a word embedding that can be processed by a (later) defined Neural Network. 


## 2.1. Create Classes 
The goal after this step is to have a variable which contains the distinct characters of the text. Characters can be letters, digits, punctions, new lines, spaces, etc.

### Example:
Let's assume we have the following text as input: "hallo. "

After the following step, we want to have all distinct characters, i.e.:

``[ "h", "a", "l", "o", ".", " " ] ``


In [44]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))

total chars: 40


## 2.2. Create Training Set

In the following section we need to create our test set based on our text. The idea is to map a sequence of characters to a class. In this case, a class is one of our distinct characters defined in the previous task. This means that a sequence of characters predicts the next character. This is important for the later model to know which characters come after specific sequences. The sequence length can be chosen. So try out different squence length.

### Example:
Our text is still: "hallo. "
Sequence length: 2 (i.e. 2 characters predict the next character)

The result (training set) should be defined as follows:

``
Seuences --> Class
 "ha"    --> "l"
 "al"    --> "l"
 "ll"    --> "o"
 "lo"    --> "."
 "o."    --> " "
``

You can read the previous example like this: Squence "ha" predicts the next character " l ", sequence "al" predicts next character " l " and so on.

In [45]:
seqlen = 40 # Sequence length parameter
step = 5   # Determines the how many characters the window should be shifted in the text 
sequences = []  # List of sequences
char_class = [] # Corresponding class of each sequence

for i in range(0, len(text) - seqlen, step):
    sequences.append(text[i: i + seqlen])
    char_class.append(text[i + seqlen])
print('#no sequences:', len(sequences))

#no sequences: 21231


## 2.3. Check your Data

Now that we processed our data, it's time to understand what we have built so far.

In [46]:
for idx in range(len(sequences[:10])):
    print(sequences[idx], ":" , char_class[idx])


the tempest
shakespeare homepage | the  : t
tempest
shakespeare homepage | the tempe : s
st
shakespeare homepage | the tempest |  : e
akespeare homepage | the tempest | entir : e
eare homepage | the tempest | entire pla : y
homepage | the tempest | entire play
act :  
age | the tempest | entire play
act i
sc : e
 the tempest | entire play
act i
scene i : .
tempest | entire play
act i
scene i. on  : a
st | entire play
act i
scene i. on a shi : p


In [47]:
# Print from 1st to 10th character 
chars[:10]

['\n', ' ', '!', '&', "'", ',', '-', '.', ':', ';']

In [48]:
# Print from 150th to 160th character :-)
chars[150:160]

[]

## 2.4. Vectorization of Training Sequences

The following section describes the desired form of our final training set. 

text: "hallo. ".
As defined above we have a couple of sequences mapping to the next appearing character in the text (e.g. "ha" mapping to "l"). But first of all, we transform each sequence to the following one-hot-encoded matrix.

**Example:** 
sequence "ha" maps to the following matrix

|     |  h  |  a  |  l  |  o  |  .  | ' ' |
|-----|-----|-----|-----|-----|-----|-----|
|  h  |  1  |  0  |  0  |  0  |  0  |  0  |
|  a  |  0  |  1  |  0  |  0  |  0  |  0  |

next sequence "al" maps to the following matrix

|     |  h  |  a  |  l  |  o  |  .  | ' ' |
|-----|-----|-----|-----|-----|-----|-----|
|  a  |  0  |  1  |  0  |  0  |  0  |  0  |
|  l  |  0  |  0  |  1  |  0  |  0  |  0  |

... And so on

## 2.5. Vectorization of Target Classes

We build our target classes similar to the training set. We need a one hot-encoded vector for each target (which is a character).

**Example:** for target char "l" the vector looks like this

|     |  h  |  a  |  l  |  o  |  .  | ' ' |
|-----|-----|-----|-----|-----|-----|-----|
|  l  |  0  |  0  |  1  |  0  |  0  |  0  |

In [49]:
# Indexed characters as dictionary
char_indices = dict((c, i) for i, c in enumerate(chars))

# Both matrices will initialized with zeros
training_set = np.zeros((len(sequences), seqlen, len(chars)), dtype=np.bool)
target_char = np.zeros((len(sequences), len(chars)), dtype=np.bool)
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        training_set[i, t, char_indices[char]] = 1
    target_char[i, char_indices[char_class[i]]] = 1

# 3. Explore Data

In [50]:
# Let's check the shape of the training_set

training_set.shape

(21231, 40, 40)

Output: (x, y, z)

    x = number of all sequences to test
    y = window size to predict the next character
    z = number of all appearing characters in text (for one-hot-enconding) 

In [51]:
# Let's check the shape of the target_char (act as our target classes)

target_char.shape

(21231, 40)

Output: (x, y)

    x = number of all sequences to test
    y = the mapping of each sequence to the next character


# 4. Model data

Let's get down to business! Create your model.

Try different model configuration (see [keras doc](https://keras.io/models/about-keras-models/#about-keras-models)) 

In [71]:
model = Sequential()

# build the model: a LSTM
model = Sequential()
model.add(LSTM(128, input_shape=(seqlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
optimizer = RMSprop(lr=0.01)

model.compile(loss='categorical_crossentropy', optimizer=optimizer)

model.summary()

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_13 (LSTM)               (None, 128)               86528     
_________________________________________________________________
dense_7 (Dense)              (None, 40)                5160      
_________________________________________________________________
activation_7 (Activation)    (None, 40)                0         
Total params: 91,688
Trainable params: 91,688
Non-trainable params: 0
_________________________________________________________________


In [61]:
def getNextCharIdx(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [68]:
# Creation of reverse char index, to get the char for the predicted class
indices_char = dict((i, c) for i, c in enumerate(chars))

def on_epoch_end(epoch, logs):
    
    if (epoch % 10 == 0):
        # Function invoked at end of each epoch. Prints generated text.
        print()
        print('----- Generating text after Epoch: %d' % epoch)
        start_index = random.randint(0, len(text) - seqlen - 1)
        for diversity in [1, 0.1, 0.5]:
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index: start_index + seqlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(1000):
                x_pred = np.zeros((1, seqlen, len(chars)))
                for t, char in enumerate(sentence):
                    x_pred[0, t, char_indices[char]] = 1.

                preds = model.predict(x_pred, verbose=0)[0]
                next_index = getNextCharIdx(preds, diversity)
                next_char = indices_char[next_index]

                generated += next_char
                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()

In [69]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# 5. Evaluate Model

We are not at the sweet part of the model. Let's fit our model and see what it prints!

In [70]:
model.fit(training_set, target_char,
          batch_size=128,
          epochs=150,
          callbacks=[print_callback])

Epoch 1/150

----- Generating text after Epoch: 0
----- diversity: 1
----- Generating with seed: "ospero

    so glad of this as they i ca"
ospero

    so glad of this as they i call eae it is's sount.

meaddsai

sindal

    i hi
       thay neme freest sery bofion so han, so's't.

ant

    eres and trinculon mo it thai cas hoteen
    a cafuy, it the sureen fith het say makin!
    whither gon ester it s: fer safeld, of but cuay
    grom it berbaid all be, mang the bourg;
    e'sther to be pood mo'd man nit hat thincull
    it sill vure.

trinculo

    a list hist withes lost thy feund his that,
    when eres the dour w es thou ast uf pracesio
    sace archer monet, amenw good mo elee,
    hame loaving io; what afmist coke, tile.

tronculo

    so heatave mpoon more.
    the divesets, thet that and 't sse laveo.'stablo
    this me dove noc tadre.
    wailon: wewe hown your eatit's that  h ary,
    be thm ferd of nou!
    af nhan thmury berike you lant?
    shand and , hinculy,
    mirve

  after removing the cwd from sys.path.


ing is the dule
    whoce be me more me not, stand is thin
    the should the rance. whou dost thou live
    ince it should poor strange most be dreepons the bast trinculo

    come of napless no lord with soft.

merastaan

    what stain beside. whou thin the canstraing
    ferdinand be purtersell how shall dad dever
    the soundst our the some more a prace lorded: where come of thin stiban st as no,
    is make to and twine isle.

antonio

    i do not the bark you be as munchee.

alonso

    the parthent in this son o' the reasure
    which i do no let me thin shape of thin stoud;
    the bet mo charg thee arand. where i care'd,
    not hom and sebastian] stoud-

antonio

    she shale on the son my sonster down heards now
    the sures my son mourn? but mo charg me
      for the surest has putto the a merry it
    the serve come of ting, and swerited, and drink.

miranda

    my soullors and he parther wot him them but this sonselo

    i do a give a firch, twith semponting inc
Ep

    this graself for the ralk you the reasue
    soundely himser, to the call. life in'd;
    enter are prospero; the soncele it.

antonio

    what see prospero the reathing on aband did;
    thou drey with aschericl: he shreeh, in my pact a surpole with a treep,
    then your indion, in a surerome brows. i hereading
    the seavles no murt of mage euss of milan,
    sir all poor but my charging to it it
    the master the ract of napler in dispert the very spiring they
    the master the ratt my not but he gred.

sebassia

    then you torchand of hing.

    exet that fald more; who
    the senst the reasure me with my hopaintion.

sebastian

    how you have tornce this rabable inst one say you
    in lowss where the sark with sle ploourser stand your
    so, that make livewring their fablr; by so,
    the fordone to stave beer liceranger 'gaher and it
    the seavles a mon
----- diversity: 0.5
----- Generating with seed: " furlongs of sea for an
    acre of barr"
 furlongs of sea f

    mishoars that lary did poor knot ofin hat veace fawther
    and was heave the him now-are broviguld her agan,
    and i say worlon, alcess, he's nont, have nep,
    compoce to the forst. thee as you loverang thee,
    sound 'twas the king on.

sebastian

    do not this incuse. that's thee forwert
    of these my for o'lt o'eld it surer consom
    they ever shall crarglnt here-sto thou didst part our sayswess it you
    lay hout do am down pallo, and trum never
    suther souftheir befode poor speak forth,
    not a gresends
    and link my faundst few werring, so corns stoce stiden sto make lastionstle me,
    wither son e't stall be ploced to ableald--
    chan so all a giving in and onceyo

    every moast to did poor weach relicle
    soundrats art master
   
----- diversity: 0.1
----- Generating with seed: "sp channels and on this green land
    a"
sp channels and on this green land
    and weacod while i did pery did poor wean epes o' this is
    abine and drewn me.

miranda


Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150

----- Generating text after Epoch: 130
----- diversity: 1
----- Generating with seed: "on him; his complexion is
    perfect ga"
on him; his complexion is
    perfect gase bestep, sintlemsed hereach,
    o disporielt our caliban
    and whey wastle i serve hose pros.

trinculo

    he isle, as the grodsedwitus; thou tore,
    what you reje'd mostive thee forge that
    se'll do you were should ind good sirvil
    not him
    as mast act shall prospero's booking so,
    they hose i pro: dowst for the shall did they firs
    ap marine am amant

    serve there oo master gall rend;
    whos that wasle plosebzer: for the saneril
    the gras d bestain wat a sprace: done of pongut fere offerturt
    is this grasing sid beher
    if hat heave fies that the is so bles

boot nal
    aster with go dono. fastelt for the man the matt
    but, forlompain!

antonio

    and all and mone upon our callo

<keras.callbacks.History at 0x193dd1d0>