> **DO NOT EDIT IF INSIDE annadl_f19 folder**


# Week 5: Recurrent neural networks

Text, speech, weather, sensor output and video are but a few examples of the many types of data that is inherently sequential. So how does one predict the next word in a sentence, future temperatures or missing video frames? Using **recurrent neural networks** (RNNs)!

In [8]:
%matplotlib inline
%load_ext tensorboard

import numpy as np
import requests as rq
import random
import sys
import io
from bs4 import BeautifulSoup
import keras
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.optimizers import RMSprop

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


## Exercises

#### Modeling text

Text prediction is a good place to start when learning about RNNs, because most of us humans have a pretty well
optimized inner model for text prediction ourselves. We can, therefore, easily assess the performance of a neural
network in executing this task.

Below is some code that loads the screenplay for Tarantino's 1994 film 'Pulp Fiction'. I recommend reading through the
first 20 lines or so to get a feeling for the language and style used (and enjoy probably the best written screenplay
in the history of film).

In [3]:
response = rq.get("http://www.dailyscript.com/scripts/pulp_fiction.html")
text = BeautifulSoup(response.content, "html.parser").getText()
print(text[:2000])



"PULP FICTION" -- by Quentin Tarantino & Roger Avary


                                      "PULP FICTION"

                                            By

                             Quentin Tarantino & Roger Avary

                

               PULP [pulp] n.

               1. A soft, moist, shapeless mass or matter.

               2. A magazine or book containing lurid subject matter and 
               being characteristically printed on rough, unfinished paper.

               American Heritage Dictionary: New College Edition

               INT. COFFEE SHOP – MORNING

               A normal Denny's, Spires-like coffee shop in Los Angeles. 
               It's about 9:00 in the morning. While the place isn't jammed, 
               there's a healthy number of people drinking coffee, munching 
               on bacon and eating eggs.

               Two of these people are a YOUNG MAN and a YOUNG WOMAN. The 
               Young Man has a slight 

> **Ex. 5.1.1:** What is the most used symbol in this screenplay and what accuracy would a model constantly predicting this symbol obtain? In other words, what is the "baseline accuracy"?

In [4]:
char_freqs = {}

for symbol in text:
    if symbol in char_freqs:
        char_freqs[symbol] += 1
    else:
        char_freqs[symbol] = 1

freqs = 0       
i=0
max_freq = 0
        
for symbol in sorted(char_freqs, key=char_freqs.get, reverse=True):
    freqs += char_freqs[symbol]
    if i==0:
        print("max freq letter: %s with %d occurences" % (repr(symbol), char_freqs[symbol]))
        max_freqs = freqs
    i+=1
    
acc = max_freqs / freqs
print("baseline accuracy: %f" % acc)

max freq letter: ' ' with 164787 occurences
baseline accuracy: 0.541059


I've adapted some code for text generation from [this Keras example](https://keras.io/examples/lstm_text_generation/).
I've inserted some questions in the code (look for `Q:`) for you to answer in the exercise below.

In [5]:
# Q1: What is the purpose of this block? When is `char_indices` used? What about `indices_char`?
# A1: char_indices is a dictionary of the characters keyed by the characters and indices_char is keyed by the indices
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# Q2: What is the purpose of this block? What does the `seqlen` and `step` parameters do?
# A2: seqlen specifies the length of the sequence for each iteration and the step parameter specifies the number of elements to skip
seqlen = 40
step = seqlen
sentences = []
for i in range(0, len(text) - seqlen - 1, step):
    sentences.append(text[i: i + seqlen + 1])

# Q3: What about this block? What is `x` and what is `y`? Why do they have this dimensionality?
# A3: `x` is the current character and `y` is the next character
x = np.zeros((len(sentences), seqlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), seqlen, len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    # Q3a: What happens in this loop?
    # A3a: assigns 1 to the current char and the next char
    for t, (char_in, char_out) in enumerate(zip(sentence[:-1], sentence[1:])):
        x[i, t, char_indices[char_in]] = 1
        y[i, t, char_indices[char_out]] = 1


# Q4: Here we build the model. What does the `return_sequences` argument do? Why the dense layer at the end?
# A4: returns the full output sequence
model = Sequential()
model.add(LSTM(128, input_shape=(seqlen, len(chars)), return_sequences=True))
model.add(Dense(len(chars), activation='softmax'))

model.compile(
    loss='categorical_crossentropy',
    optimizer=RMSprop(lr=0.01),
    metrics=['categorical_crossentropy', 'accuracy']
)

def sample(preds, temperature=1.0):
    """Helper function to sample an index from a probability array."""
    preds = np.asarray(preds).astype('float64')
    preds = np.exp(np.log(preds) / temperature)  # softmax
    preds = preds / np.sum(preds)                #
    probas = np.random.multinomial(1, preds, 1)  # sample index
    return np.argmax(probas)                     #


def on_epoch_end(epoch, _):
    """Function invoked at end of each epoch. Prints generated text."""
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - seqlen - 1)
    
    # Q5: What does diversity do?
    # A5: picks a float in the list to divide the predictions by to give chance for possible local minima escape
    for diversity in [0.2, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + seqlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, seqlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.
            
            # What is the dimensionality of `preds`? Why do we input `preds[0, -1]` to the `sample` function?
            preds = model.predict(x_pred, verbose=0)
            next_index = sample(preds[0, -1], diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)






Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Epoch 1/50


KeyboardInterrupt: 

In [9]:
from datetime import datetime

logdir = './logs_week5/' + datetime.now().strftime("%Y%m%d-%H%M%S") # log file name
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

In [10]:
model.fit(x, y,
          batch_size=128,
          epochs=50,
          callbacks=[print_callback, tensorboard_callback])



Epoch 1/50

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "      It's cool, Honey Bunny, we're stil"
      It's cool, Honey Bunny, we're stils                                                                                                                                                                                                                                                                                                                                                                                                               
----- diversity: 0.5
----- Generating with seed: "      It's cool, Honey Bunny, we're stil"
      It's cool, Honey Bunny, we're stiltsor                                                                                                                                it                                                                                                                                                      

                                   JULES
                                                                                                                                                                                                                                                                                                                                                                                                              
----- diversity: 1.0
----- Generating with seed: "                                   JULES"
                                   JULES
                                
                              VINCENT
                                 S VINCENT
                                            JULES
                           Marse.

                       YOXTFED HELDY UNGfomen it fobuving off.

                      shauge topp.

                        VINCENT
                                              To him, 

Epoch 6/50

----- Generating text after 

                                                                                                                                                                                                                                                                                                                                                                                                                               
----- diversity: 0.5
----- Generating with seed: "     What time is it?

               "
     What time is it?

                                                                                                                                                                                                                                                                                                                                                                                                                               
----- diversity: 1.0
----- Generating with seed: "     

                         an when got the starter toward and stops her, out ahteadle 
               Whereing to give the stop, seed reads, for has phone)
                         say it that Honda. Hears is I hase. Beens all youet a seet in 
                         to before theatherd endy's."Eecause all the apartment, 
                     
Epoch 15/50

----- Generating text after Epoch: 14
----- diversity: 0.2
----- Generating with seed: "                                   BUTCH"
                                   BUTCH
                                                                                                                                                                                                                                                                                                                                                                                                              
----- diversity: 0.5
----- Generating with seed: "                      

                                                                                                                                                                                                                                                                                                                                                                                                                              
----- diversity: 1.0
----- Generating with seed: "y sometime. Me, I can't 
              "
y sometime. Me, I can't 
                         can't gean countning, zendy of the drendy 
               band of here, be 
                         a tarth and ahpead, not in the corned cearact, he can happened you't 
               undexter.

               Then Mac Lanca cop up a shot!

               Mia's nobe.

               Mia hold normans and bohing Marsellus.

               The mmpomothing.

               Mi
Epoch 20/50

----- Generating text after Epoch: 19
----- diversity:

                         Vincent takes a sup.

                                                                                                                                                                                                                                                                                                                                                                                          
----- diversity: 0.5
----- Generating with seed: "ightening, 
                         Vi"
ightening, 
                         Vincent.

                                                                                                                                                                                                                                                                                                                                                                                                      
----- diversity: 1.0
----- Generating with se

               seed time, their hamatain' preven 
               want to drink time.

                           Butch's body from his song, nods off.

                                     Beeip.

               Vince hit?

                                       THE WOLF
                         Yeah, wakes a smallett make a roller, 

Epoch 29/50

----- Generating text after Epoch: 28
----- diversity: 0.2
----- Generating with seed: "    matter. You're judging this thing 
"
    matter. You're judging this thing 
                         that a great of the bathroom door and sticks in a bad shit 
                         the bathroom door and starts to the car. The 
                                                                                                                                                                                                                                                  
----- diversity: 0.5
----- Generating with seed: "    matter. You're judging this t

                                                                                                                                                                                                                                                                                                                JULES
                              (solld's a saw solver and starts to 
                         seems on th
----- diversity: 1.0
----- Generating with seed: "t through the needle.

               "
t through the needle.

               Marsil well, let her nade, the ewon.

                                                 THE WOLF
                        shit looks the ring.

               Jules noded a backs and the interact-you, I 
               makemed.

               Jules two time that's mouth ringin' 
                         scarey, face. When the from the phone, 
               a locks, hand it up of the othi
Epoch 34/50

----- Generating text after Epoch: 33
----- diversity:

                                                                                                                                                                                                                                                                                                                                                                                                                   
----- diversity: 0.5
----- Generating with seed: "  It was a wedding present from my 
   "
  It was a wedding present from my 
                         and Vincent looks to a greation that so fuckin' 
                                                                                                                                                                                                                                                                                                                                        
----- diversity: 1.0
----- Generating with seed: "  It was a weddi

                                     It was shool, he's his what they do, and you go 
                                           (to Brett)
                         Nox think I've a wallet – not now go on 
                         for him bag. You feel 'im for a 
                         gon
Epoch 43/50

----- Generating text after Epoch: 42
----- diversity: 0.2
----- Generating with seed: "                              VINCENT
 "
                              VINCENT
                                                                                                                                                                                                                                                                                                                                                                                                                 
----- diversity: 0.5
----- Generating with seed: "                              VINCENT
 "
                              VINC

                         the story get a big blow?

                                     What's he weak through the tellection.

                                                                                                                                                                                                              come to be fucki
----- diversity: 1.0
----- Generating with seed: " me on a cellular 
                    "
 me on a cellular 
                         two over to he's)

                                     remember uself wantony howdong-asbo shoe mad poinon! Diess out that 
               how call sike?

               Vince STARTS the bill up reach.

                                     Dimber! The money?

                                     On! Chaikners ain't doin', what eart 
                         see smile –

         
Epoch 48/50

----- Generating text after Epoch: 47
----- diversity: 0.2
----- Generating with seed: "in the red. Redline 
       

<keras.callbacks.History at 0x7fdb43e9ae48>

In [6]:
%tensorboard --logdir logs_week5/

> **Ex. 5.1.2**: Add a callback for Tensorboard, so you can log the training process. Start training the network (takes ~10 minutes on my computer). While it's running move on to the next question.

> **Ex. 5.1.3**: Answer the questions in the code above (look for code comments starting with `Q:`).

> **Ex. 5.1.4**: Did the network finish training? Consider the generated text across epochs.
1. In the early batches (0-10), the generated text looks very bad. Can you explain why the low diversity generated text contains almost only the symbol " " (that is, spaces)?
2. The high diversity generated text is messed up too, but in a different way. Explain how.
3. In later batches (20-30) what do you notice is off about the low diversity generated text?

>> the low diversity (almost only contains the symbol " ") is due to the high density of spaces in the original text. With little training, the model tries to insert more spaces because its probability of occurring is highest.

> **Ex. 5.1.5**: For the network trained over all 50 epochs, generate a longer piece of text
(say 5000 symbols long). Use the sentence `text[1486:1526]` as seed (starts with 'YOUNG MAN' ends with 'No, ')
and set diversity to 0.5.
Describe what features of the screenplay and language in general that the network learned in only 50 epochs.
Also describe what serious mistakes it makes.

In [13]:
start_index = 1486
seqlen = 40

# Q5: What does diversity do?
# A5: picks a float in the list to divide the predictions by to give chance for possible local minima escape
diversity = 0.5
print('----- diversity:', diversity)

generated = ''
sentence = text[start_index: start_index + seqlen]
generated += sentence
print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)

for i in range(5000):
    x_pred = np.zeros((1, seqlen, len(chars)))
    for t, char in enumerate(sentence):
        x_pred[0, t, char_indices[char]] = 1.

    # What is the dimensionality of `preds`? Why do we input `preds[0, -1]` to the `sample` function?
    preds = model.predict(x_pred, verbose=0)
    next_index = sample(preds[0, -1], diversity)
    next_char = indices_char[next_index]

    sentence = sentence[1:] + next_char

    sys.stdout.write(next_char)
    sys.stdout.flush()
print()

----- diversity: 0.5
----- Generating with seed: "YOUNG MAN
                         No, "
YOUNG MAN
                         No, I miral was a boldoway.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  the cartauran fuckin' fannents the bad, sits a 
                                                                                                           

> **Ex. 5.1.6**: Do the same as above, but for 40 random letters (e.g. smash away on your keyboard) as seed. What happens? Can you explain why?

In [23]:
seed = "oijawef wefijwe oujioa woo w dfaosij efk"
len(seed)

# Q5: What does diversity do?
# A5: picks a float in the list to divide the predictions by to give chance for possible local minima escape
diversity = 0.5
print('----- diversity:', diversity)

generated = ''
sentence = seed
generated += sentence
print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)

for i in range(5000):
    x_pred = np.zeros((1, seqlen, len(chars)))
    for t, char in enumerate(sentence):
        x_pred[0, t, char_indices[char]] = 1.

    # What is the dimensionality of `preds`? Why do we input `preds[0, -1]` to the `sample` function?
    preds = model.predict(x_pred, verbose=0)
    next_index = sample(preds[0, -1], diversity)
    next_char = indices_char[next_index]

    sentence = sentence[1:] + next_char

    sys.stdout.write(next_char)
    sys.stdout.flush()
print()

----- diversity: 0.5
----- Generating with seed: "oijawef wefijwe oujioa woo w dfaosij efk"
oijawef wefijwe oujioa woo w dfaosij efker 
                         the stares. He leanters windows. The 
                         all you would leave wo know what the 
                         a same every one of the car, walk a 
                         and sering to him.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

> **Challenge** Download [this](https://www.yelp.com/dataset/download) Yelp dataset and train a model that predicts rating given a review text!