# Lecture 5 : Recurrent Neural Networks
### Please access this notebook at: https://github.com/sangaer/PracticalMachineLearning2019

In today's course, we'll cover 3 labs:
- Lab1. Pre-processing for Text Data
- Lab2. Long Short-Term Memory - Spam Message Classification
- Lab3. Sequence-to-Sequence Autoencoder - English to French Translation

## Package Import
----
Import the required package, including numpy, tensorflow, and matplotlib

In [None]:
import os
import re
import requests

from io import BytesIO
from zipfile import ZipFile

import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline

## Lab 1. Pre-processing for Text Data
----


### Dataset Preparation
----
The **SMS Spam Collection** Data Set is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,572 messages, tagged acording being ham (legitimate) or spam.

For more information, see: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [None]:
dataset_dir = os.path.join('data')
dataset_sms_path = os.path.join('data', 'SMSSpamCollection')

def download_sms_collection_dataset():
    r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip')
    if r.status_code == 200:
        f_zip = ZipFile(BytesIO(r.content))

        dataset_raw_content = f_zip.read('SMSSpamCollection')
        with open(dataset_sms_path, 'wb') as f_dataset:
            f_dataset.write(b'label\tcontent\n')
            f_dataset.write(dataset_raw_content)
    else:
        assert('Error for downloading SMS Spam Collection Dataset')
        

if not os.path.exists(dataset_dir):
    os.makedirs(dataset_dir)
if not os.path.exists(dataset_sms_path):
    download_sms_collection_dataset()
    
df = pd.read_csv(dataset_sms_path, sep='\t')
display(df)

### Word Spliting
----
Convert each sentence into separate words sequence. Notice that not only for the dot (`.`), other stop words (e.g., comma (`,`), colon (`:`), semi-colon (`;`)., etc.) are also possible separators.
> **Your task**:  
> Spliting each sentence into words list by using `tf.keras.preprocessing.text.text_to_word_sentence`. For example, for the following sentence:
> - Recurrent neural networks have a wide array of applications, including time series analysis, document classification, speech and voice recognition.
> 
> The expected result is:
> - `['recurrent', 'neural', 'networks', 'have', 'a', 'wide', 'array', 'of', 'applications', 'including', 'time', 'series', 'analysis', 'document', 'classification', 'speech', 'and', 'voice', 'recognition']`
> ---
> In the mean while, adjust the argument `MAX_SEQUENCE_LENGTH` based on the observation of the sequence length histogram.

In [None]:
# Finish your part here
# 1. Your input is df['content'], use `for x_str in df['content']` for iterative processing
#     e.g.,
#     df['content'] = [
#         'abc def',
#         'gh, ijkl; mnop, qrs.',
#         'tu! vwx, yz!'
#     ]
#
# 2. Expected result should be a Python list `X_str`, with each element a variable-length array
#     e.g.,
#     X_str == [
#         ['abc', 'def'], 
#         ['gh', 'ijkl', 'mnop', 'qrs'],
#         ['tu', 'vwx', 'yz']
#     ]


# The following part validate if your result is correct
for i in range(0, 20):
    print('[\'{}\'{}]'.format('\', \''.join(X_str[i][0:10]), ' ...' if len(X_str[i]) > 10 else ''))

In [None]:
plt.hist([len(x_str) for x_str in X_str], bins=100)
plt.ylabel('occurrences')
plt.xlabel('Number of Words in Sentence')
plt.show()

MAX_SEQUENCE_LENGTH = 32

### Tokenization
----
Use `keras.preprocessing.text.Tokenizer` to encode each word into integer-based code according to word occurrences.

> **Your task**:  
> Adjust the argument `MAX_NUM_DICT_WORDS`, make sure the uncommon words (e.g., occurrences < 10) can be removed during the string-based to integer-based code conversion

In [None]:
# Adjust this argument by observing the word occurrence
MAX_NUM_DICT_WORDS = 2000

# Finish your part here. The variable tokenizer is expected to be an instance of tf.keras.preprocessing.text.Tokenizer
# ....

# The following part validate if your result is correct
tokenizer.fit_on_texts(X_str)

In [None]:
def display_tokenizer_stat(tokenizer, max_items=None):
    list_tokenizer_stat_keys, list_tokenizer_stat_words, list_tokenizer_stat_cnts = [], [], []
    for key in sorted(tokenizer.index_word):
        word = tokenizer.index_word[key]
        cnt = -1
        if word != 'UNK':
            cnt = tokenizer.word_counts[word]
        list_tokenizer_stat_keys.append(key)
        list_tokenizer_stat_words.append(word)
        list_tokenizer_stat_cnts.append(cnt)

    df_tokenizer_stat = pd.DataFrame(columns=['Key', 'Word', 'Occurrences'],
                                     data={
                                         'Key': list_tokenizer_stat_keys,
                                         'Word': list_tokenizer_stat_words,
                                         'Occurrences': list_tokenizer_stat_cnts
                                     })
    if max_items is not None:
        display(df_tokenizer_stat[:max_items])
    else:
        display(df_tokenizer_stat)


# display here
display_tokenizer_stat(tokenizer, MAX_NUM_DICT_WORDS)

Now, use the pre-trained tokenizer to convert **string-based** sequences into **integer-based** sequence:

In [None]:
X = tokenizer.texts_to_sequences(X_str)
for x in X[:20]:
    print('[{}]'.format(', '.join([str(x_) for x_ in x])))

### Sequences Padding
----
Most of deep learning framework working relies on batch processing. A fix-sized sequences dataset are definitely helpful for this. Therefore, we use sequence padding help function `tf.keras.preprocessing.sequence.pad_sequences` to convert variable-length sequences into fixed-length sequences.

> **Your task**:  
> Use `tf.keras.preprocessing.sequence.pad_sequences`, padding or truncating the sequences `X` into fixed-length array. Your length should be `MAX_SEQUENCE_LENGTH`.
> 
> Reference: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

In [None]:
# Finish your part here
# ...

# The following part validate if your result is correct
for i in range(0, 10):
    print('{}'.format(X[i]))

### Label Encoder
----
Use `sklearn.preprocessing.LabelEncoder` to convert **string-based** label into **integer-based** label

In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(df['label'])

Y = label_encoder.transform(df['label'])

display( [ (idx, label) for idx, label in enumerate(label_encoder.classes_) ] )

Now you already have cleaned, and good shaped data. Split it into training and testing set for further evaluation.

- **Training Set**  
    A set of examples used for learning the model parameters (i.e., weights or bases) of the classifier.

    The pair of data and corresponded label are expressed as variable (`X_train`, `Y_train`)
- **Testing Set**  
    A set of examples used for performance evaluation only.

    The pair of data and corresponded label are expressed as variable (`X_test`, `Y_test`)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15, random_state=0)

## Lab 2. Long Short-Term Memory (LSTM) - Spam Message Classification
----
Remember the following arguments:
- `MAX_NUM_DICT_WORDS`:  
    All possible words which may be appeared in the sentence dataset, which convert the word to the one-hot encoded space. The argument is used by tokenizer and embedding layer.
    
    If the word occurrence is less than the last item kept in the tokenizer, the word will be removed and replaced by `UNK`.
- `MAX_SEQUENCE_LENGTH`:  
    The maximum length of the sequences.
- `WORD_EMBED_DIMENSION`:  
    The dimension of the projected space of embedding layer.
    
Now, we use **LSTM** and **Embedding** layer to construct the sequence classifier.

For more details, please refer to **week5.pptx**.

In [None]:
WORD_EMBED_DIMENSION = 256

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(MAX_NUM_DICT_WORDS, WORD_EMBED_DIMENSION, input_length=MAX_SEQUENCE_LENGTH),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()

tf.keras.utils.plot_model(model, show_layer_names=True, show_shapes=True)

In [None]:
history = model.fit(X_train, Y_train, epochs=10, validation_data=(X_test, Y_test), batch_size=64, shuffle=True)

In [None]:
plt.figure(figsize=(12,6))
plt.plot(history.history['loss'], label='Training loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.legend()
plt.grid()
plt.ylabel('loss')
plt.xlabel('epoch')
plt.ylim((0.0, 1.0))
plt.show()

plt.figure(figsize=(12,6))
plt.plot(history.history['acc'], label='Training accuracy')
plt.plot(history.history['val_acc'], label='Validation accuracy')
plt.legend()
plt.grid()
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.ylim((0.0, 1.0))
plt.show()

## Lab 3. Sequence-to-Sequence Autoencoder - English to French Translation
----

### Dataset Preparation
----
The **Tatoeba** is a collection of sentences and translations. The subset `fra-eng` contains 160872 english to french sentence pair. Here, we use the first 10000 sentences for sequence to sequence autoencoder training and prediction.

For more information, please refer to:
- https://tatoeba.org/eng
- http://www.manythings.org/anki/

In [None]:
dataset_fra_eng_path = os.path.join(dataset_dir, 'fra.txt')

def download_fra_eng_dataset():
    if not os.path.exists('data/fra.txt'):
        if not os.path.exists('~/kaggle.json'):
            !mkdir ~/.kaggle
            !cp kaggle.json ~/.kaggle/
            !chmod 600 ~/.kaggle/kaggle.json
        !kaggle datasets download -d myksust/fra-eng
        !unzip fra-eng.zip
        !rm _about.txt
        !rm fra-eng.zip
        !mv fra.txt ./data

def prepare_fra_eng_dataset():
    with open(dataset_fra_eng_path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')
    
    encoder_input_data_str, decoder_input_data_str, decoder_output_data_str = [], [], []
    
    max_len_src = 0
    max_len_tgt = 0
    for idx, line in enumerate(lines):
        if len(line) == 0:
            continue
        if idx == 10000:
            break
        source, target = line.split('\t')

        encoder_input_data_str.append(source)
        decoder_input_data_str.append('\t' + target)
        decoder_output_data_str.append(target + '\n')
        
    return encoder_input_data_str, decoder_input_data_str, decoder_output_data_str
        
download_fra_eng_dataset()
encoder_input_data_str, decoder_input_data_str, decoder_output_data_str = prepare_fra_eng_dataset()

In [None]:
df_eng_fra = pd.DataFrame({'English': encoder_input_data_str[:50],
                           'French (input)': decoder_input_data_str[:50],
                           'French (outputs)': decoder_output_data_str[:50]})
display(df_eng_fra)

> **Your task**:  
> In the function `prepare_fra_eng_dataset`, both English sentences (source) and French sentences (target) have been splited into word sequences. Please adjust the argument `MAX_SEQUENCE_LENGTH` based on the observation of the sequence length histogram.

In [None]:
plt.hist([len(x_str) for x_str in encoder_input_data_str], bins=10)
plt.ylabel('occurrences')
plt.xlabel('Number of Words in Sentence (English)')
plt.show()

plt.hist([len(x_str) for x_str in decoder_output_data_str], bins=10)
plt.ylabel('occurrences')
plt.xlabel('Number of Words in Sentence (French)')
plt.show()

MAX_SEQUENCE_LENGTH_ENG = 16
MAX_SEQUENCE_LENGTH_FRA = 32

### Tokenization
----
Use `keras.preprocessing.text.Tokenizer` to encode each word into integer-based code according to word occurrences.

**Notice:** Due to scale of the dataset is relatively small, we use **Character-based** tokenization

> **Your task**:  
> Adjust the argument `MAX_NUM_DICT_TOKENS_ENG` and `MAX_NUM_DICT_TOKENS_FRA` for both english and french, make sure the uncommon words (e.g., occurrences < 10) can be removed during the string-based to integer-based code conversion

In [None]:
MAX_NUM_DICT_TOKENS_ENG = 45
MAX_NUM_DICT_TOKENS_FRA = 66


# English part
tokenizer_eng = tf.keras.preprocessing.text.Tokenizer(num_words=MAX_NUM_DICT_TOKENS_ENG, char_level=True)
tokenizer_eng.fit_on_texts(encoder_input_data_str)

encoder_input_data = tokenizer_eng.texts_to_sequences(encoder_input_data_str)
display_tokenizer_stat(tokenizer_eng, MAX_NUM_DICT_TOKENS_ENG)


# French part
tokenizer_fra = tf.keras.preprocessing.text.Tokenizer(num_words=MAX_NUM_DICT_TOKENS_FRA, char_level=True)
tokenizer_fra.fit_on_texts([sentence + '\n' for sentence in decoder_input_data_str])

decoder_input_data = tokenizer_fra.texts_to_sequences(decoder_input_data_str)
decoder_output_data = tokenizer_fra.texts_to_sequences(decoder_output_data_str)
display_tokenizer_stat(tokenizer_fra, MAX_NUM_DICT_TOKENS_FRA)

### Sequences Padding
----

In [None]:
encoder_input_data = tf.keras.preprocessing.sequence.pad_sequences(encoder_input_data,
                                                                   maxlen=MAX_SEQUENCE_LENGTH_ENG,
                                                                   padding='post',
                                                                   truncating='post')
encoder_input_data = tf.keras.utils.to_categorical(encoder_input_data)

decoder_input_data = tf.keras.preprocessing.sequence.pad_sequences(decoder_input_data,
                                                                   maxlen=MAX_SEQUENCE_LENGTH_FRA,
                                                                   padding='post',
                                                                   truncating='post')
decoder_input_data = tf.keras.utils.to_categorical(decoder_input_data)


decoder_output_data = tf.keras.preprocessing.sequence.pad_sequences(decoder_output_data,
                                                                    maxlen=MAX_SEQUENCE_LENGTH_FRA,
                                                                    padding='post',
                                                                    truncating='post')
decoder_output_data = tf.keras.utils.to_categorical(decoder_output_data)

### Sequence to Sequence Autoencoder Model
----
Please refer to **week5.pptx** for more details

#### Create Model and Training
----

In [None]:
LSTM_LATENT_DIMENSION = 256


# ENCODER
encoder_input = tf.keras.layers.Input(shape=(None, MAX_NUM_DICT_TOKENS_ENG),
                                      name='encoder_input')
encoder_lstm = tf.keras.layers.CuDNNLSTM(LSTM_LATENT_DIMENSION,
                                    return_state=True,
                                    name='encoder_lstm')
encoder_outputs, state_h, state_c = encoder_lstm (encoder_input)
encoder_states = [state_h, state_c]


# DECODER
decoder_input = tf.keras.layers.Input(shape=(None, MAX_NUM_DICT_TOKENS_FRA),
                                      name='decoder_input')
decoder_lstm = tf.keras.layers.CuDNNLSTM(LSTM_LATENT_DIMENSION,
                                    return_sequences=True,
                                    return_state=True,
                                    name='decoder_lstm')
decoder_lstm_out, _, _ = decoder_lstm(decoder_input,
                                      initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(MAX_NUM_DICT_TOKENS_FRA,
                                      activation='softmax',
                                      name='decoder_dense')
decoder_outputs = decoder_dense (decoder_lstm_out)


# MODEL
model = tf.keras.models.Model([encoder_input, decoder_input], decoder_outputs)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
tf.keras.utils.plot_model(model, show_layer_names=True, show_shapes=True)

In [None]:
history = model.fit([encoder_input_data, decoder_input_data],
                    decoder_output_data,
                    batch_size=64,
                    epochs=100,
                    validation_split=0.1)

In [None]:
plt.figure(figsize=(12,6))
plt.plot(history.history['loss'], label='Training loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.legend()
plt.grid()
plt.ylabel('loss')
plt.xlabel('epoch')
plt.ylim((0.0, 1.5))
plt.show()

#### Predict Sentence by Pre-trained Sequence to Sequence Model
----

In [None]:
encoder_model = tf.keras.models.Model(encoder_input, encoder_states)

decoder_state_input_h = tf.keras.layers.Input(shape=(LSTM_LATENT_DIMENSION,),
                                              name='decoder_state_input_h')
decoder_state_input_c = tf.keras.layers.Input(shape=(LSTM_LATENT_DIMENSION,),
                                              name='decoder_state_input_c')
decoder_states_input = [decoder_state_input_h, decoder_state_input_c]

decoder_lstm_out, state_h, state_c = decoder_lstm(decoder_input, initial_state=decoder_states_input)
decoder_states = [state_h, state_c]
decoder_output = decoder_dense(decoder_lstm_out)

decoder_model = tf.keras.models.Model([decoder_input] + decoder_states_input,
                                      [decoder_output] + decoder_states)
decoder_model.summary()

tf.keras.utils.plot_model(decoder_model, show_layer_names=True, show_shapes=True)

In [None]:
token_fra_start_idx = tokenizer_fra.texts_to_sequences([['\t']])[0][0]
token_fra_end_idx = tokenizer_fra.texts_to_sequences([['\n']])[0][0]

def decode_sequence(input_seq):
    # Get the states h and c from source language (English)
    states_value = encoder_model.predict(input_seq)

    # Sampling loop for a batch of sequences
    stop_condition = False
    
    # Decoded sentence - One-hot array
    decoded_sentence_onehot = np.zeros((1, 1, MAX_NUM_DICT_TOKENS_FRA))
    decoded_sentence_onehot[0, 0, token_fra_start_idx] = 1
    
    # Decoded sentence - Integer code-based sequence
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([decoded_sentence_onehot] + states_value)

        # Sample a token
        token_idx = np.argmax(output_tokens[0, -1, :])
        decoded_sentence += tokenizer_fra.index_word[token_idx]

        # Exit condition: either hit max length
        # or find stop character.
        if (token_idx == token_fra_end_idx or len(decoded_sentence) >= MAX_SEQUENCE_LENGTH_FRA):
            stop_condition = True

        # Update the target sequence (of length 1).
        decoded_sentence_onehot = np.zeros((1, 1, MAX_NUM_DICT_TOKENS_FRA))
        decoded_sentence_onehot[0, 0, token_idx] = 1

        # Update states
        states_value = [h, c]
    
    # Decoded sentence - Integer code-based sequence to Character-based sequence
    return decoded_sentence


In [None]:
decoder_output_data_str_predicted = []
for i in range(0, 50):
    decoder_output_data_str_predicted.append(decode_sequence(encoder_input_data[i:i+1, :]))

df_eng_fra_predict = pd.DataFrame({'English': encoder_input_data_str[:50],
                                   'French (Actual)': decoder_input_data_str[:50],
                                   'French (Predict)': decoder_output_data_str_predicted[:50]})
display(df_eng_fra_predict)


**Once finished, please Submit Your Colab Notebook [Here](https://forms.gle/2VZkhTyH59hZhD249)**

----

**NOTICE:** It takes time to train a sequence to sequence autoencoder. For your convenience:
- Both Lab1 and Lab2 are required
- Lab3 is optional