# Algorithms for Big Data - Exercise 9
This lecture is focused on the Estimator usage in Keras library.

It shows you how to solve the classification problems in TensorFlow using Estimators. An Estimator is TensorFlow's high-level representation of a complete model, and it has been designed for easy scaling and asynchronous training. For more details see [this](https://www.tensorflow.org/guide/estimator).

You can download the dataset from this course [Github](https://github.com/rasvob/2020-21-ARD/tree/master/datasets)


[Open in Google colab](https://colab.research.google.com/github/rasvob/2020-21-ARD/blob/master/abd_09.ipynb)
[Download from Github](https://github.com/rasvob/2020-21-ARD/blob/master/abd_09.ipynb)

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import matplotlib.pyplot as plt # plotting
import matplotlib.image as mpimg # images
import numpy as np #numpy
import seaborn as sns
import tensorflow as tf
import tensorflow.compat.v2 as tf #use tensorflow v2 as a main 
import tensorflow.keras as keras # required for high level applications
from sklearn.model_selection import train_test_split # split for validation sets
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import normalize # normalization of the matrix
import scipy
import pandas as pd

tf.version.VERSION

'2.3.0'

In [2]:
import requests
from typing import List, Tuple

In [3]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

# We need to download the data first and split text to lines

In [None]:
req = requests.get('https://raw.githubusercontent.com/rasvob/2020-21-ARD/master/datasets/hp1.txt', allow_redirects=True)

In [None]:
txt = str(req.text).splitlines()

In [None]:
txt[:20]

## We can see that the text is far from perfect because we have some noise in the data as in the last lecture
We need to preprocess the text to be suitable for the RNN application. We need to clear blank lines and remove chapter headers. To simplify the task, we will get rid partialy of the interpunction as well for now. Final step will be joining the text into one big string.

In [None]:
txt = txt[3:]
txt[:10]

#### Remove the chapter header with chapter name
We will remove the blank lines in this part as well.

In [None]:
txt = [x for x in txt if 'CHAPTER ' not in x]
txt[:10]

In [None]:
txt = [x for x in txt if not x.upper() == x]
txt[:10]

### There are another minor imperfections connected to the  -- 't -- suffix, we need to fix it.

In [None]:
[x for x in txt if "\'" in x][25:30]

In [None]:
txt = [x.replace('"', '') for x in txt]
[x for x in txt if "a squeaky voice that" in x]

### We will join the text to one long line and tokenize it like the last time

In [None]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def fix_nt(words):
    st_res = []
    for i in range(0, len(words) - 1):
        if words[i+1] == "n't" or words[i+1] == "nt":
            st_res.append(words[i]+("n't"))
        else:
            if words[i] != "n't" and words[i] != "nt":
                st_res.append(words[i])
    return st_res

def fix_s(words):
    st_res = []
    for i in range(0, len(words) - 1):
        if words[i+1] == "'s":
            st_res.append(words[i]+("'s"))
        else:
            if words[i] != "'s":
                st_res.append(words[i])
    return st_res

def normalize(words):
    words = remove_non_ascii(words)
    words = fix_nt(words)   
    words = fix_s(words)
    return words



In [None]:
txt_one_line = ' '.join(txt)

In [None]:
txt_one_line[:300]

In [None]:
tokenized = TextBlob(txt_one_line).words

In [None]:
tokenized = normalize(tokenized)

### n't suffix should be fixed now (far from ideal TextBlob functionality)

In [None]:
[x for x in tokenized if "'s" in x or "n't" in x][:10]

### Final step of the preprocessing is joining the tokenized text back into fixed length sequences

### We differ among 4 modes of predictions in case of RNN
 - 1:1 - One word is classified as one of the classes, e.g. POS tag
 - 1:N - One word is classified in multiple classes, not very common
 - N:1 - Very commom, e.g. sentiment analysis
 - N:N - Also very common, e.g. machine translation, text generation
 
![rnn_pred](https://github.com/rasvob/2020-21-ARD/raw/master/images/rnn_pred.jpeg)
 
We need to define training vectors which are of the same length. There are multiple approaches for text generation - N:1 or N:N. The problem of the N:N approach is that it will generate fixed length sequences. Thus it's wise to transform the task into N:1 classification task, with N words in the training vector. Network will predict the next word for the input sequence which is basicaly a classification task.

#### Sequence length is very important hyper-parameter!!


# Let's take a look at the vocabulary size

In [None]:
dist = nltk.FreqDist(tokenized)

### We have 6829 unique words

In [None]:
len(dist)

In [None]:
most_common_words = sorted(list(dist.items()), key=lambda x: x[1], reverse=True)[:30]

In [None]:
ax, fig = plt.subplots(1, figsize=(20, 14))
sns.barplot(x=[x[0] for x in most_common_words], y=[x[1] for x in most_common_words])

## We have 78301 words in the whole corpus

In [None]:
len(tokenized)

In [None]:
def create_vectors(tokens, sequence_length:int) -> Tuple[List, str]:
    X, y = [], []
    
    for i in range(0, len(tokens) - sequence_length - 1):
        seq, word = tokens[i:i+sequence_length], tokens[i + sequence_length]
        X.append(' '.join(seq))
        y.append(word)
        
    return X, y

In [None]:
SEQ_LEN = 20

In [None]:
X, y = create_vectors(tokenized, SEQ_LEN)

In [None]:
X[0]

In [None]:
X[1]

In [None]:
y[0]

In [None]:
len(X)

In [None]:
from tensorflow import string as tf_string
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [None]:
embedding_dim = 50 # Dimension of embedded representation - this is already part of latent space, there is captured some dependecy among words, we are learning this vectors in ANN
vocab_size = 7000 # Number of unique tokens in vocabulary
sequence_length = SEQ_LEN # Output dimension after vectorizing - words in vectorited representation are independent

vect_layer = TextVectorization(standardize=None, max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)
vect_layer.adapt(X)

# Final step is integer encoding of the target words into numbers according to the defined vocabulary

In [None]:
vect_layer.get_vocabulary()[:10]

In [None]:
vocab = vect_layer.get_vocabulary()

In [None]:
dict_vocab = {vocab[i]: i  for i in range(len(vocab))}

In [None]:
len(vect_layer.get_vocabulary())

In [None]:
vocabulary_size = len(vect_layer.get_vocabulary())

In [None]:
y_enc = [dict_vocab[x] for x in y]

In [None]:
from tensorflow.compat.v1.keras.layers import CuDNNGRU, CuDNNLSTM
from tensorflow.keras.layers import LSTM, GRU, Bidirectional

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y_enc, test_size=0.2, random_state=13)

# We can define our model and train it using created sequences

In [None]:
input_layer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(vocab_size, embedding_dim)(x_v)
x = LSTM(512, return_sequences=True)(emb)
x = LSTM(256, return_sequences=False)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(64, 'relu')(x)
x = keras.layers.Dropout(0.2)(x)
output_layer = keras.layers.Dense(vocabulary_size, activation=tf.nn.softmax)(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer='rmsprop', loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=['accuracy'])

In [None]:
es = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=700, restore_best_weights=True)

batch_size = 128
epochs = 5
history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
show_history(history)

In [None]:
X[0]

In [None]:
y_pred = model.predict([X[0]])

## Softmax gives you the probabilities which sums to 1 for every word in vocabulary
We need to to choose the word with the highest probability.

In [None]:
y_pred

In [None]:
y_pred = np.argmax(y_pred[0])

In [None]:
y_pred

In [None]:
vocab[y_pred]

#### We won't use probabilities directly but we will sample from the predicted outputs using Temperature Softmax [see this](https://medium.com/@majid.ghafouri/why-should-we-use-temperature-in-softmax-3709f4e0161)

Basically, its ideas is that it would re-weight the probability distribution so that you can control how much surprising (i.e. higher temperature/entropy) or predictable (i.e. lower temperature/entropy) the next selected character would be.

In [None]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

We have to first generate a 20 vocab long sentence called seed text, then our model will use seed text to predict the next vocab, then we update the seed text with our newly generated vocab to predict the next vocab. Repeat this process to generate new text content.

In [None]:
paragraph = X[0]
whole_text = paragraph
for i in range(50):
    y_pred = model.predict([paragraph])
    y_pred = sample(y_pred[0], 10)
    word = vocab[y_pred]
    paragraph += f' {word}'
    whole_text += f' {word}'
    tokens = paragraph.split()
    paragraph = ' '.join(tokens[-SEQ_LEN:])

In [None]:
X[0]

In [None]:
whole_text

# We can even use pre-trained embedding

# We need to download the embedding files
~~~
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
~~~

50 dims GLOVE is also avaiable here: https://vsb.ai/vsbai/static/data/glove.6B.50d.txt

# First we need to load the file to memory and create embedding dictionary

In [None]:
path_to_glove_file = './data/glove.6B.50d.txt'

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

## We need to get the voacabulary from the Vectorizer and the integer indexes

In [None]:
embedding_dim = 50 # Dimension of embedded representation - this is already part of latent space, there is captured some dependecy among words, we are learning this vectors in ANN
vocab_size = 7000 # Number of unique tokens in vocabulary
sequence_length = SEQ_LEN # Output dimension after vectorizing - words in vectorited representation are independent

vect_layer = TextVectorization(standardize=None, max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)
vect_layer.adapt(X)

In [None]:
voc = vect_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [None]:
len(voc)

In [None]:
voc[:10]

In [None]:
num_tokens = len(voc) + 2
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

In [None]:
show_historyyer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(num_tokens, embedding_dim, embeddings_initializer=keras.initializers.Constant(embedding_matrix), trainable=True)(x_v)
x = LSTM(128, return_sequences=True)(emb)
x = LSTM(128, return_sequences=True)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(128, 'relu')(x)
x = keras.layers.Dense(64, 'relu')(x)
x = keras.layers.Dropout(0.5)(x)
output_layer = keras.layers.Dense(vocabulary_size, activation=tf.nn.softmax)(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer='adam', loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=['accuracy'])

#### Let's try to train the model for much longer time

In [None]:
es = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=700, restore_best_weights=True)

batch_size = 128
# epochs = 50
epochs = 5

history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
show_history(history)

In [None]:
paragraph = X[0]
whole_text = paragraph
for i in range(50):
    y_pred = model.predict([paragraph])
    y_pred = sample(y_pred[0], 1)
    word = vocab[y_pred]
    paragraph += f' {word}'
    whole_text += f' {word}'
    tokens = paragraph.split()
    paragraph = ' '.join(tokens[-SEQ_LEN:])

In [None]:
X[0]

In [None]:
whole_text

# Your can see that we are able to generate text of any length using this approach, unfortunately the task is quite complex for model of this simplicity and relatively small dataset
## The text usually doesn't make much sense as you could see

# Another approach is to create character-level model which learns how to write from scratch
## We will try to train this model and comprare obtained results

#### We will simplify the task for using only lower case letters

In [None]:
txt_one_line = txt_one_line.lower()

In [None]:
txt_one_line[:100]

In [None]:
letters = []
for x in txt_one_line:
    if x >= 'a' and x <= 'z' or x == ' ':
        letters.append(x)

In [None]:
letters[:10]

# We have corpus of 412 325 characters available

In [None]:
len(letters)

In [None]:
chars = sorted(list(set(letters)))
print("Total chars:", len(chars))

In [None]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [None]:
char_indices

## We need to create fixed length sequences once again for prediction of the next character

In [None]:
SEQ_LEN = 40
step = 1
X, y = [], []
for i in range(0, len(letters) - SEQ_LEN, step):
    seq, ch = letters[i:i+SEQ_LEN], letters[i + SEQ_LEN]
    X.append(seq)
    y.append(ch)

In [None]:
X[0]

In [None]:
X[1]

In [None]:
y[0]

# OHE is used for the characted level RNN so we need to encode our characters

In [None]:
X_ohe = np.zeros((len(X), SEQ_LEN, len(chars)), dtype=np.bool)
y_ohe = np.zeros((len(X), len(chars)), dtype=np.bool)
for i, sentence in enumerate(X):
    for t, char in enumerate(sentence):
        X_ohe[i, t, char_indices[char]] = 1
    y_ohe[i, char_indices[y[i]]] = 1

In [None]:
X_ohe[120]

In [None]:
y_ohe[0]

In [None]:
X_ohe.shape

In [None]:
input_layer = keras.layers.Input(shape=(SEQ_LEN, len(chars)))
x = LSTM(128, return_sequences=True)(input_layer)
x = LSTM(128, return_sequences=False)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(256, 'relu')(x)
x = keras.layers.Dense(128, 'relu')(x)
x = keras.layers.Dropout(0.2)(x)
output_layer = keras.layers.Dense(len(chars), activation=tf.nn.softmax)(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer='rmsprop', loss=keras.losses.CategoricalCrossentropy(), metrics=['accuracy'])

In [None]:
es = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=700, restore_best_weights=True)

batch_size = 128
epochs = 50

history = model.fit(X_ohe, y_ohe, validation_split=0.2, callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
X_ohe[0].reshape((1, 40, 27))

In [None]:
y_pred = model.predict(X_ohe[0].reshape((1, 40, 27)))[0]

In [None]:
y_pred

In [None]:
c = sample(y_pred)
indices_char[c]

In [None]:
whole_text = X[10].copy()
seq = X[10].copy()
for i in range(500):
    paragraph_ohe = np.zeros((1, SEQ_LEN, len(chars)))
    for t, char in enumerate(seq):
        paragraph_ohe[0, t, char_indices[char]] = 1
    y_pred = model.predict(paragraph_ohe)
    c = sample(y_pred[0], 0.5)
    next_char = indices_char[c]
    whole_text.append(next_char)
    seq = whole_text[-SEQ_LEN:]

In [None]:
''.join(whole_text)

# Task for the lecture
 - Choose either word or character level model
 - Choose another, at least one, HP book (it's on my Github, link at the top)
 - Preprocess it according to the first one
 - Merge the books together
 - Use pre-defined model from lecture or your own and train it for the long time (epochs > 50)
 - Experiment a little - try different batch sizes, optimimizers
 - Send me the Colab notebook with results and description of what you did and your final solution!