# Algorithms for Big Data - Exercise 7
This lecture is focused on the more advanced examples of the RNN usage for text generation.

We will use Harry Potter books in this lectures for generating our own stories.

You can download the dataset from this course [Github](https://github.com/rasvob/2020-21-ARD/tree/master/datasets)


[Open in Google colab](https://colab.research.google.com/github/rasvob/2020-21-ARD/blob/master/abd_08.ipynb)
[Download from Github](https://github.com/rasvob/2020-21-ARD/blob/master/abd_08.ipynb)

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import matplotlib.pyplot as plt # plotting
import matplotlib.image as mpimg # images
import numpy as np #numpy
import seaborn as sns
import tensorflow.compat.v2 as tf #use tensorflow v2 as a main 
import tensorflow.keras as keras # required for high level applications
from sklearn.model_selection import train_test_split # split for validation sets
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import normalize # normalization of the matrix
import scipy
import pandas as pd

tf.version.VERSION

'2.3.0'

In [2]:
import unicodedata, re, string
import nltk
from textblob import TextBlob

In [6]:
import requests

In [3]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/fei/svo0175/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# We need to download the data first and split text to lines

In [92]:
req = requests.get('https://raw.githubusercontent.com/rasvob/2020-21-ARD/master/datasets/hp1.txt', allow_redirects=True)

In [93]:
txt = str(req.text).splitlines()

In [94]:
txt[:20]

["Harry Potter and the Sorcerer's Stone",
 '',
 '',
 'CHAPTER ONE',
 '',
 'THE BOY WHO LIVED',
 '',
 'Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say',
 'that they were perfectly normal, thank you very much. They were the last',
 "people you'd expect to be involved in anything strange or mysterious,",
 "because they just didn't hold with such nonsense.",
 '',
 'Mr. Dursley was the director of a firm called Grunnings, which made',
 'drills. He was a big, beefy man with hardly any neck, although he did',
 'have a very large mustache. Mrs. Dursley was thin and blonde and had',
 'nearly twice the usual amount of neck, which came in very useful as she',
 'spent so much of her time craning over garden fences, spying on the',
 'neighbors. The Dursleys had a small son called Dudley and in their',
 'opinion there was no finer boy anywhere.',
 '']

## We can see that the text is far from perfect because we have some noise in the data as in the last lecture
We need to preprocess the text to be suitable for the RNN application. We need to clear blank lines and remove chapter headers. To simplify the task, we will get rid partialy of the interpunction as well for now. Final step will be joining the text into one big string.

In [95]:
txt = txt[3:]
txt[:10]

['CHAPTER ONE',
 '',
 'THE BOY WHO LIVED',
 '',
 'Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say',
 'that they were perfectly normal, thank you very much. They were the last',
 "people you'd expect to be involved in anything strange or mysterious,",
 "because they just didn't hold with such nonsense.",
 '',
 'Mr. Dursley was the director of a firm called Grunnings, which made']

#### Remove the chapter header with chapter name
We will remove the blank lines in this part as well.

In [96]:
txt = [x for x in txt if 'CHAPTER ' not in x]
txt[:10]

['',
 'THE BOY WHO LIVED',
 '',
 'Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say',
 'that they were perfectly normal, thank you very much. They were the last',
 "people you'd expect to be involved in anything strange or mysterious,",
 "because they just didn't hold with such nonsense.",
 '',
 'Mr. Dursley was the director of a firm called Grunnings, which made',
 'drills. He was a big, beefy man with hardly any neck, although he did']

In [97]:
txt = [x for x in txt if not x.upper() == x]
txt[:10]

['Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say',
 'that they were perfectly normal, thank you very much. They were the last',
 "people you'd expect to be involved in anything strange or mysterious,",
 "because they just didn't hold with such nonsense.",
 'Mr. Dursley was the director of a firm called Grunnings, which made',
 'drills. He was a big, beefy man with hardly any neck, although he did',
 'have a very large mustache. Mrs. Dursley was thin and blonde and had',
 'nearly twice the usual amount of neck, which came in very useful as she',
 'spent so much of her time craning over garden fences, spying on the',
 'neighbors. The Dursleys had a small son called Dudley and in their']

### There are another minor imperfections connected to the  -- 't -- suffix, we need to fix it.

In [98]:
[x for x in txt if "\'" in x][25:30]

['a squeaky voice that made passersby stare, "Don\'t be sorry, my dear sir,',
 "didn't approve of imagination.",
 "and it didn't improve his mood -- was the tabby cat he'd spotted that",
 '"Shoo!" said Mr. Dursley loudly. The cat didn\'t move. It just gave him a',
 "about Mrs. Next Door's problems with her daughter and how Dudley had"]

In [101]:
txt = [x.replace('"', '') for x in txt]
[x for x in txt if "a squeaky voice that" in x]

["a squeaky voice that made passersby stare, Don't be sorry, my dear sir,"]

### We will join the text to one long line and tokenize it like the last time

In [121]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def fix_nt(words):
    st_res = []
    for i in range(0, len(words) - 1):
        if words[i+1] == "n't" or words[i+1] == "nt":
            st_res.append(words[i]+("n't"))
        else:
            if words[i] != "n't" and words[i] != "nt":
                st_res.append(words[i])
    return st_res

def normalize(words):
    words = remove_non_ascii(words)
    words = fix_nt(words)
    return words



In [122]:
txt_one_line = ' '.join(txt)

In [123]:
txt_one_line[:300]

"Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a fir"

In [124]:
tokenized = TextBlob(txt_one_line).words

In [125]:
tokenized = normalize(tokenized)

### n't suffix should be fixed now because of the TextBlob functionalitym

In [129]:
[x for x in tokenized if "n't" in x][:10]

["didn't",
 "didn't",
 "hadn't",
 "didn't",
 "didn't",
 "didn't",
 "wasn't",
 "couldn't",
 "couldn't",
 "couldn't"]

### Final step of the preprocessing is joining the tokenized text back into one sentence and spliting it into fixed length sequences
We need to define training vectors which are of the same length.

In [133]:
long_sequence = ' '.join(tokenized)
long_sequence[:100]

'Mr and Mrs Dursley of number four Privet Drive were proud to say that they were perfectly normal tha'

# Let's take a look at the data

In [None]:
df.shape

## We can see that the classification task is highly imbalanced, because we have only 2242 negative tweets compared with positive one

In [None]:
sns.countplot(x='label', data=df)

In [None]:
df.label.value_counts()

In [None]:
df['length'] = df.tweet.apply(len)

### We can see that the sentences are of similar lengths

In [None]:
sns.barplot(x='label', y='length', data = df)

# We can see that the text data are full of noise

- Social posts suffer the most from this effect
- The text is full of hashtags, emojis, @mentions and so on
- These parts usually don't influence the sentiment score by much
- Although most advanced models usually extract even this features because e.g. emojis can help you with the sarcasm understanding

In [None]:
for x in df.loc[:10, 'tweet']:
    print(x)
    print('---------')

## Stemming
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”,

## Lemmatization 
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

Examples of lemmatization:

- rocks : rock
- corpora : corpus
- better : good

## Both techiques can be used in the preprocessing pipeline
You have to decide if it is beneficial to you, because this steps leads to some generalization of the data by itself. You will definitely lose some pieces of the information. If you use some form of embedding like Word2Vec or Glove, it is better to skip this steps because the embedding vocabulary skipped it as well.

In [None]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_numbers(words):
    """Remove all interger occurrences in list of tokenized words with textual representation"""
    new_words = []
    for word in words:
        new_word = re.sub("\d+", "", word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
# words = remove_punctuation(words)
    words = remove_numbers(words)
#    words = remove_stopwords(words)
    return words

def form_sentence(tweet):
    tweet_blob = TextBlob(tweet)
    return tweet_blob.words

# Tokenize sentences and remove puncuation by TextBlob library

In [None]:
df['Words'] = df['tweet'].apply(form_sentence)

In [None]:
df.head()

# Normalize sentences 
- We want only ascii, lowercase and no numbers

## You can experiments with different preprocess steps!

In [None]:
df['Words_normalized'] = df['Words'].apply(normalize)

In [None]:
df.head()

## Remove the 'user' word from tweets

In [None]:
df['Words_normalized_no_user'] = df['Words_normalized'].apply(lambda x: [y for y in x if 'user' not in y])

In [None]:
df.head()

## We can see that no pre-processing is ideal and we have to fix some issues by ourselves
- e.g. n't splitting

In [None]:
print(df.tweet.iloc[1])
print(df.Words_normalized_no_user.iloc[1])

In [None]:
def fix_nt(words):
    st_res = []
    for i in range(0, len(words) - 1):
        if words[i+1] == "n't" or words[i+1] == "nt":
            st_res.append(words[i]+("n't"))
        else:
            if words[i] != "n't" and words[i] != "nt":
                st_res.append(words[i])
    return st_res

In [None]:
df['Words_normalized_no_user_fixed'] = df['Words_normalized_no_user'].apply(fix_nt)

## The issue is now fixed

In [None]:
print(df.tweet.iloc[1])
print(df.Words_normalized_no_user.iloc[1])
print(df.Words_normalized_no_user_fixed.iloc[1])

In [None]:
df['Clean_text'] = df['Words_normalized_no_user_fixed'].apply(lambda x: " ".join(x))

In [None]:
df.head()

# Let's take a look at the most common words in corpus

In [None]:
import itertools

In [None]:
all_words = list(itertools.chain(*df.Words_normalized_no_user_fixed))

In [None]:
dist = nltk.FreqDist(all_words)

In [None]:
dist

### We have 34289 unique words

In [None]:
len(dist)

### The longest tweet has 42 words

In [None]:
max(df.Words_normalized_no_user_fixed.apply(len))

# We will use new TextVectorization layer for creating vector model from our text data
For those of you who are interested in the topic there is very good [article on Medium](https://towardsdatascience.com/you-should-try-the-new-tensorflows-textvectorization-layer-a80b3c6b00ee) about the layer and its parameters.

There is of course a [documentation page](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) about the layer.


In [None]:
from tensorflow import string as tf_string
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [None]:
embedding_dim = 128 # Dimension of embedded representation - this is already part of latent space, there is captured some dependecy among words, we are learning this vectors in ANN
vocab_size = 10000 # Number of unique tokens in vocabulary
sequence_length = 30 # Output dimension after vectorizing - words in vectorited representation are independent

vect_layer = TextVectorization(max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)
vect_layer.adapt(df.Clean_text.values)



### We will split our dataset to train and test parts with stratification

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.Clean_text, df.label, test_size=0.20, random_state=13, stratify=df.label)

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=13, stratify=y_train)

In [None]:
print(X_train.shape, X_test.shape)

In [None]:
print('Train')
print(y_train.value_counts())
print('Test')
print(y_test.value_counts())

In [None]:
print('Vocabulary example: ', vect_layer.get_vocabulary()[:10])
print('Vocabulary shape: ', len(vect_layer.get_vocabulary()))

In [None]:
from tensorflow.compat.v1.keras.layers import CuDNNGRU, CuDNNLSTM
from tensorflow.keras.layers import LSTM, GRU, Bidirectional

In [None]:
input_layer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(vocab_size, embedding_dim)(x_v)
x = LSTM(64, activation='mish', return_sequences=True)(emb)
x = GRU(64, activation='mish', return_sequences=True)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(64, 'mish')(x)
x = keras.layers.Dense(32, 'mish')(x)
x = keras.layers.Dropout(0.2)(x)
output_layer = keras.layers.Dense(1, 'sigmoid')(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer='rmsprop', loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])


In [None]:
es = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=70, restore_best_weights=True)

batch_size = 128
epochs = 5
history = model.fit(X_train.values, y_train.values, validation_data=(X_valid.values, y_valid.values), callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
show_history(history)

In [None]:
y_test_loss, accuracy = model.evaluate(X_test, y_test)

In [None]:
y_pred = model.predict(X_test).ravel()

#### Sigmoid function gives us real number in range <0, 1>.

#### We need to map this valus to classes

In [None]:
y_pred

In [None]:
y_pred = [1 if x >= 0.5 else 0 for x in y_pred]

# We can see that accuracy is not the best metric in the imbalanced situation - why?
There are many more metrics we can use and one of the most common in this situation is the F1 Score, see [this](https://en.wikipedia.org/wiki/F-score) and [this](https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/) for more info

In [None]:
accuracy_score(y_true=y_test, y_pred=y_pred)

In [None]:
f1_score(y_true=y_test, y_pred=y_pred)

In [None]:
print(classification_report(y_true=y_test, y_pred=y_pred))

In [None]:
print(confusion_matrix(y_true=y_test, y_pred=y_pred))

# We don't have to train our own embedding
There are multiple embeddings available online which were trained on very large corpuses e.g. Wikipedia. Good examples are Word2Vec, Glove or FastText. These embeddings contains fixed length vectors for words in the vocabulary.

We will use GloVe embedding with 50 dimensional embedding vectors. For more details see [this](https://nlp.stanford.edu/projects/glove/).
You can download zip with vectors from [http://nlp.stanford.edu/data/glove.6B.zip](http://nlp.stanford.edu/data/glove.6B.zip) ~ 800 MB

#### Beware that the original text corpus was more general than the specific social media text data, so if you deal with very specific domains it may be beneficial to train your own embedding or at least fine tune existing one.

# We need to download the embedding files
~~~
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
~~~

50 dims GLOVE is also avaiable here: https://vsb.ai/vsbai/static/data/glove.6B.50d.txt

# First we need to load the file to memory and create embedding dictionary

In [None]:
path_to_glove_file = './data/glove.6B.50d.txt'

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

In [None]:
embeddings_index['analysis']

## We need to get the voacabulary from the Vectorizer and the integer indexes

In [None]:
embedding_dim = 50 # Dimension of embedded representation - this is already part of latent space, there is captured some dependecy among words, we are learning this vectors in ANN
vocab_size = 10000 # Number of unique tokens in vocabulary
sequence_length = 20 # Output dimension after vectorizing - words in vectorited representation are independent

vect_layer = TextVectorization(max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)
vect_layer.adapt(df.Clean_text.values)

In [None]:
voc = vect_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [None]:
voc[:10]

In [None]:
word_index['the']

In [None]:
embeddings_index['the']

In [None]:
num_tokens = len(voc) + 2
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

In [None]:
embedding_matrix[2]

In [None]:
show_historyyer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(num_tokens, embedding_dim, embeddings_initializer=keras.initializers.Constant(embedding_matrix), trainable=False)(x_v)
x = LSTM(64, activation='mish', return_sequences=True)(emb)
x = GRU(64, activation='mish', return_sequences=False)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(64, 'mish')(x)
x = keras.layers.Dense(32, 'mish')(x)
x = keras.layers.Dropout(0.2)(x)
output_layer = keras.layers.Dense(1, 'sigmoid')(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer='rmsprop', loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])


In [None]:
es = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=70, restore_best_weights=True)

batch_size = 128
epochs = 5
history = model.fit(X_train.values, y_train.values, validation_data=(X_valid.values, y_valid.values), callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
show_history(history)

In [None]:
y_pred = model.predict(X_test).ravel()
y_pred = [1 if x >= 0.5 else 0 for x in y_pred]
print(f'Accuracy: {accuracy_score(y_true=y_test, y_pred=y_pred)}')
print(f'F1 Score: {f1_score(y_true=y_test, y_pred=y_pred)}')
print(confusion_matrix(y_true=y_test, y_pred=y_pred))

# Task for the lecture
 - Try to create your own architecture
 - Experiment a little - try different batch sizes, optimimizers, time lags as features, etc
 - Send me the Colab notebook with results and description of what you did and your final solution!
 
# There is a competition for bonus points this week!
- Everyone who will send me a correct solution will be included in the F1 - Score toplist
- Deadline for the competition submission is Sunday 8th November at midnigth
- The toplist will be publicly available on Monday
- There is no limitation in used layers (LSTM, CNN, ...), optimizers, etc. 
- You can use any model architecture from the internet including transfer learning,
- The only limitation is that the model has to be trained/fine-tuned on Colab/Kaggle/Your machine so online sentiment scoring services are forbidden!

## The winner with the best F1 - Score on test set will be awarded with 5 bonus points
- The test set is the same as we used in the lecture