# Deep Learning - Exercise 6

This lecture is focused on the RNN usage for text data anylysis.

We will deal with the sentiment analysis task using Twitter data.

You can download the dataset from [this link](https://github.com/MohamedAfham/Twitter-Sentiment-Analysis-Supervised-Learning/tree/master/Data)

[Open in Google colab](https://colab.research.google.com/github/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_06.ipynb)
[Download from Github](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_06.ipynb)

##### Remember to set **GPU** runtime in Colab!

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np 
import pandas as pd
import seaborn as sns
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow import string as tf_string
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import LSTM, GRU, Bidirectional

from sklearn.model_selection import train_test_split # 
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import normalize
import scipy
import itertools

tf.version.VERSION

# 📒 We will work with LSTM/GRU layers
* 🔎Do you know anything abour RNN in general?
    * How are they different from FCNN?

![Meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_06_meme_02.png?raw=true)

* How is pure RNN and LSTM/GRU layer different?
    * 🔎 What issue of RNN do they address?
* Can you imagine some use-cases for RNN?
    * Can you imagine some limits of ML/DL solutions in the usecases as well?

## 🔎 We have some new packages today 🔎
* Below is a short description of them, check out the URLs for more details and API 

### 📌 NLTK
* NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
    * https://www.nltk.org/

### 📌 TextBlob
* TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
    * https://textblob.readthedocs.io/en/dev/

In [None]:
import unicodedata, re, string
import nltk
from textblob import TextBlob

In [None]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

## Punkt Sentence Tokenizer
* 🔎 Why do we use tokenizers?
* This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. 
    * It must be trained on a large collection of plaintext in the target language before it can be used.

In [None]:
nltk.download('punkt')

## Download the dataset

In [None]:
df = pd.read_csv('https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/raw/main/datasets/train_tweets.csv')

# ⚡ Let's take a look at the data

In [None]:
df.head()

In [None]:
df.shape

In [None]:
sns.countplot(x='label', data=df)

In [None]:
df.label.value_counts()

## We can see that the classes are highly imbalanced, because we have only 2242 negative tweets compared to positive ones number
* 🔎 What will be impacted by class imbalance?

### 💡 We can see that sentences are of similar length regardless the class

In [None]:
df['length'] = df.tweet.apply(len)

In [None]:
sns.boxplot(x='label', y='length', data = df)

## We can see that the text data are full of noise

* 💡 Social posts suffer the most from this effect
    * The text is full of hashtags, emojis, @mentions and so on
    * These parts usually don't influence the sentiment score much
* 💡 Although most advanced models usually extract even this features because e.g. emojis can help you with the sarcasm understanding

## Take a look at few examples, it will share many of these caveates which we've just discussed

In [None]:
for x in df.loc[:10, 'tweet']:
    print(x, '\n', '-'*len(x))

# 📒 We have a few specific pre-processing techniques for the text data
* 💡 Benefits of using these techniques varies from approach to approach
    * However it is good to have at least some knowledge about them

## Stemming
* Stemming is the process of producing morphological variants of a root/base word
    * Stemming programs are commonly referred to as stemming algorithms or stemmers
* 💡 A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”

#### ⚡ Examples of stemming:
* chocolates, chocolatey, choco : **chocolate**
* retrieval, retrieved, retrieves : **retrieve**


## Lemmatization 
* Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item
* 💡 Lemmatization is similar to stemming but it brings context to the words
    * 💡 It links words with similar meaning to one word

#### ⚡ Examples of lemmatization:
* rocks : **rock**
* corpora : **corpus**
* better : **good**

### Both techiques can be used in the preprocessing pipeline
* You have to decide if it is beneficial to you, because this steps leads to generalization of the data by itself
    * 💡 You will definitely lose some pieces of the information!

# 📌 Embedding note
* **If you use some form of embedding like Word2Vec or Glove, it is better to skip this steps because during the embedding vocabulary building process it was skipped as well** 🙂

# You don't have to code the pre-process steps yourself 🙂
* We have already prepared the most common functions used
    * 💡 Modify function `normalize(...)` for different step combination

In [None]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_numbers(words):
    """Remove all interger occurrences in list of tokenized words with textual representation"""
    new_words = []
    for word in words:
        new_word = re.sub("\d+", "", word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
# words = remove_punctuation(words)
    words = remove_numbers(words)
#    words = remove_stopwords(words)
    return words

def form_sentence(tweet):
    tweet_blob = TextBlob(tweet)
    return tweet_blob.words

# First we must tokenize sentences and remove puncuation 
* We will use the `TextBlob` library

In [None]:
df['Words'] = df['tweet'].apply(form_sentence)

In [None]:
df.head()

# Normalize sentences 
* We want only ascii and lowercase characters and we also want to get rid of numbers in the strings

### You can always experiments with different preprocessing steps! 🙂
* 💡 The steps choice usually depends on the dataset

In [None]:
df['Words_normalized'] = df['Words'].apply(normalize)

In [None]:
df.head()

## Remove the 'user' word from tweets

In [None]:
df['Words_normalized_no_user'] = df['Words_normalized'].apply(lambda x: [y for y in x if 'user' not in y])

In [None]:
df.head()

## 💡 We can see that no pre-processing is ideal and we have to fix some issues by ourselves
* e.g. n't splitting

In [None]:
print(df.tweet.iloc[1])
print(df.Words_normalized_no_user.iloc[1])

In [None]:
def fix_nt(words):
    st_res = []
    for i in range(0, len(words) - 1):
        if words[i+1] == "n't" or words[i+1] == "nt":
            st_res.append(words[i]+("n't"))
        else:
            if words[i] != "n't" and words[i] != "nt":
                st_res.append(words[i])
    return st_res

In [None]:
df['Words_normalized_no_user_fixed'] = df['Words_normalized_no_user'].apply(fix_nt)

## ⚡ The issue is now fixed

In [None]:
print(df.tweet.iloc[1])
print(df.Words_normalized_no_user.iloc[1])
print(df.Words_normalized_no_user_fixed.iloc[1])

## Now we can join the text into single string again for each instance

In [None]:
df['Clean_text'] = df['Words_normalized_no_user_fixed'].apply(lambda x: " ".join(x))

In [None]:
df['Clean_text'].head()

# 🚀 Let's take a look at the most common words in corpus
* It is one of the usual EDA step for text data
    * ⚠ Without the preprocessing there will be a lot of so-called *stopwords*

## 🔎 Do you know what the term **stopword** mean?

## Step 1: Tokenize each string and merge the token array into one big array using `itertools.chain()`

In [None]:
all_words = list(itertools.chain(*df.Words_normalized_no_user_fixed))

In [None]:
all_words[:20]

## Step 2: Compute frequency of every token using `nltk`

In [None]:
dist = nltk.FreqDist(all_words)

## 💡 The most common tokens are:

In [None]:
dist

## 💡 We have 34289 unique words (tokens)

In [None]:
len(dist)

## 💡 The longest tweet has 42 tokens

In [None]:
max(df.Words_normalized_no_user_fixed.apply(len))

# 🚀 Our dataset is ready, we can start our Deep learning experiments 
* 🔎 Can you use regular FCANN for the sentiment analysis?

![Meme01](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_06_meme_01.png?raw=true)


# We will use `TextVectorization` layer for creating vector model from our text data
* For those of you who are interested in the topic there is very good [article on Medium](https://towardsdatascience.com/you-should-try-the-new-tensorflows-textvectorization-layer-a80b3c6b00ee) about the layer and its parameters
    * There is of course a [documentation page](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) about the layer as well

## 🔎 What does *text vectorization* mean in this context?
* Is it a different term from the one used in information retrieval?

## 📌 There are few important parameters:
* `emedding_dim` 
    * Dimension of embedded representation
    * This is already part of latent space, there is captured dependecy among words in these vectors, we are learning this vectors using the ANN
* `vocab_size`
    * Number of unique tokens in vocabulary
* `sequence_length`
    * Output dimension after vectorizing - words in vectorized representation are treated as independent

In [None]:
embedding_dim = 128 # Dimension of embedded representation
vocab_size = 10000 # Number of unique tokens in vocabulary
sequence_length = 30 # Output dimension after vectorizing

vect_layer = TextVectorization(max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)
vect_layer.adapt(df.Clean_text.values)

## We will split our dataset to train and test parts with stratification

## ⚠⚠ COMPETITION TEST SET HERE ⚠⚠

### COMPETITION ?!?! 🤔
* I will provide the details in the end of the lecture 🙂

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.Clean_text, df.label, test_size=0.20, random_state=13, stratify=df.label)

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=13, stratify=y_train)

In [None]:
print(X_train.shape, X_test.shape)

In [None]:
print('Train')
print(y_train.value_counts())
print('Test')
print(y_test.value_counts())

In [None]:
print('Vocabulary example: ', vect_layer.get_vocabulary()[:10])
print('Vocabulary shape: ', len(vect_layer.get_vocabulary()))

# 🚀 Let's finally try the RNN-based model!

In [None]:
input_layer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(vocab_size, embedding_dim)(x_v)
x = LSTM(64, activation='relu', return_sequences=True)(emb)
x = GRU(64, activation='relu', return_sequences=True)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(64, 'relu')(x)
x = keras.layers.Dense(32, 'relu')(x)
x = keras.layers.Dropout(0.2)(x)
output_layer = keras.layers.Dense(1, 'sigmoid')(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer=keras.optimizers.RMSprop(), loss=keras.losses.BinaryCrossentropy(), metrics=keras.metrics.F1Score(average='weighted',threshold=0.5, name='f1'))

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.tf',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
batch_size = 128
epochs = 5

history = model.fit(X_train.values, tf.cast(y_train.values, tf.float32), validation_data=(X_valid.values, tf.cast(y_valid.values, tf.float32)), callbacks=[model_checkpoint_callback], epochs=epochs, batch_size=batch_size)

show_history(history)

In [None]:
# Load the best setup
model.load_weights("weights.best.tf")

In [None]:
y_pred = model.predict(X_test).ravel()

## Sigmoid function gives us real number in range <0, 1>.

In [None]:
y_pred[:10]

## We need to map this values to discreet classes 0 and 1

In [None]:
y_pred = [1 if x >= 0.5 else 0 for x in y_pred]

In [None]:
y_pred[:10]

# The accuracy is not the best metric in the imbalanced situation - you already know the reason 🙂
* There are many more metrics we can use and one of the most common in this situation is the F1 Score, see [this](https://en.wikipedia.org/wiki/F-score) and [this](https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/) for more info

In [None]:
accuracy_score(y_true=y_test, y_pred=y_pred)

In [None]:
f1_score(y_true=y_test, y_pred=y_pred)

In [None]:
print(classification_report(y_true=y_test, y_pred=y_pred))

In [None]:
sns.heatmap(confusion_matrix(y_true=y_test, y_pred=y_pred), annot=True, cmap='Greens', fmt='.0f')

# Do we need to train our own embedding from scratch? 🤔
* 💡 There are multiple embeddings available online which were trained on very large corpuses e.g. Wikipedia
* Good examples are Word2Vec, Glove or FastText
    * These embeddings contains fixed length vectors for words in the vocabulary

* We will use GloVe embedding with 50 dimensional embedding vectors
    * For more details see [this](https://nlp.stanford.edu/projects/glove/)
* You can download zip with vectors from [http://nlp.stanford.edu/data/glove.6B.zip](http://nlp.stanford.edu/data/glove.6B.zip) ~ 800 MB

### 📌 Beware that the original text corpus was more general than the specific social media text data
* 💡 So if you deal with very specific domains it may be beneficial to train your own embedding or at least fine tune existing one in the end

# We need to download the embedding files
~~~
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
~~~

* 💡50 dims GLOVE is also avaiable at https://ai.vsb.cz/downloads/glove.6B.50d.txt

# First we need to load the file to memory and create embedding dictionary

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

In [None]:
path_to_glove_file = 'glove.6B.50d.txt'

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

## 💡 This is how the embedding latent vector looks like for the word 'analysis'

In [None]:
embeddings_index['analysis']

In [None]:
embeddings_index['analysis'].shape

# 🚀 Our goal is to use the pre-trained embedding in our model

## We need to get the vocabulary from the `TextVectorization` layer and the integer indexes in the first step

In [None]:
embedding_dim = 50 # Embedding dimension -> GloVe 50
vocab_size = 10000 # Number of unique tokens in vocabulary
sequence_length = 20 # Output dimension after vectorizing - words in vectorited representation are independent

vect_layer = TextVectorization(max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)
vect_layer.adapt(df.Clean_text.values)

In [None]:
voc = vect_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [None]:
voc[:10]

In [None]:
word_index['the']

In [None]:
embeddings_index['the']

## Now we can create the embedding matrix
* We just need to map the `int` indices to the embedding vectors and save the mapping to the matrix
    * 💡 Each row of the matrix is a one token

In [None]:
num_tokens = len(voc) + 2
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

In [None]:
embedding_matrix[2]

# Finall, we can use the GloVe embedding in the Embedding layer in our model
* 💡 Beware the `embeddings_initializer=keras.initializers.Constant(embedding_matrix), trainable=False` part
    * You say the model to use the GloVe vectors and that it can't modify them
    * 💡 You can also set the parameter `trainable=True` and do the fine-tuning of the embedding

In [None]:
input_layer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(num_tokens, embedding_dim, embeddings_initializer=keras.initializers.Constant(embedding_matrix), trainable=False)(x_v)
x = LSTM(64, activation='relu', return_sequences=True)(emb)
x = GRU(64, activation='relu', return_sequences=False)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(64, 'relu')(x)
x = keras.layers.Dense(32, 'relu')(x)
x = keras.layers.Dropout(0.2)(x)
output_layer = keras.layers.Dense(1, 'sigmoid')(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer=keras.optimizers.RMSprop(), loss=keras.losses.BinaryCrossentropy(), metrics=keras.metrics.F1Score(average='weighted',threshold=0.5, name='f1'))

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.tf',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
batch_size = 128
epochs = 5

history = model.fit(X_train.values, tf.cast(y_train.values, tf.float32), validation_data=(X_valid.values, tf.cast(y_valid.values, tf.float32)), callbacks=[model_checkpoint_callback], epochs=epochs, batch_size=batch_size)

show_history(history)

In [None]:
# Load the best setup
model.load_weights("weights.best.tf")

# 🔎 Which model is better?
* The one using pre-trained embedding or the one that we've trained from scrath?
* 🔎 Why?

In [None]:
y_pred = model.predict(X_test).ravel()
y_pred = [1 if x >= 0.5 else 0 for x in y_pred]
print(f'Accuracy: {accuracy_score(y_true=y_test, y_pred=y_pred)}')
print(f'F1 Score: {f1_score(y_true=y_test, y_pred=y_pred)}')

# ✅  Tasks for the lecture (2p)

* Create your own architecture (1p)
    * You can also try different batch sizes, optimimizers, etc

* Try to fine-tune the GloVe embedding (you can also use GloVe 100 or higher dimentional embbeding) and compare the models (1p)
    * Is it any different according to the F1-Score?

# 🚀 There is a competition for bonus points this week! 
* Everyone who will send me a correct solution will be included in the F1 - Score toplist
    * 📌 **Send me a link to the notebook, not the .ipynb file**
* **Deadline for the competition submission is Wednesday 10th at 12:00**
    * The toplist will be publicly available on Thursday
* There is no limitation in used layers (LSTM, CNN, ...), optimizers and so on
    * 💡 You can use any model architecture from the internet including transfer learning
* The test set is the same as the one that we used in the lecture
    * 💡 It is marked with ⚠⚠ in the notebook
* ⚡ The winner with the best F1 - Score on test set will be awarded with **10 bonus points**

## 📌 The only limitation is that the model has to be trained/fine-tuned on Colab/Kaggle/Your machine so online sentiment scoring services or REST-API LLMs are forbidden!

![Meme03](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_06_meme_03.png?raw=true)
