# Deep Learning - Exercise 7

This lecture is focused on build own unsupervised word embedding using Word2Vec Skip-Gram method and RNN usage for text generation..

We will use Harry Potter books in this lectures for demonstration of Word2Vec embedding training in Keras and generating our own stories.

The Word2Vec approach is based on [official Keras tutorial](https://www.tensorflow.org/tutorials/text/word2vec)

[Open in Google colab](https://colab.research.google.com/github/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_06.ipynb)
[Download from Github](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_06.ipynb)

##### Remember to set **GPU** runtime in Colab!

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np 
import pandas as pd
import seaborn as sns
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow import string as tf_string
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import LSTM, GRU, Bidirectional

from sklearn.model_selection import train_test_split # 
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import normalize
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_distances
import scipy
import itertools
import string
import re
import tqdm
import io

tf.version.VERSION

In [None]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

In [None]:
SEED = 13

# 🔎 What is word embedding?
* Why do we use it? 
* Do we need to train our own embedding?
* Do the embedding have any other usage beside ANN applications?

# Word2Vec

## 💡 There are two approaches for a Word2Vec embedding training

* **Continuous bag-of-words model**: 
    * predicts the middle word based on surrounding context words. 
    * the context consists of a few words before and after the current (middle) word. 
    * this architecture is called a bag-of-words model as the order of words in the context is not important.

* **Continuous skip-gram model**: 
    * predicts words within a certain range before and after the current word in the same sentence. 
    * **we will use this as it is easier concept to grasp**

![w2v](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_07_skip.png?raw=true)

  
* 💡 Bag-of-words model predicts a word given the neighboring context
* 💡 Skip-gram model predicts the context (or neighbors) of a word, given the word itself

* The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (see the diagram below for an example). 
* The context of a word can be represented through a set of skip-gram pairs of *(target_word, context_word)* where *context_word* appears in the neighboring context of target_word.

## We will demonstrate the approach using single sentence

* The context words for each of the 8 words of this sentence are defined by a window size. 
* The window size determines the span of words on either side of a target_word that can be considered a context word.

![w2v_tab](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_07_tab.png?raw=true)



## First we will tokenize the sentence

In [None]:
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
print(len(tokens))

## Now we can build the vocabulary and mapping WORD -> ID

In [None]:
vocab, index = {}, 1  # start indexing from 1
vocab['<pad>'] = 0  # add a padding token
for token in tokens:
  if token not in vocab:
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
print(vocab)

## It is common to build also the inverse mapping ID -> WORD

In [None]:
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)

## Int-encoded sentence looks like this

In [None]:
example_sequence = [vocab[word] for word in tokens]
print(example_sequence)

## You can use the [tf.keras.preprocessing.sequence.skipgrams](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence) to generate skip-gram pairs

*  Generate skip-grams from the example_sequence with a given window_size from tokens in the range [0, vocab_size)

In [None]:
window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      example_sequence,
      vocabulary_size=vocab_size,
      window_size=window_size,
      negative_samples=0)
print(len(positive_skip_grams))

## We will take a look at some skip-gram examples

In [None]:
for target, context in positive_skip_grams[:5]:
  print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

## Negative sampling for one skip-gram

* The skipgrams function returns all positive skip-gram pairs by sliding over a given window span. 

### 💡 But we need some negative examples to train the model as well

## How to generate such samples?
* To produce additional skip-gram pairs that would serve as negative samples for training, you need to sample random words from the vocabulary. 
* Use the tf.random.log_uniform_candidate_sampler function to sample num_ns number of negative samples for a given target word in a window. 
* You can call the function on one skip-grams's target word and pass the context word as true class to exclude it from being sampled.

In [None]:
# Get target and context words for one positive skip-gram.
target_word, context_word = positive_skip_grams[0]

# Set the number of negative samples per positive context.
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,  # class that should be sampled as 'positive'
    num_true=1,  # each positive skip-gram has 1 positive context class
    num_sampled=num_ns,  # number of negative context words to sample
    unique=True,  # all the negative samples should be unique
    range_max=vocab_size,  # pick index of the samples from [0, vocab_size]
    seed=SEED,  # seed for reproducibility
    name="negative_sampling"  # name of this operation
)
print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

In [None]:
# Reduce a dimension so you can use concatenation (in the next step).
squeezed_context_class = tf.squeeze(context_class, 1)

# Concatenate a positive context word with negative sampled words.
context = tf.concat([squeezed_context_class, negative_sampling_candidates], 0)

# Label the first context word as `1` (positive) followed by `num_ns` `0`s (negative).
label = tf.constant([1] + [0]*num_ns, dtype="int64")
target = target_word

In [None]:
print(f"target_index    : {target}")
print(f"target_word     : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
print(f"context_words   : {[inverse_vocab[c.numpy()] for c in context]}")
print(f"label           : {label}")

# The whole process can be illustrated with this example

![w2v_example](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_07_example.png?raw=true)

## Skip-gram sampling table
* A large dataset means larger vocabulary with higher number of more frequent words such as stopwords
* Training examples obtained from sampling commonly occurring words (such as the, is, on) don't add much useful information for the model
* Subsampling of frequent words as a helpful practice to improve embedding quality

### sampling_table[i] denotes the probability of sampling the i-th most common word in a dataset. 

In [None]:
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(size=10)
print(sampling_table)

### Compile all the steps described above into a function that can be called on a list of vectorized sentences obtained from any text dataset. 

In [None]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
    targets, contexts, labels = [], [], []

  # Build the sampling table for `vocab_size` tokens.
    sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in the dataset.
    for sequence in tqdm.tqdm(sequences):
        
    # Generate positive skip-gram pairs for a sequence (sentence).
        positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
              sequence,
              vocabulary_size=vocab_size,
              sampling_table=sampling_table,
              window_size=window_size,
              negative_samples=0)

        # Iterate over each positive skip-gram pair to produce training examples
        # with a positive context word and negative samples.
        for target_word, context_word in positive_skip_grams:
            context_class = tf.expand_dims(tf.constant([context_word], dtype="int64"), 1)
            negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
              true_classes=context_class,
              num_true=1,
              num_sampled=num_ns,
              unique=True,
              range_max=vocab_size,
              seed=seed,
              name="negative_sampling")

          # Build context and label vectors (for one target word)
            context = tf.concat([tf.squeeze(context_class,1), negative_sampling_candidates], 0)
            label = tf.constant([1] + [0]*num_ns, dtype="int64")

          # Append each element from the training example to global lists.
            targets.append(target_word)
            contexts.append(context)
            labels.append(label)

    return targets, contexts, labels

## Now we can download the Harry Potter and the Sorcerer's Stone book and train our own ombedding

In [None]:
path_to_file = tf.keras.utils.get_file('hp1.txt', 'https://raw.githubusercontent.com/rasvob/VSB-FEI-Deep-Learning-Exercises/main/datasets/hp1.txt')

## First 50 lines of the book

In [None]:
with open(path_to_file) as f:
    lines = f.read().splitlines()
for line in lines[:50]:
    print(line)

# We will employ the *TextLineDataset* from the TF data API
* It allows us to easily load text file line by line and preprocess it
* We will skip the book title and blank lines, then we will remove the CHAPTER XYZ lines as the information is not useful
    * Then we can transform the text into lowercase and remove the punctuation
    * We will use the punctuation from the *re* package

In [None]:
re.escape(string.punctuation)

In [None]:
text_ds = tf.data.TextLineDataset(path_to_file).skip(1).filter(lambda x: tf.cast(tf.strings.length(x), bool)).filter(lambda y: not tf.strings.regex_full_match(y, 'CHAPTER.*')).map(lambda z: tf.strings.lower(z)).map(lambda a: tf.strings.regex_replace(a, f'[{re.escape(string.punctuation)}]', ''))

## Here is our pre-processed dataset

In [None]:
for element in text_ds.take(10).as_numpy_iterator():
    print(element)

## The TF dataset works as a data stream
* How do we iterate over data stream?
* How to count elements?

## Total number of lines in the book

In [None]:
text_ds.map(lambda x: tf.cast(tf.strings.length(x), tf.int32)).reduce(0, lambda x, y: x + 1).numpy()

## Total length of the text

In [None]:
text_ds.map(lambda x: tf.cast(tf.strings.length(x), tf.int32)).reduce(0, lambda x, y: x + y).numpy()

## Average length of the text

In [None]:
text_ds.map(lambda x: tf.cast(tf.strings.length(x), tf.int32)).reduce(0, lambda x, y: x + y).numpy() // text_ds.map(lambda x: tf.cast(tf.strings.length(x), tf.int32)).reduce(0, lambda x, y: x + 1).numpy()

# Now we can setup the TextVectorization for integer encoding of the tokens

In [None]:
sequence_length = 15
vectorize_layer = keras.layers.TextVectorization(max_tokens=None, output_mode='int', output_sequence_length=sequence_length)
vectorize_layer.adapt(text_ds.batch(1024))

## Vocabulary example

In [None]:
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

## Number of tokens in vocabulary

In [None]:
vocab_size = len(vectorize_layer.get_vocabulary())
vocab_size

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

## Note: The *unbatch* works in a similar way as the *ravel* in numpy

In [None]:
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

In [None]:
for x in text_vector_ds.take(5).as_numpy_iterator():
    print(x)

## We can take a look at number of sequences generated and some examples of the data as well

In [None]:
sequences = list(text_vector_ds.as_numpy_iterator())
len(sequences)

In [None]:
for seq in sequences[:5]:
    print(f"{seq} => {[inverse_vocab[i] for i in seq]}")

## Finally we can create the whole dataset

In [None]:
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=4,
    vocab_size=vocab_size,
    seed=SEED)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print('\n')
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")

In [None]:
targets

In [None]:
contexts

In [None]:
labels

## We will form the dataset using TF data API as well

In [None]:
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets.reshape(-1, 1), contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

## Performance tweaks
* "When the GPU is working on forward / backward propagation on the current batch, we want the CPU to process the next batch of data so that it is immediately ready. 
* As the most expensive part of the computer, we want the GPU to be fully used all the time during training. 
    * We call this consumer / producer overlap, where the consumer is the GPU and the producer is the CPU.

* With tf.data, you can do this with a simple call to dataset.prefetch(N) at the end of the pipeline (after batching). 
    * This will always prefetch N batches of data and make sure that there is always N batches ready.

In [None]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)

## The final step is to define and train the model
* We will use 2 Embedding layers
    * One for the target word and one for the context words
* Finally the dot product of the Embedding outputs will be computed to combine the vectors and the result will be taken as an output

In [None]:
embedding_dim = 50

target_input = keras.layers.Input((1,))
context_input = keras.layers.Input((num_ns+1))

emb_w2v = keras.layers.Embedding(vocab_size, embedding_dim, name="w2v_embedding", embeddings_initializer='glorot_uniform', input_length=1)(target_input)
emb_ctx = keras.layers.Embedding(vocab_size, embedding_dim, name="ctx_embedding", embeddings_initializer='glorot_uniform', input_length=num_ns+1)(context_input)

dots = keras.layers.dot([emb_w2v, emb_ctx], axes=2)

fl = keras.layers.Flatten()(dots)

# out = keras.layers.Dense(num_ns+1, activation='linear')(fl)

model = keras.Model(inputs=[target_input, context_input], outputs=fl)

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
model.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [None]:
history = model.fit(dataset, epochs=50)

In [None]:
show_history(history)

## Now we can save the vectors and visualize it using [TF projector](https://projector.tensorflow.org/)

In [None]:
weights = model.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index == 0:
        continue  # skip 0, it's padding.
    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")

out_v.close()
out_m.close()

## I bet that you

In [None]:
inverse_vocab[:20]

In [None]:
id2word = {k: v for k, v in enumerate(vocab)}
word2id = {v: k for k, v in enumerate(vocab)}

In [None]:
inverse_vocab[20:40]

In [None]:
from sklearn.metrics.pairwise import cosine_distances

distance_matrix = cosine_distances(weights)
print(distance_matrix.shape)

similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]].argsort()[1:6]] 
                   for search_term in ['harry', 'hagrid', 'potter', 'go', 'he', 'the', 'one','hermione']}

similar_words

In [None]:
path_to_glove_file = 'glove.6B.50d.txt'

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

## 💡 This is how the embedding latent vector looks like for the word 'analysis'

In [None]:
embeddings_index['audi']

In [None]:
embeddings_index['bmw']

In [None]:


cosine(embeddings_index['audi'], embeddings_index['bmw'])

In [None]:
from scipy.spatial.distance import cosine

cosine(embeddings_index['audi'], embeddings_index['king'])

In [None]:
num_tokens = len(embeddings_index.keys())
embedding_dim = 50
hits = 0
misses = 0
word2id = {k:i for i, (k,v) in enumerate(embeddings_index.items())}
id2word = {v:k for k, v in word2id.items()}

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word2id.items():
    embedding_vector = embeddings_index.get(word)
    embedding_matrix[i] = embedding_vector


In [None]:
embeddings_index['audi'].shape

In [None]:
len(embeddings_index.keys())

In [None]:
embeddings_index['bmw']

In [None]:
embedding_matrix[word2id['bmw']]

In [None]:
fm = cosine_distances(embedding_matrix[word2id['bmw']].reshape(-1, 50), embedding_matrix)

In [None]:
fm.shape

In [None]:
fm.argsort().ravel()[:6]

In [None]:
for x in fm.argsort().ravel()[:6]:
    print(id2word[x])

In [None]:
dist = embeddings_index['man'] - embeddings_index['woman']

In [None]:
dist

In [None]:
summed = embeddings_index['queen'] + dist

In [None]:
summed

In [None]:
fm = cosine_distances(summed.reshape(-1, 50), embedding_matrix)

In [None]:
fm.shape

In [None]:
fm.argsort().ravel()[:6]

In [None]:
for x in fm.argsort().ravel()[1:6]:
    print(id2word[x])