# VSB,FEI - Generative AI Workshop

The aim of the workshop is to get an overview of data analysis and deep learning techniques in the generative artificial intelligence (GenAI) domain.

* We will use [Python](https://www.python.org/), [Huggingface](https://huggingface.co/) and [Tensorflow](https://www.tensorflow.org/).

**The exercise will cover these topics:**
* GenAI tools for image data using Huggingface models
<!-- * LLM usage for text generating with Huggingface API -->
* Vector representation of text data and searching for similar words using vector distance 
* Design of own deep learning model for generating "Harry Potter"-like text using Keras framework from scratch

## âš› Deep learning in Python âš›
* This lecture is focused on using word embedding for searching for similar words and RNN usage for text generation.

* We will use Harry Potter books in this lectures for demonstration of training own model in Keras and generating our own HP-like stories.

![meme01](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_meme_01.jpg?raw=true)

## Import of the TensorFlow
The main version of the TensorFlow (TF) is a in the Version package in the field VERSION Since the TensformFlow 2.0 everything was encapsulaed under the KERAS api.

In [None]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata, re, string
import nltk
import requests

from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_distances
from textblob import TextBlob
from wordcloud import WordCloud
from typing import List, Tuple
from tensorflow.keras.layers import LSTM, GRU, Bidirectional


tf.version.VERSION

In [None]:
nltk.download('punkt')

# ðŸ”Ž How does the neural network work with text?
* Is is capable to process text directly or does it works just with numbers?
* Can you come up with some very simple way how to encode text to numbers?

# ðŸ”Ž What is a word embedding?
* Why do we use it?
* What different propeties will it have compared to some naive approaches?

# Word embedding is a vector
* Do you know what is vector?

# $$\vec{w} = \left(w_1, w_2, ..., w_n\right)$$

# ðŸ’¡You can imagine embedding vector as an array of numbers, e.g. [0.5,0.3,0.1,-0.3,1.2]

![meme03](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_05_enc_arch.png?raw=true)

# The most famous word embedding is perhaps the Word2Vec

## ðŸ’¡ There are two approaches for a Word2Vec embedding training

* **Continuous bag-of-words model**: 
    * predicts the middle word based on surrounding context words. 
    * the context consists of a few words before and after the current (middle) word. 
    * this architecture is called a bag-of-words model as the order of words in the context is not important.

* **Continuous skip-gram model**: 
    * predicts words within a certain range before and after the current word in the same sentence. 

![w2v](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_07_skip.png?raw=true)
  
* ðŸ’¡ Bag-of-words model predicts a word given the neighboring context
* ðŸ’¡ Skip-gram model predicts the context (or neighbors) of a word, given the word itself
* ðŸ’¡ The context of a word can be represented through a set of skip-gram pairs of *(target_word, context_word)* where *context_word* appears in the neighboring context of target_word.

## We will demonstrate the approach using single sentence

* The context words for each of the 8 words of this sentence are defined by a window size. 
* The window size determines the span of words on either side of a target_word that can be considered a context word.

![w2v_tab](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_07_tab.png?raw=true)

# ðŸ’¡ The deep learning model de-facto learns which pairs of words are often appear together in text and which do not
* Can you give some word-pairs examples yourself?

# A nice property of word embedding vectors is that vectors of similar meaning are put close together
* If you compute a distance between two similar words, it will be less than for two unrelated words
* E.g. dog - animal X car - cake

## Let's say that the vector is just 2D
* How does 2D vector look like?
* ðŸ”Ž Can you calculate distance between two 2D vectors?
* ðŸ”Ž How is the formula called for 2D and how for n-D?

# Ok, enough of theory!
## Let's try it practically with a pre-trained vectors! ðŸ™‚
* ðŸ”Ž Pre-trained on what!?

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

In [None]:
path_to_glove_file = 'glove.6B.50d.txt'

# We will take a look on the file structure now

In [None]:
with open(path_to_glove_file) as f:
    i = 0
    for line in f:
        print(line)
        i += 1
        if i > 5:
            break

# Let's load the file into a dictionary
* key:value structure -> word:vector

In [None]:
embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

## ðŸ’¡ This is how the embedding latent vector looks like for the word 'audi' and 'bmw'

In [None]:
embeddings_index['audi']

In [None]:
embeddings_index['bmw']

## ðŸ’¡ The cosine similarity of the car brands should be smaller than of some random words
* Why?

# Cosine vs. Euclidean similarity
* ðŸ”Ž What is the difference?
* ðŸ”Ž How to compute it?

![meme03](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_meme_tf_02.png?raw=true)

## $$cos(\vec{A},\vec{B}) = \frac{\sum_{i=1}^{n} A_i \cdot B_i}{\sqrt{\sum_{i=1}^{n} A_i^2 \cdot \sum_{i=1}^{n} B_i^2}}$$



# Let's try it out! ðŸ™‚

In [None]:
cosine(embeddings_index['audi'], embeddings_index['bmw'])

In [None]:
cosine(embeddings_index['audi'], embeddings_index['king'])

# For trying the famous queen -> king example we need to build the embedding matrix

![w2v_meme_03](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_07_meme_03.png?raw=true)

In [None]:
num_tokens = len(embeddings_index.keys())
embedding_dim = 50
hits = 0
misses = 0
word2id = {k:i for i, (k,v) in enumerate(embeddings_index.items())}
id2word = {v:k for k, v in word2id.items()}

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word2id.items():
    embedding_vector = embeddings_index.get(word)
    embedding_matrix[i] = embedding_vector


## Finding the closest words is pretty easy now
* ðŸ”Ž What is the distance for two same words?

In [None]:
c_w = cosine_distances(embedding_matrix[word2id['man']].reshape(-1, 50), embedding_matrix)

In [None]:
for x in c_w.argsort().ravel()[1:6]:
    print(id2word[x])

In [None]:
c_w = cosine_distances(embedding_matrix[word2id['woman']].reshape(-1, 50), embedding_matrix)

In [None]:
for x in c_w.argsort().ravel()[1:6]:
    print(id2word[x])

## The idea is that using the difference between *man* and *woman* should be simillar as *king* and *queen* thus it should be possible to use the difference for searching for analogies

In [None]:
dist = embeddings_index['man'] - embeddings_index['woman']

In [None]:
dist

In [None]:
summed = embeddings_index['queen'] + dist

In [None]:
summed

In [None]:
res = cosine_distances(summed.reshape(-1, 50), embedding_matrix)

In [None]:
res

# And here we go ðŸ™‚

In [None]:
for x in res.argsort().ravel()[1:6]:
    print(id2word[x])

# âš› Deep learning usage in a text generation task âš›
* We will use Harry Potter books in this lectures for generating our own stories.
* 1st step is a data pre-processing so we transform the data into a form suitable for a deep learning model

# We need to download the data first and split text to lines
* Download the text file using *requests* library
* Convert raw HTTP response into a text and split it by lines into array

In [None]:
req = requests.get('https://raw.githubusercontent.com/rasvob/PopAI-VSB-Workshop/main/data/hp1.txt', allow_redirects=True)

In [None]:
txt = str(req.text).splitlines()

In [None]:
txt[:20]

# Let's clean the data and do a brief exploration analysis after that ðŸ™‚

### ðŸ’¡ Skip the header

In [None]:
txt = txt[3:]
txt[:10]

### ðŸ’¡ Remove the chapter header with chapter name
We will remove the blank lines in this part as well.

In [None]:
txt = [x for x in txt if 'CHAPTER ' not in x]
txt[:10]

In [None]:
txt = [x for x in txt if not x.upper() == x]
txt[:10]

### ðŸ’¡ There are another minor imperfections connected to the  -- 't -- suffix, we need to fix it.

In [None]:
[x for x in txt if "\'" in x][25:30]

In [None]:
txt = [x.replace('"', '') for x in txt]
[x for x in txt if "a squeaky voice that" in x]

### We will join the text to one long line and tokenize it after that
* ðŸ’¡ We have prepared few useful functions that remove non-ASCII characters and fix some details in the text if needed

In [None]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def fix_nt(words):
    st_res = []
    for i in range(0, len(words) - 1):
        if words[i+1] == "n't" or words[i+1] == "nt":
            st_res.append(words[i]+("n't"))
        else:
            if words[i] != "n't" and words[i] != "nt":
                st_res.append(words[i])
    return st_res

def fix_s(words):
    st_res = []
    for i in range(0, len(words) - 1):
        if words[i+1] == "'s":
            st_res.append(words[i]+("'s"))
        else:
            if words[i] != "'s":
                st_res.append(words[i])
    return st_res

def normalize(words):
    words = remove_non_ascii(words)
    words = fix_nt(words)   
    words = fix_s(words)
    return words

In [None]:
txt_one_line = ' '.join(txt)

In [None]:
txt_one_line[:300]

In [None]:
tokenized = TextBlob(txt_one_line).words

In [None]:
tokenized = normalize(tokenized)

# ðŸ’¡ Let's take a look at the vocabulary size

In [None]:
dist = nltk.FreqDist(tokenized)

## ðŸ’¡ We have 6829 unique words

In [None]:
len(dist)

### These are the most common words

In [None]:
most_common_words = sorted(list(dist.items()), key=lambda x: x[1], reverse=True)[:30]

In [None]:
ax, fig = plt.subplots(1, figsize=(20, 14))
sns.barplot(x=[x[0] for x in most_common_words], y=[x[1] for x in most_common_words])

## ðŸ’¡ We have ~ 78300 words in the whole corpus

In [None]:
len(tokenized)

# ðŸ”Ž What kind of words are the most frequent? Is this information helpful?

## Another nice visualization is a **WordCloud**

In [None]:
wordcloud = WordCloud().generate(txt_one_line)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

## ðŸ’¡ Limit *max_font_size* if you want to include more words

In [None]:
wordcloud = WordCloud(max_font_size=40).generate(txt_one_line)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

# We can now learn how to use ANN as a text generator! ðŸ™‚
* There are two main ways for solving the task
    * Word-based model
    * Character-based model

## ðŸ”Ž How does it work from high-level view?
    
* ðŸ’¡ We have relatively small dataset thus we will use the **Character-based model** as it works better with smaller datasets
* ðŸ’¡ We will also simplify the task for using only lower case letters

# ðŸ’¡ Build an array of letters from the whole text and filter out everything which is not lower-case letters and spaces

### Original text

In [None]:
txt_one_line[:240]

### Transform to lower-case

In [None]:
txt_one_line = txt_one_line.lower()

In [None]:
txt_one_line[:240]

## Split into letters

In [None]:
letters = []
for x in txt_one_line:
    if x >= 'a' and x <= 'z' or x == ' ':
        letters.append(x)

In [None]:
letters[:10]

# ðŸ’¡We have corpus of more than 400k characters available

In [None]:
len(letters)

## ðŸ’¡ But only 27 unique tokens

In [None]:
chars = sorted(list(set(letters)))
print("Total chars:", len(chars))

## We will now build ID -> CHAR and CHAR -> ID lookup tables

In [None]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [None]:
char_indices

In [None]:
indices_char

# We need to create fixed length sequences for the model
* We will shift the sliding window of *SEQ_LEN* by *step* and for X,y pair
* Input is array of *SEQ_LEN* letters output is just **1** letter which comes after the sequence

In [None]:
SEQ_LEN = 40
step = 1
X, y = [], []
for i in range(0, len(letters) - SEQ_LEN, step):
    seq, ch = letters[i:i+SEQ_LEN], letters[i + SEQ_LEN]
    X.append(seq)
    y.append(ch)

## ðŸ”Ž Let's take a look at the example
* Focus on the last letter of the second sequence

In [None]:
print(X[2])

In [None]:
print(X[3])

In [None]:
y[2]

# Characted level RNN uses usually one-hot encoding as we work just with a few unique tokens so no complex embedding is needed
* ðŸ”Ž How would one-hot encoding look like for 4 letters A B C D?
    * How many bits do we need? 
    * Can it be even more simplified?

In [None]:
X_ohe = np.zeros((len(X), SEQ_LEN, len(chars)), dtype=bool)
y_ohe = np.zeros((len(X), len(chars)), dtype=bool)
for i, sentence in enumerate(X):
    for t, char in enumerate(sentence):
        X_ohe[i, t, char_indices[char]] = 1
    y_ohe[i, char_indices[y[i]]] = 1

In [None]:
X_ohe.shape

In [None]:
y_ohe.shape

# Final step is the model definition and training ðŸ™‚
* What do we need the model to learn?
* How does the model learn?
* What is an input and what is an output?
* We will use *softmax* function as an output
    * How many neurons do we need at the ouput layer?

* We use several types of layers
    * ðŸ”Ž Have you heard about LSTM, Dense or Dropout layers yet?
    * What about optimization algorithms? What is their purpose?

In [None]:
input_layer = keras.layers.Input(shape=(SEQ_LEN, len(chars)))
x = LSTM(128, return_sequences=True)(input_layer)
x = LSTM(128, return_sequences=False)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(256, 'relu')(x)
x = keras.layers.Dense(128, 'relu')(x)
x = keras.layers.Dropout(0.2)(x)
output_layer = keras.layers.Dense(len(chars), activation='softmax')(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer='rmsprop', loss=keras.losses.CategoricalCrossentropy(from_logits=False), metrics=['accuracy'])

## ðŸ”Ž What is a **batch** and an **epoch**?
* What is **loss function**?
* What is *ModelCheckpoint* and why is it useful?
    * ðŸ’¡ Hint: Overfitting

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.tf',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

batch_size = 128
epochs = 10

history = model.fit(X_ohe, y_ohe, validation_split=0.2, callbacks=[model_checkpoint_callback], epochs=epochs, batch_size=batch_size)

In [None]:
model.load_weights("weights.best.tf")

## Try to predict one letter

In [None]:
X_ohe[0].reshape((1, 40, 27))

In [None]:
y_pred = model.predict(X_ohe[0].reshape((1, 40, 27)))[0]

In [None]:
y_pred

# We won't use probabilities directly but we will sample from the predicted outputs using Temperature Softmax [see this](https://medium.com/@majid.ghafouri/why-should-we-use-temperature-in-softmax-3709f4e0161)

* Basically, the ideas is that it would re-weight the probability distribution so that you can control how much surprising (i.e. higher temperature/entropy) or predictable (i.e. lower temperature/entropy) the next selected character would be.

In [None]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
c = sample(y_pred)
indices_char[c]

# And in the end we are able to create a feedback loop and use the model as a next characted generator for given seed text ðŸ™‚

In [None]:
whole_text = X[10].copy()
seq = X[10].copy()
for i in range(500):
    paragraph_ohe = np.zeros((1, SEQ_LEN, len(chars)))
    for t, char in enumerate(seq):
        paragraph_ohe[0, t, char_indices[char]] = 1
    y_pred = model.predict(paragraph_ohe)
    c = sample(y_pred[0], 0.5)
    next_char = indices_char[c]
    whole_text.append(next_char)
    seq = whole_text[-SEQ_LEN:]

## You can see that the model has only seen character-level data however it has learnt the patterns from th data thus it is able to generate existing words/phrases 

### And yes, the output is still far from ideal ðŸ™‚
* ðŸ”Ž How would you make it better?

In [None]:
''.join(whole_text)

![meme0_final](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/thats_all.jpg?raw=true)