# VSB,FEI - Generative AI Workshop

The aim of the workshop is to get an overview of data analysis and deep learning techniques in the generative artificial intelligence (GenAI) domain.

* We will use [Python](https://www.python.org/), [Huggingface](https://huggingface.co/) and [Tensorflow](https://www.tensorflow.org/).

**The exercise will cover these topics:**
* GenAI tools for image data using Huggingface models
<!-- * LLM usage for text generating with Huggingface API -->
* Vector representation of text data and searching for similar words using vector distance 
* Design of own deep learning model for generating "Harry Potter"-like text using Keras framework from scratch

## Deep learning in Python introduction
* This lecture is focused on using word embedding for searching for similar words and RNN usage for text generation.

* We will use Harry Potter books in this lectures for demonstration of training own model in Keras and generating our own HP-like stories.

![meme01](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_meme_01.jpg?raw=true)

## Import of the TensorFlow
The main version of the TensorFlow (TF) is a in the Version package in the field VERSION Since the TensformFlow 2.0 everything was encapsulaed under the KERAS api.

In [2]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

tf.version.VERSION

'2.13.0'

# ðŸ”Ž How does the neural network work with text?
* Is is capable to process text directly or does it works just with numbers?
* Can you come up with some very simple way how to encode text to numbers?

# ðŸ”Ž What is a word embedding?
* Why do we use it?
* What different propeties will it have compared to some naive approaches?

# Word embedding is a vector
* Do you know what is vector?

# $$\vec{w} = \left(w_1, w_2, ..., w_n\right)$$

![meme03](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_meme_tf_02.png?raw=true)

# ðŸ’¡You can imagine embedding vector as an array of numbers, e.g. [0.5,0.3,0.1,-0.3,1.2]

![meme03](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/dl_05_enc_arch.png?raw=true)

# The most famous word embedding is perhaps the Word2Vec

## ðŸ’¡ There are two approaches for a Word2Vec embedding training

* **Continuous bag-of-words model**: 
    * predicts the middle word based on surrounding context words. 
    * the context consists of a few words before and after the current (middle) word. 
    * this architecture is called a bag-of-words model as the order of words in the context is not important.

* **Continuous skip-gram model**: 
    * predicts words within a certain range before and after the current word in the same sentence. 

![w2v](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_07_skip.png?raw=true)
  
* ðŸ’¡ Bag-of-words model predicts a word given the neighboring context
* ðŸ’¡ Skip-gram model predicts the context (or neighbors) of a word, given the word itself

* The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (see the diagram below for an example). 
* The context of a word can be represented through a set of skip-gram pairs of *(target_word, context_word)* where *context_word* appears in the neighboring context of target_word.

## We will demonstrate the approach using single sentence

* The context words for each of the 8 words of this sentence are defined by a window size. 
* The window size determines the span of words on either side of a target_word that can be considered a context word.

![w2v_tab](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_07_tab.png?raw=true)

# ðŸ’¡ The deep learning model de-facto learns which pairs of words are often appear together in text and which do not
* Can you give some word-pairs examples yourself?

# A nice property of word embedding vectors is that vectors of similar meaning are put close together
* If you compute a distance between them, it will be close to zero

## Let's say that the vector is just 2D
* How does 2D vector look like?
* Can you calculate distance between two 2D vectors?
* How is the formula called for 2D and how for n-D?

# Ok, enough of theory!
## Let's try it practically with a pre-trained vectors! ðŸ™‚
* ðŸ”Ž Pre-trained on what!?

In [4]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2023-09-06 15:36:13--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-09-06 15:36:14--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-09-06 15:36:15--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: â€˜glove.6B.zipâ€™



In [6]:
path_to_glove_file = 'glove.6B.50d.txt'

# We will take a look on the file structure now

In [8]:
with open(path_to_glove_file) as f:
    i = 0
    for line in f:
        print(line)
        i += 1
        if i > 10:
            break

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581

, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392

. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.423

# Let's load the file into a dictionary
* key:value structure -> word:vector

In [9]:
embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


## ðŸ’¡ This is how the embedding latent vector looks like for the word 'audi' and 'bmw'

In [10]:
embeddings_index['audi']

array([ 0.051355 ,  0.11694  ,  1.0251   ,  0.12414  , -0.83236  ,
        1.0288   , -0.64566  , -1.4468   , -0.89265  , -0.32658  ,
        0.66507  , -0.65524  , -1.8323   , -1.0347   ,  0.13486  ,
       -0.033565 , -0.2208   ,  1.855    , -0.2495   , -0.84343  ,
        0.14318  , -0.81258  , -0.84232  ,  1.1247   , -0.075604 ,
       -0.30852  , -0.79071  ,  0.80721  , -0.24747  , -0.029263 ,
        0.2684   ,  0.6531   ,  0.48872  ,  1.1838   ,  0.5606   ,
       -0.68087  ,  0.25192  ,  0.98091  , -1.0433   , -0.27203  ,
        1.1912   , -0.88594  ,  0.022038 , -0.82012  , -0.0022396,
       -0.68251  ,  0.12713  ,  0.85041  ,  1.002    ,  0.33904  ],
      dtype=float32)

In [11]:
embeddings_index['bmw']

array([ 0.70038 , -0.16073 ,  1.3423  ,  0.63331 , -0.21958 ,  0.31944 ,
       -0.67042 , -0.94041 , -0.56935 , -0.67842 ,  0.39705 , -0.18964 ,
       -2.2101  , -0.90947 ,  0.95511 , -0.01321 , -0.32738 ,  1.1554  ,
       -0.48464 , -1.7606  , -0.051495, -1.0745  , -1.183   ,  0.68672 ,
       -0.107   , -0.42152 , -0.15516 ,  0.12724 , -0.42114 ,  0.30905 ,
        0.59784 ,  0.050149,  0.24022 ,  0.86494 ,  0.63488 , -0.75644 ,
       -0.09189 ,  1.0218  , -0.96638 , -0.90508 ,  0.80575 , -0.75225 ,
        0.7642  , -0.94425 ,  0.4609  ,  0.11877 ,  0.24907 ,  0.066667,
        0.59622 ,  0.1275  ], dtype=float32)

## The cosine similarity of the car brands should be smaller than with some random word
* Why?

# Cosine vs. Euclidean similarity
* ðŸ”Ž What is the difference?
* ðŸ”Ž How to compute it?

## $$cos(\vec{A},\vec{B}) = \frac{\sum_{i=1}^{n} A_i \cdot B_i}{\sqrt{\sum_{i=1}^{n} A_i^2 \cdot \sum_{i=1}^{n} B_i^2}}$$



In [None]:
cosine(embeddings_index['audi'], embeddings_index['bmw'])

In [None]:
cosine(embeddings_index['audi'], embeddings_index['king'])

# For trying the famous queen -> king example we need to build the embedding matrix

![w2v_meme_03](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_07_meme_03.png?raw=true)

In [None]:
num_tokens = len(embeddings_index.keys())
embedding_dim = 50
hits = 0
misses = 0
word2id = {k:i for i, (k,v) in enumerate(embeddings_index.items())}
id2word = {v:k for k, v in word2id.items()}

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word2id.items():
    embedding_vector = embeddings_index.get(word)
    embedding_matrix[i] = embedding_vector


## Finding the closest words is pretty easy now

In [None]:
c_w = cosine_distances(embedding_matrix[word2id['man']].reshape(-1, 50), embedding_matrix)

In [None]:
for x in c_w.argsort().ravel()[1:6]:
    print(id2word[x])

In [None]:
c_w = cosine_distances(embedding_matrix[word2id['woman']].reshape(-1, 50), embedding_matrix)

In [None]:
for x in c_w.argsort().ravel()[1:6]:
    print(id2word[x])

## The idea is that using the difference between *man* and *woman* should be simillar as *king* and *queen* thus it should be possible to use the difference for searching for analogies

In [None]:
dist = embeddings_index['man'] - embeddings_index['woman']

In [None]:
dist

In [None]:
summed = embeddings_index['queen'] + dist

In [None]:
summed

In [None]:
res = cosine_distances(summed.reshape(-1, 50), embedding_matrix)

![meme0_final](https://github.com/rasvob/PopAI-VSB-Workshop/blob/main/images/thats_all.jpg?raw=true)