# Embeddings

In [1]:
%load_ext autoreload
%autoreload 2
import tensorflow as tf
from pathlib import Path



We start with a simple sentence

In [2]:
text = 'the cat sat on the mat'

How to encode this? Lets start with OHE

In [3]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
oh = OneHotEncoder()
X = np.array(text.split(" ")).reshape(-1, 1)
X = oh.fit_transform(X)
X.todense()

matrix([[0., 0., 0., 0., 1.],
        [1., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1.],
        [0., 1., 0., 0., 0.]])

We could use one hot encoding, like this

<img src=https://www.tensorflow.org/text/guide/images/one-hot.png width=400/>

But this becomes really inefficient if we have a 10.000 word vocabulary....

So, maybe let's give each word a unique number. First, we create a tensorflow dataset

In [4]:
X = text.split(" ")
text_dataset = tf.data.Dataset.from_tensor_slices(X)

2022-01-18 16:51:47.269466: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


We have to set some hyperparameters for the creation of our embedding. 

- What is the **maximum vocabulary size**? This depenends on the problem at hand and the vocabulary available.
- What is a **max sequence length** at which we want to cut off the sequences? We wont be using RNNs, so we have to set a fixed length. If a sequence is too short, it will be padded with some token, eg zeros.
- What is the **dimensionality of the embedding**? This depends on the problem at hand. If we create an embedding with 50 dimensions, it can contain much more finegrained information as opposed to a 10D embedding, or 2D. But embeddings of 50 (or 300, or 1000) might also become too complex, and thus suffer from overfitting / long time to train etc.

In [5]:
vocab_size = 5 + 2  # Maximum vocab size . 5 for the words, 2 for a padding token and an unknown token.
max_len = 6  # max sequence length to pad the outputs to.
embedding_dims = 4

With these settings, we can initialize the vectorization of the words.

In [6]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Create the layer.
vectorize_layer = TextVectorization(
 max_tokens=vocab_size,
 output_mode='int',
 output_sequence_length=max_len)

Now that the vocab layer has been created, call `adapt` on the text-only dataset to create the vocabulary. You don't have to batch with this small example, but for large datasets this is useful.

In [7]:
vectorize_layer.adapt(text_dataset.batch(64))

Now, those were all the steps we had to take. We now have a functioning vectorize layer. Let's have a look at the resulting vocabulary

In [8]:
vectorize_layer.get_vocabulary()

['', '[UNK]', 'the', 'sat', 'on', 'mat', 'cat']

We recognize the words of the text, but also additional tokens for an empty token (used for padding) and an unknown token.

Let's look at a minimal example of the vectorization in action

In [10]:
# the layer works as a stand alone layer to vectorize strings
vectorize_layer("the cat")

<tf.Tensor: shape=(6,), dtype=int64, numpy=array([2, 6, 0, 0, 0, 0])>

In [11]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Embedding

# Create the model that uses the vectorize text layer
model = Sequential([
    # Start by creating an explicit input layer. It needs to have a shape of
    # (1,) (because we need to tell the model that there is exactly one string
    # input per batch), and the dtype needs to be 'string'.
    InputLayer(input_shape=[1], dtype=tf.string),
    # The first layer in our model is the vectorization layer. After this
    # layer, we have a tensor of shape (batch_size, max_len) containing vocab
    # indices.
    vectorize_layer
])

In [12]:
# Now, the model can map strings to integers, and you can add an embedding
# layer to map these integers to learned embeddings.
input_data = [["where sat the cat"], ["the cat sat on the mat"]]
output = model.predict(input_data)
print(output)

[[1 3 2 6 0 0]
 [2 6 3 4 2 5]]


So, this is what we wanted! Instead of using a one-hot-encoding that takes a lot of space, we have a sparse encoding that encodes every word as a arbitrary integer.

Can you trace back the sentence encoding from the vocabulary?
Try to decode for yourself. Look at the sentence 'where sat the cat', look at the vocabulary we printed three cells back, and look at the result `[1,3,2,6,0,0]`.

- what happened with the word 'where'? Why is that?
- what happened with the length of the inputsentence?


We can feed this vectorized form to an embedding layer in the shape of `(batches x sequencelength)`. An Embedding layer will try to learn an encoding of the input. The learning will take place with regards to the output we feed the model, and the loss function.

The output of the Embedding layer will be `(batches x sequencelength x embedding_dimensionality)`. So, 32 sentences, each with a length of 10 words, will have a shape `(32,10)`. The ouput of the embedding will be `(32, 10, 4)` if you use a 4-dimensional embedding for every word.

<img src=https://www.tensorflow.org/text/guide/images/embedding2.png width=400/>

In [13]:
embedding_dimension = 5
model = Sequential([
    InputLayer(input_shape=[1], dtype=tf.string),
    vectorize_layer,
    Embedding(input_dim = vocab_size, output_dim = embedding_dimension)
])

In [14]:
result = model.predict(input_data)
result.shape

(2, 6, 5)

So, we can see:

- we have two observations (batch size)
- every observation has a sequence length of 6 words
- every word is encoded with 5 numbers

We have effectively embedded every word in a 5 dimensional vectorspace.

In [15]:
result[0] # result for the first sentence

array([[ 0.00285207,  0.01483097,  0.00756167,  0.00207195,  0.02628548],
       [ 0.03857994, -0.0447681 ,  0.01171069, -0.03995188, -0.00779179],
       [ 0.00806916, -0.02925973,  0.04288353,  0.02198106,  0.01243716],
       [ 0.00224929, -0.016835  ,  0.01042606, -0.02297521,  0.04011783],
       [-0.03304671,  0.04753392, -0.02972338,  0.014785  ,  0.00142245],
       [-0.03304671,  0.04753392, -0.02972338,  0.014785  ,  0.00142245]],
      dtype=float32)

We fed the model a sentence of 6 words, which were 6 numbers. The output is a vector of 5 numbers for every word, so (6 x 5) numbers. You can look at this output as a way of generating features. For that part, it is similar to what we have been doing with `Conv1D` and `Conv2D` layers.

Another way to think about these embeddings, is as en encoding of the "meaning" of every word.