<a href="https://colab.research.google.com/github/shreyans-sureja/llm-101/blob/main/part4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Token Embeddings

- Input text broke into tokens
- Tokens converted into token Ids
- TokenIds converted into token embeddings
- Embeddings serves as input in LLM.

What are token embeddings and why they are needed?

- how can we represent words in numbers as computer does not understand words?
- that's why we created vocab


Why can't we use this token ids and need token embeddings?
- cat and kitten are semantically related but associated token ids can't capture that.


What about one-hot encoding?
- still fails to capture semantic relationship

How to encode semantic meaning?
- semantically similar words should have similar vectors
- create feature vector representation and values associated to similar feature will have close values and hence get close vector representation than other items.
- we can train a neural network to create vector embedding.

Many companies have pre-trained token embedding
- example: Google news https://huggingface.co/fse/word2vec-google-news-300
- this model contains 300 dimensional vectors for 3 million words and phrases.


In [None]:
! pip install -U gensim



In [None]:
import gensim.downloader as api

In [None]:
model = api.load("word2vec-google-news-300")



In [None]:
word_vectors = model

In [None]:
print(type(word_vectors))

<class 'gensim.models.keyedvectors.KeyedVectors'>


In [None]:
print(word_vectors['india'])

[-2.34375000e-01 -7.17773438e-02  1.05590820e-02  3.26171875e-01
 -6.29882812e-02 -1.78710938e-01  3.17382812e-02 -3.96484375e-01
 -1.69921875e-01 -3.54003906e-02 -1.81640625e-01 -3.28125000e-01
  6.59179688e-02 -2.07031250e-01  1.19140625e-01  1.74804688e-01
 -1.10839844e-01  3.30078125e-01  5.20019531e-02 -2.47802734e-02
  1.48773193e-03 -1.60156250e-01  2.70996094e-02 -1.80664062e-01
 -4.14062500e-01  1.95312500e-01 -3.49609375e-01  1.03515625e-01
 -8.54492188e-02 -1.48437500e-01 -8.25195312e-02 -2.90527344e-02
 -3.02734375e-01  1.98974609e-02 -3.26171875e-01  1.70898438e-01
 -4.55078125e-01 -4.39453125e-03  4.27734375e-01 -2.13867188e-01
 -6.86645508e-03  1.23535156e-01  4.96093750e-01  3.41796875e-01
  1.70898438e-01 -1.56250000e-01 -9.42382812e-02 -5.73730469e-02
 -1.95312500e-01  6.44531250e-02 -1.49414062e-01  1.58203125e-01
  2.53906250e-01  2.13867188e-01 -2.85156250e-01 -2.77343750e-01
 -2.24609375e-01 -2.96875000e-01 -7.17163086e-03 -3.47656250e-01
 -6.89697266e-03  8.39843

In [None]:
print(word_vectors['india'].shape)

(300,)


Any available word here is 300 dimensional vector.

### Similar words

- well trained vectors can capture similarity

In [None]:
# check below words present in model

print('king' in word_vectors)
print('woman' in word_vectors)
print('man' in word_vectors)

True
True
True


## king + woman - man = ?

In [None]:
print(word_vectors.most_similar(positive=['king', 'woman'], negative=['man']))

[('queen', 0.7118193507194519), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321839332581), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.5181134343147278), ('sultan', 0.5098593831062317), ('monarchy', 0.5087411999702454)]


see, output captures similarity

In [None]:
print('independence' in word_vectors)
print('uk' in word_vectors)

True
True


In [None]:
print(word_vectors.most_similar(positive=['India', 'war']))

[('wars', 0.6058735251426697), ('subcontinent', 0.5804473161697388), ('Alwine_Goveas_Kulshekar_Mangalore', 0.5778571367263794), ('sub_continent', 0.5734809041023254), ('Maurice_Quadras_Mangalore', 0.5726993083953857), ('Sri_Lanka', 0.5608813762664795), ('Indias', 0.560208261013031), ('Pakistan', 0.5474928617477417), ('Operation_Parakaram', 0.5458939671516418), ('subcontinent_partition', 0.5419802069664001)]


Check similarity b/w word pairs

In [None]:
print(word_vectors.similarity('man', 'woman'))
print(word_vectors.similarity('India', 'Pakistan'))
print(word_vectors.similarity('India', 'Brazil'))
print(word_vectors.similarity('paper', 'man'))
print(word_vectors.similarity('Mars', 'computer'))

0.76640123
0.6706861
0.49957347
0.085691616
0.0976148


These results shows us that if vector embedding done nicely we can capture similarity between words and it helps in training significantly.

## How are token embeddings created for LLMs?

- start with vocab
- tokens for that vocab
- we have token ids for those tokens
- every token ids converted to embedding vectors
- for GPT-2 vocab size was 50257 and dimension was 768 for small model.
- so total weights will be 50257 * 768 and this is called embedding layer weight matrix.


## How actually training works?
- we fixed the dimension and vocab size.
- initialized embedding weights with random values.
- serves as starting point for LLM learning process.
- weights are optimized as part of LLM training process.
- we have the training data for example google news data, this kind of training data is used to get ideal weight similar to neural networks training via backpropogation.


## Example

In [None]:
import torch
input_ids = torch.tensor([2, 3, 5, 1])

In [None]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)

<torch._C.Generator at 0x7f9920b1bb10>

In [None]:
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [None]:
print(type(embedding_layer))

<class 'torch.nn.modules.sparse.Embedding'>


In [None]:
print(embedding_layer)

Embedding(6, 3)


In [43]:
print(embedding_layer.weight) # initial weights

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


These are the weights which we will optimize in LLM training. (will see in future notes)

1. embedding layer weights
2. actual weights needed to predict next word.

In [45]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


In [46]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


This is same as neural network linear layer. Both embedding layer and NN linear layer lead to same output.

Embedding layer is much more computationally efficient, since NN linear layer has many unnecessary multiplications with zero.