<a href="https://colab.research.google.com/github/vkjadon/llm/blob/main/vectorembeddigs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Any model cannot process text, it needs numbers. Tokenizer breaks text into subword tokens. So, a transformer starts by converting text into tokens and then into numbers (Token ID). Each token ID points to a row in a huge learned table called the embedding matrix.
That row is the first embedding. Then positional information is added, and the transformer layers improve these embeddings using self-attention.

During pretraining, these vectors are initialized randomly and updated by gradient descent after billions of training steps.

In [None]:
import torch

In [None]:
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')

In [None]:
from openai import OpenAI
client = OpenAI(api_key=api_key)

In [None]:
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Artificial intelligence is powerful."
)

In [None]:
print(len(response.data[0].embedding))
print(response.data[0].embedding[:5])

1536
[0.01670043170452118, -0.01469384040683508, -0.024587100371718407, 0.01469384040683508, 0.033121466636657715]


Sentence Transformers is a standalone Python library that is built on top of and has a mandatory dependency on the PyTorch (or TensorFlow) framework. It is not bundled with PyTorch itself. The library provides a convenient, high-level interface for generating high-quality sentence and text embeddings using pre-trained models, abstracting away much of the complexity of the underlying frameworks

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")
sentence = "Transformers learn contextual embeddings."

embedding = model.encode(sentence)
print(len(embedding), embedding[:5])   # 384-dim vector

In [None]:
from transformers import AutoTokenizer, AutoModel

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "Transformers are powerful."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state      # shape: (batch, tokens, hidden)
print(embeddings.shape)


In [None]:
from transformers import GPT2Tokenizer, GPT2Model

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

text = "Embeddings capture meaning."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state      # token embeddings
print(embeddings.shape)


In [None]:
from transformers import T5Tokenizer, T5EncoderModel

In [None]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5EncoderModel.from_pretrained("t5-small")

text = "Text-to-text transformers."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state
print(embeddings.shape)


In [None]:
from transformers import AutoTokenizer, AutoModel

In [None]:
HF_TOKEN = userdata.get('HF_TOKEN')

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    "google-bert/bert-base-uncased",
    token=HF_TOKEN
)

text = "Large language models use attention."
tokens = tokenizer(text, return_tensors="pt")
print(tokens)
