# RAG, embeddings, vector databases (Week 5)

+ Unhappy: 👎
+ Anxious: 😬
+ Curious: 🤔
+ Happy: 👍

## Objectives

For this week's activities, we must do the following:

- [ ] Create a database or vector store.
- [x] Create embeddings of a data source.
- [ ] Create a simple RAG query from the embeddings.
    + Take request from end user and convert to embedding.
    + Search embeddings table or index for relevant embeddings.
    + Return original inputs from embeddings search.
    + Pass inputs into the prompt as context for LLM.
    + Call the LLM for the result.
- [ ] Wire up the application to connect to the database.
- [ ] Experiment with different prompt methods and search methods
      for different results.

Nice to haves:

- [ ] Limit context passed to Gemma model based upon token count
- [ ] Train Gemma model on Guanaco dataset
- [ ] Upgrade ALL the things to Genkit

Sources:

+ https://developers.google.com/machine-learning/crash-course/embeddings
+ https://colab.sandbox.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/embeddings/intro-textemb-vectorsearch.ipynb
+ https://cloud.google.com/blog/products/databases/get-started-with-firestore-vector-similarity-search

## Step 0. Install and import libraries

In [4]:
%%writefile -a requirements.txt
torch
certifi

Appending to requirements.txt


In [5]:
!pip install -qr requirements.txt

In [6]:
import requests
import urllib.request
import certifi

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f93426398d0>

In [7]:
PROJECT_ID = !gcloud config get-value project
PROJECT_ID = PROJECT_ID[0]
LOCATION = "us-west1"

## Step 1. Create word embeddings from sources

+ I want to build an embeddings dataset that helps my Mediterranean archaeology travel site.
+ I want to include text from the following sources:
  + Herodotus' actual works (Histories primarily)
  + Reddit posts (r/archaeology, r/travel)
+ I'm going to use PyTorch as it is my preferred library for data science.
+ 👎👎👎 I can't just download the text programmatically :/. Instead I'll need to download a TXT version
  manually and then open it as a file.

In [8]:
url = "https://classics.mit.edu/Herodotus/history.mb.txt"

# No work-ee
# DSL certificate error
#response = requests.get(url, verify=certifi.where())

In [9]:
with open("history.mb.txt", "r") as f:
    herodotus_text = f.read()

print(herodotus_text[:500])

Provided by The Internet Classics Archive.
See bottom for copyright. Available online at
    http://classics.mit.edu//Herodotus/history.html

The History of Herodotus
By Herodotus


Translated by George Rawlinson

----------------------------------------------------------------------

BOOK I

Clio 

These are the researches of Herodotus of Halicarnassus, which he publishes,
in the hope of thereby preserving from decay the remembrance of what
men have done, and of preventing the great and wonderf


In [10]:
herodotus_test = herodotus_text[285:]
print(herodotus_test)


BOOK I

Clio 

These are the researches of Herodotus of Halicarnassus, which he publishes,
in the hope of thereby preserving from decay the remembrance of what
men have done, and of preventing the great and wonderful actions of
the Greeks and the Barbarians from losing their due meed of glory;
and withal to put on record what were their grounds of feuds. According
to the Persians best informed in history, the Phoenicians began to
quarrel. This people, who had formerly dwelt on the shores of the
Erythraean Sea, having migrated to the Mediterranean and settled in
the parts which they now inhabit, began at once, they say, to adventure
on long voyages, freighting their vessels with the wares of Egypt
and Assyria. They landed at many places on the coast, and among the
rest at Argos, which was then preeminent above all the states included
now under the common name of Hellas. Here they exposed their merchandise,
and traded with the natives for five or six days; at the end of which
time, when

In [11]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10

herodotus_arr = herodotus_test.split()
ngrams = [
    (
        [herodotus_arr[i - j - 1] for j in range(CONTEXT_SIZE)],
        herodotus_arr[i]
    )
    for i in range(CONTEXT_SIZE, len(herodotus_arr))
]
print(ngrams[:3])

[(['I', 'BOOK'], 'Clio'), (['Clio', 'I'], 'These'), (['These', 'Clio'], 'are')]


In [12]:
vocab = set(herodotus_arr)
word_to_ix = {word: i for i, word in enumerate(vocab)}

In [13]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

In [14]:
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

KeyboardInterrupt: 

In [47]:
# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["Argos,"]])

tensor([-0.3862,  0.5477, -0.1582, -0.7363, -0.2339, -0.5388,  0.1579, -0.5973,
        -0.8833,  0.6078], grad_fn=<SelectBackward0>)


### Sources

+ https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673
+ https://medium.com/@mervebdurna/advanced-word-embeddings-word2vec-glove-and-fasttext-26e546ffedbd
+ https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
+ https://www.promptingguide.ai/techniques/rag
