# Chapter 1

**Set up a basic RAG pipeline (BM25/TFIDF + simple QA model)**

In [1]:
import json
import os
import pathlib
from datetime import datetime
from typing import Dict, List

import dotenv
import numpy as np
import wandb
import cohere
from scipy.spatial.distance import cdist
from sklearn.feature_extraction.text import TfidfVectorizer


dotenv.load_dotenv()

True

In [2]:
WANDB_ENTITY = "rag-course"
WANDB_PROJECT = "dev"

wandb.require("core")

run = wandb.init(
    entity=WANDB_ENTITY,
    project=WANDB_PROJECT,
    group="Chapter 1",
)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mparambharat[0m ([33mrag-course[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [3]:
# TODO: Remove this once we more to the final project
# documents_artifact = wandb.Artifact(
#     name="wandb_docs",
#     type="dataset",
#     description="W&B Documentation in Markdown format",
#     metadata={
#         "total_files": 380,
#         "date_processed": datetime.now().strftime("%Y-%m-%d"),
#     },
# )

# documents_artifact.add_dir("../data/wandb_docs")
# run.log_artifact(documents_artifact)

## Data ingestion

### Loading the data

In [4]:
documents_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/wandb_docs:latest", type="dataset"
)
data_dir = "../data/wandb_docs"

docs_dir = documents_artifact.download(data_dir)

2024/07/04 13:07:11 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/rag-course/dev/0z2t11h3/artifact/936064166/wandb_manifest.json?Expires=1720082230&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=rZBPxDoCAV0s7%2FtHw60Pno8zAh4kIimMBR98Ekee27UERazKYFsTGwdlIrXurZ%2F%2B5nXwuFD3wa3GngPULrCZr3fEO1CgVV%2BPSkZFwcTY1yHKZy61V9rUo2pdzTYZEQZdioGSoVJQur3lf1iWfSxKHRnHxvVstOgj5SIVYS%2F29VfvP%2FiKOyCnxlp9i62pIfHsbJ8xPkHXTb9mEu8i08zf4BE%2FEapC15J9yLb0qnavs0Rw7BHN9jxhCJWn%2BBtfXv4YMsj%2FIVO2GUMPM7eFs8XMAXTYRavjV9Buqz6Pw8iUrIukTLpSTf9GS6%2BL3e8xLzkP8%2FZoDKSHNtNSCjWbrgIUSg%3D%3D


In [5]:
docs_dir = pathlib.Path(docs_dir)
docs_files = sorted(docs_dir.rglob("*.md"))

print(f"Number of files: {len(docs_files)}\n")
print("First 5 files:\n{files}".format(files="\n".join(map(str, docs_files[:5]))))

Number of files: 380

First 5 files:
../data/wandb_docs/guides/app/features/anon.md
../data/wandb_docs/guides/app/features/custom-charts/intro.md
../data/wandb_docs/guides/app/features/custom-charts/walkthrough.md
../data/wandb_docs/guides/app/features/intro.md
../data/wandb_docs/guides/app/features/notes.md


In [6]:
# Lets look at an example file
print(docs_files[0].read_text())

---
description: Log and visualize data without a W&B account
displayed_sidebar: default
---

# Anonymous Mode

Are you publishing code that you want anyone to be able to run easily? Use Anonymous Mode to let someone run your code, see a W&B dashboard, and visualize results without needing to create a W&B account first.

Allow results to be logged in Anonymous Mode with `wandb.init(`**`anonymous="allow"`**`)`

:::info
**Publishing a paper?** Please [cite W&B](https://docs.wandb.ai/company/academics#bibtex-citation), and if you have questions about how to make your code accessible while using W&B, reach out to us at support@wandb.com.
:::

### How does someone without an account see results?

If someone runs your script and you have to set `anonymous="allow"`:

1. **Auto-create temporary account:** W&B checks for an account that's already signed in. If there's no account, we automatically create a new anonymous account and save that API key for the session.
2. **Log results quickly:** T

In [7]:
# We'll store the files as dictionaries with some content and metadata
data = []
for file in docs_files:
    content = file.read_text()
    data.append(
        {
            "content": content,
            "metadata": {
                "source": str(file.relative_to(docs_dir)),
                "raw_tokens": len(content.split()),
            },
        }
    )
data[:2]

  'metadata': {'source': 'guides/app/features/anon.md', 'raw_tokens': 470}},
 {'content': '---\nslug: /guides/app/features/custom-charts\ndisplayed_sidebar: default\n---\n\nimport Tabs from \'@theme/Tabs\';\nimport TabItem from \'@theme/TabItem\';\n\n# Custom Charts\n\nUse **Custom Charts** to create charts that aren\'t possible right now in the default UI. Log arbitrary tables of data and visualize them exactly how you want. Control details of fonts, colors, and tooltips with the power of [Vega](https://vega.github.io/vega/).\n\n* **What\'s possible**: Read the[ launch announcement →](https://wandb.ai/wandb/posts/reports/Announcing-the-W-B-Machine-Learning-Visualization-IDE--VmlldzoyNjk3Nzg)\n* **Code**: Try a live example in a[ hosted notebook →](https://tiny.cc/custom-charts)\n* **Video**: Watch a quick [walkthrough video →](https://www.youtube.com/watch?v=3-N9OV6bkSM)\n* **Example**: Quick Keras and Sklearn [demo notebook →](https://colab.research.google.com/drive/1g-gNGokPWM2Qbc8p

In [8]:
total_tokens = sum(map(lambda x: x["metadata"]["raw_tokens"], data))
print(f"Total Tokens in dataset: {total_tokens}")

Total Tokens in dataset: 246998


In [9]:
# Let's store the raw data in an artifact for future use and reproducibility
raw_artifact = wandb.Artifact(
    name="raw_data",
    type="dataset",
    description="Raw wandb documentation",
    metadata={
        "total_files": len(data),
        "date_processed": datetime.now().strftime("%Y-%m-%d"),
        "total_raw_tokens": total_tokens,
    },
)
with raw_artifact.new_file("documents.jsonl", mode="w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")
run.log_artifact(raw_artifact)

<Artifact raw_data>

### Chunking the data

In [10]:
# These are hyperparameters of our ingestion pipeline

CHUNK_SIZE = 300
CHUNK_OVERLAP = 0


def split_into_chunks(
    text: str, chunk_size: int = CHUNK_SIZE, chunk_overlap: int = CHUNK_OVERLAP
) -> List[str]:
    """Function to split the text into chunks of a maximum number of tokens
    ensure that the chunks are of size CHUNK_SIZE and overlap by chunk_overlap tokens
    use the `tokenizer.encode` method to tokenize the text
    """
    tokens = text.split()
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk = tokens[start:end]
        chunks.append(" ".join(chunk))
        start = end - chunk_overlap
    return chunks

In [11]:
# We'll re-use the raw dataset from the artifact in our previous step


raw_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/raw_data:latest", type="dataset"
)
artifact_dir = raw_artifact.download()
raw_data_file = pathlib.Path(f"{artifact_dir}/documents.jsonl")
raw_data = list(map(json.loads, raw_data_file.read_text().splitlines()))
raw_data[:2]

2024/07/04 13:07:15 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/rag-course/dev/0z2t11h3/artifact/936065098/wandb_manifest.json?Expires=1720082235&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=hl%2FulZNdj388XoBdYqfp0BAl5v9QwwJnTFVNPnpywMHbj7m2lTLn%2Fb2QEkd74auyUXyF4pegj%2Bb%2BIKudFKeT3%2BmaXUT8dwrkpxrYgB9vN%2Fo5uyK5hlvsRihrVx9uHJ408cChaq837iaHKwVW%2BLT6luHzurr8l7wgkBsrLQaklLj%2BtkCRi4Ziq%2B%2BrNsVaXSlgSE5%2BC68QOxZPu9qtvJ%2BNsHUmLWv7yuIAPofgDeCyF6UEAy%2BWV1T7q3JuENXGtwWpkK1zEIguR2YUp43EdTp%2FTMCGfAoZpxZ2RWcXhKIoK7IJvDsebZjlvvfbVJ6ob9lDDfvQFWymWmrt6lk%2BI18WQw%3D%3D


  'metadata': {'source': 'guides/app/features/anon.md', 'raw_tokens': 470}},
 {'content': '---\nslug: /guides/app/features/custom-charts\ndisplayed_sidebar: default\n---\n\nimport Tabs from \'@theme/Tabs\';\nimport TabItem from \'@theme/TabItem\';\n\n# Custom Charts\n\nUse **Custom Charts** to create charts that aren\'t possible right now in the default UI. Log arbitrary tables of data and visualize them exactly how you want. Control details of fonts, colors, and tooltips with the power of [Vega](https://vega.github.io/vega/).\n\n* **What\'s possible**: Read the[ launch announcement →](https://wandb.ai/wandb/posts/reports/Announcing-the-W-B-Machine-Learning-Visualization-IDE--VmlldzoyNjk3Nzg)\n* **Code**: Try a live example in a[ hosted notebook →](https://tiny.cc/custom-charts)\n* **Video**: Watch a quick [walkthrough video →](https://www.youtube.com/watch?v=3-N9OV6bkSM)\n* **Example**: Quick Keras and Sklearn [demo notebook →](https://colab.research.google.com/drive/1g-gNGokPWM2Qbc8p

In [12]:
chunked_data = []
for doc in raw_data:
    chunks = split_into_chunks(doc["content"])
    for chunk in chunks:
        chunked_data.append(
            {
                "content": chunk,
                "metadata": {
                    "source": doc["metadata"]["source"],
                    "raw_tokens": len(chunk.split()),
                },
            }
        )

### Cleaning the data

In [13]:
# some of our examples have special tokens that we need to remove otherwise it will break the chat.completions api.


def make_text_tokenization_safe(content: str) -> str:
    special_tokens_set = {
        "<|endofprompt|>",
        "<|endoftext|>",
        "<|fim_middle|>",
        "<|fim_prefix|>",
        "<|fim_suffix|>",
    }

    def remove_special_tokens(text: str) -> str:
        """Removes special tokens from the given text.

        Args:
            text: A string representing the text.

        Returns:
            The text with special tokens removed.
        """
        for token in special_tokens_set:
            text = text.replace(token, "")
        return text

    cleaned_content = remove_special_tokens(content)
    return cleaned_content

In [14]:
cleaned_data = []
for doc in chunked_data:
    cleaned_doc = doc.copy()
    cleaned_doc["cleaned_content"] = make_text_tokenization_safe(doc["content"])
    cleaned_doc["metadata"]["cleaned_tokens"] = len(
        cleaned_doc["cleaned_content"].split()
    )
    cleaned_data.append(cleaned_doc)
cleaned_data[:2]

[{'content': '--- description: Log and visualize data without a W&B account displayed_sidebar: default --- # Anonymous Mode Are you publishing code that you want anyone to be able to run easily? Use Anonymous Mode to let someone run your code, see a W&B dashboard, and visualize results without needing to create a W&B account first. Allow results to be logged in Anonymous Mode with `wandb.init(`**`anonymous="allow"`**`)` :::info **Publishing a paper?** Please [cite W&B](https://docs.wandb.ai/company/academics#bibtex-citation), and if you have questions about how to make your code accessible while using W&B, reach out to us at support@wandb.com. ::: ### How does someone without an account see results? If someone runs your script and you have to set `anonymous="allow"`: 1. **Auto-create temporary account:** W&B checks for an account that\'s already signed in. If there\'s no account, we automatically create a new anonymous account and save that API key for the session. 2. **Log results qui

In [15]:
# Again, we'll store the cleaned data in an artifact for future use and reproducibility

total_raw_tokens = sum(map(lambda x: x["metadata"]["raw_tokens"], cleaned_data))
total_cleaned_tokens = sum(map(lambda x: x["metadata"]["cleaned_tokens"], cleaned_data))

chunked_artifact = wandb.Artifact(
    name="chunked_data",
    type="dataset",
    description="Chunked wandb documentation",
    metadata={
        "total_files": len(cleaned_data),
        "date_processed": datetime.now().strftime("%Y-%m-%d"),
        "total_raw_tokens": total_raw_tokens,
        "total_cleaned_tokens": total_cleaned_tokens,
        "chunk_size": CHUNK_SIZE,
        "chunk_overlap": CHUNK_OVERLAP,
    },
)
with chunked_artifact.new_file("documents.jsonl", mode="w") as f:
    for item in cleaned_data:
        f.write(json.dumps(item) + "\n")
run.log_artifact(chunked_artifact)

<Artifact chunked_data>

## Vectorizing the data

**TODO**: Add weave ops and traces in this section

In [16]:
# Now we can re-use the chunked data from the artifact in our previous step

chunked_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/chunked_data:latest", type="dataset"
)
artifact_dir = chunked_artifact.download()
chunked_data_file = pathlib.Path(f"{artifact_dir}/documents.jsonl")
chunked_data = list(map(json.loads, chunked_data_file.read_text().splitlines()))
chunked_data[:2]

2024/07/04 13:07:18 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/rag-course/dev/vr8n8v06/artifact/942570916/wandb_manifest.json?Expires=1720082238&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=l5yVdeI6awSqNip5a5aN0PT0pa4czDTd4WaoF8jPuwOpv1KybPZFwFauCCmPNiTMCEMTDjWxlU71uKrtfpJDugjP%2FDeuDSNqJxWRlxdin%2F%2FRaj%2FS5Uhzpt4ALRDeiCyDEXwq5LtLfjwq8IjqTC%2BqWFF7YkXiFMyqJrYuTPK4yc1kVH6uHvWGdiyskjp6GFWcVKehQxiZ2vsO3CCzln7ZzMS09YB8ZA8mKajMUK8oBoMF1sMXIFhSuVSw%2FMpfmms%2BOqiW5dsCOTnnecZmdM4DV2y0bcD0AlIQOoyE2sndCVXlk6nEeiD%2FziAr1vUynLZO2GnYC%2Fxp3h14vdlhycLLnQ%3D%3D


[{'content': '--- description: Log and visualize data without a W&B account displayed_sidebar: default --- # Anonymous Mode Are you publishing code that you want anyone to be able to run easily? Use Anonymous Mode to let someone run your code, see a W&B dashboard, and visualize results without needing to create a W&B account first. Allow results to be logged in Anonymous Mode with `wandb.init(`**`anonymous="allow"`**`)` :::info **Publishing a paper?** Please [cite W&B](https://docs.wandb.ai/company/academics#bibtex-citation), and if you have questions about how to make your code accessible while using W&B, reach out to us at support@wandb.com. ::: ### How does someone without an account see results? If someone runs your script and you have to set `anonymous="allow"`: 1. **Auto-create temporary account:** W&B checks for an account that\'s already signed in. If there\'s no account, we automatically create a new anonymous account and save that API key for the session. 2. **Log results qui

In [17]:
# We'll create a simple retriever class to get the most relevant chunks of data for a given query.
# We'll use TF-IDF to vectorize the documents and cosine distance to measure the similarity between the query and the documents.
# Two methods: index_data and search
# index_data will take the data and vectorize it and store the index
# search will take a query and return the most relevant chunks from the index


class Retriever:
    def __init__(self):
        self.vectorizer = TfidfVectorizer()
        self.index = None
        self.data = None

    def index_data(self, data):
        self.data = data
        docs = [doc["cleaned_content"] for doc in data]
        self.index = self.vectorizer.fit_transform(docs)

    def search(self, query, k=5):
        query_vec = self.vectorizer.transform([query])
        cosine_distances = cdist(
            query_vec.todense(), self.index.todense(), metric="cosine"
        )[0]
        top_k_indices = cosine_distances.argsort()[:k]
        output = []
        for idx in top_k_indices:
            output.append(
                {
                    "source": self.data[idx]["metadata"]["source"],
                    "text": self.data[idx]["cleaned_content"],
                    "score": 1 - cosine_distances[idx],
                }
            )
        return output

In [18]:
# Let's test with a simple query


retriever = Retriever()
retriever.index_data(chunked_data)

query = "How do I use W&B to log metrics in my training script?"
search_results = retriever.search(query)
for result in search_results:
    print(result)

{'source': 'guides/technical-faq/general.md', 'text': '--- displayed_sidebar: default --- # General ### What does `wandb.init` do to my training process? When `wandb.init()` is called from your training script an API call is made to create a run object on our servers. A new process is started to stream and collect metrics, thereby keeping all threads and logic out of your primary process. Your script runs normally and writes to local files, while the separate process streams them to our servers along with system metrics. You can always turn off streaming by running `wandb off` from your training directory, or setting the `WANDB_MODE` environment variable to `offline`. ### Does your tool track or store training data? You can pass a SHA or other unique identifier to `wandb.config.update(...)` to associate a dataset with a training run. W&B does not store any data unless `wandb.save` is called with the local file name. ### What formula do you use for your smoothing algorithm? We use the s

## Generating a response

**TODO**: Add weave ops and traces in this section

In [19]:
# Now we are ready to generate a response grounded on the documentation.


class ResponseGenerator:
    def __init__(self, model: str, prompt: str):
        self.client = cohere.Client(api_key=os.environ["CO_API_KEY"])
        self.model = model
        self.prompt = prompt

    # @weave.op()

    def generate_response(self, query: str, context: List[Dict[str, any]]) -> str:
        
        documents = [{"source": item['source'], "text": item['text']} for item in context]
        response = self.client.chat(
            preamble=self.prompt,
            message=query,
            model=self.model,
            documents=documents,
            temperature=0.1,
            max_tokens=2000,
        )
        return response.text

In [20]:
PROMPT = "Answer to the following question about W&B. Provide an helful and complete answer based only on the provided documents."

In [21]:
response_generator = ResponseGenerator(model="command-r", prompt=PROMPT)
answer = response_generator.generate_response(query, search_results)
print(answer)

You can use the W&B API to log metrics in your training script. First, call `wandb.init()` in your script to create a run object on W&B servers and start a process to stream and collect metrics. Then, write your script as normal; metrics will be saved locally and streamed to the servers asynchronously. To log a metric, call `wandb.log()` with the metric name and value as a key-value pair. For example, `wandb.log({"epoch": epoch, "val_acc": 0.94})` would log the accuracy after each epoch.

You can also log other data types, such as pandas DataFrames, images, and videos, by using the appropriate data type wrappers from the W&B library.

Remember that if you're running your script in a Jupyter or Google Colab notebook, you'll need to call `wandb.finish()` at the end of your training to finalise the W&B run. You can view your logged metrics and data in the W&B Dashboard.


In [22]:
class RAGPipeline:
    def __init__(self, retriever: Retriever, response_generator: ResponseGenerator, top_k: int = 5):
        self.retriever = retriever
        self.response_generator = response_generator
        self.top_k = top_k

    def __call__(self, query: str):
        context = self.retriever.search(query, self.top_k)
        return self.response_generator.generate_response(query, context)

In [23]:
rag_pipeline = RAGPipeline(retriever, response_generator, top_k=10)
response = rag_pipeline(query=query)
print(response)

You can log metrics in your training script by incorporating the following lines of code:

`python
import wandb
run = wandb.init()

# Log metrics inside your training loop
for epoch in range(wandb.config.epochs):
for batch in dataloader:
loss, accuracy = model.training_step()
wandb.log{{"accuracy": accuracy, "loss": loss}})

This will visualize your model's performance during training. If you want to log metrics on different time scales, make sure to include your indices in the logs, such as 'batch' or 'epoch', so that they can be plotted on separate charts.

Remember that calling `wandb.log` writes a line to a local file that will later be synced to the W&B cloud. You can use different data types such as strings, integers, floats, tensors, and dictionaries, and even log media such as images and videos.


In [None]:
# TODO: Add exercise for chapter 1.