# Chapter 1

**Set up a basic RAG pipeline (BM25/TFIDF + simple QA model)**

In [1]:
import json
import os
import pathlib
from datetime import datetime
from typing import Dict, List

import dotenv
import numpy as np
import wandb
from openai import OpenAI
from scipy.spatial.distance import cdist
from sklearn.feature_extraction.text import TfidfVectorizer


dotenv.load_dotenv()

True

In [2]:
WANDB_ENTITY = "rag-course"
WANDB_PROJECT = "dev"

wandb.require("core")

run = wandb.init(
    entity=WANDB_ENTITY,
    project=WANDB_PROJECT,
    group="Chapter 1",
)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mparambharat[0m ([33mrag-course[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [3]:
# TODO: Remove this once we more to the final project
# documents_artifact = wandb.Artifact(
#     name="wandb_docs",
#     type="dataset",
#     description="W&B Documentation in Markdown format",
#     metadata={
#         "total_files": 380,
#         "date_processed": datetime.now().strftime("%Y-%m-%d"),
#     },
# )

# documents_artifact.add_dir("../data/wandb_docs")
# run.log_artifact(documents_artifact)

## Data ingestion

### Loading the data

In [4]:
documents_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/wandb_docs:latest", type="dataset"
)
data_dir = "../data/wandb_docs"

docs_dir = documents_artifact.download(data_dir)

2024/07/02 16:13:30 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/rag-course/dev/0z2t11h3/artifact/936064166/wandb_manifest.json?Expires=1719920610&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=LMu2QkJsxYepvOOhPUOLvSswJFVbG4KovPz3R06ElyKTkt0k1c4UyzPVTMq2JCN7Xiuog9ZI1uJKrXC2WvDVn%2BhVrL3EV7urSeewuvSDSW33fqnyxh1DFZWZAiLtJhLUvAyXpPsqWG80k5HUJk7j3fHoeRZh7owIefSnqRmIiW7cUtftbFalRbNHBD3rBZRNY0GXKCO5RjyIRyr8MudvfbAQ2sr959aqn3E%2FkIfRFRQBjCb4W%2FMZ2kPswQkBMWcox%2BZTcVeqFdlpnC4LJkA0V7c9WUnUlCZUAAjXBSnT9iPsb6OxRt11ImprVX3ooTVl0nJB0gi4vUkf%2BcdWxhhjvA%3D%3D


In [5]:
docs_dir = pathlib.Path(docs_dir)
docs_files = sorted(docs_dir.rglob("*.md"))

print(f"Number of files: {len(docs_files)}\n")
print("First 5 files:\n{files}".format(files="\n".join(map(str, docs_files[:5]))))

Number of files: 380

First 5 files:
../data/wandb_docs/guides/app/features/anon.md
../data/wandb_docs/guides/app/features/custom-charts/intro.md
../data/wandb_docs/guides/app/features/custom-charts/walkthrough.md
../data/wandb_docs/guides/app/features/intro.md
../data/wandb_docs/guides/app/features/notes.md


In [6]:
# Lets look at an example file
print(docs_files[0].read_text())

---
description: Log and visualize data without a W&B account
displayed_sidebar: default
---

# Anonymous Mode

Are you publishing code that you want anyone to be able to run easily? Use Anonymous Mode to let someone run your code, see a W&B dashboard, and visualize results without needing to create a W&B account first.

Allow results to be logged in Anonymous Mode with `wandb.init(`**`anonymous="allow"`**`)`

:::info
**Publishing a paper?** Please [cite W&B](https://docs.wandb.ai/company/academics#bibtex-citation), and if you have questions about how to make your code accessible while using W&B, reach out to us at support@wandb.com.
:::

### How does someone without an account see results?

If someone runs your script and you have to set `anonymous="allow"`:

1. **Auto-create temporary account:** W&B checks for an account that's already signed in. If there's no account, we automatically create a new anonymous account and save that API key for the session.
2. **Log results quickly:** T

In [7]:
# We'll store the files as dictionaries with some content and metadata
data = []
for file in docs_files:
    content = file.read_text()
    data.append(
        {
            "content": content,
            "metadata": {
                "source": str(file.relative_to(docs_dir)),
                "raw_tokens": len(content.split()),
            },
        }
    )
data[:2]

  'metadata': {'source': 'guides/app/features/anon.md', 'raw_tokens': 470}},
 {'content': '---\nslug: /guides/app/features/custom-charts\ndisplayed_sidebar: default\n---\n\nimport Tabs from \'@theme/Tabs\';\nimport TabItem from \'@theme/TabItem\';\n\n# Custom Charts\n\nUse **Custom Charts** to create charts that aren\'t possible right now in the default UI. Log arbitrary tables of data and visualize them exactly how you want. Control details of fonts, colors, and tooltips with the power of [Vega](https://vega.github.io/vega/).\n\n* **What\'s possible**: Read the[ launch announcement →](https://wandb.ai/wandb/posts/reports/Announcing-the-W-B-Machine-Learning-Visualization-IDE--VmlldzoyNjk3Nzg)\n* **Code**: Try a live example in a[ hosted notebook →](https://tiny.cc/custom-charts)\n* **Video**: Watch a quick [walkthrough video →](https://www.youtube.com/watch?v=3-N9OV6bkSM)\n* **Example**: Quick Keras and Sklearn [demo notebook →](https://colab.research.google.com/drive/1g-gNGokPWM2Qbc8p

In [8]:
total_tokens = sum(map(lambda x: x["metadata"]["raw_tokens"], data))
print(f"Total Tokens in dataset: {total_tokens}")

Total Tokens in dataset: 246998


In [9]:
# Let's store the raw data in an artifact for future use and reproducibility
raw_artifact = wandb.Artifact(
    name="raw_data",
    type="dataset",
    description="Raw wandb documentation",
    metadata={
        "total_files": len(data),
        "date_processed": datetime.now().strftime("%Y-%m-%d"),
        "total_raw_tokens": total_tokens,
    },
)
with raw_artifact.new_file("documents.jsonl", mode="w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")
run.log_artifact(raw_artifact)

<Artifact raw_data>

### Chunking the data

In [10]:
# These are hyperparameters of our ingestion pipeline

CHUNK_SIZE = 500
CHUNK_OVERLAP = 0


def split_into_chunks(
    text: str, chunk_size: int = CHUNK_SIZE, chunk_overlap: int = CHUNK_OVERLAP
) -> List[str]:
    """Function to split the text into chunks of a maximum number of tokens
    ensure that the chunks are of size CHUNK_SIZE and overlap by chunk_overlap tokens
    use the `tokenizer.encode` method to tokenize the text
    """
    tokens = text.split()
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk = tokens[start:end]
        chunks.append(" ".join(chunk))
        start = end - chunk_overlap
    return chunks

In [11]:
# We'll re-use the raw dataset from the artifact in our previous step


raw_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/raw_data:latest", type="dataset"
)
artifact_dir = raw_artifact.download()
raw_data_file = pathlib.Path(f"{artifact_dir}/documents.jsonl")
raw_data = list(map(json.loads, raw_data_file.read_text().splitlines()))
raw_data[:2]

2024/07/02 16:13:37 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/rag-course/dev/0z2t11h3/artifact/936065098/wandb_manifest.json?Expires=1719920617&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=XwXIvDq8ATYX667uyd1RztqGi06AV91uh61vu1N%2FnnuGi1ECN2LBi8WYj97%2B5vwBsVoyID76ujrSAj6XCOeL5ObHCPRcu6zMbolACPTcs06Y2JsoNdwr5DpkxTzj4NAZ0ZBf%2FIFLoVb%2F5FtbZECsa3nImSQu1rppMtvk%2Fy6dw55a8ZCP%2F9PdU998CLRpLyPUusOzIH4MG%2FIKHiZ%2BgJnOn8U%2BEyK38EwicnZTly2sKc7eq97QtXPEE7TvIf6JAMWWVqLNrfI%2Fov5p8dGJ4G7Yb1158JKS%2FWcm7F7QDwLllnYY2fQ7V5buIfPZXamhghU7ouO%2FSu3hR7MDYdxE%2FSyu4A%3D%3D


  'metadata': {'source': 'guides/app/features/anon.md', 'raw_tokens': 470}},
 {'content': '---\nslug: /guides/app/features/custom-charts\ndisplayed_sidebar: default\n---\n\nimport Tabs from \'@theme/Tabs\';\nimport TabItem from \'@theme/TabItem\';\n\n# Custom Charts\n\nUse **Custom Charts** to create charts that aren\'t possible right now in the default UI. Log arbitrary tables of data and visualize them exactly how you want. Control details of fonts, colors, and tooltips with the power of [Vega](https://vega.github.io/vega/).\n\n* **What\'s possible**: Read the[ launch announcement →](https://wandb.ai/wandb/posts/reports/Announcing-the-W-B-Machine-Learning-Visualization-IDE--VmlldzoyNjk3Nzg)\n* **Code**: Try a live example in a[ hosted notebook →](https://tiny.cc/custom-charts)\n* **Video**: Watch a quick [walkthrough video →](https://www.youtube.com/watch?v=3-N9OV6bkSM)\n* **Example**: Quick Keras and Sklearn [demo notebook →](https://colab.research.google.com/drive/1g-gNGokPWM2Qbc8p

In [12]:
chunked_data = []
for doc in raw_data:
    chunks = split_into_chunks(doc["content"])
    for chunk in chunks:
        chunked_data.append(
            {
                "content": chunk,
                "metadata": {
                    "source": doc["metadata"]["source"],
                    "raw_tokens": len(chunk.split()),
                },
            }
        )

### Cleaning the data

In [13]:
# some of our examples have special tokens that we need to remove otherwise it will break the chat.completions api.


def make_text_tokenization_safe(content: str) -> str:
    special_tokens_set = {
        "<|endofprompt|>",
        "<|endoftext|>",
        "<|fim_middle|>",
        "<|fim_prefix|>",
        "<|fim_suffix|>",
    }

    def remove_special_tokens(text: str) -> str:
        """Removes special tokens from the given text.

        Args:
            text: A string representing the text.

        Returns:
            The text with special tokens removed.
        """
        for token in special_tokens_set:
            text = text.replace(token, "")
        return text

    cleaned_content = remove_special_tokens(content)
    return cleaned_content

In [14]:
cleaned_data = []
for doc in chunked_data:
    cleaned_doc = doc.copy()
    cleaned_doc["cleaned_content"] = make_text_tokenization_safe(doc["content"])
    cleaned_doc["metadata"]["cleaned_tokens"] = len(
        cleaned_doc["cleaned_content"].split()
    )
    cleaned_data.append(cleaned_doc)
cleaned_data[:2]

  'metadata': {'source': 'guides/app/features/anon.md',
   'raw_tokens': 470,
   'cleaned_tokens': 470},
 {'content': '--- slug: /guides/app/features/custom-charts displayed_sidebar: default --- import Tabs from \'@theme/Tabs\'; import TabItem from \'@theme/TabItem\'; # Custom Charts Use **Custom Charts** to create charts that aren\'t possible right now in the default UI. Log arbitrary tables of data and visualize them exactly how you want. Control details of fonts, colors, and tooltips with the power of [Vega](https://vega.github.io/vega/). * **What\'s possible**: Read the[ launch announcement →](https://wandb.ai/wandb/posts/reports/Announcing-the-W-B-Machine-Learning-Visualization-IDE--VmlldzoyNjk3Nzg) * **Code**: Try a live example in a[ hosted notebook →](https://tiny.cc/custom-charts) * **Video**: Watch a quick [walkthrough video →](https://www.youtube.com/watch?v=3-N9OV6bkSM) * **Example**: Quick Keras and Sklearn [demo notebook →](https://colab.research.google.com/drive/1g-gNGok

In [15]:
# Again, we'll store the cleaned data in an artifact for future use and reproducibility

total_raw_tokens = sum(map(lambda x: x["metadata"]["raw_tokens"], cleaned_data))
total_cleaned_tokens = sum(map(lambda x: x["metadata"]["cleaned_tokens"], cleaned_data))

chunked_artifact = wandb.Artifact(
    name="chunked_data",
    type="dataset",
    description="Chunked wandb documentation",
    metadata={
        "total_files": len(cleaned_data),
        "date_processed": datetime.now().strftime("%Y-%m-%d"),
        "total_raw_tokens": total_raw_tokens,
        "total_cleaned_tokens": total_cleaned_tokens,
        "chunk_size": CHUNK_SIZE,
        "chunk_overlap": CHUNK_OVERLAP,
    },
)
with chunked_artifact.new_file("documents.jsonl", mode="w") as f:
    for item in cleaned_data:
        f.write(json.dumps(item) + "\n")
run.log_artifact(chunked_artifact)

<Artifact chunked_data>

## Vectorizing the data

**TODO**: Add weave ops and traces in this section

In [16]:
# Now we can re-use the chunked data from the artifact in our previous step

chunked_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/chunked_data:latest", type="dataset"
)
artifact_dir = chunked_artifact.download()
chunked_data_file = pathlib.Path(f"{artifact_dir}/documents.jsonl")
chunked_data = list(map(json.loads, chunked_data_file.read_text().splitlines()))
chunked_data[:2]

2024/07/02 16:13:43 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/rag-course/dev/0z2t11h3/artifact/936065852/wandb_manifest.json?Expires=1719920623&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=Uduj9tm%2FlbAZcXIA3LSmpomVvSkuORRR0uIL5l5O8tXIvDtIKXJfWs1VNS6h1ke3gxzPGM6EuNhTL%2BueFrKVYyER%2B6E%2FynEyEaCdYQkt3KN95NPOmYTl4VF6HXXPgocERv23Zose6fUMBiTqs9lny9slNJ1NMlxg%2FbZYi6bvlmyG8C2d9FBwkngs04%2FJgbZyBDSyR8vXph9mH5nM41qUxb00%2BZ90wQRaxPFdKSk%2BJDh%2BozZ5mWVdove3WnlIU%2B4pRc%2FYbW8jfXb8aLKEsGNAU%2FCOQtgGtPrcATRv8ToOx1xKcpae0DHxsZj52rr67rlrBnYyi42avJbSKFYvXgx2HQ%3D%3D


  'metadata': {'source': 'guides/app/features/anon.md',
   'raw_tokens': 470,
   'cleaned_tokens': 470},
 {'content': '--- slug: /guides/app/features/custom-charts displayed_sidebar: default --- import Tabs from \'@theme/Tabs\'; import TabItem from \'@theme/TabItem\'; # Custom Charts Use **Custom Charts** to create charts that aren\'t possible right now in the default UI. Log arbitrary tables of data and visualize them exactly how you want. Control details of fonts, colors, and tooltips with the power of [Vega](https://vega.github.io/vega/). * **What\'s possible**: Read the[ launch announcement →](https://wandb.ai/wandb/posts/reports/Announcing-the-W-B-Machine-Learning-Visualization-IDE--VmlldzoyNjk3Nzg) * **Code**: Try a live example in a[ hosted notebook →](https://tiny.cc/custom-charts) * **Video**: Watch a quick [walkthrough video →](https://www.youtube.com/watch?v=3-N9OV6bkSM) * **Example**: Quick Keras and Sklearn [demo notebook →](https://colab.research.google.com/drive/1g-gNGok

In [17]:
# We'll create a simple retriever class to get the most relevant chunks of data for a given query.
# We'll use TF-IDF to vectorize the documents and cosine distance to measure the similarity between the query and the documents.
# Two methods: index_data and search
# index_data will take the data and vectorize it and store the index
# search will take a query and return the most relevant chunks from the index


class Retriever:
    def __init__(self):
        self.vectorizer = TfidfVectorizer()
        self.index = None
        self.data = None

    def index_data(self, data):
        self.data = data
        docs = [doc["cleaned_content"] for doc in data]
        self.index = self.vectorizer.fit_transform(docs)

    def search(self, query, k=5):
        query_vec = self.vectorizer.transform([query])
        cosine_distances = cdist(
            query_vec.todense(), self.index.todense(), metric="cosine"
        )[0]
        top_k_indices = cosine_distances.argsort()[:k]
        output = []
        for idx in top_k_indices:
            output.append(
                {
                    "source": self.data[idx]["metadata"]["source"],
                    "text": self.data[idx]["cleaned_content"],
                    "score": 1 - cosine_distances[idx],
                }
            )
        return output

In [19]:
# Let's test with a simple query


retriever = Retriever()
retriever.index_data(chunked_data)

query = "How do I get get started with wandb?"
search_results = retriever.search(query)
for result in search_results:
    print(result)

{'source': 'ref/cli/wandb-artifact/wandb-artifact-get.md', 'text': '# wandb artifact get **Usage** `wandb artifact get [OPTIONS] PATH` **Summary** Download an artifact from wandb **Options** | **Option** | **Description** | | :--- | :--- | | --root | The directory you want to download the artifact to | | --type | The type of artifact you are downloading |', 'score': 0.21027940027251169}
{'source': 'guides/artifacts/artifacts-faqs.md', 'text': 'One effective pattern for logging models in a [sweep](../sweeps/intro.md) is to have a model artifact for the sweep, where the versions will correspond to different runs from the sweep. More concretely, you would have: ```python wandb.Artifact(name="sweep_name", type="model") ``` ### How do I find an artifact from the best run in a sweep? You can use the following code to retrieve the artifacts associated with the best performing run in a sweep: ```python api = wandb.Api() sweep = api.sweep("entity/project/sweep_id") runs = sorted(sweep.runs, key

## Generating a response

**TODO**: Add weave ops and traces in this section

In [20]:
# Now we are ready to generate a response grounded on the documentation.


class ResponseGenerator:
    def __init__(self, model: str, prompt: str):
        self.client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
        self.model = model
        self.prompt = prompt

    def generate_context(self, context: List[Dict[str, any]]) -> str:
        return "\n".join(
            [f"Source: {item['source']}\nText: {item['text']}\n\n" for item in context]
        )

    # @weave.op()

    def generate_response(self, query: str, context: List[Dict[str, any]]) -> str:
        context_text = self.generate_context(context)
        system_message = {
            "role": "system",
            "content": self.prompt.format(context=context_text),
        }
        user_message = {"role": "user", "content": f"Question: {query}\n\nAnswer:"}
        response = self.client.chat.completions.create(
            model=self.model, messages=[system_message, user_message]
        )
        return response.choices[0].message.content

In [21]:
PROMPT = (
    "You are a helpful customer support assistant that can answer questions about W&B\n\n"
    "Your answers must be based only on the provided context.\n\n"
    "<context>\n{context}\n</context>"
)

In [22]:
response_generator = ResponseGenerator(model="gpt-3.5-turbo", prompt=PROMPT)
answer = response_generator.generate_response(query, search_results)
print(answer)

To get started with W&B, you can begin by installing the W&B library in your environment using `!pip install wandb -qqq`. After installation, you can link your account by invoking `wandb.login()`. Then, set up your experiment and save hyperparameters using `wandb.init()`. For more detailed steps, you can refer to the official W&B documentation or guides available on their website.


In [23]:
class RAGPipeline:
    def __init__(self, retriever: Retriever, response_generator: ResponseGenerator, top_k: int = 5):
        self.retriever = retriever
        self.response_generator = response_generator
        self.top_k = top_k

    def __call__(self, query: str):
        context = self.retriever.search(query, self.top_k)
        return self.response_generator.generate_response(query, context)

In [24]:
rag_pipeline = RAGPipeline(retriever, response_generator, top_k=5)
rag_pipeline("How do I get get started with wandb?")

'To get started with W&B, you can begin by installing the W&B library using `!pip install wandb -qqq` in your Jupyter notebook. Then, link your account by importing `wandb` and calling `wandb.login()`. Next, set up your experiment and save hyperparameters using `wandb.init`. Finally, you can start tracking your runs and visualizing the results directly in your notebook.'

In [None]:
# TODO: Add exercise for chapter 1.