# A Comprehensive Guide for Developing and Serving RAG Applications in Production (Part 1)

- GitHub repository: https://github.com/ray-project/llm-applications
- Anyscale Endpoints: https://endpoints.anyscale.com/
- Ray documentation: https://docs.ray.io/

In this guide, we will learn how to:
- 💻 Develop a retrieval augmented generation (RAG) based LLM application.
- 🚀 Scale the major components (embed, index, serve, etc.) in our application.
- ✅ Evaluate different configurations of our application to optimize for both per-component (ex. retrieval_score) and overall performance (quality_score).
- 🔀 Implement a hybrid routing approach that closes the gap between open-source and closed-source LLMs.
- 📦 Serve the application in a highlight available and scalable manner.

# Overview

Large language models (LLMs) have undoubtably changed the way we interact with information. However, they come with their fair share of limitaions as to what we can ask of them. They are aware of the information that they've been trained on and can use that to extend their reasoning but fall short when we require them to know information beyond that. Retrieval augmented generation (RAG) based LLM applications address this exact issue and extend the utility of LLMs and their generative reasoning abilities on our unique datasets.

In this guide, we're going to be build an AI assistant that will help answer questions about [Ray](https://github.com/ray-project/ray). The goal here is to make it easier for developers to adopt Ray but also, as we'll see in this guide, to help improve our Ray documentation itself. Our application involves many moving pieces (embedding model, context parameters, the LLM itself, etc.) and so it's important that we experiment with different configurations to optimize for the best quality responses. But it's non-trivial to evaluate and quantitaively compare different configurations for a generative task.

**Note**: We'll be experimenting with different LLMs (OpenAI, Llama, etc.) in this guide. You will need [OpenAI credentials](https://platform.openai.com/account/api-keys) to access [ChatGPT models](https://platform.openai.com/docs/models/) and [Anyscale Endpoints](https://endpoints.anyscale.com/) to access OSS LLMs.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> overall application view.

## Set up

We're going to start by setting up our base imports, directories and initializing Ray with credentials. We'll be using [Ray](https://docs.ray.io/) to easily scale our workloads with minimal changes to our code.

In [1]:
import os
import openai
from pathlib import Path
from pprint import pprint
import ray
from tqdm import tqdm

In [2]:
import sys; sys.path.append("..")
import warnings; warnings.filterwarnings("ignore")
from dotenv import load_dotenv; load_dotenv()

True

We're going to define several directories where we'll store artifacts such as our downloaded data and experiment results. **Note**: if you cloned the [repository](https://github.com/ray-project/llm-applications), you will notice an existing `experiments` directory. You can change the `EXPERIMENTS_DIR` to a new name or keep the same name to override our previous experiments' results.

In [3]:
# Directories
EFS_DIR = Path("/efs/shared_storage/goku")
ROOT_DIR = Path(os.getcwd()).parent
EXPERIMENTS_DIR = Path(ROOT_DIR, "experiments")
print (f"EFS_DIR: {EFS_DIR}")
print (f"ROOT_DIR: {ROOT_DIR}")
print (f"EXPERIMENTS_DIR: {EXPERIMENTS_DIR}")

EFS_DIR: /efs/shared_storage/goku
ROOT_DIR: /home/ray/ray-assistant
EXPERIMENTS_DIR: /home/ray/ray-assistant/experiments


We're also going to initiailize Ray with some required credentials for our application, such as, our [OpenAI](https://platform.openai.com/docs/models/gpt-4) (for ChatGPT models), [Anyscale Endpoints](https://endpoints.anyscale.com/) (for OSS LLMs like Llama-2) and database connection credentials.

In [4]:
# Credentials
ray.init(runtime_env={"env_vars": {
    "OPENAI_API_BASE": os.environ["OPENAI_API_BASE"],
    "OPENAI_API_KEY": os.environ["OPENAI_API_KEY"], 
    "ANYSCALE_API_BASE": os.environ["ANYSCALE_API_BASE"],
    "ANYSCALE_API_KEY": os.environ["ANYSCALE_API_KEY"],
    "DB_CONNECTION_STRING": os.environ["DB_CONNECTION_STRING"],
}})

2023-09-05 11:59:06,933	INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.0.7.108:6379...
2023-09-05 11:59:06,947	INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-yn5cwtau135l5cajlbkzrdyqqp.i.anyscaleuserdata-staging.com [39m[22m
2023-09-05 11:59:06,950	INFO packaging.py:346 -- Pushing file package 'gcs://_ray_pkg_48527ac2c6151cc0588558fb7a9d8c9e.zip' (0.25MiB) to Ray cluster...
2023-09-05 11:59:06,952	INFO packaging.py:359 -- Successfully pushed file package 'gcs://_ray_pkg_48527ac2c6151cc0588558fb7a9d8c9e.zip'.


0,1
Python version:,3.8.13
Ray version:,2.6.3
Dashboard:,http://session-yn5cwtau135l5cajlbkzrdyqqp.i.anyscaleuserdata-staging.com


## Data

### Load data

We need to first download the [Ray documentation](https://docs.ray.io/) to a local directory / elastic file system:
```bash
export DOCS_PATH=/desired/output/directory
wget -e robots=off --recursive --no-clobber --page-requisites \
  --html-extension --convert-links --restrict-file-names=windows \
  --domains docs.ray.io --no-parent --accept=html \
  -P $DOCS_PATH https://docs.ray.io/en/master/
```

Then, we'll load the paths to our downloaded artifacts (html files) into a [Ray Dataset](https://docs.ray.io/en/latest/data/data.html) so that we can perform workloads on them at scale (ex. embed, index, etc.)

In [5]:
# Ray dataset
docs_path = Path(EFS_DIR, "docs.ray.io/en/master/")
ds = ray.data.from_items([{"path": path} for path in docs_path.rglob("*.html") if not path.is_dir()])
print(f"{ds.count()} documents")

3282 documents


### Chunk data

Now that we have a dataset of all the paths to the html files, we're going to develop some functions that can appropriately extract the text from these files. We want to do this in a generalized manner so that we can perform this extraction across all of our docs pages. Therefore, we identify the sections in our html page and then extract the text in between them. We save all of this into a list of dictionaries that map the text within a section to the specific url + section anchor id.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Example of sectionization process.

In [6]:
from bs4 import BeautifulSoup, NavigableString, Tag

In [7]:
def extract_text_from_section(section):
    texts = []
    for elem in section.children:
        if isinstance(elem, NavigableString):
            if elem.strip():
                texts.append(elem.strip())
        elif elem.name == 'section':
            continue
        else:
            texts.append(elem.get_text().strip())
    return '\n'.join(texts)

In [8]:
def path_to_uri(path, scheme="https://", domain="docs.ray.io"):
    return scheme + domain + str(path).split(domain)[-1]

In [9]:
def extract_sections(record):
    with open(record["path"], "r", encoding="utf-8") as html_file:
        soup = BeautifulSoup(html_file, "html.parser")
    sections = soup.find_all("section")
    section_list = []
    for section in sections:
        section_id = section.get("id")
        section_text = extract_text_from_section(section)
        if section_id:
            uri = path_to_uri(path=record["path"])
            section_list.append({"source": f"{uri}#{section_id}", "text": section_text})
    return section_list

In [10]:
html_file_path = Path(EFS_DIR, "docs.ray.io/en/master/rllib/rllib-env.html")
extract_sections({"path": html_file_path})[0]

{'source': 'https://docs.ray.io/en/master/rllib/rllib-env.html#environments',
 'text': '\nEnvironments#\nRLlib works with several different types of environments, including Farama-Foundation Gymnasium, user-defined, multi-agent, and also batched environments.\nTip\nNot all environments work with all algorithms. Check out the algorithm overview for more information.\n'}

We can apply our extraction function to all the paths in our dataset by using [flat_map](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.flat_map.html). This will apply our `extract_section` function to each path in our dataset by utilizing all of our CPU workers in parallel.

In [11]:
# Extract sections
sections_ds = ds.flat_map(extract_sections)
sections = sections_ds.take_all()
print (len(sections))

2023-09-05 11:59:12,047	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[FlatMap(extract_sections)]
2023-09-05 11:59:12,048	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-05 11:59:12,050	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/200 [00:00<?, ?it/s]

5727


We now have a list of sections (with text and source of each section) but we shouldn't directly use this as context to our RAG application just yet. The text lengths of each section are all varied and many are quite large chunks. If were to use these large sections, then we'd be inserting a lot of noisy/unwanted context and because all LLMs have a maximum context length, we wouldn't be able to fit too many relevant contexts. Therefore, we're going to split the text within each section into smaller chunks. Intuitively, smaller chunks will encapsulate single/few concepts and will be less noisy compared to larger chunks. We're going to choose some typical text splitting values (ex. `chunk_size=300`) to create our chunks for now but we'll be experiments with a range of values later.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Sample chunking logic in action.

In [12]:
from langchain.document_loaders import ReadTheDocsLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [13]:
chunk_size = 300
chunk_overlap = 50
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=len,
)

In [14]:
# Chunks
chunks = text_splitter.create_documents(
    texts=[section["text"] for section in sections], 
    metadatas=[{"source": section["source"]} for section in sections]
)

In [15]:
print (f"{len(chunks)} chunks\n")
pprint (chunks[0].page_content)
print (f"\nmetadata:\n{chunks[0].metadata}")

32276 chunks

('ray.tune.schedulers.PopulationBasedTrainingReplay.restore#\n'
 'PopulationBasedTrainingReplay.restore(checkpoint_path: str)#\n'
 'Restore trial scheduler from checkpoint.')

metadata:
{'source': 'https://docs.ray.io/en/master/tune/api/doc/ray.tune.schedulers.PopulationBasedTrainingReplay.restore.html#ray-tune-schedulers-populationbasedtrainingreplay-restore'}


We'll again load our chunks into a Ray Dataset so that we can perform workloads at scale on them.

In [16]:
# Ray dataset
chunks_ds = ray.data.from_items([{"text": chunk.page_content, "source": chunk.metadata["source"]} for chunk in chunks])
chunks_ds.show(1)

2023-09-05 11:59:35,022	INFO dataset.py:2180 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.


{'text': 'ray.tune.schedulers.PopulationBasedTrainingReplay.restore#\nPopulationBasedTrainingReplay.restore(checkpoint_path: str)#\nRestore trial scheduler from checkpoint.', 'source': 'https://docs.ray.io/en/master/tune/api/doc/ray.tune.schedulers.PopulationBasedTrainingReplay.restore.html#ray-tune-schedulers-populationbasedtrainingreplay-restore'}


### Embed data

Now that we've created small chunks from our dataset, we need a way to identify the most relevant ones to a given query. A very effective and quick method is to embed our data using a pretrained model and use the same model to embed the query. We can then compute the distance between all of the chunk embeddings and our query embedding to determine the top k chunks. There are many different pretrained models to choose from to embed our data but the most popular ones can be discovered through [HuggingFace's Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) leadboard. These models were pretrained on very large text corpus through tasks such as next/masked token prediction that allows them to learn to represent subtokens in N dimensions and capture semantic relationships. We can leverage this to represent our data and make decisions such as the most relevant contexts to use to answer a given query. We're using Langchain's Embedding wrappers ([HuggingFaceEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html) and [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html)) to easily load the models and embed our document chunks.

**Note**: embeddings aren't the only way to determine the more relevant chunks. We could also use an LLM to decide! However, because LLMs are much larger than these embedding models and have maximum context lengths, it's better to use embeddings to retrieve the top k chunks. And then we could use LLMs on the fewer k chunks to determine the <k chunks to use as the context to answer our query. We could also use reranking (ex. [Cohere Rerank](https://txt.cohere.com/rerank/)) to further identify the most relevant chunks to use.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Represent a text chunk getting embedded.

In [17]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import numpy as np
from ray.data import ActorPoolStrategy

In [18]:
class EmbedChunks:
    def __init__(self, model_name):
        model_kwargs = {"device": "cuda"}
        encode_kwargs = {"device": "cuda", "batch_size": 100}
        if model_name == "text-embedding-ada-002":
            self.embedding_model = OpenAIEmbeddings(
                model=model_name,
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs,
                openai_api_base=os.environ["OPENAI_API_BASE"],
                openai_api_key=os.environ["OPENAI_API_KEY"])
        else:
            self.embedding_model = HuggingFaceEmbeddings(
                model_name=model_name,
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs)
    
    def __call__(self, batch):
        embeddings = self.embedding_model.embed_documents(batch["text"])
        return {"text": batch["text"], "source": batch["source"], "embeddings": embeddings}

Here we're able to embed our chunks at scale by using [map_batches](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html). All we had to do was define the `batch_size` and the compute to use (we're using two workers, each with 1 GPU).

In [19]:
# Embed chunks
embedding_model_name = "thenlper/gte-base"
embedded_chunks = chunks_ds.map_batches(
    EmbedChunks,
    fn_constructor_kwargs={"model_name": embedding_model_name},
    batch_size=100, 
    num_gpus=1,
    compute=ActorPoolStrategy(size=2))

In [22]:
# Sample
sample = embedded_chunks.take(1)
print ("embedding size:", len(sample[0]["embeddings"]))
pprint(sample[0]["text"])

2023-09-04 12:36:36,645	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(EmbedChunks)]
2023-09-04 12:36:36,646	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-04 12:36:36,647	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2023-09-04 12:36:36,663	INFO actor_pool_map_operator.py:117 -- MapBatches(EmbedChunks): Waiting for 2 pool actors to start...


Running 0:   0%|          | 0/200 [00:00<?, ?it/s]



embedding size: 768
('It is equivalent to PENDING_CREATION,\n'
 'but means the actor was dead more than once.\n'
 'DEAD: The actor is permanatly dead.')


### Index data

Now that we have our embedded chunks, we need to index (store) them somewhere so that we can retrieve them quickly for inference. While there are many popular vector database options, we're going to use [Postgres](https://www.postgresql.org/) for it's simplificty and performance. We'll create a table (`document`) and write the (`text`, `source`, `embedding`) triplets for each embedded chunk we have.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show a triplet getting indexed in a vector DB.

In [20]:
import psycopg
from pgvector.psycopg import register_vector

In [None]:
%%bash
# Set up pgvector
bash ../setup-pgvector.sh

In [21]:
%%bash
# Drop existing table if it exists
psql "$DB_CONNECTION_STRING" -c "DROP TABLE IF EXISTS document;"
sudo -u postgres psql -f ../migrations/vector-768.sql  # "thenlper/gte-base" dimension is 768

NOTICE:  table "document" does not exist, skipping


DROP TABLE
CREATE TABLE


If we have already created an index (and saved it), we can reload it:

In [22]:
%%bash
# Load index
export SQL_DUMP_FP="/efs/shared_storage/goku/sql_dumps/gte-base_300_50.sql"
echo $SQL_DUMP_FP
psql "$DB_CONNECTION_STRING" -f $SQL_DUMP_FP  # load

/efs/shared_storage/goku/sql_dumps/gte-base_300_50.sql
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
ALTER TABLE
ALTER TABLE
DROP SEQUENCE
DROP TABLE
DROP EXTENSION
CREATE EXTENSION
COMMENT
SET
SET
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
ALTER SEQUENCE
ALTER TABLE
COPY 32276
 setval 
--------
  32276
(1 row)

ALTER TABLE


In [23]:
%%bash
psql "$DB_CONNECTION_STRING" -c "SELECT count(*) FROM document;"

 count 
-------
 32276
(1 row)



otherwise, we can index the data and save it:

In [24]:
class StoreResults:
    def __call__(self, batch):
        with psycopg.connect(os.environ["DB_CONNECTION_STRING"]) as conn:
            register_vector(conn)
            with conn.cursor() as cur:
                for text, source, embedding in zip(batch["text"], batch["source"], batch["embeddings"]):
                    cur.execute("INSERT INTO document (text, source, embedding) VALUES (%s, %s, %s)", (text, source, embedding,),)
        return {}

In [160]:
# Index data
embedded_chunks.map_batches(
    StoreResults,
    batch_size=128,
    num_cpus=1,
    compute=ActorPoolStrategy(size=28),
).count()

2023-09-04 15:34:51,436	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(EmbedChunks)] -> ActorPoolMapOperator[MapBatches(StoreResults)]
2023-09-04 15:34:51,439	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-04 15:34:51,440	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2023-09-04 15:34:51,465	INFO actor_pool_map_operator.py:117 -- MapBatches(EmbedChunks): Waiting for 2 pool actors to start...
2023-09-04 15:35:10,444	INFO actor_pool_map_operator.py:117 -- MapBatches(StoreResults): Waiting for 28 pool actors to start...


Running 0:   0%|          | 0/200 [00:00<?, ?it/s]



0

In [161]:
%%bash
# Save index
export SQL_DUMP_FP="/efs/shared_storage/goku/sql_dumps/gte-base_300_50.sql"
mkdir -p $(dirname "$SQL_DUMP_FP") && touch $SQL_DUMP_FP
sudo -u postgres pg_dump -c > $SQL_DUMP_FP  # save

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Retrieval

With our embedded chunks properly indexed in our vector database, we're ready to perform retrieval for a given query. We'll start by using the same embedding model we used to embed our text chunks to now embed the incoming query.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show the query getting embedded and show the retrieval process.

In [25]:
import json
import numpy as np

In [26]:
# Embed query
embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name)
query = "What is the default batch size for map_batches?"
embedding = np.array(embedding_model.embed_query(query))
len(embedding)

Downloading (…)9c8a9/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)db4ec9c8a9/README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

Downloading (…)4ec9c8a9/config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

Downloading (…)8a9/onnx/config.json:   0%|          | 0.00/630 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)/onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)9c8a9/onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/219M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)9c8a9/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)db4ec9c8a9/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)ec9c8a9/modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

768

Then, we'll retrieve the top most revelant chunks by extracting the closest embedded chunks to our embedded query. We use euclidean distance (`<->`) but there are [many options](https://github.com/pgvector/pgvector#vector-operators) to choose from. Once we retrieve the top `num_chunks`, we can collect the text for each chunk and use it as context to generate a response.

In [27]:
# Get context
num_chunks = 5
with psycopg.connect(os.environ["DB_CONNECTION_STRING"]) as conn:
    register_vector(conn)
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM document ORDER BY embedding <-> %s LIMIT %s", (embedding, num_chunks))
        rows = cur.fetchall()
        context = [{"text": row[1]} for row in rows]
        sources = [row[2] for row in rows]
for i, item in enumerate(context):
    print (sources[i])
    print (item["text"])
    print ()

https://docs.ray.io/en/master/data/api/doc/ray.data.Dataset.map_batches.html#ray-data-dataset-map-batches
entire blocks as batches (blocks may contain different numbers of rows).
The actual size of the batch provided to fn may be smaller than
batch_size if batch_size doesn’t evenly divide the block(s) sent
to a given map task. Default batch_size is 4096 with “default”.

https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-size
batch_size.
Note
The default batch size depends on your resource type. If you’re using CPUs,
the default batch size is 4096. If you’re using GPUs, you must specify an explicit
batch size.

https://docs.ray.io/en/master/data/batch_inference.html#configuring-batch-size
# Specify that each input batch should be of size 2.
ds.map_batches(assert_batch, batch_size=2)
Caution
The default batch_size of 4096 may be too large for datasets with large rows
(for example, tables with many columns or a collection of large images).

https://docs.ray.io/en/

## Generation

We can now use the context to generate a response from our LLM. Without this relevant context that we retreived, the LLM may not have been able to accurately answer our question. And as our data grows, we can just as easily embed and index any new data and be able to retrieve it to answer questions.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show how retrieved context + query texts are fed into API.

In [28]:
import time

In [29]:
def generate_response(
    llm, temperature=0.0, 
    system_content="", assistant_content="", user_content="", 
    max_retries=3, retry_interval=60):
    """Generate response from an LLM."""
    retry_count = 0
    while retry_count < max_retries:
        try:
            response = openai.ChatCompletion.create(
                model=llm,
                temperature=temperature,
                messages=[
                    {"role": "system", "content": system_content},
                    {"role": "assistant", "content": assistant_content},
                    {"role": "user", "content": user_content},
                ],
            )
            return response["choices"][-1]["message"]["content"]
        except Exception as e:
            print(e)
            time.sleep(retry_interval)  # default is per-minute rate limits
            retry_count += 1
    return ""

In [30]:
# Credentials
openai.api_base = os.environ["ANYSCALE_API_BASE"]
openai.api_key = os.environ["ANYSCALE_API_KEY"]

In [31]:
# Generate response
generate_response(
    llm="meta-llama/Llama-2-70b-chat-hf",
    temperature=0.0,
    system_content="Answer the query using the context provided.",
    user_content=f"query: {query}, context: {context}"
)

'The default batch size for map_batches is 4096. However, this may not always be the actual size of the batch provided to the function, as the batch size may need to be adjusted to fit the block size of the data being processed. The default batch size can be overridden by specifying a different value for the batch_size argument when calling map_batches. Note that the default batch size may vary depending on the resource type being used, with a default of 4096 for CPUs and a requirement for an explicit batch size specification when using GPUs.'

Let's combine the context retrieval and response generation together into a conventient query agent that we can use to easily generate our responses.

In [32]:
class QueryAgent:
    def __init__(self, embedding_model_name="thenlper/gte-base",
                 llm="meta-llama/Llama-2-70b-chat-hf", 
                 temperature=0.0, max_context_length=4096,
                 system_content="", assistant_content=""):
        
        # Embedding model
        model_kwargs = {"device": "cuda"}
        encode_kwargs = {"device": "cuda", "batch_size": 100}
        if embedding_model_name == "text-embedding-ada-002":
            self.embedding_model = OpenAIEmbeddings(
                model=embedding_model_name,
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs,
                openai_api_base=os.environ["OPENAI_API_BASE"],
                openai_api_key=os.environ["OPENAI_API_KEY"])
        else:
            self.embedding_model = HuggingFaceEmbeddings(
                model_name=embedding_model_name,
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs)
        
        # LLM
        self.llm = llm
        self.temperature = temperature
        self.context_length = max_context_length - len(system_content + assistant_content)
        self.system_content = system_content
        self.assistant_content = assistant_content

    def __call__(self, query, num_chunks=5):
        # Get context
        embedding = np.array(self.embedding_model.embed_query(query))
        with psycopg.connect(os.environ["DB_CONNECTION_STRING"]) as conn:
            register_vector(conn)
            with conn.cursor() as cur:
                cur.execute("SELECT * FROM document ORDER BY embedding <-> %s LIMIT %s", (embedding, num_chunks))
                rows = cur.fetchall()
                context = [{"text": row[1]} for row in rows]
                sources = [row[2] for row in rows]
            
        # Generate response
        user_content = f"query: {query}, context: {context}"
        answer = generate_response(
            llm=self.llm,
            temperature=self.temperature,
            system_content=self.system_content,
            assistant_content=self.assistant_content,
            user_content=user_content[: self.context_length],
        )

        # Result
        result = {
            "question": query,
            "sources": sources,
            "answer": answer,
        }
        return result

In [33]:
query = "What is the default batch size for map_batches?"
system_content = "Answer the query using the context provided."
agent = QueryAgent(
    embedding_model_name="thenlper/gte-base",
    llm="meta-llama/Llama-2-7b-chat-hf",
    max_context_length=4096,
    system_content=system_content,
)
result = agent(query=query)
print(json.dumps(result, indent=2))

{
  "question": "What is the default batch size for map_batches?",
  "sources": [
    "https://docs.ray.io/en/master/data/api/doc/ray.data.Dataset.map_batches.html#ray-data-dataset-map-batches",
    "https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-size",
    "https://docs.ray.io/en/master/data/batch_inference.html#configuring-batch-size",
    "https://docs.ray.io/en/master/data/batch_inference.html#configuring-batch-size",
    "https://docs.ray.io/en/master/tune/getting-started.html#setting-up-a-tuner-for-a-training-run-with-tune"
  ],
  "answer": "Based on the provided context, the default batch size for `map_batches` is 4096. However, it's important to note that the default batch size may vary depending on the resource type being used. If using CPUs, the default batch size is 4096, while if using GPUs, an explicit batch size must be specified. Additionally, it's recommended to use a smaller batch size for datasets with large rows, such as tables with many 

# Evaluation

So far, we've chosen typical/arbitrary values for the various parts of our RAG application. But if we were to change something, such as our chunking logic, embedding model, LLM, etc. how can we know that we have a better configuration than before. A generative task like this is very difficult to quantitatively assess and so we need to develop creative ways to do so. 

Because we have many moving parts in our application, we need to perform unit/component and end-to-end evaluation. Component-wise evaluation can involve evaluating our retrieval in isolation (is the best source in our set of retrieved chunks) and evaluating our LLMs response (given the best source, is the LLM able to produce a quality answer). As for end-to-end evaluation, we can assess the quality of the entire system (given all data, what is the quality of the response). 

We'll be asking our evaluator LLM to score the response between 1-5 using the context, however, we could also have it produce scores for other dimensions such as hallucination (is the generated answer using information only from the provided context), toxticity, etc.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Component and end-to-end evaluation.

## Evaluator

We're going to start by determining our evaluator. Given a response to a query and relevant context, our evaluator should be a trusted way to score/assess the quality of the response. But before we can determine our evaluator, we need a dataset of questions and the source where the answer comes from. We can use this dataset to ask our different evaluators to provide an answer and then rate their answer (ex. score between 1-5). We can then inspect this dataset to determine if our evaluator is unbiased and has sound reasoning for the scores that are assigned.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show the process of evaluator answers the question (given dataset of questions + best source) and how we inspect the results to determine the evaluator.

In [34]:
openai.api_base = os.environ["OPENAI_API_BASE"]
openai.api_key = os.environ["OPENAI_API_KEY"]

We'll start by manually creating our dataset. We have a list of user queries and the ideal source to answer the query [`datasets/eval-dataset-v1.jsonl`](https://github.com/ray-project/llm-applications/blob/main/datasets/eval-dataset-v1.jsonl). We will our LLM app above to generate reference answer for each query/source pair using `gpt-4`.

In [35]:
import re
import urllib.parse
from bs4 import BeautifulSoup
from IPython.display import clear_output, display, JSON

In [36]:
# If running tests / small samples, set num_samples to <10
num_samples = None  # None = all samples

In [37]:
with open(Path(ROOT_DIR, "datasets/eval-dataset-v1.jsonl"), "r") as f:
    data = [json.loads(item) for item in list(f)]

In [38]:
data[:5]

[{'question': 'I’m struggling a bit with Ray Data type conversions when I do map_batches. Any advice?',
  'source': 'https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-format'},
 {'question': 'How does autoscaling work in a Ray Serve application?',
  'source': 'https://docs.ray.io/en/master/serve/scaling-and-resource-allocation.html#autoscaling'},
 {'question': 'how do I get the address of a ray node',
  'source': 'https://docs.ray.io/en/master/ray-core/miscellaneous.html#node-information'},
 {'question': 'Does Ray support NCCL?',
  'source': 'https://docs.ray.io/en/master/ray-more-libs/ray-collective.html'},
 {'question': 'Is Ray integrated with DeepSpeed?',
  'source': 'https://docs.ray.io/en/master/ray-air/examples/gptj_deepspeed_fine_tuning.html#fine-tuning-the-model-with-ray-air-a-name-train-a'}]

In [39]:
def fetch_text(uri):
    url, anchor = uri.split("#") if "#" in uri else (uri, None)
    file_path = Path(EFS_DIR, url.split("https://")[-1])
    with open(file_path, "r", encoding="utf-8") as file:
        html_content = file.read()
    soup = BeautifulSoup(html_content, "html.parser")
    if anchor:
        target_element = soup.find(id=anchor)
        if target_element:
            text = target_element.get_text()
        else:
            return fetch_text(uri=url)
    else:
        text = soup.get_text()
    return text

In [40]:
# Sample
uri = "https://docs.ray.io/en/master/data/transforming-data.html#configuring-batch-format"
fetch_text(uri=uri)

'\nConfiguring batch format#\nRay Data represents batches as dicts of NumPy ndarrays or pandas DataFrames. By\ndefault, Ray Data represents batches as dicts of NumPy ndarrays.\nTo configure the batch type, specify batch_format in\nmap_batches(). You can return either format from your function.\n\n\n\nNumPy\nfrom typing import Dict\nimport numpy as np\nimport ray\n\ndef increase_brightness(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:\n    batch["image"] = np.clip(batch["image"] + 4, 0, 255)\n    return batch\n\nds = (\n    ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple")\n    .map_batches(increase_brightness, batch_format="numpy")\n)\n\n\n\n\n\npandas\nimport pandas as pd\nimport ray\n\ndef drop_nas(batch: pd.DataFrame) -> pd.DataFrame:\n    return batch.dropna()\n\nds = (\n    ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")\n    .map_batches(drop_nas, batch_format="pandas")\n)\n\n\n\n\n'

In [41]:
# Content for inference
system_content = """
    "Answer the query using the context provided.
    Then, you must {score} your response between 1 and 5.
    You must return your response in a line with only the score.
    Do not add any more details.
    On a separate line provide your {reasoning} for the score as well.
    Return your response following the exact format outlined below.
    Do not add or remove anything.
    And all of this must be in a valid JSON format.
    
    {"answer": answer,
     "score": score,
     "reasoning": reasoning}
    """
assistant_content = ""

In [42]:
def extract_from_response(response):
    # Define regular expressions for extracting values
    answer_pattern = r'"answer"\s*:\s*"([^"]*)"'
    score_pattern = r'"score"\s*:\s*([0-9]+)'
    reasoning_pattern = r'"reasoning"\s*:\s*"([^"]*)"'

    # Extract values using regular expressions
    answer_match = re.search(answer_pattern, response)
    score_match = re.search(score_pattern, response)
    reasoning_match = re.search(reasoning_pattern, response)

    # Convert
    if answer_match and score_match and reasoning_match:
        answer = answer_match.group(1)
        score = float(score_match.group(1))
        reasoning = reasoning_match.group(1)
        return answer, score, reasoning

    return "", "", ""

In [43]:
def get_references(data, llm, temperature, max_context_length, system_content, assistant_content, num_samples=None):
    results = []
    for row in tqdm(data[:num_samples]):
        # Get context
        query = row["question"]
        context = fetch_text(uri=row["source"])

        # Generate response
        context_length = max_context_length - len(system_content + assistant_content)
        user_content = f"The query is {query} and the additional context is {context}"[:context_length]
        response = generate_response(
            llm=llm,
            temperature=temperature,
            system_content=system_content, 
            assistant_content=assistant_content, 
            user_content=user_content)

        # Extract from response
        answer, score, reasoning = extract_from_response(response=response)

        # Store result
        result = ({
                "question": query,
                "source": row["source"],
                "answer": answer,
                "score": score,
                "reasoning": reasoning,
            })
        results.append(result)
        clear_output(wait=True)
        display(JSON(json.dumps(result, indent=2)))
    return results

Let's generate reference responses with `gpt-4` as well:

In [47]:
# GPT-4
openai.api_base = os.environ["OPENAI_API_BASE"]
openai.api_key = os.environ["OPENAI_API_KEY"]
results = get_references(
    data=data, llm="gpt-4", temperature=0.0, max_context_length=8192, 
    system_content=system_content, assistant_content=assistant_content,
    num_samples=num_samples)
print (np.mean([float(result["score"]) for result in results if result["score"]]))

<IPython.core.display.JSON object>

100%|██████████| 177/177 [40:45<00:00, 13.82s/it]

4.519774011299435





In [48]:
# Save to file
references_fp = Path(ROOT_DIR, EXPERIMENTS_DIR, "references", "gpt-4.json")
references_fp.parent.mkdir(parents=True, exist_ok=True)
with open(references_fp, "w") as fp:
    json.dump(results, fp, indent=4)

Let's generate reference responses with `Llama-2-70b` as well:

In [49]:
# Llama-2-70b
openai.api_base = os.environ["ANYSCALE_API_BASE"]
openai.api_key = os.environ["ANYSCALE_API_KEY"]
results = get_references(
    data=data, llm="meta-llama/Llama-2-70b-chat-hf", temperature=0.0, max_context_length=4096, 
    system_content=system_content, assistant_content=assistant_content,
    num_samples=num_samples)
print (np.mean([float(result["score"]) for result in results if result["score"]]))

<IPython.core.display.JSON object>

100%|██████████| 177/177 [28:43<00:00,  9.74s/it]

4.912751677852349





In [50]:
# Save to file
references_fp = Path(ROOT_DIR, EXPERIMENTS_DIR, "references", "llama-2-70b.json")
references_fp.parent.mkdir(parents=True, exist_ok=True)
with open(references_fp, "w") as fp:
    json.dump(results, fp, indent=4)

Now that we've seen the answers, scores and reasoning for our references dataset from both `gpt-4` and `Llama-2-70b`. We can use these responses to decide on a quality evaluator for our future experiments. This evaluator will be used to score answers for different experiment configuations and so we need to be able to trust their scores, reasoning, etc. After inspecting Llama2 evaluating Llama2's answers, it is definitely not a good evaluator. For most answers the reasoning is not good, and the score is pretty random with lots of 4s. Therefore, our evaluator will be `gpt-4`.

In [44]:
EVALUATOR = "gpt-4"

We may not always have a prepared dataset of questions and the best source to answer that question readily available. To address this cold start problem, we could use an LLM to look at our text chunks and generate questions that the specific chunk would answer. This provides us with quality questions and the exact source the answer is in. However, this dataset generation method could be a bit noisy. The generate questions may not always be resembling of what your users may ask and the specific chunk we say is the best source may also have that exact information in other chunks. Nonetheless, this is a great way to start our development process while we collect + manually label a high quality dataset.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show the synthetic data generation process.

In [45]:
num_questions = 3
system_content = f"""
Create {num_questions} questions using only the context provided.
End each question with a '?' character and then in a newline write the answer to that question using only the context provided.
Separate each question/answer pair by a newline.
"""

In [36]:
# Generate questions
synthetic_data = []
for chunk in chunks[:3]:  # small samples
    response = generate_response(
        llm="gpt-4",
        temperature=0.0,
        system_content=system_content,
        user_content=f"context: {chunk.page_content}"
    )
    entries = response.split("\n\n")
    for entry in entries:
        question, answer = entry.split("\n")
        synthetic_data.append({"question": question, "source": chunk.metadata["source"], "answer": answer})

In [37]:
synthetic_data[:3]

[{'question': 'What is the context discussing about?',
  'source': 'https://docs.ray.io/en/master/tune/api/integration.html#external-library-integrations-for-ray-tune',
  'answer': 'The context is discussing about external library integrations for Ray Tune.'},
 {'question': 'What is Ray Tune?',
  'source': 'https://docs.ray.io/en/master/tune/api/integration.html#external-library-integrations-for-ray-tune',
  'answer': 'The context does not provide information on what Ray Tune is.'},
 {'question': 'What are external library integrations?',
  'source': 'https://docs.ray.io/en/master/tune/api/integration.html#external-library-integrations-for-ray-tune',
  'answer': 'The context does not provide information on what external library integrations are.'}]

## Experiments

With our evaluator set, we're ready to start experimenting with the various components in our LLM application. While we could perform this as a large [tuning experiment](https://docs.ray.io/en/latest/tune/index.html), where we can search across promising combintion of values/decisions, we're going to evaluation one decision at a time and fix the best value for the next experiment.

**Note**: this approach is slightly biased because many of our decisions are not indepedent (ex. `chunk_size` and `num_chunks` should ideally be evaluated across many combinations of values).

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Illustrate all the components that we'll be tuning.

### Utilities

Before we get started with our experiments, we're going to define some utility functions that we'll use to easily generate and evaluate responses using the different experiment configurations. We'll also define some functions to help determine our response quality score, retrieval recall score, etc.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Represent the main experiment function doing the `generate` and `evaluate` responses.

In [46]:
import subprocess

We'll set where our labeled data and reference reports are located. We'll be using the former to generate responses and the latter dataset to evaluate those responses.

In [47]:
# Paths
DATA_PATH = str(Path(ROOT_DIR, "datasets", "eval-dataset-v1.jsonl"))
REFERENCE_LOC = str(Path(ROOT_DIR, EXPERIMENTS_DIR, "references", "gpt-4.json"))

We'll also create some mappings to know what embedding dimension and max content lengths of our different embedding models and LLMs.

In [48]:
# Mappings
EMBEDDING_DIMENSIONS = {
    "thenlper/gte-base": 768,
    "BAAI/bge-large-en": 1024,
    "text-embedding-ada-002": 1536
}
MAX_CONTEXT_LENGTHS = {
    "gpt-4": 8192,
    "gpt-3.5-turbo": 4096,
    "gpt-3.5-turbo-16k": 16384,
    "meta-llama/Llama-2-7b-chat-hf": 4096,
    "meta-llama/Llama-2-13b-chat-hf": 4096,
    "meta-llama/Llama-2-70b-chat-hf": 4096,
}

In [49]:
def execute_bash(command):
    results = subprocess.run(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    return results

We'll set the index to our vector DB. If it already exists, then we'll load it from the saved SQL dump. Otherwise, we'll build a new index and save it.

In [50]:
def set_index(sections, embedding_model_name, chunk_size, chunk_overlap):
    # Drop current Vector DB and prepare for new one
    execute_bash(f'psql "{os.environ["DB_CONNECTION_STRING"]}" -c "DROP TABLE document;"')
    execute_bash(f'sudo -u postgres psql -f ../migrations/vector-{EMBEDDING_DIMENSIONS[embedding_model_name]}.sql')
    SQL_DUMP_FP = Path(EFS_DIR, "sql_dumps", f"{embedding_model_name.split('/')[-1]}_{chunk_size}_{chunk_overlap}.sql")
    
    # Vector DB
    if SQL_DUMP_FP.exists():  # Load from SQL dump
        execute_bash(f'psql "{os.environ["DB_CONNECTION_STRING"]}" -f {SQL_DUMP_FP}')
    else:  # Create new index
        # Create chunks dataset
        text_splitter = RecursiveCharacterTextSplitter(
            separators=["\n\n", "\n", " ", ""],
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )
        chunks = text_splitter.create_documents(
            texts=[section["text"] for section in sections], 
            metadatas=[{"source": section["source"]} for section in sections]
        )
        chunks_ds = ray.data.from_items([{"text": chunk.page_content, "source": chunk.metadata["source"]} for chunk in chunks])

        # Embed chunks
        embedded_chunks = chunks_ds.map_batches(
            EmbedChunks,
            fn_constructor_kwargs={"model_name": embedding_model_name},
            batch_size=100, 
            num_gpus=1,
            compute=ActorPoolStrategy(size=2))
        
        # Index data
        embedded_chunks.map_batches(
            StoreResults,
            batch_size=128,
            num_cpus=1,
            compute=ActorPoolStrategy(size=28),
        ).count()
        
        # Save to SQL dump
        execute_bash(f"sudo -u postgres pg_dump -c > {SQL_DUMP_FP}")

In [51]:
def set_credentials(llm):
    if llm.startswith("gpt"):
        openai.api_base = os.environ["OPENAI_API_BASE"]
        openai.api_key = os.environ["OPENAI_API_KEY"]
    else:
        openai.api_base = os.environ["ANYSCALE_API_BASE"]
        openai.api_key = os.environ["ANYSCALE_API_KEY"]

We'll generate responses for the dataset of questions and save the responses.

In [52]:
# Generate responses
def generate_responses(
    experiment_name, data_path, sections,
    chunk_size, chunk_overlap, num_chunks,
    embedding_model_name, 
    llm, temperature, max_context_length, 
    system_content, assistant_content="",
    num_samples=None):
    
    # Set credentials
    set_credentials(llm=llm)
    
    # Build index
    set_index(
        sections=sections,
        embedding_model_name=embedding_model_name,
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )
    
    # Query agent
    agent = QueryAgent(
        embedding_model_name=embedding_model_name,
        llm=llm,
        temperature=temperature,
        max_context_length=max_context_length,
        system_content=system_content,
        assistant_content=assistant_content,
    )

    # Generate responses
    results = []
    with open(Path(data_path), "r") as f:
        questions = [json.loads(item)["question"] for item in list(f)][:num_samples]
    for query in tqdm(questions):
        result = agent(query=query, num_chunks=num_chunks)
        results.append(result)
        clear_output(wait=True)
        display(JSON(json.dumps(result, indent=2)))

    # Save to file
    responses_fp = Path(ROOT_DIR, EXPERIMENTS_DIR, "responses", f"{experiment_name}.json")
    responses_fp.parent.mkdir(parents=True, exist_ok=True)
    config = {
        "experiment_name": experiment_name,
        "data_path": data_path,
        "chunk_size": chunk_size,
        "chunk_overlap": chunk_overlap,
        "num_chunks": num_chunks,
        "embedding_model_name": embedding_model_name,
        "llm": llm,
        "temperature": temperature,
        "max_context_length": max_context_length,
        "system_content": system_content,
        "assistant_content": assistant_content,
    }
    responses = {
        "config": config,
        "results": results,
    }
    with open(responses_fp, "w") as fp:
        json.dump(responses, fp, indent=4)

Function to determine our retrieval score, which registers a success if the best source is anywhere in our retrieval `num_chunks` sources. We don't account for order, exact page section, etc. but we could add those constraints to have a more conservative retreival score.

In [53]:
def get_retrieval_score(references, generated):
    matches = np.zeros(len(references))
    for i in range(len(references)):
        reference_source = references[i]["source"].split("#")[0]
        if not reference_source:
            matches[i] = 1
            continue
        for source in generated[i]["sources"]:
            # sections don't have to perfectly match
            if reference_source == source.split("#")[0]:
                matches[i] = 1
                continue
    retrieval_score = np.mean(matches)
    return retrieval_score

With our evaluator and generated responses, we're ready to evaluate the quality of the responses with a score between 1-5. At the end, we can average the scores and use that to represent the end-to-end performance of this specific configuration.

In [54]:
def evaluate_responses(
    experiment_name, reference_loc, response_loc,
    evaluator, temperature, max_context_length,
    system_content, assistant_content="",
    num_samples=None):
    
    # Set credentials
    set_credentials(llm=evaluator)
    
    # Load answers
    with open(Path(reference_loc), "r") as f:
        references = [item for item in json.load(f)][:num_samples]
    with open(Path(response_loc), "r") as f:
        generated = [item for item in json.load(f)["results"]][:num_samples]
    assert len(references) == len(generated)

    # Quality score
    results = []
    context_length = max_context_length - len(system_content + assistant_content)
    for ref, gen in tqdm(zip(references, generated), total=len(references)):
        assert ref["question"] == gen["question"]
        user_content = str(
            {
                "question": gen["question"],
                "generated_answer": gen["answer"],
                "reference_answer": ref["answer"],
            }
        )[:context_length]

        # Generate response
        response = generate_response(
            llm=evaluator,
            temperature=temperature,
            system_content=system_content,
            assistant_content=assistant_content,
            user_content=user_content,
        )

        # Extract from response
        score, reasoning = response.split("\n", 1)

        # Store result
        result = {
            "question": gen["question"],
            "generated_answer": gen["answer"],
            "reference_answer": ref["answer"],
            "score": float(score),
            "reasoning": reasoning.lstrip("\n"),
            "sources": gen["sources"],
        }
        results.append(result)
        clear_output(wait=True)
        display(JSON(json.dumps(result, indent=2)))

    # Save to file
    evaluator_name = evaluator.split("/")[-1].lower()
    evaluation_fp = Path(ROOT_DIR, EXPERIMENTS_DIR, "evaluations", f"{experiment_name}_{evaluator_name}.json")
    evaluation_fp.parent.mkdir(parents=True, exist_ok=True)
    config = {
        "experiment_name": experiment_name,
        "reference_loc": reference_loc,
        "response_loc": response_loc,
        "evaluator": evaluator,
        "temperature": temperature,
        "max_context_length": max_context_length,
        "system_content": system_content,
        "assistant_content": assistant_content,
    }
    evaluation = {
        "config": config,
        "retrieval_score": get_retrieval_score(references, generated),
        "quality_score": np.mean([item["score"] for item in results if (item["score"] and item["reference_answer"])]),
        "results": results,
    }
    with open(evaluation_fp, "w") as fp:
        json.dump(evaluation, fp, indent=4)

We'll define one encompassing function that will generate and evaluate the responses so that we can run these experiments with one function call.

In [55]:
def run_experiment(
    experiment_name, data_path, sections,
    chunk_size, chunk_overlap, num_chunks,
    embedding_model_name, llm,
    reference_loc, evaluator,
    num_samples=None):
    """Generate responses and evaluate them."""
    
    # Generate responses
    generate_responses(
        experiment_name=experiment_name, 
        data_path=data_path,
        sections=sections,
        chunk_size=chunk_size, 
        chunk_overlap=chunk_overlap, 
        num_chunks=num_chunks,
        embedding_model_name=embedding_model_name, 
        llm=llm, 
        temperature=0.0, 
        max_context_length=MAX_CONTEXT_LENGTHS[llm], 
        system_content="Answer the query using the context provided.",
        num_samples=num_samples)

    # Evaluate responses
    evaluation_system_content = """
        Your job is to rate the quality of our generated answer {generated_answer}
        given a query {query} and a reference answer {reference_answer}.
        Your score has to be between 1 and 5.
        You must return your response in a line with only the score.
        Do not return answers in any other format.
        On a separate line provide your reasoning for the score as well.
        """
    evaluate_responses(
        experiment_name=experiment_name,
        reference_loc=reference_loc, 
        response_loc=str(Path(ROOT_DIR, EXPERIMENTS_DIR, "responses", f"{experiment_name}.json")),
        evaluator=EVALUATOR, 
        temperature=0.0, 
        max_context_length=MAX_CONTEXT_LENGTHS[EVALUATOR],
        system_content=evaluation_system_content,
        num_samples=num_samples)

In [56]:
def print_experiment(experiment_name, evaluator=EVALUATOR):
    eval_fp = Path(ROOT_DIR, EXPERIMENTS_DIR, "evaluations", f"{experiment_name}_{evaluator}.json")
    with open(eval_fp, "r") as fp:
        d = json.load(fp)
    print (experiment_name)
    print ("  retrieval score:", d["retrieval_score"])
    print ("  quality score:", d["quality_score"])
    print ()

In [57]:
llm = "gpt-3.5-turbo"

### Context

We're first going to test if the additonal context we provide is helpful at all. This is to validate that the RAG system is indeed worth the effort. We can do this by settings `num_chunks=0` (no context) and comparing that to `num_chunks=5`.

In [61]:
# Without context
num_chunks = 0
experiment_name = f"without-context"
run_experiment(
    experiment_name=experiment_name, 
    data_path=DATA_PATH,
    sections=sections,
    chunk_size=100, 
    chunk_overlap=50,
    num_chunks=num_chunks,
    embedding_model_name="thenlper/gte-base",
    llm=llm,
    reference_loc=REFERENCE_LOC,
    evaluator=EVALUATOR,
    num_samples=num_samples)

<IPython.core.display.JSON object>

100%|██████████| 177/177 [31:30<00:00, 10.68s/it]


As a sanity check, our retrieval score should be zero since we're not using any context :)

In [62]:
print_experiment(experiment_name=experiment_name)

without-context
  retrieval score: 0.0
  quality score: 3.110169491525424



In [63]:
# With context
num_chunks = 5
experiment_name = "with-context"
run_experiment(
    experiment_name=experiment_name, 
    data_path=DATA_PATH,
    sections=sections,
    chunk_size=300, 
    chunk_overlap=50, 
    num_chunks=num_chunks,
    embedding_model_name="thenlper/gte-base",
    llm=llm,
    reference_loc=REFERENCE_LOC,
    evaluator=EVALUATOR,
    num_samples=num_samples)

<IPython.core.display.JSON object>

100%|██████████| 177/177 [21:23<00:00,  7.25s/it]


In [64]:
print_experiment(experiment_name=experiment_name)

with-context
  retrieval score: 0.5254237288135594
  quality score: 3.4491525423728815



As we can see, using context (RAG) does indeed help in the quality of our answers (and by a meaningful margin).

### Chunk size

Next, we'll access various chunk sizes. Smaller chunks (but not too small!) are able to encapsulate atomic concepts which yields more precise retrieval. While larger chunks may be more noisy. Popular strategies include using small chunks but retrieving a bit of the [surrounding chunks](https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/dev_practices/production_rag.html#decoupling-chunks-used-for-retrieval-vs-chunks-used-for-synthesis) around it (since it may have relevnat info) or store [mulitple embeddings](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) per document (ex. summary embedding per document). 

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> illustrate small vs large and popular strategies.

In [65]:
chunk_sizes = [100, 300, 500, 700]

In [66]:
for chunk_size in chunk_sizes:
    experiment_name = f"chunk-size-{chunk_size}"
    run_experiment(
        experiment_name=experiment_name, 
        data_path=DATA_PATH,
        sections=sections,
        chunk_size=chunk_size, 
        chunk_overlap=50, 
        num_chunks=5,
        embedding_model_name="thenlper/gte-base",
        llm=llm,
        reference_loc=REFERENCE_LOC,
        evaluator=EVALUATOR,
        num_samples=num_samples)

<IPython.core.display.JSON object>

100%|██████████| 177/177 [21:14<00:00,  7.20s/it]


In [67]:
for chunk_size in chunk_sizes:
    experiment_name = f"chunk-size-{chunk_size}"
    print_experiment(experiment_name=experiment_name)

chunk-size-100
  retrieval score: 0.4180790960451977
  quality score: 3.073446327683616

chunk-size-300
  retrieval score: 0.5254237288135594
  quality score: 3.3983050847457625

chunk-size-500
  retrieval score: 0.5480225988700564
  quality score: 3.5338983050847457

chunk-size-700
  retrieval score: 0.519774011299435
  quality score: 3.573446327683616



Seem that a larger chunk size does help but it tapers off around the 600 characters mark (too much context might be too noisy).

**Note**: If we were to use larger chunk sizes (ours is based on characters), keep in mind that [most](https://huggingface.co/spaces/mteb/leaderboard) open source embedding models have a maximum sequence length of 512 sub-word tokens. This means that if our chunk contains more than 512 sub-word tokens, the embedding wouldn't account for it anyway (unless we finetune our embedding model to have longer sequence lengths).

In [60]:
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

### Number of chunks

**Note**: Keep in mind that the `chunk_size` you chose multiplied by the `num_chunks` below fits inside the LLM's context length. We're experimenting with the chunk size and number of chunks as if they were indepdent variables but they area heavily related. Especially since all of our LLMs have a finite maximum context length. So ideally, we would tune for a combination if `chunk_size` * `num_chunks`.

In [69]:
num_chunks_list = [1, 3, 5, 7]

In [70]:
for num_chunks in num_chunks_list:
    experiment_name = f"num-chunks-{num_chunks}"
    run_experiment(
        experiment_name=experiment_name, 
        data_path=DATA_PATH,
        sections=sections,
        chunk_size=CHUNK_SIZE, 
        chunk_overlap=CHUNK_OVERLAP, 
        num_chunks=num_chunks,
        embedding_model_name="thenlper/gte-base",
        llm=llm,
        reference_loc=REFERENCE_LOC,
        evaluator=EVALUATOR,
        num_samples=num_samples)

<IPython.core.display.JSON object>

100%|██████████| 177/177 [24:52<00:00,  8.43s/it]


In [71]:
for num_chunks in num_chunks_list:
    experiment_name=f"num-chunks-{num_chunks}"
    print_experiment(experiment_name=experiment_name)

num-chunks-1
  retrieval score: 0.20903954802259886
  quality score: 3.1045197740112993

num-chunks-3
  retrieval score: 0.4406779661016949
  quality score: 3.477401129943503

num-chunks-5
  retrieval score: 0.5480225988700564
  quality score: 3.5706214689265536

num-chunks-7
  retrieval score: 0.6214689265536724
  quality score: 3.6016949152542375



Increasing our number of chunks improves our retrieval and quality scores. We had to stop testing at 6 chunks since our `chunk_size` is 600 tokens and `Llama-2-70b`'s maximum context length is 4096 tokens (we also have to account for the system, assistant and user content to our LLM). This is a major reason to invest in extending context size via RoPE scaling (rotary position embeddings), etc. But it also seems that the benefit of increasing the number of chunks is starting to taper off.

In [61]:
NUM_CHUNKS = 7

### Embedding models

So far, we've used [`thenlper/gte-base`](https://huggingface.co/thenlper/gte-base) as our embedding model because it's a relatively small (0.22 GB) and performant option. But now, let's explore other popular options such the current leader on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard), [`BAAI/bge-large-en`](https://huggingface.co/BAAI/bge-large-en) (1.34 GB), and OpenAI's [`text-embedding-ada-002`](https://openai.com/blog/new-and-improved-embedding-model).

In [73]:
embedding_model_names = ["thenlper/gte-base", "BAAI/bge-large-en", "text-embedding-ada-002"]

In [74]:
for embedding_model_name in embedding_model_names:
    experiment_name = f"{embedding_model_name.split('/')[-1]}"
    run_experiment(
        experiment_name=experiment_name, 
        data_path=DATA_PATH,
        sections=sections,
        chunk_size=CHUNK_SIZE, 
        chunk_overlap=CHUNK_OVERLAP, 
        num_chunks=NUM_CHUNKS,
        embedding_model_name=embedding_model_name,
        llm=llm,
        reference_loc=REFERENCE_LOC,
        evaluator=EVALUATOR,
        num_samples=num_samples)

<IPython.core.display.JSON object>

100%|██████████| 177/177 [54:21<00:00, 18.42s/it]


In [75]:
for embedding_model_name in embedding_model_names:
    experiment_name = f"{embedding_model_name.split('/')[-1]}"
    print_experiment(experiment_name=experiment_name)

gte-base
  retrieval score: 0.6214689265536724
  quality score: 3.57909604519774

bge-large-en
  retrieval score: 0.4406779661016949
  quality score: 3.3446327683615817

text-embedding-ada-002
  retrieval score: 0.5988700564971752
  quality score: 3.5112994350282487



This is an interesting outcome because the #1 (`BAAI/bge-large-en`) on the current leaderboard isn't necessarily the best for our specific task. Using the smaller `thenlper/gte-base` produced the best retrieval and quality scores in our experiments.

In [62]:
EMBEDDING_MODEL_NAME = "thenlper/gte-base"

### OSS vs. closed LLMs

We're now going to use the best configurations from above to evaluate different choices for the main LLM.

**Note**:
- We've been using a specific LLM so far to decide on the configuration so that specific LLM's performance here will be a bit biased.
- This list is not exhaustive and even for the LLMs we use, there are versions with longer context windows available.

In [63]:
llms = ["gpt-3.5-turbo",
        "gpt-4",
        "meta-llama/Llama-2-7b-chat-hf", 
        "meta-llama/Llama-2-13b-chat-hf", 
        "meta-llama/Llama-2-70b-chat-hf"]

In [64]:
for llm in llms:
    experiment_name = f"{llm.split('/')[-1].lower()}"
    run_experiment(
        experiment_name=experiment_name, 
        data_path=DATA_PATH,
        sections=sections,
        chunk_size=CHUNK_SIZE, 
        chunk_overlap=CHUNK_OVERLAP, 
        num_chunks=NUM_CHUNKS,
        embedding_model_name=EMBEDDING_MODEL_NAME,
        llm=llm,
        reference_loc=REFERENCE_LOC,
        evaluator=EVALUATOR,
        num_samples=num_samples)

<IPython.core.display.JSON object>

100%|██████████| 177/177 [20:22<00:00,  6.91s/it]


In [65]:
for llm in llms:
    experiment_name = f"{llm.split('/')[-1].lower()}"
    print_experiment(experiment_name=experiment_name)

gpt-3.5-turbo
  retrieval score: 0.6214689265536724
  quality score: 3.57909604519774

gpt-4
  retrieval score: 0.6214689265536724
  quality score: 3.824858757062147

llama-2-7b-chat-hf
  retrieval score: 0.6214689265536724
  quality score: 2.864406779661017

llama-2-13b-chat-hf
  retrieval score: 0.6214689265536724
  quality score: 3.138418079096045

llama-2-70b-chat-hf
  retrieval score: 0.6214689265536724
  quality score: 3.4887005649717513



**Note**: Some of our LLMs have much larger context lengths, ex. `gpt-4` is 8192 and `gpt-3.5-turbo-16k` is 16384. We could increase the number of chunks that we use for these since we saw that increasing `num_chunks` continued to improve the retrieval and quality scores. However, we will keep this value fixed for now since the performance started to taper off anyway and so we can compare these performances under the exact same configurations.

In [66]:
LLM = "meta-llama/Llama-2-70b-chat-hf"

## Cost analysis

**Note**: Our `Llama-2` models are priced at $1/M tokens with [Anyscale Endpoints](https://endpoints.anyscale.com/).

In [67]:
# Pricing details
pricing = {
    "gpt-3.5-turbo": {
        "prompt": 2e-6,
        "sampled": 2e-6
    },
    "gpt-4": {
        "prompt": 3e-5,
        "sampled": 6e-5
    },
    "llama-2-7b-chat-hf": {
        "prompt": 1e-6,
        "sampled": 1e-6
    },
    "llama-2-13b-chat-hf": {
        "prompt": 1e-6,
        "sampled": 1e-6
    },
    "llama-2-70b-chat-hf": {
        "prompt": 1e-6,
        "sampled": 1e-6
    }
}

In [68]:
def cost_analysis(llm):
    experiment_name = f"{llm.split('/')[-1].lower()}"
    eval_fp = Path(ROOT_DIR, EXPERIMENTS_DIR, "evaluations", f"{experiment_name}_{EVALUATOR}.json")
    with open(eval_fp, "r") as fp:
        d = json.load(fp)
    num_samples = len(d["results"])
    prompt_size, sampled_size = 0, 0
    for result in d["results"]:
        prompt_size += len(result["question"]) + (CHUNK_SIZE * NUM_CHUNKS)
        sampled_size += len(result["generated_answer"])
    total_cost = pricing[experiment_name]["prompt"] * prompt_size + pricing[experiment_name]["sampled"] * sampled_size
    avg_cost = total_cost / num_samples
    
    print (llm)
    print (f"  avg prompt size: {int(prompt_size/num_samples)}")
    print (f"  avg sampled size: {int(sampled_size/num_samples)}")
    print (f"  total cost: ${total_cost:.2f}")
    print (f"  avg cost: ${avg_cost:.2f}")
    print ()

In [69]:
for llm in llms:
    cost_analysis(llm=llm)

gpt-3.5-turbo
  avg prompt size: 3567
  avg sampled size: 852
  total cost: $1.56
  avg cost: $0.01

gpt-4
  avg prompt size: 3567
  avg sampled size: 677
  total cost: $26.14
  avg cost: $0.15

meta-llama/Llama-2-7b-chat-hf
  avg prompt size: 3567
  avg sampled size: 2375
  total cost: $1.05
  avg cost: $0.01

meta-llama/Llama-2-13b-chat-hf
  avg prompt size: 3567
  avg sampled size: 1619
  total cost: $0.92
  avg cost: $0.01

meta-llama/Llama-2-70b-chat-hf
  avg prompt size: 3567
  avg sampled size: 1476
  total cost: $0.89
  avg cost: $0.01



## MoE routing

## Serve

## Next steps

Coming in Part II:

LlamaIndex / LangChain:
- Generate synthetic datasets (query, source, answer)
- add context to embeddings
- better chunking logic
- fine-tune embedding model
- fine-tune base LLM (gpt-3.5 and OSS)

Later:
- additional data sources
- longer context lengths (RoPE)
- keyword search with semantic (embedding) search
- reranking with LLM after results from (faster) embedding search