# RAG Overview

## Goals
In this unit, you will learn about the following:
- What RAG is and why it is important. 
- How RAG is commonly designed.
- What stages are involved in implementing RAG for our Ray QA engine.

## What is RAG ?

Retrieval augmented generation (RAG) is a system design that combines the strengths of LLMs and information retrieval systems. It was first introduced by Lewis et al. in the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401). It has since been implemented in popular frameworks such as [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://langchain.com/). For a more general overview of RAG, we recommend you to take a look at our [Introduction to Retrieval Augmented Generation](https://learn.ray.io/llm-applications-and-ray-serve/intro-to-llm-applications-and-ray-serve/introduction-to-retrieval-augmented-generation.html) module.

## Why RAG ?
RAG systems are designed to address shortcomings of LLMs. You can think of RAG as providing LLMs with a "contextual memory" - Almost like how a human would use a search engine to look up information to provide context to a question.

More specifically, RAG systems will enable us to:

- Reduce LLM hallucinations by providing context relevant to a user prompt.
- Provide clear attribution as to the source of the information used as context.
- Control the subset of information that a user has access to when using an LLM.
- Address inherent knowledge boundaries of LLMs by providing up-to-date information that is open to revision. 

## What does a RAG system design look like ?

Without RAG, we start out with the following:
- The user query.
- A prompt that is tuned for the given model and domain.
- A model to generate a response.
- The generated output. 

See the diagram below for a visual representation of the system design without RAG.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/rag-bootcamp-mar-2024/without_rag.svg" alt="Without RAG" width="50%"/>

With RAG, we now have:

- The user query.
- A query encoder of the user query.
- A document encoder that encodes documents.
- A retriever that takes the encoded query and fetches relevant documents from a store.
- Augmented prompt with the retrieved context.
- A model to generate a response.
- The generated output.

See the diagram below for a visual representation of the system design with RAG.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/rag-bootcamp-mar-2024/with_rag.svg" alt="With RAG" width="80%"/>


Therefore to build a basic RAG system, we require introducing the following steps:

1. Encoding our documents, commonly referred to as generating embeddings of our documents.
2. Storing the generated embeddings in a vector store.
3. Encoding our user query.
4. Retrieving relevant documents from our vector store given the encoded user query.
5. Augmenting the user prompt with the retrieved context.

Below is the same diagram as above, but with the RAG components highlighted.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/rag-bootcamp-mar-2024/with_rag_highlighted.svg" alt="With RAG Highlights" width="80%"/>


## What are the key stages in implementing RAG for our QA engine ?

One way to break down RAG is to divide the implementation into three key stages:

- Stage 1: Indexing
    1. Loading the documents from a source like a website, API, or database.
    2. Processing the documents into "embeddable" document chunks.
    3. Encoding the documents chunks into embedding vectors.
    4. Storing the document embedding vectors in a vector store.
- Stage 2: Retrieval
    1. Encoding the user query.
    2. Retrieving the most similar documents from the vector store given the encoded user query.
- Stage 3: Generation
    1. Augmenting the prompt with the provided context.
    2. Generating a response from the augmented prompt.
 
Stages 1 is a setup stage that needs to be performed only when new data is available. Stages 2 and 3 encompass the system in its operational state.


## Canopy Overview

Canopy is an open-source RAG framework and context engine built on top of the Pinecone vector database. Canopy enables you to quickly and easily experiment with and build applications using RAG. Start chatting with your documents or text data with a few simple commands.

Canopy takes on the heavy lifting for building RAG applications: from chunking and embedding your text data to chat history management, query optimization, context retrieval (including prompt engineering), and augmented generation.



To get started, we'll set up some enviornment variables:
1. The Pinecone API key
2. The Anyscale base URL - this will be the endpoint hosting model we'll use to generate embeddings and completions
3. The Anyscale API key

In [1]:
import os

os.environ["PINECONE_API_KEY"] = os.environ.get('PINECONE_API_KEY') or '9386359a-0227-4d5b-80d9-b1bb7600dd08'
os.environ["ANYSCALE_BASE_URL"] = 'https://api.endpoints.anyscale.com/v1'
os.environ["ANYSCALE_API_KEY"] = os.environ.get('ANYSCALE_API_KEY') or 'esecret_f6dz2g16nnrai635si83z8upk8'

Next, we'll use Pandas to read a parquet file into which we previously downloaded and saved the Pinecone documentation website. We'll retrieve the parquet file and sample it:

In [47]:
import os
import json
import shutil
from pathlib import Path

import numpy as np
import pandas as pd
import joblib
import psutil
import ray
import torch
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec, PodSpec
from bs4 import BeautifulSoup
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

warnings.filterwarnings('ignore')

import json



DATA_DIR = Path("/mnt/cluster_storage/")
shutil.copytree(Path("../data/"), DATA_DIR, dirs_exist_ok=True)

# Initialize an empty list to store the JSON objects
data = []

dest_dir = DATA_DIR / "simplest_pipeline"

# Initialize an empty list to store the documents
dataset = []

# Specify the path to your JSONL file
file_path = dest_dir / "air.jsonl"

# Open the JSONL file and parse it line by line
try:
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            # Convert each line to a JSON object
            json_obj = json.loads(line)
            print(line)
            
            # Extract the desired field (e.g., 'text') and append it to the dataset list
            # Replace 'text' with the actual field name you're interested in
            if 'text' in json_obj:  # Make sure the field exists
                dataset.append(json_obj['text'])
except FileNotFoundError:
    print(f"File not found: {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

# Display the first few items in the dataset
dataset[:2]

def chunk_fn(doc):
    return doc.split(" ")

chunks = []
for doc in dataset:
    chunks.extend(chunk_fn(doc))
chunks

word_to_vec = {
    "this": [0.1, 0.2],
    "is": [0.3, 0.4],
    "a": [0.5, 0.6],
    "document": [0.7, 0.8],
    "another": [0.9, 1.0],
}
word_to_vec["<UNK>"] = [0.0, 0.0]


def embed_model(word):
    return word_to_vec.get(word, word_to_vec["<UNK>"])

embeddings = [embed_model(chunk) for chunk in chunks]
embeddings



#embeddings[:2]


#data1 = pd.read_json("C:\\Users\\shank\\Downloads\\canopy\\examples\\ai_arxiv.jsonl", lines=True)

#data1.head()

[]


[]

### Initialize a Tokenizer
Many of Canopy's components are using _tokenization_, which is a process that splits text into tokens - basic units of text (like word or sub-words) that are used for processing. Therefore, Canopy uses a singleton Tokenizer object which needs to be initialized once.

In [3]:
from canopy.tokenizer import Tokenizer
Tokenizer.initialize()

Let's see how this tokenizer works:

In [4]:
tokenizer = Tokenizer()
tokenizer.tokenize("Hello world!")

['Hello', ' world', '!']

### Creating a KnowledgBase to store our data for search
The `KnowledgeBase` object is responsible for storing and indexing textual documents.

Once documents are indexed, the `KnowledgeBase` can be queried with a new unseen text passage, for which the most relevant document chunks are retrieved.

The `KnowledgeBase` holds a connection to a Pinecone index and provides a simple API to insert, delete and search textual documents.

The `KnowledgeBase`'s `upsert()` operation is used to index new documents, or update already stored documents. The upsert process splits each document's text into smaller chunks, transforms these chunks to vector embeddings, then upserts those vectors to the underlying Pinecone index. At Query time, the KnowledgeBase transforms the textual query text to a vector in a similar manner, then queries the underlying Pinecone index to retrieve the `top-k` most closely matched document chunks.

To Make the `KnowledgeBase` work with the Anyscale endpoint, we'll have to first define an `AnyscaleRecordEncoder`:

In [5]:
from canopy.knowledge_base.record_encoder import AnyscaleRecordEncoder

anyscale_record_encoder = AnyscaleRecordEncoder(
    api_key=os.environ["ANYSCALE_API_KEY"],
    base_url=os.environ["ANYSCALE_BASE_URL"],
    batch_size=30,
)

Next we create a `KnowledgeBase` with our desired index name (make sure you are using some unique string like your name):



In [7]:
from canopy.knowledge_base import KnowledgeBase

INDEX_NAME = "shanker-index" # Set the index name here

kb = KnowledgeBase(index_name=INDEX_NAME, record_encoder=anyscale_record_encoder)

In the first one-time setup of a new Canopy service, an underlying Pinecone index needs to be created. If you have created a Canopy-enabled Pinecone index before - you can skip this step.

Note: Since Canopy uses a dedicated data schema, it is not recommended to use a pre-existing Pinecone index that wasn't created by Canopy's `create_canopy_index()` method.

In [8]:
from canopy.knowledge_base import list_canopy_indexes
if not any(name.endswith(INDEX_NAME) for name in list_canopy_indexes()):
    kb.create_canopy_index()

You can see the index created in Pinecone's [console](https://app.pinecone.io/).

Next, we'll connect to the create `KnowledgeBase`.

In [9]:
kb.connect()

### Upsert data to our KnowledgBase
First, we need to convert our dataset to list of `Document` objects

Each document object can hold `id`, `text`, `source` and `metadata`. For example:

In [10]:
from canopy.models.data_models import Document

example_docs = [Document(id="1",
                      text="This is text for example",
                      source="https://url.com"),
                Document(id="2",
                        text="this is another text",
                        source="https://another-url.com",
                        metadata={"my-key": "my-value"})]

The data in our example dataset is already provided in this schema, so we can simply iterate over it and instantiate Document objects:



In [27]:
documents = [Document(**row) for _, row in data.iterrows()]

documents1 = [Document(**row) for _, row in data1.iterrows()]

NameError: name 'data1' is not defined

Now we are ready to upsert our data, with only a single command:



In [12]:
from tqdm.auto import tqdm

batch_size = 10

kb.upsert(documents, batch_size=batch_size, show_progress_bar=True)

Upserted vectors:   0%|          | 0/571 [00:00<?, ?it/s]

Internally, the `KnowledgeBase` handles all the processing needed to Index the documents. Each document's text is chunked to smaller pieces and encoded to vector embeddings that can be then upserted directly to Pinecone. Later in this notebook we'll learn how to tune and customize this process.



### Query the KnowledgeBase
Now we can query the knowledge base. The KnowledgeBase will use its default parameters like top_k to execute the query:

In [13]:
def print_query_results(results):
    for query_results in results:
        print('query: ' + query_results.query + '\n')
        for document in query_results.documents:
            print('document: ' + document.text.replace("\n", "\\n"))
            print("title: " + document.metadata["title"])
            print('source: ' + document.source)
            print(f"score: {document.score}\n")

Let's use this function to query the term `"p1 pod capacity"`:

In [14]:
from canopy.models.data_models import Query
results = kb.query([Query(text="p1 pod capacity")])

print_query_results(results)

query: p1 pod capacity

document: ### s1 pods\n\n\nThese storage-optimized pods provide large storage capacity and lower overall costs with slightly higher query latencies than p1 pods. They are ideal for very large indexes with moderate or relaxed latency requirements.\n\n\nEach s1 pod has enough capacity for around 5M vectors of 768 dimensions.\n\n\n### p1 pods\n\n\nThese performance-optimized pods provide very low query latencies, but hold fewer vectors per pod than s1 pods. They are ideal for applications with low latency requirements (<100ms).\n\n\nEach p1 pod has enough capacity for around 1M vectors of 768 dimensions.
title: indexes
source: https://docs.pinecone.io/docs/indexes
score: 0.91904277

document: ## Pod storage capacity\n\n\nEach **p1** pod has enough capacity for 1M vectors with 768 dimensions.\n\n\nEach **s1** pod has enough capacity for 5M vectors with 768 dimensions.\n\n\n## Metadata\n\n\nMax metadata size per vector is 40 KB.\n\n\nNull metadata values are not supp

Next, let's limit the source by using a `metadata_filter`:

In [15]:
from canopy.models.data_models import Query
results = kb.query([Query(text="p1 pod capacity",
                          metadata_filter={"source": "https://docs.pinecone.io/docs/limits"},
                          top_k=2)])

print_query_results(results)

query: p1 pod capacity

document: ## Pod storage capacity\n\n\nEach **p1** pod has enough capacity for 1M vectors with 768 dimensions.\n\n\nEach **s1** pod has enough capacity for 5M vectors with 768 dimensions.\n\n\n## Metadata\n\n\nMax metadata size per vector is 40 KB.\n\n\nNull metadata values are not supported. Instead of setting a key to hold a null value, we recommend you remove that key from the metadata payload.\n\n\nMetadata with high cardinality, such as a unique value for every vector in a large index, uses more memory than expected and can cause the pods to become full.
title: limits
source: https://docs.pinecone.io/docs/limits
score: 0.9050136

document: # Limits\n\n[Suggest Edits](/edit/limits)This is a summary of current Pinecone limitations. For many of these, there is a workaround or we're working on increasing the limits.\n\n\n## Upserts\n\n\nMax vector dimensionality is 20,000.\n\n\nMax size for an upsert request is 2MB. Recommended upsert limit is 100 vectors per r


### Query the Context Engine
`ContextEngine` is an object responsible for retrieving the most relevant context for a given query and token budget.

While `KnowledgeBase` retrieves the full `top-k` structured documents for each query including all the metadata related to them, the context engine in charge of transforming this information to a "prompt ready" context that can later feeded to an LLM. To achieve this the context engine holds a `ContextBuilder` object that takes query results from the knowledge base and returns a `Context` object. The `ContextEngine`'s default behavior is to use a `StuffingContextBuilder`, which simply stacks retrieved document chunks in a JSON-like manner, hard limiting by the number of chunks that fit the `max_context_tokens` budget. More complex behaviors can be achieved by providing a custom `ContextBuilder` class.

In [16]:
from canopy.context_engine import ContextEngine
context_engine = ContextEngine(kb)

In [17]:
import json

result = context_engine.query([Query(text="capacity of p1 pods", top_k=5)], max_context_tokens=512)

print(result.to_text(indent=2))
print(f"\n# tokens in context returned: {result.num_tokens}")

[
  {
    "query": "capacity of p1 pods",
    "snippets": [
      {
        "source": "https://docs.pinecone.io/docs/indexes",
        "text": "### s1 pods\n\n\nThese storage-optimized pods provide large storage capacity and lower overall costs with slightly higher query latencies than p1 pods. They are ideal for very large indexes with moderate or relaxed latency requirements.\n\n\nEach s1 pod has enough capacity for around 5M vectors of 768 dimensions.\n\n\n### p1 pods\n\n\nThese performance-optimized pods provide very low query latencies, but hold fewer vectors per pod than s1 pods. They are ideal for applications with low latency requirements (<100ms).\n\n\nEach p1 pod has enough capacity for around 1M vectors of 768 dimensions."
      },
      {
        "source": "https://docs.pinecone.io/docs/limits",
        "text": "## Pod storage capacity\n\n\nEach **p1** pod has enough capacity for 1M vectors with 768 dimensions.\n\n\nEach **s1** pod has enough capacity for 5M vectors with 76

As you can see above, although we set `top_k=5`, context engine retreived only 3 results in order to satisfy the 512 tokens limit. Also, the documents in the context contain only the text and source and not all the metadata that is not necessarily needed by the LLM.



### Knowledgeable chat engine
Now we are ready to start chatting with our data!

Canopy's `ChatEngine` is a one-stop-shop RAG-infused Chatbot. The `ChatEngine` wraps an underlying LLM such as OpenAI's GPT-4, enhancing it by providing relevant context from the user's knowledge base. It also automatically phrases search queries out of the chat history and send them to the knowledge base.

Again, to allow the `ChatEngine` to work with the `AnyScaleLLM`, we'll have to initilize it first:

In [19]:
from canopy.chat_engine import ChatEngine
from canopy.llm.anyscale import AnyscaleLLM
from canopy.chat_engine.query_generator import InstructionQueryGenerator

anyscale_llm = AnyscaleLLM(
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
    api_key=os.environ["ANYSCALE_API_KEY"],
    base_url=os.environ["ANYSCALE_BASE_URL"],
)

chat_engine = ChatEngine(
    context_engine,
    query_builder=InstructionQueryGenerator(
        llm=anyscale_llm,
    ),
    llm=anyscale_llm,
)

Next, we'll define a `chat` function:

In [20]:
from typing import Tuple
from canopy.models.data_models import Messages, UserMessage, AssistantMessage

def chat(new_message: str, history: Messages) -> Tuple[str, Messages]:
    messages = history + [UserMessage(content=new_message)]
    response = chat_engine.chat(messages)
    assistant_response = response.choices[0].message.content
    return assistant_response, messages + [AssistantMessage(content=assistant_response)]

Let's test the chat out:

In [21]:
from IPython.display import display, Markdown

history = []
response, history = chat("What is the capacity of p1 pods?", history)
display(Markdown(response))

 Each p1 pod has enough capacity for around 1 million vectors of 768 dimensions.

Source: https://docs.pinecone.io/docs/limits

Let's test out the chat's ability to look at the chat history:

In [23]:
response, history = chat("And for what latency requirements does it fit?", history)
display(Markdown(response))



 I apologize for the repeated question. To answer your query, P1 pods fit applications with low latency requirements, specifically less than 100ms.

Source: <https://docs.pinecone.io/docs/indexes>