# Retrieval Augmented Generation

LLM's excel at a wide range of tasks, but they will struggle with queries specific to a unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables an LLM to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions. Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more.

This short notebook demonstrate how to create a simple RAG solution using the Anthropic documentation as the source for knowledge base.

It creates up a basic RAG system using an in-memory vector database and embeddings from [Voyage AI](https://www.voyageai.com/).

## Setup

Install needed libraries, including:

1) `anthropic` - to interact with Claude

2) `voyageai` - to generate high quality embeddings

3) `pandas`, `numpy` - for data processing


You'll also need an `API key` for:
[Voyage AI](https://www.voyageai.com/) - for embeddings

Optionally, provide an `API key` for Anthropic Cloud Service. If not provided the code will use Amazon Bedrock.
For `API key` go to [Anthropic](https://www.anthropic.com/)

Note: This code will run with Claude Haiku model unless changed

In [1]:
## Install the required python packages 
!pip install anthropic
!pip install voyageai
!pip install pandas
!pip install numpy
!pip install python-dotenv



In [2]:
from dotenv import load_dotenv
import boto3
import json
import os
import anthropic

In [3]:
# This notebook requires that at a minimum you have created an account 
# with VoyageAI (embeddings service) and have set the VOYAGE_API_KEY value in the
# file that dotenv is going to read.
# See the following for details: https://pypi.org/project/python-dotenv/

# load_dotenv() loads in the key value information for the secret keys being used: 

# 1/ "VOYAGE_API_KEY"
# 2/ "ANTHROPIC_API_KEY" - If you are using the service direct from Anthropic (rather than using Amazon Bedrock)

load_dotenv()
anthropic_api_key=os.getenv("ANTHROPIC_API_KEY", None)

### Define a minimal in-Memory vector DB class

In this example, we're using an in-memory vector DB, but for a production application, you may want to use a hosted solution. 

In [4]:
import os
import pickle
import json
import numpy as np
import voyageai

class VectorDB:
    def __init__(self, name, api_key=None):
        if api_key is None:
            api_key = os.getenv("VOYAGE_API_KEY")
        self.client = voyageai.Client(api_key=api_key)
        self.name = name
        self.embeddings = []
        self.metadata = []
        self.query_cache = {}
        self.db_path = f"./data/{name}/vector_db.pkl"

    def load_data(self, data):
        if self.embeddings and self.metadata:
            print("Vector database is already loaded. Skipping data loading.")
            return
        if os.path.exists(self.db_path):
            print("Loading vector database from disk.")
            self.load_db()
            return

        texts = [f"Heading: {item['chunk_heading']}\n\n Chunk Text:{item['text']}" for item in data]
        self._embed_and_store(texts, data)
        self.save_db()
        print("Vector database loaded and saved.")

    # TODO Change this function to limit text size sent to embeddeding to a max of 256 words
    def _embed_and_store(self, texts, data):
        batch_size = 128
        result = [
            self.client.embed(
                texts[i : i + batch_size],
                model="voyage-2"
            ).embeddings
            for i in range(0, len(texts), batch_size)
        ]
        self.embeddings = [embedding for batch in result for embedding in batch]
        self.metadata = data

    def search(self, query, k=5, similarity_threshold=0.75):
        if query in self.query_cache:
            query_embedding = self.query_cache[query]
        else:
            query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
            self.query_cache[query] = query_embedding

        if not self.embeddings:
            raise ValueError("No data loaded in the vector database.")

        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[::-1]
        top_examples = []
        
        for idx in top_indices:
            if similarities[idx] >= similarity_threshold:
                example = {
                    "metadata": self.metadata[idx],
                    "similarity": similarities[idx],
                }
                top_examples.append(example)
                
                if len(top_examples) >= k:
                    break
        return top_examples

    def save_db(self):
        data = {
            "embeddings": self.embeddings,
            "metadata": self.metadata,
            "query_cache": json.dumps(self.query_cache),
        }
        os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
        with open(self.db_path, "wb") as file:
            pickle.dump(data, file)

    def load_db(self):
        if not os.path.exists(self.db_path):
            raise ValueError("Vector database file not found. Use load_data to create a new database.")
        with open(self.db_path, "rb") as file:
            data = pickle.load(file)
        self.embeddings = data["embeddings"]
        self.metadata = data["metadata"]
        self.query_cache = json.loads(data["query_cache"])

## Level 1 - Basic RAG

To get started, we'll set up a basic RAG pipeline using a bare bones approach. This is sometimes called 'Naive RAG' by many in the industry. A basic RAG pipeline includes the following 3 steps:

1) Chunk documents by heading - containing only the content from each subheading

2) Embed each document

3) Use Cosine similarity to retrieve documents in order to answer query

In [5]:
# Load the Anthropic documentation segments into a dictionary
with open('data/anthropic_docs.json', 'r') as f:
    anthropic_docs = json.load(f)

In [6]:
len(anthropic_docs)

232

In [7]:
abbreviated_docs = anthropic_docs[:10]

In [8]:
abbreviated_docs

[{'chunk_link': 'https://docs.anthropic.com/en/docs/welcome#get-started',
  'chunk_heading': 'Get started',
  'text': 'Get started\n\n\nIf you’re new to Claude, start here to learn the essentials and make your first API call.\nIntro to ClaudeExplore Claude’s capabilities and development flow.QuickstartLearn how to make your first API call in minutes.Prompt LibraryExplore example prompts for inspiration.\nIntro to ClaudeExplore Claude’s capabilities and development flow.\n\nIntro to Claude\nExplore Claude’s capabilities and development flow.\nQuickstartLearn how to make your first API call in minutes.\n\nQuickstart\nLearn how to make your first API call in minutes.\nPrompt LibraryExplore example prompts for inspiration.\n\nPrompt Library\nExplore example prompts for inspiration.\n'},
 {'chunk_link': 'https://docs.anthropic.com/en/docs/welcome#models',
  'chunk_heading': 'Models',
  'text': 'Models\n\n\nClaude consists of a family of large language models that enable you to balance intel

In [9]:
# Initialize the VectorDB
db = VectorDB("anthropic_docs")
# Import the document segments into the vector database
db.load_data(abbreviated_docs)

Loading vector database from disk.


In [10]:
len(db.embeddings)

10

### Define a minimal LLM Facade class

This facade makes it easy to use either AWS Bedrock or Anthropic Cloud for invoking the LLM.
If a value for the anthropic_api_key is set, then Anthropic Cloud will be used, otherwise, AWS Bedrock is used.

In [11]:
LLM_MAX_TOKENS = 2500
LLM_TEMPERATURE = 0.01
BEDROCK_MODEL_ID = 'anthropic.claude-3-haiku-20240307-v1:0'


class LlmFacade:
    def __init__(self, anthropic_api_key=None):
        self.max_tokens = LLM_MAX_TOKENS
        self.temperature = LLM_TEMPERATURE
        # Use Anthropic Claude via Anthropic Cloud if the key is set
        # if not, set up to use Anthropic Claude via Bedrock
        self.aws_bedrock = True

        if anthropic_api_key:
            self.anthropic_client = anthropic.Anthropic(api_key=anthropic_api_key)
            self.aws_bedrock = False
            print("Configured to use: Anthropic Cloud Service")
        else:
            session = boto3.Session()
            region = session.region_name

            # Set the model id to Claude Haiku
            self.bedrock_client = boto3.client(service_name='bedrock-runtime', region_name=region)
            print("Configured to use: AWS Bedrock Service")

    def invoke(self, prompt: str) -> str:
        if self.aws_bedrock == True:
            return self.invoke_aws_bedrock_llm(prompt)
        else:
            return self.invoke_anthropic_cloud_llm(prompt)

    def invoke_anthropic_cloud_llm(self, prompt: str) -> str:
        messages = [{"role": "user", "content": [{"text": prompt}]}]

        response = self.anthropic_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=self.max_tokens,
            messages=[{"role": "user", "content": prompt}],
            temperature=self.temperature
        )
        return response.content[0].text

    def invoke_aws_bedrock_llm(self, prompt: str) -> str:
        messages = [{"role": "user", "content": [{"text": prompt}]}]

        inference_config = {
            "temperature": self.temperature,
            "maxTokens": self.max_tokens
        }
        converse_api_params = {
            "modelId": BEDROCK_MODEL_ID,
            "messages": messages,
            "inferenceConfig": inference_config
        }
        # Send the request to the Bedrock service to generate a response
        try:
            response = self.bedrock_client.converse(**converse_api_params)

            # Extract the generated text content from the response
            text_content = response['output']['message']['content'][0]['text']

            # Return the generated text content
            return text_content

        except ClientError as err:
            message = err.response['Error']['Message']
            print(f"A client error occured: {message}")
        return("500: Request failed")

In [12]:
llm = LlmFacade(anthropic_api_key=anthropic_api_key)

Configured to use: Anthropic Cloud Service


In [14]:
llm.invoke("how fast does a swallow fly")

"The speed at which swallows fly can vary quite a bit, but here are some general details:\n\n- Common swallows (also known as barn swallows) typically fly at speeds between 12-24 mph (19-39 km/h) during normal flight.\n\n- When diving or chasing prey, swallows can reach top speeds of around 35-45 mph (56-72 km/h).\n\n- Migratory swallows can sustain flight speeds of 20-30 mph (32-48 km/h) over long distances during their seasonal migrations.\n\n- Factors like wind conditions, the swallow's flight path, and whether it's hunting or just cruising can all affect its flight speed.\n\n- Smaller swallow species like the bank swallow or tree swallow tend to be on the lower end of the speed range, while larger swallows like the barn swallow can reach the higher end.\n\nSo in general, swallows are quite agile and fast flyers, capable of bursts of speed when needed, but maintaining more moderate cruising speeds during normal flight. Their aerial acrobatics and speed make them excellent insect hun

In [16]:
def retrieve_base(query, db, similarity_threshold=0.7):
    results = db.search(query, k=3, similarity_threshold=similarity_threshold)
    context = ""
    for result in results:
        chunk = result['metadata']
        context += f"\n{chunk['text']}\n"
    return results, context

def answer_query_base(query, db, llm):
    documents, context = retrieve_base(query, db)
    prompt = f"""
    You have been tasked with helping us to answer the following query: 
    <query>
    {query}
    </query>
    You have access to the following documents which are meant to provide context as you answer the query:
    <documents>
    {context}
    </documents>
    Please remain faithful to the underlying context, and only deviate from it if you are 100% sure that you know the answer already. 
    Answer the question now, and avoid providing preamble such as 'Here is the answer', etc
    """
    return llm.invoke(prompt)

In [17]:
example_question = ["i have a billing question", "what capabilities are there", "who's cat is that"]

In [18]:
i = 0
results, context = retrieve_base(example_question[i], db)
print("Question:", example_question[i])
results

Question: i have a billing question


[{'metadata': {'chunk_link': 'https://docs.anthropic.com/en/docs/welcome#support',
   'chunk_heading': 'Support',
   'text': 'Support\n\n\nHelp CenterFind answers to frequently asked account and billing questions.Service StatusCheck the status of Anthropic services.\nHelp CenterFind answers to frequently asked account and billing questions.\n\nHelp Center\nFind answers to frequently asked account and billing questions.\nService StatusCheck the status of Anthropic services.\n\nService Status\nCheck the status of Anthropic services.\nQuickstartxlinkedin\nQuickstart\nxlinkedin\nGet started Models Develop with Claude Key capabilities Support\nGet startedModelsDevelop with ClaudeKey capabilitiesSupport\n'},
  'similarity': 0.7012104891700733}]

In [19]:
i = 1
results, context = retrieve_base(example_question[i], db, 0.7)
print("Question:", example_question[i])
results

Question: what capabilities are there


[{'metadata': {'chunk_link': 'https://docs.anthropic.com/en/docs/welcome#key-capabilities',
   'chunk_heading': 'Key capabilities',
   'text': 'Key capabilities\n\n\nClaude can assist with many tasks that involve text, code, and images.\nText and code generationSummarize text, answer questions, extract data, translate text, and explain and generate code.VisionProcess and analyze visual input and generate text and code from images.\nText and code generationSummarize text, answer questions, extract data, translate text, and explain and generate code.\n\nText and code generation\nSummarize text, answer questions, extract data, translate text, and explain and generate code.\nVisionProcess and analyze visual input and generate text and code from images.\n\nVision\nProcess and analyze visual input and generate text and code from images.\n'},
  'similarity': 0.7397239715695028}]

In [20]:
i = 2
results, context = retrieve_base(example_question[i], db, 0.7)
print("Question:", example_question[i])
results

Question: who's cat is that


[]

In [21]:
i = 0
result = answer_query_base(example_question[i], db, llm)
print("Question:", example_question[i])
result

Question: i have a billing question


"I'm afraid I don't have enough information to fully answer your billing question. The documents provided give some general information about Anthropic's support and service status, but do not contain specific details about billing. To get a more complete answer, I would need additional details about the specific billing issue you are facing. Please feel free to provide more details about your billing question, and I'll do my best to assist you."

In [22]:
i = 1
result = answer_query_base(example_question[i], db, llm)
print("Question:", example_question[i])
result

Question: what capabilities are there


'The key capabilities of Claude include:\n\nText and code generation: Summarize text, answer questions, extract data, translate text, and explain and generate code.\n\nVision: Process and analyze visual input and generate text and code from images.'

In [23]:
i = 2
result = answer_query_base(example_question[i], db, llm)
print("Question:", example_question[i])
result

Question: who's cat is that


'Unfortunately, without any documents provided, I do not have enough context to determine whose cat is being referred to in the query "who\'s cat is that". I would need additional information or documents to be able to provide a reliable answer to this query.'