# Retrieval Augmented Generation (RAG) with Amazon Bedrock

Retrieval Augmented Generation (RAG) combines the power of large language models with information retrieval. Instead of relying solely on a model's internal knowledge, RAG fetches relevant information from external sources and uses it to generate more accurate, up-to-date responses.

RAG doesn't have to include a vectorDB, it can pull from various sources. However, for unstructured text, vectorDBs and semantic search is the most common approach. 

In this lab, we'll implement RAG using a local in-memory vector database called ChromaDB and use Llama index for taking unstructured text and ingesting it into our vectorDB. 

#### About ChromaDB

ChromaDB is a lightweight, in-memory vector database that makes it easy to store and query embeddings. We'll use it to: (1) Store document embeddings, (2) Perform semantic searches, (3) Retrieve relevant context for our LLM. We will discuss VectorDBs further down the lab.

**Will Local DBs Scale?** Running ChromaDB locally is perfect for learning and testing: For production applications, consider using: Amazon Bedrock Knowledge Bases, Pinecone, other managed vector database services. These solutions offer persistence, scalability, and performance features necessary for real-world applications.

#### About Llama Index
For chunking, we'll use LlamaIndex. There are many tools/frameworks for ingesting documents and implementing chunking strategies. We chose LlamaIndex because it offers a lot of advanced chunking options. It creates "nodes" that can be converted into different formats for ingestion.

When using these GenAI frameworks, you want to abstract the details of the framework away from your core code. This makes mixing and matching much easier. In the example below, we'll use LlamaIndex but we'll wrap the chunking logic in a class and normalize the output to a class that we create named RAGChunk. This way, we aren't too reliant on the framework.


## Setup

Let's stand up our local Chroma DB and initialize Bedrock

In [None]:
import chromadb
import boto3
from chromadb.config import Settings

# Initialize Chroma client from our persisted store
chroma_client = chromadb.PersistentClient(path="../../data/chroma")

session = boto3.Session()
bedrock = session.client(service_name='bedrock-runtime')

print("✅ Setup complete!")

#### Splitting text
To make text easier to search, we want to split it up so we're only retrieving the most relevant information. To do this, we need to create a RAG pipeline

#### Setup RAG Pipeline
We'll wrap our LlamaIndex and Chroma code in our own class definitions to abstract away the framework from our core logic. That way we can easily change our mind and pick a different DB, framework, etc.. 


In [None]:
from typing import List, Dict, Any
from pydantic import BaseModel
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import BaseNode
from llama_index.core import SimpleDirectoryReader

import uuid

# Create a class to use instead of LlamaIndex Nodes. This way we decouple our chroma collections from LlamaIndexes
class RAGChunk(BaseModel):
    id_: str
    text: str
    metadata: Dict[str, Any] = {}


# This is a simple chunker that uses LlamaIndex's SentenceSplitter to chunk raw text
# It can easily be extended to support files or other data sources.
class TextChunker:
    def __init__(self, chunk_size: int = 256, chunk_overlap: int = 20):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    
    def chunk_text(self, text: str, metadata: Dict = None) -> List[RAGChunk]:
        """Chunk raw text directly"""
        metadata = metadata if metadata else {'source': 'raw_text'}
        # Create a document from the text
        document: Document = Document(text=text, metadata=metadata)
        # Split into chunks
        nodes: List[BaseNode] = self.splitter.get_nodes_from_documents([document])
        # Create a unique id for the chunk
        unique_id: str = str(uuid.uuid4())
        
        # Convert to RAGChunk objects
        chunks: List[RAGChunk] = []

        for i, node in enumerate(nodes):
            rag_chunk: RAGChunk = RAGChunk(
                id_=node.node_id or f"text_chunk_{i}_{unique_id}",
                text=node.text,
                metadata={ **node.metadata, **metadata }
            )

            chunks.append(rag_chunk)
        
        return chunks

### Setup VectorDB Wrapper
Next we'll create a wrapper around chromaDB and our retrieval results so that we can easily swap out different DBs later. To understand what's happening, we need to introduce a couple concepts

##### Embeddings
Embeddings are essentially a numerical representation of a chunk of texts meaning. Imagine you have a 2D space with (X,Y) coordinates. A question like "What is espresso?" might be represented in that graph as X,Y coordinate of (1,1). 

You also have two chunks of text, one relevant one not:
1. "concentrated coffee beverage made by forcing hot water under high pressure through ground coffee beans"
2. "Opensearch is an opensource database that can perform semantic search"

Chunk text one might be represented on that 2D space as (1.5, 1.5) while chunk two might be represented as (4,4). Using these embeddings, we can find semantically similar chunks of information by finding the nearest neighbors.

Find a diagram below to illustrate the concept:

<img src="../assets/semanticsearch.png" width="50%" height="auto">

##### Embedding Models
To get these embeddings, we use an embedding model. It consumes information (in our case text) and outputs the embedding (vector). In the example above, 2 dimensions (X,Y) is not enough to capture enough information to do anything useful. These embedding models usually output vectors in 512 dimensions or above and are commonly in the 1024+ range. 

**Note**: embeddings are often called vectors because they have magnitude and direction from the origin. In our 2D example, that means those embeddings have direction and magnitude from (0,0)

##### VectorDBs
VectorDBs work by finding the closest neighbors to your input. If we expanded out that example above to 100k+ documents, finding the nearest neighbors would be difficult. VectorDBs are optimized for finding nearest neighbors of embeddings

In [None]:
from pydantic import BaseModel
from typing import List, Dict, Any, Optional
from chromadb.api.types import EmbeddingFunction

# Create abstraction around ChromaDB so that we can easily swap out the DB or embedding model
class RetrievalResult(BaseModel):
    id: str
    document: str
    embedding: List[float]
    distance: float
    metadata: Dict = {}

class ChromaDBWrapperClient:
    def __init__(self, chroma_client, collection_name: str, embedding_function: Optional[EmbeddingFunction] = None):
        self.client = chroma_client
        self.collection_name = collection_name
        self.embedding_function = embedding_function
        
        # Create or get the collection
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            embedding_function=embedding_function
        )

    def add_chunks_to_collection(self, chunks: List[RAGChunk]):
        # Add the chunks to the collection
        self.collection.add(
            ids=[chunk.id_ for chunk in chunks],
            documents=[chunk.text for chunk in chunks],
            metadatas=[chunk.metadata for chunk in chunks]
        )

        print(f"✅ Added {len(chunks)} chunks to collection {self.collection_name}")
        
    def retrieve(self, query_text: str, n_results: int = 1) -> List[RetrievalResult]:
        # Query the collection
        results = self.collection.query(
            query_texts=[query_text],
            n_results=n_results,
            include=['embeddings', 'documents', 'metadatas', 'distances']
        )

        # Transform the results into RetrievalResult objects
        retrieval_results = []
        for i in range(len(results['ids'][0])):
            retrieval_results.append(RetrievalResult(
                id=results['ids'][0][i],
                document=results['documents'][0][i],
                embedding=results['embeddings'][0][i],
                distance=results['distances'][0][i],
                metadata=results['metadatas'][0][i] if results['metadatas'][0] else {}
            ))

        return retrieval_results

### Setup Complete!
We've set up everything we need to do RAG with a vectorDB. If you notice in the code above, we add an embedding model to the collection (think table) directly. It's valuable to tie the embedding model to the same place where you're doing your search. Different embedding models have different "embedding spaces" so they're not interchangable. Using Cohere models to create embeddings for your query and then doing a search on embeddings created with Amazon's Titan embedding model won't work

## Creating Our Knowledge
Next lets create some chunks and populate it in our vectorDB

In [None]:
COFFEE_KNOWLEDGE = """
Espresso is a concentrated form of coffee served in small, strong shots. It is made by forcing hot water under pressure through finely-ground coffee beans. Espresso forms the base for many coffee drinks.

Cappuccino is an espresso-based drink that's traditionally prepared with steamed milk, and milk foam. A traditional Italian cappuccino is generally a single shot of espresso topped with equal parts steamed milk and milk foam.

Latte is a coffee drink made with espresso and steamed milk. The word comes from the Italian 'caffè e latte' meaning 'coffee and milk'. A typical latte is made with one or two shots of espresso, steamed milk and a small layer of milk foam on top.

Cold Brew is coffee made by steeping coarse coffee grounds in cold water for 12-24 hours. This method creates a smooth, less acidic taste compared to hot brewed coffee. Cold brew can be served over ice or heated up.
"""

# Create our chunker. We're making the chunk size extremely small here to demonstrate the point.
text_chunker: TextChunker = TextChunker(chunk_size=128, chunk_overlap=5)

# Create our rag chunks
rag_chunks: List[RAGChunk] = text_chunker.chunk_text(COFFEE_KNOWLEDGE)

# checkout our chunks
print(f'We just created {len(rag_chunks)} chunks!\n')

for i, chunk in enumerate(rag_chunks):
    print(f'Chunk {i}:\n {chunk.text}\n\n')

print("✅ Knowledge created!")

Pretty cool! We just created 2 chunks that we can use for our knowledge base to lookup knowledge. 

Now we need to plug it into our vector DB. 

In [None]:
from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction

# Define our embedding model. In our case, we're using Titan Embed Text v2 embedding model from Bedrock
TITAN_TEXT_EMBED_V2_ID: str = 'amazon.titan-embed-text-v2:0'

# This is a handy function Chroma implemented for calling bedrock. Lets use it!
# You need the session object to call bedrock and the model name to use.
embedding_function: AmazonBedrockEmbeddingFunction = AmazonBedrockEmbeddingFunction(
    session=session,
    model_name=TITAN_TEXT_EMBED_V2_ID
)

# Initialize our vector store. We'll use the same client we created earlier.
coffee_vector_store = ChromaDBWrapperClient(chroma_client, "coffee_knowledge", embedding_function)

# Add our chunks to the vector store
coffee_vector_store.add_chunks_to_collection(rag_chunks)

print("✅ RAG retrieval ready!")

## Testing the System
To test the system, we just need to retrieve chunks from our vectorDB and use it as context when calling Bedrock. Let's create a RAG helper function and create a simple RAG prompt.

In [None]:
# Define our system prompt. This is the prompt that will be used to generate the response.
SYSTEM_PROMPT: str = """
You are a coffee expert. You are given a question and a context. 
Your job is to answer the question based ONLY on the context provided. 
Just answer the question, avoid saying "Based on the context provided" before answering.
If the context doesn't contain the answer, say "I don't know"
"""

# Define our RAG prompt template. This is the prompt that will be used to generate the response.
RAG_PROMPT_TEMPLATE: str = """
Using the context below, answer the question.

<context>
{context}
</context>

<question>
{question}
</question>

Remember, if the context doesn't contain the answer, say "I don't know".
"""

MODEL_ID: str = "us.anthropic.claude-3-5-haiku-20241022-v1:0"

def call_bedrock(prompt: str) -> str:
    # Create the message in Bedrock's required format
    user_message: Dict[str, Any] = { "role": "user","content": [{ "text": prompt}] }
    # Configure model parameters
    inference_config: Dict[str, Any] = {
        "temperature": .4,
        "maxTokens": 1000
    }

    # Send request to Claude Haiku 3.5 via Bedrock
    response: Dict[str, Any] = bedrock.converse(
        modelId=MODEL_ID,  # Using Sonnet 3.5 
        messages=[user_message],
        system=[{"text": SYSTEM_PROMPT}],
        inferenceConfig=inference_config
    )

    # Get the model's text response
    return response['output']['message']['content'][0]['text']

# Helper function to call bedrock
def do_rag(input_question: str) -> str:
    # Retrieve the context from the vector store
    retrieval_results: List[RetrievalResult] = coffee_vector_store.retrieve(input_question)
    # Format the context into a string
    context: str = "\n\n".join([result.document for result in retrieval_results])
    # Create the RAG prompt
    rag_prompt: str = RAG_PROMPT_TEMPLATE.format(question=input_question, context=context)
    # Call Bedrock with the RAG prompt
    return call_bedrock(rag_prompt)

Now lets ask questions using our knoweldge base.

In [None]:
questions: List[str] = [
    "How is cold brew coffee made?",
    "What is pour over coffee?"
]

for question in questions:
    print(f"\nQuestion: {question}")
    # Retrieve the context from 
    response = do_rag(question)
    print("\nAnswer:")
    print(response)

Great! We were able to answer the first question. However, information about pour over coffee doesn't exist in our vectorDB so the model says it doesn't know. Let's add some knowledge to the database

## Adding New Knowledge

In [None]:
new_knowledge = """
Pour Over is a method of brewing coffee where hot water is poured over ground coffee in a filter. This method gives the brewer complete control over brewing time and water temperature, leading to a clean, flavorful cup of coffee.
"""

# Reuse the same text splitter and collection wrapper to add new knowledge
new_texts: List[RAGChunk] = text_chunker.chunk_text(new_knowledge)
coffee_vector_store.add_chunks_to_collection(new_texts)

# Test new knowledge
question: str = "How is pour over coffee made?"

print(f"\nQuestion: {question}")
response = do_rag(question)

print("\nAnswer:")
print(response)

## Exercise

Now it's your turn! Try creating a knowledge base about a topic you're interested in:

1. Write 3-4 paragraphs about your chosen topic
2. Process the text using the text splitter
3. Add data user the chroma wrapper.
4. Test with different questions

Remember: The quality of the answers depends on the quality and relevance of the information in your knowledge base!