# Trace with Langfuse - Your Retrieval Augmented Question & Answering with Amazon Bedrock using LangChain

### Context
We integrate a tracing mechanism for Retreival Augmented Generation (RAG) pattern implementation. RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. The implementation leverages the documents to provide answers to the questions. We shall use Langfuse to trace through the interactions to gain deeper insights into the working.

### Challenges
- LLM traceability is not a mere tracing of the requests for latency and processing bottlenecks. It is about understanding the prompt, input context, and refining further as required. A methodical way to unearth the insights is a challenge.

### Understand the RAG working first

#### Prepare documents
Before being able to answer the questions, the documents must be processed and a stored in a document store index
- Load the documents
- Process and split them into smaller chunks
- Create a numerical vector representation of each chunk using Amazon Bedrock Titan Embeddings model
- Create an index using the chunks and the corresponding embeddings
#### Ask question
When the documents index is prepared, you are ready to ask the questions and relevant documents will be fetched based on the question being asked. Following steps will be executed.
- Create an embedding of the input question
- Compare the question embedding with the embeddings in the index
- Fetch the (top N) relevant document chunks
- Add those chunks as part of the context in the prompt
- Send the prompt to the model under Amazon Bedrock
- Get the contextual answer based on the documents retrieved

#### Dataset
We are using the documents from IRS. These documents explain topics such as:
- Original Issue Discount (OID) Instruments
- Reporting Cash Payments of Over $10,000 to IRS
- Employer's Tax Guide

#### Who is interacting with our system?
Alayman who doesn't have an understanding of how IRS works and if some actions have implications or not. The model will try to answer from the documents it ingested.

## Implementation
We are using the LangChain framework. Langfuse has integration with LangChain. So, we will be able to trace the LLM interactions.

- **LLM (Large Language Model)**: Anthropic Claude V1 available through Amazon Bedrock, used to understand the document chunks and provide an answer.
- **Embeddings Model**: Amazon Titan Embeddings available through Amazon Bedrock, used to generate embeddings (numerical representation) of the textual documents.
- **Document Loader**: PDF Loader available through LangChain; This can load the documents from a source. We are loading the sample files from a local path. This could easily be replaced with a loader to load documents from enterprise internal systems.
- **Vector Store**: FAISS available through LangChain; We are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone, or Milvus.
- **Index**: VectorIndex; The index helps to compare the input embedding and the document embeddings to find relevant document
- **Wrapper**: wraps index, vector store, embeddings model and the LLM to abstract away the logic from the user.

In [1]:
%pip install langchain pypdf==4.1.0 langchain-community langchain-core faiss-cpu==1.8.0 tiktoken==0.6.0 sqlalchemy==2.0.28 langfuse boto3
%pip install -U langchain-aws

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# set credentials - for langchain and for langfuse
import os
os.system('export AWS_PROFILE=default')
os.environ["LANGFUSE_PUBLIC_KEY"] = 'pk-lf-c8ec60a4-3f7e-4e65-8eda-09e76f796b3f'
os.environ["LANGFUSE_SECRET_KEY"] = 'sk-lf-0ffdfee6-4e88-4110-85ef-b6e153382c81'
os.environ["LANGFUSE_HOST"] = 'http://localhost:3000'

In [3]:
# create a print function
import warnings
import sys
import textwrap
import os
from typing import Optional
from io import StringIO
import boto3
from botocore.config import Config

warnings.filterwarnings('ignore')

def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))
        

## Initialization

We are using Anthropic Claude for text generation and Amazon Titan for text embedding.
**Note:** 
Adding the model defintion in Langfuse will help in cost and usage tracking.
We have added as a 'anthropic.claude-3-sonnet-20240229-v1:0' as Bedrock model definition for cost and usage tracking in Langfuse
If time is available, we will show how to add model definition for 'amazon.titan-text-express-v1'


In [4]:
import boto3
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_aws import ChatBedrock
from langchain_aws import BedrockEmbeddings

bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1",
)

model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

model_kwargs =  { 
    "max_tokens": 2048,
    "temperature": 0.0,
    "top_k": 250,
    "top_p": 1,
    "stop_sequences": ["\n\nHuman"],
}

llm = ChatBedrock(
    client=bedrock_runtime,
    model_id=model_id,
    model_kwargs=model_kwargs,
)
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain.embeddings import BedrockEmbeddings
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=bedrock_runtime)

  bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=bedrock_runtime)


## Data Preparation
Download some of the files to build our document store. For this example we will be using public IRS documents from [here](https://www.irs.gov/publications).

In [5]:
from urllib.request import urlretrieve

os.makedirs("data", exist_ok=True)
files = [
    "https://www.irs.gov/pub/irs-pdf/p1544.pdf",
    "https://www.irs.gov/pub/irs-pdf/p15.pdf",
    "https://www.irs.gov/pub/irs-pdf/p1212.pdf",
]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

After downloading we can load the documents with the help of [DirectoryLoader from PyPDF available under LangChain](https://python.langchain.com/en/latest/reference/modules/document_loaders.html) and splitting them into smaller chunks.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also the embeddings model has a limit of the length of input tokens limited to 8192 tokens, which roughly translates to ~32,000 characters. For the sake of this use-case we are creating chunks of roughly 1000 characters with an overlap of 100 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

#### Next steps - to show how we can iterate through and create vector store. Skip this portion (as needed)

In [6]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain_community.document_loaders.pdf import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 100,
)
docs = text_splitter.split_documents(documents)

In [None]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

We had 3 PDF documents which have been split into smaller ~500 chunks.

Now we can see how a sample embedding would look like for one of those chunks

In [None]:
try:
    
    sample_embedding = np.array(bedrock_embeddings.embed_query(docs[0].page_content))
    print("Sample embedding of a document chunk: ", sample_embedding)
    print("Size of the embedding: ", sample_embedding.shape)

except ValueError as error:
    if  "AccessDeniedException" in str(error):
        print(f"\x1b[41m{error}\
        \nTo troubeshoot this issue please refer to the following resources.\
         \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
         \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n")      
        class StopExecution(ValueError):
            def _render_traceback_(self):
                pass
        raise StopExecution        
    else:
        raise error

### [Continue ...] Use VectorStoreIndexWrapper to create vector store

Following the similar pattern, embeddings could be generated for the entire corpus and stored in a vector store.

This can be easily done using [FAISS](https://github.com/facebookresearch/faiss) implementation inside [LangChain](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) which takes  input the embeddings model and the documents to create the entire vector store. Using the Index Wrapper we can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents and calling the LLM. [VectorStoreIndexWrapper](https://python.langchain.com/en/latest/modules/indexes/getting_started.html#one-line-index-creation) helps us with that.

**⚠️⚠️⚠️ NOTE: it might take few minutes to run the following cell ⚠️⚠️⚠️**

In [20]:
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langfuse.decorators import langfuse_context, observe



@observe(as_type="generation")
def create_vector_store():
    langfuse_context.update_current_observation(
        name="Vector store creation", input="PDF_Docs", output="Vector_store"
    ) 
    langfuse_context.update_current_trace(
        name="Vector store creation trace",
        session_id="Vector store creation session",
        tags=["embeddings", "vector_store"],
        public=True
    )    
    vectorstore_faiss = FAISS.from_documents(
                            docs,
                            bedrock_embeddings,
                      )
    return vectorstore_faiss
vectorstore_faiss = create_vector_store()
wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)


### Question and Answer - Traceability

In [21]:
trace_name = "RAG_Trace001"
session_id = "RAG_Session001"
user_id = "Developer_RAG001"

In [22]:
from langfuse.decorators import langfuse_context, observe
@observe()
def invokeLLM(rag_chain, question):
    langfuse_handler = CallbackHandler()   
    # adding Langfuse context
    langfuse_context.update_current_trace(
        name=trace_name, 
        session_id=session_id,
        user_id=user_id, 
    )      
    langfuse_handler=langfuse_context.get_current_langchain_handler()       
    result = rag_chain.invoke({"input": question},config={"callbacks": [langfuse_handler]})
    return result

In [23]:
#YOUR QUERY HERE ...
question = "What is the difference between market discount and qualified stated interest?"

In [25]:
from langfuse.callback import CallbackHandler
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.chains.combine_documents import create_stuff_documents_chain
import bs4
import json
from langchain.chains import create_retrieval_chain

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
    
retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    )

# 2. Incorporate the retriever into a question-answering chain.
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

result = invokeLLM(rag_chain, question)
print_ww(result)

{'input': 'What is the difference between market discount and qualified stated interest?',
'context': [Document(metadata={'source': 'data/p1212.pdf', 'page': 2}, page_content="was less than
the debt instrument's issue price \nplus the total OID that accrued before you ac-\nquired it. The
market discount is the difference \nbetween the issue price plus accrued OID and \nyour adjusted
basis.\nPremium. A debt instrument is purchased at a \npremium if its adjusted basis immediately
after \npurchase is greater than the total of all amounts \npayable on the debt instrument after the
pur-\nchase date, other than qualified stated interest. \nThe premium is the excess of the adjusted
ba-\nsis over the payable amounts.\nPremium will generally eliminate the future \nreporting of OID
in income by the purchaser, as \ndiscussed under Information for Owners of OID \nDebt Instruments ,
later. See Pub. 550  for more \ninformation on the tax treatment of bond pre-\nmium.\nQualified
stated interest.  In 

## Next
Now that we have executed Q n A interaction with LLM, let us examine the traces in Langfuse.
# Thank You

### Flush Langfuse context (Clear off)

In [None]:
# SDK is async, make sure to await all requests
langfuse.flush()

### Query for similarity search [You can skip this, as these are more of verification steps]

Now that we have our vector store in place, we can start asking questions.

In [None]:
query = """Is it possible that I get sentenced to jail due to failure in filings?"""

The first step would be to create an embedding of the query such that it could be compared with the documents

In [None]:
query_embedding = vectorstore_faiss.embedding_function.embed_query(query)
np.array(query_embedding)

We can use this embedding of the query to then fetch relevant documents.
Now our query is represented as embeddings we can do a similarity search of our query against our data store providing us with the most relevant information.

In [None]:
relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print_ww(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')

Now we have the relevant documents, it's time to use the LLM to generate an answer based on these documents. 

We will take our inital prompt, together with our relevant documents which were retreived based on the results of our similarity search. We then by combining these create a prompt that we feed back to the model to get our result. 

### Knowledge share
You have the possibility to use the wrapper provided by LangChain which wraps around the Vector Store and takes input the LLM.
This wrapper performs the following steps behind the scences:
- Take the question as input
- Create question embedding
- Fetch relevant documents
- Stuff the documents and the question into a prompt
- Invoke the model with the prompt and generate the answer in a human readable manner.