Doing Retrieval Augmented Generation with Langchain using knowledge base populated from a PDF or Web to answer questions. This solution uses an in memory vector store, Claude Sonnet 3.5 LLM for generation and Amazon Titan Embed LLM for indexing/retrieval.

[Langchain documentation](https://python.langchain.com/docs/tutorials/rag/)


Installed all the necessary libraries for to run with Langchain

In [37]:
%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph langchain-aws langchain-core langchain-community

In [39]:
# Ensure your AWS credentials are configured

from langchain.chat_models import init_chat_model

llm = init_chat_model("anthropic.claude-3-5-sonnet-20240620-v1:0", model_provider="bedrock_converse")

Select your embeddings model "AWS"

In [40]:
from langchain_aws import BedrockEmbeddings

embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")

Now, you can select your vector store. It can be "in-memory" or "FAISS"

In [41]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

I used a pdf that I uploaded which is called edfors-et-al-gene. The following code will load the file, split it into several chunks, and store it into the vector database. It also does the retrieval step in Retrieval Augmented Generation.

The retrieval step is pulling stored info from the database. The question that the user asks is tranformed into vectors, these vectors are matched to the ones in the database, and then it asks the llm to answer the question in text.

In [42]:
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_community.document_loaders import PyPDFLoader

# Load and chunk contents of the PDF

file_path = "/content/rag-langchain/edfors-et-al-gene.pdf"  # Replace with your PDF path
loader = PyPDFLoader(file_path)


"""
# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
"""
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# Index chunks
_ = vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
prompt = hub.pull("rlm/rag-prompt")


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()



The LLM will generate an answer based on the user's prompt.

In [43]:
# response = graph.invoke({"question": "What is Task Decomposition?"})
response = graph.invoke({"question": "What are the RNA and protein levels were studied in samples from nine human cell lines (Table EV1) and 11 human tissues representing?"})
print(response["answer"])

Based on the context provided, the RNA and protein levels were studied in samples from nine human cell lines and 11 human tissues. The study aimed to compare absolute protein copy numbers with corresponding mRNA levels across these samples. However, the specific details about what the 11 tissues represent are not given in the provided context.
