# LangChain Rag Tutorial
* 下記チュートリアルを写経しつつ、関連ドキュメントを読んで、LangChainによるRAG実装方法を学ぶ。
    * https://python.langchain.com/docs/tutorials/rag/#:~:text=These%20applications%20use%20a%20technique%20known

* LangSmith使い方
    * https://zenn.dev/nari007/articles/e1531d3b9370cb

# Concepts
1. **Indexing**
2. **Retrieval and generation**

## 1. Indexing
1. Load: use **Document Loaders** to load documents from various sources like CSV, JSON, PDF, etc.
2. Split: **Text Spliters** split the documents into smaller chunks for indexing.
3. Store: **VectorStore** and **Embeddings** store the indexed documents in a searchable format.

## 2. Retrieval and Generation
4. Retrieve: **Retriever** finds relevant documents from the vector store based on user queries.
5. Generate: **ChatModel**/**LLM** generates answers based on the retrieved documents.

## Comceptuual guide
* https://python.langchain.com/docs/concepts/

In [16]:
from dotenv import load_dotenv
import os
from langchain_openai import ChatOpenAI
import bs4

from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [17]:
# Load environment
load_dotenv()

True

# 1. Indexing: Load
We need to first laod the blog post contents. We can use **DocumentLoaders** for this, which are ojbects that loads in data from a source and return a list of **Documetns**. A `Document` is an object with some `page_content`(string) and `metadata`(dict)


In this case we'll use the **WebBaseLoader**, which uses `urllib` to load HTML from URLd and `BeautifulSoup` to parse it to text. We can customeize the HTML -> text parsing by passing in parameters to the `BeautifulSoup` parser via `bs_kwargs`. In this case, only HTML tags with class "post-content", "post-title", or "post-header" are relevant, so we'll remove all others.

In [24]:
# Load, chunk and index the contents of the blog
# Only keep post title, headers, and content from the full HTML.
bs_strainer = bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))
loader = WebBaseLoader(
    web_path=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs_strainer}
)
docs = loader.load()

len(docs[0].page_content)

43131

In [28]:
print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


* Memo:
    * `DocumentLoader`: Object that loads data from a source as list of `Documents`.
        * https://python.langchain.com/docs/concepts/#document-loaders
        * https://python.langchain.com/docs/how_to/#document-loaders
        * https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html



# 2. Indexing: Split
Our loaded documents is over 42k characters long. This is too long to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very log inputs.


To handle this we'll split the `Document` into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at rum time.


In this case, we'll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the **RecursiveCharacterTextSplitter**, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. Ths is the recommended text splitter for generic text use case.


We set `add_start_index=True` so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute "start_index".


* 日本語の分割は、以下のマニュアルを参照
    * https://python.langchain.com/docs/how_to/recursive_text_splitter/



In [29]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True,
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

66

In [30]:
len(all_splits[0].page_content)

969

In [31]:
all_splits[10].metadata

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'start_index': 7056}

In [33]:
all_splits[10].page_content[:500]

'To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens during training.\nThe training dataset in their experiments is a combination of WebGPT comparisons, summarization from human feedback and human preference dataset.'

* Memo:


    * `TextSplitter`: Object that splits a list of `Documents` into smaller chunks. subclass of `DocumentTransformer`s.

        * splitting text using different methods
            * https://python.langchain.com/docs/how_to/#text-splitters



    * `DocumentTransformer`: Object tha performs a transformation on a list of `Document` objects.

# 3. Indexing: Store

Now we need to index our 66 text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of "similarity" search to identify the stored splits with the most similar embeddings to our query embedding. The  simplest similarity measure is cosine similarity - we measure the cosine of the angle each pair of embeddings (which are high dimensional vectors).


We can embed and store all of our document splits in a single command using the **Chroma** vector store and **OpenAIEmbeddings** model.

* Chroma document:
    * https://python.langchain.com/docs/integrations/vectorstores/chroma/
* OpenAIEmbeddings document:
    * https://python.langchain.com/docs/integrations/text_embedding/openai/

In [34]:
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

* Memo:
    * `Embeddings`: Wrapper arount a text embedding model, used for converting text to embeddings.
        * https://python.langchain.com/docs/how_to/embed_text/

    * `VectorStore`: Wrapper around a vector database, used for storing and querying embeddings.
        * https://python.langchain.com/docs/how_to/vectorstores/


This completes the **Indexing** portion of the pipeline. At this point we have a query-able vector store containing the chunked contents of our blog post. Given a user question, we should ideally be able to return the snippets of the blog post that answer the question.

# 4. Retrieval and Generation: Retrirve

Now let's write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

First we need to define our logic for searching ove documents. LnagChain defines a **Retriever** interface which wraps an index that can return relevant `Documents` given a string query.


* https://python.langchain.com/docs/concepts/#retrievers/


The most common type of `Retriever` is the **VectorStoreRetriever**, which uses the similarity search capabilities of a vector store to facilitate retrieval. Any `VectorStore` can easily be turned into a `Retriever` with `VectorStore.as_retriever()`:


* https://python.langchain.com/docs/how_to/vectorstore_retriever/

In [35]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")

len(retrieved_docs)

6

In [36]:
print(retrieved_docs[0].page_content)

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.


* Memo:
    * Vector stores are commoly used for retrieval, but there are other ways to do retrieval, too.
    * `Retriever`: An object that returns `Documents` given a text query
        * https://python.langchain.com/docs/how_to/#retrievers
    * Retrieval techniques:
        * `MultiQueryRetriever` **generates variants of the input question** to improve retrieval hit rate.
            * https://python.langchain.com/docs/how_to/MultiQueryRetriever/
        * `MultiVectorRetriever` instead generates **variants of the embeddings**, also in order to improve retrieval hit rate.
            * https://python.langchain.com/docs/how_to/multi_vector/
        * `Max marginal relevance` selects for **relevance and diversity** among the retrieved documents to avoid passing in duplicate context.
            * https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf
        * Documents can be filtered during vector store retrieval using metadata filters, such as with a **Self Query Retriever**.
            * https://python.langchain.com/docs/how_to/self_query/

# 5. Retrieval and Generation: Generate

Let's put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.


We'll use the gpt-4o-mini OpenAI model, but any LnagChain `LLM` or `ChatModel` could be substituted in.

In [37]:
llm = ChatOpenAI(model="gpt-4o-mini")

We'll use a prompt for RAG that is checked into the LangChain prompt hub.

In [39]:
# download prompt template of "rlm/rag-prompt"
prompt = hub.pull("rlm/rag-prompt")

# create prompt
example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()

example_messages

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:", additional_kwargs={}, response_metadata={})]

In [40]:
print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


We'll use the **LCEL Runnable** protocol to define the chain, allowing us to

* pipe together components and functions in a transparent way
* automatically trace our chain in LangSmith
* get streaming, async, and batched calling out of the box.

* document:
    * https://python.langchain.com/docs/concepts/#langchain-expression-language-lcel


Here is the implementation:

In [42]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


for chunk in rag_chain.stream("What is Task Decomposition?"):
    print(chunk, end="", flush=True)

Task Decomposition is the process of breaking down a complex task into smaller, manageable steps or subgoals. This can be accomplished using techniques like Chain of Thought (CoT) or Tree of Thoughts, which enhance model performance by allowing for step-by-step reasoning. It can be executed through simple prompts, task-specific instructions, or human inputs.

Let's dissect the LCEL to understand what's going on.


First: each of these components (`retriever`, `prompt`, `llm`, etc.) are instances of **Runnable**. This means that they implement the same methods -- such as sync and async. `.invoke`, `.stream`, or `.batch`--which makes them easier to connect together. They can be connected into a **RunnableSequence**--another Runnable--via the `|` operator.

* https://python.langchain.com/docs/concepts/#langchain-expression-language-lcel
* https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable
* https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableSequence.html


LangChain will automatically cast certain objects to runnables when met with the `|` operator.  Here, `format_docs` is cast to a **RunnableLambda**, and the dict with `"context"` and `"question"` is cast to a **RunnableParallel**. The details are less important than the bigger point, which is that each object is a Runnable.

* https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableLambda.html
* https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableParallel.html


Let's trace how the input question flows through the above runnables.


As we've seen above, the input to `prompt` is expected to be a dict with keys `"context"` and `"question"`. So the first element of this chain builds runnables that will calculate both of these from the input question:


* `retriever | format_docs` passes the question through the retriever, generating **Document** objects, and then `format_docs` to generate strings;
    * https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html
* `RunnablePassthrough()` passes through the input quesion unchanged.


This is, if you constructed

```python
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
)
```

Then `chain.invoke(question)` would build a formatted prompt, ready for inference. (Note: when developing with LCEL, it can be practical to test with sub-chains like this.)


The last steps of the chain are `llm`, which runs the inference, and `StrOutputParse()`, which just plucks the string content out of the LLM's output message.


You can analyze the individual steps of this chain via its LangSmith trace.




## Built-in chains

If preferred, LangChain includes convenience functions that implement the above LCEL. We compose two functions:

* **create_stuff_documents_chain** specifies how retrieved context is fed into a prompt and LLM. In this case, we will "stuff" the contents into the prompt--i.e., we will include all retrieved context without any summarization or other processing. It largely implements our above `rag_chain`, with input keys `context` and `input`-- it generates an answer using retrieved context and query.
    * https://python.langchain.com/api_reference/langchain/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html
* **create_retrieval_chain** adds the retrieval step and propagates the retrieved context through the chain, providing it alongside the final answer. It has input key `input`, and includes `input`, `context`, and `answer` in its output.

In [45]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering taks."
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, sya that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is Task Decomposition?"})
print(response["answer"])

Task Decomposition is the process of breaking down a complicated task into smaller, more manageable steps. Techniques like Chain of Thought (CoT) and Tree of Thoughts (ToT) are used to enhance model performance by allowing the model to think step by step and explore multiple reasoning possibilities. This approach helps in simplifying complex tasks and clarifying the model's thought process.


* Memo:
    * https://python.langchain.com/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html

### Returning sources
Often in Q&A applications it's important to show users the sources that were used to generate the answer. LangChain's built-in `create_retrieval_chain` will propagate retrieved source documents through to the output inthe `"context"` key:

In [46]:
for document in response["context"]:
    print(document)
    print()

page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 1585}

page_content='Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS 

* Memo:
    * Choosing a model
        * `ChatModel`: An LLM-backed chat model. Takes in a sequence of messages and returns a message.
            * https://python.langchain.com/docs/how_to/#chat-models
        * `LLM`: A text-in-text-out LLM. Takes in a string and returns a string.
            * https://python.langchain.com/docs/how_to/#llms

## Customizing the prompt
As shown above, we can load prompts (e.g. https://smith.langchain.com/hub/rlm/rag-prompt?organizationId=5cb6609d-ce31-51cc-962f-6052d0aff4cc) from the prompt hub. The prompt can also be easily customized.

In [48]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always sya "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""
custom_rag_prmpt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prmpt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")

"Task decomposition is the process of breaking down a complicated task into smaller, manageable steps to facilitate execution. Techniques like Chain of Thought (CoT) and Tree of Thoughts help enhance model performance by allowing the agent to think step by step or explore multiple reasoning possibilities. This approach not only simplifies complex tasks but also aids in the interpretation of the model's thought process. Thanks for asking!"