# LCEL with RAG

RAG stands for Retrieval-Augmented Generation — it's a powerful technique that combines a language model (like GPT) with an external knowledge source (like a document database) to produce more accurate and context-aware outputs.

## 🔁 RAG Workflow Breakdown

The entire process consists of the following key steps:

1. **Load Documents**  
   Load raw data into LangChain. The source could be online websites, local files, or various platforms.

2. **Split Documents**  
   Break the loaded documents into smaller chunks. This helps fit within the model's context window and makes vector embedding and retrieval more effective.

3. **Store Embeddings**  
   Convert the document chunks into vector embeddings and store them in a vector database for future retrieval.

4. **Retrieve Documents**  
   Query the vector database to retrieve the most relevant document chunks based on the user’s question.

5. **Generate Answer**  
   Combine the retrieved context with the user query and feed it into the LLM to generate a final answer.



By following these steps, you can build a powerful Q&A system that decomposes complex tasks into smaller steps and generates detailed, accurate responses.

![rag](../../images/rag.png)

### ✅ Benefits of Retrieval-Augmented Generation (RAG)

> RAG enhances LLM capabilities by combining external knowledge retrieval with generation.

#### 🔹 Reduces Hallucination  
Retrieves grounded knowledge to reduce fabricated or inaccurate responses.

#### 🔹 Updatable Knowledge  
No need to retrain models — just update the data source or vector store.

#### 🔹 Domain-Specific Intelligence  
Supports use of private, internal, or specialized content (e.g., legal, medical, technical docs).

#### 🔹 Cost Efficient  
Cheaper and faster than fine-tuning or training large models.

#### 🔹 Transparent Reasoning  
Retrieved content can be shown alongside the answer for traceability and trust.

#### 🔹 Modular Architecture  
Each component (retriever, vector DB, LLM) can be swapped or tuned independently.

---

### ❌ Limitations of Retrieval-Augmented Generation (RAG)

> While powerful, RAG introduces new design and performance challenges.

#### ⚠️ Garbage In, Garbage Out  
Poor retrieval or bad data will lead to misleading or unhelpful answers.

#### ⚠️ Latency Overhead  
Document retrieval and embedding lookup adds time to each query.

#### ⚠️ Chunking & Embedding Quality  
Bad chunking or embedding can drastically lower relevance and retrieval accuracy.

#### ⚠️ Limited Cross-Chunk Reasoning  
LLMs can’t always reason well over multiple small chunks due to token window constraints.

#### ⚠️ Manual Updates Needed  
Keeping the vector store current requires periodic data reprocessing.

#### ⚠️ Overkill for Simple Tasks  
Adds complexity where a pure LLM might be faster and good enough.

---

> 🧠 RAG is ideal when your use case needs accuracy, up-to-date info, or grounded answers from private knowledge sources.

## 🧠 Summary: Building a Vector Store Involves

| **Stage**     | **Options**                                                                 |
|---------------|------------------------------------------------------------------------------|
| **Chunking**  | Sentence, paragraph, token-based, recursive splitters                        |
| **Embedding** | OpenAI, HuggingFace, local models                                            |
| **Store Backend** | FAISS, Qdrant, Pinecone, Chroma, Milvus, Weaviate                        |
| **Frameworks**| LangChain, LlamaIndex, Haystack, custom                                      |
| **Indexing**  | Flat, HNSW, IVF, filtering, hybrid search 

## **RAG Development Guide**

**Build a Retrieval-Augmented Generation (RAG) application using the LangChain framework**

1. **Load Documents**:  
   Use the `WebBaseLoader` class to load content from a specified source and generate `Document` objects (depends on the `bs4` library).

2. **Split Documents**:  
   Use the `split_documents()` method from the `RecursiveCharacterTextSplitter` class to split long documents into smaller chunks.

3. **Store Embeddings**:  
   Use the `from_documents()` method from the `Chroma` class to embed the split document chunks into a vector space and store them in a vector database (using `OpenAIEmbeddings`).  
   You can confirm successful storage by checking the number of stored vectors.

4. **Retrieve Documents**:  
   Use the `as_retriever()` and `invoke()` methods from the `VectorStoreRetriever` class to retrieve the most relevant document chunks based on a query.

5. **Generate Answer**:  
   Use the `invoke()` method from the `ChatOpenAI` class to combine the retrieved document chunks with the user’s question and generate an answer  
   (utilizing `RunnablePassthrough` and `StrOutputParser`).


We use the blog post *“LLM Powered Autonomous Agents”* by Lilian Weng ([https://lilianweng.github.io/posts/2023-06-23-agent/](https://lilianweng.github.io/posts/2023-06-23-agent/)) as the source document.  
The final RAG application allows us to query relevant information from this article.

In [1]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain import hub
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

USER_AGENT environment variable not set, consider setting it to identify your requests.


### Step 1: Load Documents

- **Description**: Use a `DocumentLoader` to load content from a specified source (e.g., a webpage) and convert it into `Document` objects.

- **Key Code Abstractions**:
  - Class: `WebBaseLoader`
  - Method: `load()`
  - Library: `bs4` (BeautifulSoup)

In [2]:
# Use WebBaseLoader to load content from a webpage, keeping only the title, headers, and main article content.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

In [3]:
print(len(docs[0].page_content))

43130


In [4]:
print(docs[0].page_content[:200])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a 


### Step 2: Split Documents

- **Description**: Use a text splitter to divide long loaded documents into smaller chunks for embedding and retrieval.

- **Key Code Abstractions**:
  - Class: `RecursiveCharacterTextSplitter`
  - Method: `split_documents()`

In [5]:
# Use RecursiveCharacterTextSplitter to split documents into chunks of 1000 characters with a 200-character overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Max size per chunk in characters
    chunk_overlap=200, # Overlap between chunks to preserve context across boundaries
    add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

In [6]:
print(len(all_splits))

66


In [7]:
print(len(all_splits[0].page_content))  # Print the character count of the first chunk.

969


In [8]:
print(all_splits[0].page_content)

LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory


In [9]:
print(all_splits[0].metadata)

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 8}


### Step 3: Store Embeddings

- **Description**: Embed the split document content into a vector space and store it in a vector database for later retrieval.

- **Key Code Abstractions**:
  - Class: `Chroma`
  - Method: `from_documents()`
  - Class: `OpenAIEmbeddings`

- **Code Explanation**:
  - **Storing Embeddings**: Use the `Chroma.from_documents()` method to embed all split document chunks using the `OpenAIEmbeddings` model. The resulting vectors are stored in a vector database for fast similarity-based retrieval.

#### Chroma Introduction

**Initialize a Chroma vector database (instantiation only, without storing vector data):**

**1. Initialize via Constructor**: This approach sets up a local persistent Chroma vector store.

```python
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not neccesary
)
```


**2. Initialize via Client**: Useful for direct access to the underlying database or collection.

```python
import chromadb

persistent_client = chromadb.PersistentClient()
collection = persistent_client.get_or_create_collection("collection_name")
collection.add(ids=["1", "2", "3"], documents=["a", "b", "c"])

vector_store_from_client = Chroma(
    client=persistent_client,
    collection_name="collection_name",
    embedding_function=embeddings,
)
```



Use the Chroma.from_documents() method directly for instantiation and data storage:
This method returns a Chroma instance of type langchain_chroma.vectorstores.Chroma.
Detailed API documentation: https://python.langchain.com/v0.2/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html

In [10]:
# Use Chroma vector store and the OpenAIEmbeddings model to embed and store the split document chunks.
vectorstore = Chroma.from_documents(
    documents=all_splits,
    embedding=OpenAIEmbeddings()
)

In [11]:
type(vectorstore) 

langchain_chroma.vectorstores.Chroma

### Step 4: Retrieve Documents

- **Description**: Use the `VectorStoreRetriever` class’s `as_retriever()` and `invoke()` methods to retrieve the most relevant document chunks from the vector database based on a query.

- **Key Code Abstractions**:
  - Class: `VectorStoreRetriever`
  - Methods: `as_retriever()`, `invoke()`

- **Code Explanation**:
  - **Document Retrieval**: Convert the vector store into a retriever and perform similarity search based on the query to get relevant document chunks.
  - **Check Retrieval Count**: Print the number of retrieved document chunks to ensure the retrieval was successful.
  - **Validate Retrieved Content**: Output the content of the first retrieved document to verify that the results match expectations.

In LangChain, all vector databases support the **`vectorstore.as_retriever()`** method to instantiate a corresponding retriever. The returned object is of type `VectorStoreRetriever`.  
📚 [API Documentation](https://python.langchain.com/v0.2/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStoreRetriever.html)

In [12]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [13]:
type(retriever)

langchain_core.vectorstores.base.VectorStoreRetriever

In [14]:
retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")

In [15]:
# Inspect the content of the retrieved documents
print(len(retrieved_docs))

6


In [16]:
print(retrieved_docs[0].page_content)

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.


### Step 5: Generate Answer

- **Description**: Combine the previously built components (retriever, prompt, LLM, etc.) into a complete pipeline that retrieves relevant documents and generates an answer based on the user's question.  
  The full chain works as follows: input the user question → retrieve relevant documents → build the prompt → pass it to the model (using the `invoke()` method of the `ChatOpenAI` class) → parse the output to produce the final answer.

- **Key Code Abstractions**:
  - Class: `ChatOpenAI`
  - Method: `invoke()`
  - Class: `RunnablePassthrough`
  - Class: `StrOutputParser`
  - Module: `hub`


![retrieval](../../images/retrieval.png)


#### LangChain Hub

[LangChain Hub](https://smith.langchain.com/hub) is an open-source prompt template community that provides developers with ready-to-use prompts. It is part of the **LangSmith** product suite.

Below is the prompt template used for the RAG application:  
🔗 https://smith.langchain.com/hub/rlm/rag-prompt

```text
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:
```

In [17]:
llm = ChatOpenAI(model="gpt-4o-mini")

In [19]:
prompt = hub.pull("rlm/rag-prompt")

In [20]:
print(prompt.messages)

[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})]


In [21]:
example_messages = prompt.invoke(
    {"context": "color yellow", "question": "What is yellow?"}
).to_messages()

In [22]:
# Check the prompt
print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What is yellow? 
Context: color yellow 
Answer:


#### ⭐️**Using LCEL in RAG**⭐️

##### **Overview of LCEL**

LCEL (LangChain Expression Language) is a key concept in LangChain. It provides a unified interface that allows different components (such as `retriever`, `prompt`, `llm`, etc.) to be connected via a shared `Runnable` interface. Each `Runnable` component implements standard methods like `.invoke()`, `.stream()`, and `.batch()`, allowing them to be easily chained using the `|` (pipe) operator.


##### **Components Used in LCEL**

- **Retriever**: Responsible for retrieving relevant documents based on the user query.
- **Prompt**: Builds the prompt using retrieved documents, which is then fed to the LLM.
- **LLM**: Accepts the prompt and generates the final answer.
- **StrOutputParser**: Parses the LLM output to extract a clean string for display.


##### **How LCEL Works**

- **Building the Chain**: Using the `|` operator, you can link multiple `Runnable` components into a `RunnableSequence`. LangChain automatically converts some objects into `Runnable`s, such as turning `format_docs` into a `RunnableLambda`, and a dictionary with `"context"` and `"question"` keys into a `RunnableParallel`.

- **Data Flow**: The user’s question flows through each `Runnable` in the `RunnableSequence`. First, the `retriever` fetches relevant documents. Then, `format_docs` converts the documents into strings. `RunnablePassthrough` passes the original question unchanged. These are passed into the `prompt` to construct a full prompt for the LLM.



##### **Key LCEL Operations**

- **Format Documents**:  
  `retriever | format_docs` passes the question to the retriever and formats the returned documents into a string.

- **Pass Through Question**:  
  `RunnablePassthrough()` forwards the original question as-is.

- **Build Prompt**:  
  `{"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt`  
  builds a complete prompt using both the context and the original question.

- **Run Model**:  
  `prompt | llm | StrOutputParser()` sends the prompt through the model and parses the output.




In [23]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [24]:
# Construct RAG Chain using LCEL
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [25]:
# Stream the answer
for chunk in rag_chain.stream("What is Task Decomposition?"):
    print(chunk, end="", flush=True)

Task decomposition is the process of breaking down a complicated task into smaller, manageable steps. This can be accomplished using techniques like Chain of Thought (CoT) prompting, which encourages sequential reasoning, or by employing the Tree of Thoughts approach, which explores multiple reasoning paths. It can involve simple prompting, task-specific instructions, or human input to facilitate this breakdown.

In [26]:
for chunk in rag_chain.stream("What is ToT?"):
    print(chunk, end="", flush=True)

ToT, or Tree of Thoughts, is an extension of the Chain of Thought (CoT) prompting technique that involves decomposing a problem into multiple thought steps and generating several thoughts for each step, creating a tree structure. This method allows for exploring various reasoning possibilities using search techniques like BFS or DFS. It enables the model to evaluate each state via a classifier or majority vote for enhanced reasoning capability.

### Using self-defined prompt instead of prompt template form hub

In [27]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always prepeat the question at the beginning of the answer.

{context}

Question: {question}

Helpful Answer:"""

custom_rag_prompt = PromptTemplate.from_template(template)

In [None]:
# print(custom_rag_prompt.invoke({"context": "filler context", "question": "filler question"}).text)

In [28]:
custom_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

In [29]:
custom_rag_chain.invoke("What is Task Decomposition?")

'Question: What is Task Decomposition?\n\nTask Decomposition is the process of breaking down a complex task into smaller, more manageable steps. Techniques like Chain of Thought (CoT) and Tree of Thoughts help in organizing these steps and exploring multiple reasoning possibilities. This allows models to approach difficult tasks systematically and effectively.'

### Using Another Online Article

In [31]:
# Using a different document 
bs4_strainer = bs4.SoupStrainer(class_=("blog--post"))
loader = WebBaseLoader(
    web_paths=("https://www.k2view.com/blog/llm-text-to-sql/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()
print("length of page content: ", len(docs[0].page_content))
print("first 100 character: ", docs[0].page_content[:100])

length of page content:  7869
first 100 character:  





Table of Contents


Please select







LLM Text-to-SQL Solutions: Top Challenges and Tips to


In [32]:
# Use RecursiveCharacterTextSplitter to split documents into chunks of 1000 characters with a 200-character overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Max size per chunk in characters
    chunk_overlap=200, # Overlap between chunks to preserve context across boundaries
    add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print("number of chunks: ", len(all_splits))
print("number of characters in the first chunk", len(all_splits[0].page_content))
print("first chunck content: ", all_splits[0].page_content)
print("first chuck meta data: ", all_splits[0].metadata)

number of chunks:  13
number of characters in the first chunk 422
first chunck content:  Table of Contents


Please select







LLM Text-to-SQL Solutions: Top Challenges and Tips to Overcoming Them








LLM Text-to-SQL Solutions: Top Challenges and Tips to Overcoming Them7:29










Iris Zarecki
Product Marketing Director

March 30, 2025










LLM-based text-to-SQL is the process of using Large Language Models (LLMs) to automatically convert natural language questions into SQL database queries.
first chuck meta data:  {'source': 'https://www.k2view.com/blog/llm-text-to-sql/', 'start_index': 6}


In [33]:
# Use Chroma vector store and the OpenAIEmbeddings model to embed and store the split document chunks.
vectorstore = Chroma.from_documents(
    documents=all_splits,
    embedding=OpenAIEmbeddings()
)

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retrieved_docs = retriever.invoke("How to generate SQL using LLM?")
# Inspect the content of the retrieved documents
print(len(retrieved_docs))
print(retrieved_docs[0].page_content)

llm = ChatOpenAI(model="gpt-4o-mini")
prompt = hub.pull("rlm/rag-prompt")

6
Realizing the potential of LLM text-to-SQL  
Using LLMs to generate SQL creates the potential for democratizing data access, enhancing the customer experience, and boosting productivity. However, it also introduces critical challenges in accuracy, performance, and security.  
To harness the benefits of LLM-based text-to-SQL, focus on improving schema awareness, using chain-of-thought prompting, establishing robust security measures, and using LLM agents and functions to make it all happen. By addressing these key areas, you can leverage AI-generated SQL to use your data more efficiently, improve your decision-making processes, and safeguard your customersâ€™ PII and other sensitive data.


In [34]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Construct RAG Chain using LCEL
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Stream the answer
# 1 
for chunk in rag_chain.stream("How to generate SQL using LLM?"):
    print(chunk, end="", flush=True)

To generate SQL using LLMs, ensure the model is schema-aware by incorporating trusted data sources and metadata for accurate query generation. Employ chain-of-thought prompting to break down queries into simpler steps, enhancing the quality of the output. It's also crucial to implement robust security measures to protect sensitive data throughout the process.

In [35]:
# 2
for chunk in rag_chain.stream("What is the limitation of using LLM to generate SQL?"):
    print(chunk, end="", flush=True)

The limitations of using LLMs to generate SQL include challenges with schema awareness, necessitating rich metadata to ensure accuracy, especially in complex databases. Additionally, the accuracy of generated queries can suffer from issues like AI hallucinations and misunderstood schemas. Performance concerns also arise due to the ambiguity in column names and potential inefficiencies in non-optimized queries.

In [36]:
# 3
for chunk in rag_chain.stream("How to increase accuracy in sql generated?"):
    print(chunk, end="", flush=True)

To increase accuracy in SQL generated by AI, ensure that the language model (LLM) is schema-aware and can utilize curated data sources. Additionally, implement chain-of-thought prompting to break down queries into simpler steps, which improves the quality of the generated SQL. Finally, focus on addressing ambiguities in column names and complex schemas to minimize misinterpretation.