# Lab: Using LangChain to build a simple RAG pipeline

## Goal

Explore basic LangChain and other component functionality with the ultimate goal of developing a simple **Retrieval Augmented Generation (RAG)** question and answer system based on publicly available Medium articles.


### Steps

1. Install dependencies
2. Quick LangChain tutorial (optional)
3. Download and transform data
4. Generate embeddings and add to vector store
5. Build the RAG chain
6. Query and test the RAG system
    
---

### Prerequisites

#### Python
This notebook has been tested with both Python `3.12` and `3.13` 

### Expected Outcome:

A robust pipeline capable of:
- Extracting content from urls.
- Generating high-quality embeddings.
- Querying with contextually relevant answers.
- Providing accurate responses to complex user queries in a highly scalable manner.

## 1. Install dependencies


## Introduction to the Jupyter Lab Environment


JupyterLab Basics
JupyterLab is an interactive development environment where you can run code in "cells" and see the results immediately. Here's how to navigate:

To execute a cell in Jupyter Lab, use one of the following methods:

* Press <kbd>Shift</kbd> + <kbd>Enter</kbd> to run the current cell and move on to the next cell

* Press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to run the cell and stay in the same cell.

Run Button: Click the ▶️ "Run" button in the toolbar above, it looks like a play button.

(In Jupyter Lab Notebooks, you can enable Code cells to run shell commands by prefixing them with an exclamation mark (!).)

Let's use this to install the LangChain dependencies

In [1]:
!pip install langchain_community langchain_openai langchain_milvus langchain_text_splitters langchain_huggingface langchain_core


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Now we can install other important Python dependencies needed by our project.

In [4]:
!pip install beautifulsoup4 python-dotenv chromadb sentence-transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 2. Start your LLM Locally

In this lab we will use the local Nvidia GPU and LLM runtime, `ollama` to run the 8 Billion parameter `granite3.3:latest` LLM locally.

In [36]:
import subprocess

subprocess.Popen(["ollama", "run", "granite3.3", "--keepalive", "-1m"])

<Popen: returncode: None args: ['ollama', 'run', 'granite3.3', '--keepalive'...>

[?25l[?2026h[?25l[1G[K[?25h[?2026l[2K[1G[?25h[?25l[?25h

We can easily check if the LLM, Granite, is now running.

In [6]:
!ollama ps

NAME                 ID              SIZE      PROCESSOR    UNTIL   
granite3.3:latest    fd429f23b909    7.8 GB    100% GPU     Forever    


[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠙ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠙ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l

We can also validate that our model is actually being servered, by Ollama, in a way that our application will be able to access.
A simple test to see if our model is being served is to query the LLM runtime with curl

In [37]:
!TOKENIZERS_PARALLELISM=false curl -s http://localhost:11434/v1/models | jq

[1;39m{
  [0m[34;1m"object"[0m[1;39m: [0m[0;32m"list"[0m[1;39m,
  [0m[34;1m"data"[0m[1;39m: [0m[1;39m[
    [1;39m{
      [0m[34;1m"id"[0m[1;39m: [0m[0;32m"granite3.3:latest"[0m[1;39m,
      [0m[34;1m"object"[0m[1;39m: [0m[0;32m"model"[0m[1;39m,
      [0m[34;1m"created"[0m[1;39m: [0m[0;39m1750703969[0m[1;39m,
      [0m[34;1m"owned_by"[0m[1;39m: [0m[0;32m"library"[0m[1;39m
    [1;39m}[0m[1;39m
  [1;39m][0m[1;39m
[1;39m}[0m


I0000 00:00:1751313075.406341    2639 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


We can actually also use `curl` to talk directly to `granite` via the OpenAI API for **inference**:

This command may take awhile to run, but less than a minute, as it is generating 5 - 10 paragraphs.

In [38]:
%%bash
curl -s -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite3.3",
    "messages": [
      {
        "role": "user",
        "content": "Write me 5 to 10 paragraphs about RHEL"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 5000
  }' | jq

{
  "id": "chatcmpl-251",
  "object": "chat.completion",
  "created": 1751313133,
  "model": "granite3.3",
  "system_fingerprint": "fp_ollama",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Red Hat Enterprise Linux (RHEL) is a commercial open-source operating system (OS) developed by Red Hat, an American software company. It's renowned for its stability, robustness, and extensive support, making it a popular choice for enterprise environments. Here are some key aspects of RHEL:\n\n1. **Development and Support**: RHEL is built from the source code of the community-driven Fedora project, ensuring it stays updated with the latest OS technologies. Red Hat Inc., however, provides long-term support (up to 10 years) for each major version, which is a significant advantage for businesses that require predictable and stable environments.\n\n2. **Community and Commercial Collaboration**: RHEL benefits from both community contributions 

## Getting Setup

Now we have a running LLM we can query, lets setup our Python variables which we will need later.

In [39]:
import os
os.environ['USER_AGENT'] = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/120.0.0.0 Safari/537.36"
)

api_key = "dummy"
model = "granite3.3"
base_url="http://localhost:11434/v1"

## 3. Quick LangChain tutorial (optional)

*NOTE: This section does not contribute to the RAG system. It is included for testing and exploration.* 



LangChain is a powerful framework that simplifies the development of AI applications by providing a standard way to chain together language models with external data sources and other components and tools.

In LangChain, components are the modular, reusable building blocks of an AI application, such as Models, Prompt Templates, and Output Parsers. These individual components are then linked together into chains, which define a complete workflow for a specific task, such as answering a question or summarizing a document. Using the LangChain Expression Language (LCEL), you can connect these components, creating an automated data flow from the initial user input to the final, processed output.

### 3a. Models: the core engine
A Model in LangChain is a wrapper around a large language model like Granite (the model we are using for this lab).


In [40]:
# minimum import needed
from langchain_openai import ChatOpenAI

# Temperature loose guidelines: 0-0.4 LOW deterministic;  0.5-1.0  MEDIUM balanced creativey and coherence; 1.1-2 HIGH creative and random

# Initialize the model. Set a low temperature for predictable, less creative responses
testllm = ChatOpenAI(model=model, api_key=api_key, base_url=base_url, temperature=0.1)

# We can now "invoke" the model with a simple prompt
response = testllm.invoke("What is the official currency of France?")

# The response is an AI Message object, so we access its content
print(response.content)


The official currency of France is the Euro (€). The Euro was introduced as a common currency for participating countries in the European Union on January 1, 1999, and it replaced the French Franc. This monetary union, known as the Eurozone, now consists of 19 member states, with France being one of them.

The Euro is managed and controlled by the European Central Bank (ECB), which is headquartered in Frankfurt, Germany. The ECB sets monetary policy, manages foreign exchange reserves, and oversees the issuance of Euro banknotes and coins.

If you have any more questions about the Euro, French currency, or related topics, please feel free to ask!


### 3b. Output Parsers: Structuring the Response

The output from an LLM is typically an AI Message object. An output parser is a class that helps you structure the model's response into a more usable format, like a simple string, a list, or a JSON object.

In the last exercise, we used `response.content` to get the content string from our model response. Here we will use an output parser component which we can use later in a LangChain chain. 


In [41]:
from langchain_core.output_parsers import StrOutputParser

# Initialize the parser
output_parser = StrOutputParser()

# The llm.invoke() call returns an AIMessage object
response = testllm.invoke("Give me a 60 word biography on Josephine Baker")

# The parser converts the AIMessage into a simple string
parsed_output = output_parser.invoke(response)

print(parsed_output)

Josephine Baker (1891-1975) was an American-born French entertainer and civil rights activist, celebrated for her dazzling performances in Parisian nightclubs. Known as the "Black Venus," she broke racial barriers with her exotic dance routines, captivating audiences worldwide. Beyond showbiz, Baker was a dedicated anti-colonial and civil rights campaigner, supporting the Civil Rights Movement in the US and fighting against apartheid in South Africa. She was posthumously awarded France's Legion of Honor and remains an enduring symbol of resilience and artistic brilliance.


### 3c. Prompt Templates: Crafting Your Instructions
A prompt template in LangChain is a reusable object that creates a complete and formatted prompt for a language model by dynamically inserting user inputs and other variables into a predefined text structure.

We will compare two prompt template classes.



#### PromptTemplate
This class creates a single string from a template.

In [42]:
from langchain_core.prompts import PromptTemplate

# Define a PromptTemplate object with a placeholder for a word to translate
PT_prompt = PromptTemplate(
    template="Translate the expression {toTranslate} from English to French.",
    input_variables=["toTranslate"]
)

#invoke the template and print the output and type 
formatted_PT_prompt=PT_prompt.invoke({"toTranslate": "little by little"})

print(formatted_PT_prompt)
print(f"Type: {type(formatted_PT_prompt)}\n")

text='Translate the expression little by little from English to French.'
Type: <class 'langchain_core.prompt_values.StringPromptValue'>



#### ChatPromptTemplate
This class creates a structured list of messages for chat models.

In [43]:
from langchain_core.prompts import ChatPromptTemplate

# Define a ChatPromptTemplate object with a placeholder for a word to translate
CPT_prompt = ChatPromptTemplate.from_template(
    "Translate the expression {toTranslate} from English to French."
)

#invoke the template and print the output and type 
formatted_CPT_prompt=CPT_prompt.invoke({"toTranslate": "little by little"})

print(formatted_CPT_prompt)
print(f"Type: {type(formatted_CPT_prompt)}\n")

messages=[HumanMessage(content='Translate the expression little by little from English to French.', additional_kwargs={}, response_metadata={})]
Type: <class 'langchain_core.prompt_values.ChatPromptValue'>



#### Comparison
`ChatPromptTemplate` creates a structured list of messages whereas `PromptTemplate`creates a single string.

`ChatPromptTemplate` is used more commonly because modern models are optimised for a sequence of messages. They perform best when they receive a structured list of messages with roles (System for instructions, Human for user input), and ChatPromptTemplate is designed specifically for this.
However it is also possible to format those roles manually with`PromptTemplate` 

### 3d. LangChain Expression Language (LCEL): Chaining It All Together

LCEL is the declarative syntax used to chain LangChain components together. It uses the pipe symbol ( | ). Data flows from one component to the next in the sequence.

We will create a simple chain using use the `testllm` model, the `PT_Prompt` template, and the `output_parser` from the previous exercises.

In [44]:
# Build the chain
test_chain=PT_prompt | testllm | output_parser

# Invoke the chain with the input data "congratulations on building this chain" (the expression we want to translate into French)
chain_output = test_chain.invoke("congratulations on building this chain")

print(chain_output)


Félicitations pour la construction de cette chaîne.


Congratulations on your first LangChain chain!

## 4. Download and transform data

The example uses multiple Medium articles on the subject of generative AI as the source document.

We will load the documents and split them into shorter document chunks.

In [18]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter


# Instantiate the WebBaseLoader to fetch content from specified URLs.
loader = WebBaseLoader(
    web_paths=(
        "https://medium.com/@tuhinsharma121/mastering-prompt-engineering-a-beginners-guide-to-ai-interaction-2a28434ccb67",
        "https://medium.com/@rahuljangir2992/graph-based-prompting-revolutionizing-ai-reasoning-f316b7266c1f",
        "https://medium.com/@fassha08/transforming-search-ai-agents-and-multi-vector-intelligence-1bde1dbe66e7",
        "https://medium.com/@harshkumar1146/prompt-chaining-unlocking-the-full-potential-of-ai-assistants-4fdf2f28c1a5",
    ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
        )
    ),
)

# The load() method fetches and parses the content from the URLs returning a list of Document objects, where each object contains the text content of one webpage.
documents = loader.load()

# Initialize the RecursiveCharacterTextSplitter which will break down large texts into smaller chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

docs = text_splitter.split_documents(documents)

#### Exploring the document chunks

*Note: This does not contribute to building the RAG system*

The output, `docs`, is a flat list of all the text chunks derived from the original documents. These smaller, more granular pieces of text are now in an ideal format to be used for creating vector embeddings for a similarity search in a RAG pipeline.

If we look at the first two list items we can see the included metadata and note the overlap in the `page_content` chunks. 
Without overlap, you risk cutting a sentence or a complete thought exactly in half, leading to two incoherent chunks that lose their meaning. This severely damages the ability of the RAG system to both find the right information (retrieval) and understand it correctly (generation).



In [19]:
print(f"The articles have been split into {len(docs)} sub-documents. \n")
print ("First sub-document:")
docs[0]


The articles have been split into 38 sub-documents. 

First sub-document:


Document(metadata={'source': 'https://medium.com/@tuhinsharma121/mastering-prompt-engineering-a-beginners-guide-to-ai-interaction-2a28434ccb67', 'title': 'Mastering Prompt Engineering: A Beginner’s Guide to AI Interaction | by Tuhin Sharma | Medium', 'description': 'In today’s world of artificial intelligence (AI), prompt engineering has become a key skill. It changes how we talk to AI models and make them work better. Whether you’re experienced or just…', 'language': 'en'}, page_content='Mastering Prompt Engineering: A Beginner’s Guide to AI Interaction | by Tuhin Sharma | MediumSitemapOpen in appSign upSign inMedium LogoWriteSign upSign inMastering Prompt Engineering: A Beginner’s Guide to AI InteractionTuhin Sharma11 min read·Jul 10, 2024--1ListenShareIn today’s world of artificial intelligence (AI), prompt engineering has become a key skill. It changes how we talk to AI models and make them work better. Whether you’re experienced or just starting, knowing how to create good prompts

In [20]:
print("Second sub-document:")
docs[1]

Second sub-document:


Document(metadata={'source': 'https://medium.com/@tuhinsharma121/mastering-prompt-engineering-a-beginners-guide-to-ai-interaction-2a28434ccb67', 'title': 'Mastering Prompt Engineering: A Beginner’s Guide to AI Interaction | by Tuhin Sharma | Medium', 'description': 'In today’s world of artificial intelligence (AI), prompt engineering has become a key skill. It changes how we talk to AI models and make them work better. Whether you’re experienced or just…', 'language': 'en'}, page_content='AI models, as it showcases their ability to handle a wide variety of tasks and questions they have not explicitly been prepared for.ExamplePrompt:Classify the text into neutral, negative or positive. Text: I think the vacation is okay.Sentiment:Output:NeutralLimitationFirstly, these models often aren’t as accurate as models trained on specific tasks because they’re trying to generalize without direct examples. This can lead to errors or lower confidence in the results. Also, because they handle such a

## 5. Generate embeddings and add to vector store

### 5a. Embeddings
Vector embeddings are dense numerical representations of data (our article chunks in this lab). An embedding is essentially an array of numbers that can represent a vector in multidimensional space. These vector representations position similar concepts closer together within this high-dimensional vector space, creating a semantic map of the text.

To generate embeddings from our text chunks we use an embedding model, mxbai-embed-large-v1, from HuggingFace. This model converts each text chunk into an array with 1024 dimensions.


In [21]:
from langchain_huggingface import HuggingFaceEmbeddings

# Specify the embeddings model
embeddings = HuggingFaceEmbeddings(model_name="mixedbread-ai/mxbai-embed-large-v1")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

#### Exploring embeddings
We can explore an embedding on some sample text.

*Note: This does not contribute to building the RAG system*

In [22]:
sample_text="What are the advantages of few shot prompting?"

vector_embedding = embeddings.embed_query(sample_text)

# Print some information about the vector
print(f"Number of dimensions {len(vector_embedding)}")
print(f"Data type of the numbers {type(vector_embedding[0])}")

# Print the first few numbers to give a sense of what it looks like
print("Here are the first 5 dimensions (numbers) of the vector")
print(vector_embedding[:5])


Number of dimensions 1024
Data type of the numbers <class 'float'>
Here are the first 5 dimensions (numbers) of the vector
[0.15680921077728271, 0.267135351896286, -0.11924470216035843, 0.4239083230495453, -1.2445896863937378]


### 5b. Vector store

We will add the embeddings and the document chunks to an in-memory vector database. 

Two options are presented below. Both Milvus and Chroma are vector databases that can operate in-memory to provide fast, low-latency similarity searches for AI applications. Milvus is built for very large, complex projects that need to handle huge amounts of data, while Open Source Chroma is designed to be very simple and easy to start with for smaller, single-computer applications.

#### Milvus

In [23]:
from langchain_milvus import Milvus

vectorstore = Milvus.from_documents(  
    documents=docs,
    embedding=embeddings,
    connection_args={
        "uri": "./milvus_demo.db",
    },
    drop_old=True,  # Drop the old Milvus collection if it exists
    index_params={"index_type": "FLAT", "metric_type": "L2"},

)

#check the number of entries corresponds with the number of sub-documents (chunks)
vectorstore.col.flush()
number_of_entries = vectorstore.col.num_entities
print(f"The number of entries in the vector store is: {number_of_entries}")

  from pkg_resources import DistributionNotFound, get_distribution
2025-06-30 19:33:35,543 [DEBUG][_create_connection]: Created new connection using: d3a9a0edb25e48828e43713908b9847d (async_milvus_client.py:599)


The number of entries in the vector store is: 38


#### About index parameters

An index in a vector database is a data structure that organizes the vectors. 
FLAT and HNSW (Hierarchical Navigable Small World) are two different types of indexes used in vector databases to manage and search through high-dimensional data. They represent a fundamental trade-off between search accuracy and speed.

**FLAT Index** 

A FLAT index is the most basic approach to vector search. It is a brute-force method where a query vector is directly compared to every single other vector in the dataset. It is slow, but 100% accurate.
In our example above the Milvus instance uses a FLAT index. We can change this parameter setting to HNSW.


**HNSW (Hierarchical Navigable Small World) Index** 

HNSW is a sophisticated graph-based index that provides a powerful balance between search speed and accuracy. It's an Approximate Nearest Neighbor (ANN) algorithm, meaning it finds results that are most likely the nearest neighbors, but without a 100% guarantee.
In our example above the Chroma instance uses a HNSW index.

**Similarity Measures**

In both cases we have specified L2 as an index parameter.

This parameter defines how similarity between two vector embeddings is measured. 
L2 signifies that similarity between 2 vectors is measured by the shortest Euclidean distance between them in multi-dimensional space. 

In simpler terms, the vector embeddings can be thought of as lines in space. Two vector embeddings are 'similar' if the lines are close to each other. 

This is how our RAG system can identify the most relevant document chunks to feed to the LLM. The user query vector embedding is compared to the sub-document embeddings.

L2 distance is not the only way to compare 2 vector embeddings. Another common choice is Cosine Similarity which measures the angle between 2 vectors in multi-dimensional space.

## 6. Build the RAG chain

Now our documents have been chunked and stored in a vector database with their embeddings we are ready to define the components needed for our RAG system. 

In [25]:
# imports for this section
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

#### Initialize the model

Note the low temperature parameter which will give us accurate, factual responses.

In [26]:
# Initialize the model
llm = ChatOpenAI(model=model, api_key=api_key, base_url=base_url, temperature=0.1)

#### Create a retriever object
The vectorstore.as_retriever() method creates a retriever object, which acts as a specialized search interface for the vector database.

This retriever takes a query, uses the embedding model to get the query embedding vector, uses the vector store's index to efficiently find the most semantically relevant documents by vector comparison, and returns the top k documents, ready to be used.

In [27]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

#### Define the prompt template

Note that we are using PromptTemplate and manually specifying the roles Human and Assistant (see section 3c)
You can also see two placeholders. `{question}` will be filled by the user's query and `{context}` will be filled by the article chunks returned by the retriever. 

In [28]:
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provide answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer. Specify if the answer is in the context or not.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""

prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)


#### Building the RAG chain
The chain's first step is a dictionary with two keys, context and question. 
* "context" key: The input query is passed to the retriever, which fetches relevant documents. These documents are then piped to format_docs to be converted into a single string.
* "question" key: The RunnablePassthrough() also receives the exact same input query. Its only job is to do nothing to it and pass it straight through.

This output is then fed to the prompt, where the `context` and `question` placeholders are substituted by the retrieved document string and the user input query.

The formatted prompt is passed to the LLM, and lastly the response is passed through an output parser.


In [29]:
# joins the returned documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
    
# builds the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## 7. Query and test the RAG system
We can now invoke the RAG chain and test if it replies correctly to our queries with responses based on our stored documents.

In [30]:
query = "What are the advantages of few shot prompting?"

res = rag_chain.invoke(query)
print("--------------------------\n")
print("Question : ",query)
print("\n--------------------------\n")
print("Response : ",res)
print("\n--------------------------")


--------------------------

Question :  What are the advantages of few shot prompting?

--------------------------

Response :  Few-shot prompting offers several advantages in AI model performance, particularly with models like GPT. Here are some key benefits:

1. **Improved Contextual Understanding**: By providing a few examples related to the task at hand, few-shot prompting helps AI models understand the context better. This leads to more relevant and accurate responses compared to zero-shot prompting where no examples are given.

2. **Generalization Capability**: Few-shot learning allows models to generalize from a small set of examples to new, unseen situations. A study by Lake et al. (2017) demonstrated that few-shot learning models could classify images into categories they had not seen during training with just 1, 3, or 5 examples per class, showing the model's ability to generalize effectively.

3. **Reduced Need for Large Datasets**: Traditional machine learning often require

#### Explore the retrieved documents for comparison with the response
We can invoke the retriever directly to explore the documents that have been used to inform the response.

In [33]:
# 2. Invoke the retriever directly to get the relevant documents

retrieved_docs = retriever.invoke(query)

# 3. Print the retrieved documents to inspect their content
print("--- Retrieved Documents ---")
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Document {i+1} ---\n")
    print(f"Source: {doc.metadata.get('source', 'N/A')}\n")
    print(f"Content: {doc.page_content}\n")
print("---------------------------")

--- Retrieved Documents ---

--- Document 1 ---

Source: https://medium.com/@tuhinsharma121/mastering-prompt-engineering-a-beginners-guide-to-ai-interaction-2a28434ccb67

Content: is:Output:When we won the game, we all started to farduddle in celebration.LimitationStandard few-shot prompting is good for many tasks, but it’s not perfect, especially for complex thinking tasks. Let’s show why that is.Prompt:The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.A: The answer is False.The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.A: The answer is True.The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.A: The answer is True.The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.A: The answer is False.The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. A:Output:The answer is True.3. Chain-of-Thought (CoT) PromptingChain-of-thought (CoT) prompting he

### Exploring and testing the RAG system
Let's see what happens if we ask something that is not in the Meduim article documents. Is this the behaviour you expect?

In [34]:
query_test = "What is the capital of France?"

res = rag_chain.invoke(query_test)
print("--------------------------\n")
print("Question : ",query_test)
print("\n--------------------------\n")
print("Response : ",res)
print("\n--------------------------")


--------------------------

Question :  What is the capital of France?

--------------------------

Response :  I don't have real-time data or specific information about the capital of France. However, according to general knowledge, Paris is the capital city of France. It's not only the capital but also the most populous city in the country with approximately 2.1 million people within its metropolitan area as per the 2020 estimates.

--------------------------


#### Again explore the retrieved documents for comparison with the response
Can you see any information related to the query?

In [35]:
# 2. Invoke the retriever directly to get the relevant documents

retrieved_docs = retriever.invoke(query_test)

# 3. Print the retrieved documents to inspect their content
print("--- Retrieved Documents ---")
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Document {i+1} ---\n")
    print(f"Source: {doc.metadata.get('source', 'N/A')}\n")
    print(f"Content: {doc.page_content}\n")
print("---------------------------")

--- Retrieved Documents ---

--- Document 1 ---

Source: https://medium.com/@fassha08/transforming-search-ai-agents-and-multi-vector-intelligence-1bde1dbe66e7

Content: langchain_core.output_parsers import StrOutputParserfrom semantic_router import Routefrom semantic_router.encoders import HuggingFaceEncoderfrom semantic_router.layer import RouteLayerfrom langchain_community.tools.tavily_search import TavilySearchResultsimport osInstantiate the Embedding Modelembeddings = HuggingFaceEmbeddings(            model_name="BAAI/bge-large-en-v1.5")Vector DatabasesIn this blog, we are using LangChain wrappers to connect to different vector databases. For more information on setting up PgVector, please refer to the PgVector documentation. Similarly, for Qdrant, you can refer to the Qdrant documentation. Regardless of which vector store we are using, the process remains the same.Data UsedDataset1: Technical documentationDomain: Software DevelopmentVector Database: PGVectorDataset2: Customer Serv

### Next steps

You have successfully built a simple **Retrieval Augmented Generation (RAG) question** and answer system based on publicly available Medium articles.

Well done!

#### To go further
- Experiment with different parameters and prompts
- Try other models from models.corp