In [None]:
# (For Google Colab) Install everything we need.
# In your local environment, just install in your terminal (or probably using anaconda, venv, etc.):
# pip install langchain langchain-community langchain-text-splitters langsmith pypdf beautifulsoup4 selenium requests lxml jq tiktoken sentence-transformers langchain-huggingface faiss-cpu chromadb "langchain[google-genai]"
# If you don't know what `pip` does or how to use `anaconda`, please google it!
!pip install -U langchain langchain-core langchain-community langchain-text-splitters langsmith

Collecting langchain
  Downloading langchain-1.0.5-py3-none-any.whl.metadata (4.9 kB)
Collecting langchain-core
  Downloading langchain_core-1.0.4-py3-none-any.whl.metadata (3.5 kB)
Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-text-splitters
  Downloading langchain_text_splitters-1.0.0-py3-none-any.whl.metadata (2.6 kB)
Collecting langsmith
  Downloading langsmith-0.4.42-py3-none-any.whl.metadata (14 kB)
Collecting langgraph<1.1.0,>=1.0.2 (from langchain)
  Downloading langgraph-1.0.3-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain-community)
  Downloading langchain_classic-1.0.0-py3-none-any.whl.metadata (3.9 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.

In [None]:
!pip install pypdf beautifulsoup4 selenium requests lxml jq tiktoken sentence-transformers langchain-huggingface faiss-cpu chromadb "langchain[google-genai]"

Collecting pypdf
  Downloading pypdf-6.2.0-py3-none-any.whl.metadata (7.1 kB)
Collecting selenium
  Downloading selenium-4.38.0-py3-none-any.whl.metadata (7.5 kB)
Collecting jq
  Downloading jq-1.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-1.0.1-py3-none-any.whl.metadata (2.1 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting chromadb
  Downloading chromadb-1.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting trio<1.0,>=0.31.0 (from selenium)
  Downloading trio-0.32.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket<1.0,>=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.man

# Why LangChain?

Langchain is an open-source framework for building applications with Large Language Models. Of course, we could build our RAG system from scratch by manually calling LLM APIs, writing our own document loader, and managing our own vector database. However, LangChain simplifies this entire process with:

1. Pre-built abstractions: we just need to plug the 'blocks' that LangChain provides together to build a full-fledged RAG pipeline.

2. Orchestration: the blocks are smoothly connected into a "Chain" by LangChain.

LangChain has changed a lot recently with the release of v1.0! (https://docs.langchain.com/oss/python/releases/langchain-v1) If you're already familiar with LangChain, it is okay to use the legacy functionalities. It will be still beneficial to follow this lab to learn some new features.

# 0. Data Loading



## Goal: Learn how to collect source data and convert it into LangChain's standard `Document` format.


## What is `Document`?
`Document`(Please read [this link](https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document)) is a unit of data in LangChain. It has two main parts (and other things as well):
1. `page_content`(str): the actual text content of the data.
2. `metadata` (dict): data about the data.

In [None]:
# Example from the official LangChain reference (https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document)
from langchain_core.documents import Document

document = Document(
    page_content="Hello, world!", metadata={"source": "https://example.com"}
)

print(document)

page_content='Hello, world!' metadata={'source': 'https://example.com'}


## What is `DocumentLoader`?

You can use `DocumentLoader` to load data from any source and convert it into a list of Documents. You can see many different loaders here (https://docs.langchain.com/oss/python/integrations/document_loaders). Just to give some examples:
* Files: CSVLoader, PyPDFLoader, JSONLoader, UnstructuredLoader, ...
* Web: WebBaseLoader, RecursiveUrlLoader, ...
* Tools: NotionLoader, SlackLoader, ...

All loaders share two common methods:

1. `.load()`: loads all the data from the source **at once** and returns a **full list** of Documents.
2. `.lazy_load()`: returns an **iterator**; it yields one Document at a time, only when you ask for it. This is extremely memory-efficient, so you can process massive datasets one document at a time.

In [None]:
### DO NOT RUN THIS BLOCK ###

# Example from the official LangChain reference (https://docs.langchain.com/oss/python/integrations/document_loaders)
from langchain_community.document_loaders.csv_loader import CSVLoader

# this code doesn't work since we do not give any meaningful parameters here
loader = CSVLoader(
    ...  # Integration-specific parameters here
)

# Load all documents
documents = loader.load()

# For large datasets, lazily load documents
for document in loader.lazy_load():
    print(document)

## Let's load some local files

### Use `TextLoader`

In [None]:
# Caution - we tested this code only in Google Colab environment. This is to create an example .txt file.
# If you're doing this in your local PC or Docker, you can create any arbitrary .txt file with your own content.

# After running this, please click the 'folder' icon at left and see the generated file.

%%writefile example_doc.txt
RAG (Retrieval-Augmented Generation) is a technique that supplements the limitations of LLMs.
It combines 'Retrieval' and 'Generation',
allowing the LLM to answer based on the latest information or internal documents.

Writing example_doc.txt


In [None]:
# this may takes ~1 minute
from langchain_community.document_loaders.text import TextLoader

# 1. Create the loader
loader = TextLoader("example_doc.txt")

# 2. Load the document(s)
docs = loader.load()

# 3. Check the results
print(f"Loaded {len(docs)} document.")
print("\n--- Document Object Preview ---")
print(docs[0])

print("\n--- Content (page_content) ---")
print(docs[0].page_content)

print("\n--- Source (metadata) ---")
print(docs[0].metadata)

Loaded 1 document.

--- Document Object Preview ---
page_content='RAG (Retrieval-Augmented Generation) is a technique that supplements the limitations of LLMs.
It combines 'Retrieval' and 'Generation',
allowing the LLM to answer based on the latest information or internal documents.
' metadata={'source': 'example_doc.txt'}

--- Content (page_content) ---
RAG (Retrieval-Augmented Generation) is a technique that supplements the limitations of LLMs.
It combines 'Retrieval' and 'Generation',
allowing the LLM to answer based on the latest information or internal documents.


--- Source (metadata) ---
{'source': 'example_doc.txt'}


### Use `PyPDFLoader`

In [None]:
# We download an arbitrary PDF file for this lab.
# However, as above, you can use your own PDF file.
!wget -O attention.pdf "https://arxiv.org/pdf/1706.03762.pdf"
print("PDF file downloaded (attention.pdf)")

--2025-10-31 03:25:19--  https://arxiv.org/pdf/1706.03762.pdf
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.195.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /pdf/1706.03762 [following]
--2025-10-31 03:25:20--  https://arxiv.org/pdf/1706.03762
Reusing existing connection to arxiv.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/pdf]
Saving to: ‘attention.pdf’


2025-10-31 03:25:20 (163 MB/s) - ‘attention.pdf’ saved [2215244/2215244]

PDF file downloaded (attention.pdf)


In [None]:
from langchain_community.document_loaders import PyPDFLoader

# 1. Create the loader
loader = PyPDFLoader("attention.pdf")

# 2. Load and split (by page)
pages = loader.load() # .load() returns a list of Documents

# 3. Check the results
print(f"Loaded {len(pages)} pages (Documents).")

# Preview the content of page 0 (the first page)
print(f"\n--- Page 1 Content (Partial) ---")
print(pages[0].page_content[:500])

# Metadata for page 0 (source and page number)
# Notice how the metadata is automatically populated!
print(f"\n--- Page 1 Metadata ---")
print(pages[0].metadata)

Loaded 15 pages (Documents).

--- Page 1 Content (Partial) ---
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz 

--- Page 1 Metadata ---
{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1

In [None]:
# How about a lazy loader?
loader = PyPDFLoader("attention.pdf")

# we don't want to load everything, so we only load 5
num =0
for document in loader.lazy_load():
    print("----- Loading a Document -----")
    print(document)
    print("----- Loaded a Document -----")
    num += 1
    if num == 5:
      break


----- Loading a Document -----
page_content='Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing w

### Use `WebBaseLoader`

In [None]:
from langchain_community.document_loaders import WebBaseLoader

# 1. Create the loader (Wikipedia URL for RAG)
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Berlin")

# 2. Load
docs = loader.load()

# 3. Check the results
print(f"Loaded {len(docs)} document.")
print(f"\n--- Source (metadata) ---")
print(docs[0].metadata)

print(f"\n--- Content (Partial) ---")
print(docs[0].page_content[1000:1100]) # Printing a slice, as the content is long



Loaded 1 document.

--- Source (metadata) ---
{'source': 'https://en.wikipedia.org/wiki/Berlin', 'title': 'Berlin - Wikipedia', 'language': 'en'}

--- Content (Partial) ---




Toggle Cityscape and architecture subsection





3.1
Cityscape








3.2
Architecture










#### If you're not satisfied with it, you can do it manually

You can learn how to use web crawler from the Internet.
(e.g., [here](https://minisiri.tistory.com/87) or [here](https://joy-home.tistory.com/entry/%EB%8D%B0%EC%9D%B4%ED%84%B0%EB%B6%84%EC%84%9D-%EC%9B%B9-%ED%81%AC%EB%A1%A4%EB%A7%81-%ED%8C%8C%EC%9D%B4%EC%8D%AC-BeautifulSoup-%EC%82%AC%EC%9A%A9%EB%B2%95))

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://people.csail.mit.edu/tatbul/"

# 1. Make the HTML request
response = requests.get(url)
print(f"HTML Request Status: {response.status_code}")
print("-"*25)
# 2. Parse with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# 3. Select the part we want (e.g., the page title 'h1#firstHeading')
headers = soup.find_all("big")
for header in headers:
  print(f"Header: {header.get_text()}")

# 4. (Important) Convert to a LangChain Document object
# Even manually scraped data must be converted to a Document to be used in a LangChain pipeline.
from langchain_core.documents import Document

my_docs = []
for header in headers:
  header_text = header.get_text().strip()

  if header_text:
    new_doc = Document(
        page_content=header_text,
        metadata={
            "source": url,
            "tag_type": "big" # You can do whatever you want in the metadta
        }
    )
    my_docs.append(new_doc)

# print("\n--- Manually Created Document ---")
print(f"\n--- We created {len(my_docs)} Documents ---")

print(f"\n--- Example 1: ---")
print(my_docs[0])

print(f"\n--- Example 2: ---")
print(my_docs[1])

print(f"\n--- Example 3: ---")
print(my_docs[2])

HTML Request Status: 200
-------------------------
Header: Nesime Tatbul
Header: Nesime Tatbul
Header: Short Bio
Header: Research

--- We created 4 Documents ---

--- Example 1: ---
page_content='Nesime Tatbul' metadata={'source': 'https://people.csail.mit.edu/tatbul/', 'tag_type': 'big'}

--- Example 2: ---
page_content='Nesime Tatbul' metadata={'source': 'https://people.csail.mit.edu/tatbul/', 'tag_type': 'big'}

--- Example 3: ---
page_content='Short Bio' metadata={'source': 'https://people.csail.mit.edu/tatbul/', 'tag_type': 'big'}


## [Own Data] In this lab, we'll proceed with the data we collected from web

We collected data from web and stored it as multiple JSONL files.
We need to use `JSONLoader`.

In [None]:
# Let's define a function so that we can reuse it.

import glob # https://dream2reality.tistory.com/11
import os # https://ok-lab.tistory.com/163
from langchain_community.document_loaders import JSONLoader # https://docs.langchain.com/oss/python/integrations/document_loaders/json
from langchain_core.documents import Document # Used for type hinting

def _metadata_func(record: dict, metadata: dict) -> dict:
    """Helper function to extract metadata from each JSON object."""
    metadata["id"] = record.get("id")
    metadata["title"] = record.get("title")
    metadata["date"] = record.get("date")
    return metadata

def create_document(data_dir_path: str) -> list[Document]:
  jsonl_files = glob.glob(os.path.join(data_dir_path, "*.jsonl"))
  if not jsonl_files:
    print(f"Error: No .jsonl files found in the '{data_dir_path}' directory.")
  else:
    print(f"Found {len(jsonl_files)} data files to process: {jsonl_files}")

  all_documents = []
  for file_path in jsonl_files:
      print(f"Loading documents from {file_path}...")
      loader = JSONLoader(
          file_path=file_path,
          jq_schema='.', # Please look at the JSONL file's schema. we choose the whole JSON object
          json_lines=True, # one JSON object per line in JSONL
          content_key='content', # important! page_content should come from 'content' field
          metadata_func=_metadata_func # function to apply to each line for metadata extraction
      )
      all_documents.extend(loader.load()) # load and append documents using the loader

  return all_documents

#### (Google Colab) Do not forget to create a new folder named "data" and upload the JSONL files to this folder. (or in a local environment, you should place the data folder in the appropriate file path).

In [None]:
docs = create_document("/content/data")

print(docs[0].page_content)
print(docs[0].metadata)

Found 3 data files to process: ['/content/data/press_releases_pages_51-100.jsonl', '/content/data/press_releases_pages_101-150.jsonl', '/content/data/press_releases_pages_1-50.jsonl']
Loading documents from /content/data/press_releases_pages_51-100.jsonl...
Loading documents from /content/data/press_releases_pages_101-150.jsonl...
Loading documents from /content/data/press_releases_pages_1-50.jsonl...
외교부는 11.28.(목) 우리나라의 2025-27년 인권이사회 이사국 활동을 준비하기 위한 학계‧시민사회 간담회를 개최하여 인권이사회 이사국으로서의 활동 방향에 대한 학계와 시민사회 전문가들의 의견을 청취했다.이번 간담회를 주재한 이철 국제기구·원자력국장은 모두발언을 통해 무력분쟁‧기후변화‧AI와 같은 기술의 급격한 발전 등 도전적인 환경 속에서 국제사회의 인권보호와 증진을 위한 유엔 인권이사회의 역할이 어느 때보다도 중요해지고 있다고 했다. 이 국장은 우리나라가 이러한 중요한 시기에 2025-27년 임기 유엔 인권이사국으로 활동하게 된 만큼, 인권이사회 이사국으로서 국제사회의 인권보호와 증진을 위해 의미있는 기여를 할 수 있도록 준비해 나가고자 하며, 이 과정에서 학계·시민사회 등 다양한 주체들과 소통해 나가고자 한다고 하였다.이번 간담회에 참석한 학계·시민사회 인사들은 북한 인권을 포함한 국별 인권 문제, 장애인‧난민 등 분야별 인권 문제를 비롯한 다양한 주제에 대해 의견을 교환했다. 또한 참석자들은 전통적인 인권 문제 외에 인공지능 등 기술 진보의 인권에 대한 영향과 같이 새롭게 대두되는 인권 논의에 있어 우리가 선도적 역할을 해나가야 한다는

# 1. Data Splitting

## Goal: Learn how to split the Documents we collected in Section 0 into meaningful "chunks" suitable for RAG.


## Why do we need "Chunking"?

In Section 0, we collected data from PDFs and websites into Document objects. However, these Documents are often too large for an LLM to handle at once (e.g., a 100-page PDF).

* **Problem 1**: LLMs have a fixed "context window" so they cannot read too many tokens at once.
* **Problem 2**: Even if we have frontier LLMs with large context window, RAG would work poorly if we give the whole document, instead of a few specific paragraphs.

* **Solution**: Splitting the document into small "chunks".
  * These chunks will fit on the LLM's context window.
  * Retrieval step can find the most relevant chunks.

In [None]:
# Let's prepare a sample document to use here.

# This is a long text with multiple paragraphs and newlines.
sample_text = """
LangChain is a framework for developing applications powered by language models.
It is designed to help build context-aware and reasoning applications.

Key components of LangChain include:
1. Models: Interfaces to various LLMs (e.g., OpenAI, Hugging Face).
2. Prompts: Tools for managing and optimizing prompts.
3. Chains: Sequences of calls (to models or other tools).
4. Agents: LLMs that can decide what actions to take.

RAG (Retrieval-Augmented Generation) is a popular use case.
RAG involves retrieving relevant documents from a knowledge base
and passing them to the LLM as context.
This reduces hallucination and allows for knowledge updating.

This entire text is one single Document.
It is too long to be efficient for retrieval.
We need to split it.
"""

# Convert it to a Document object
documents = [
    Document(
        page_content=sample_text,
        metadata={"source": "my_manual.txt", "author": "AI_Writer"}
    )
]

print("--- Original Document ---")
print(documents[0].page_content)

--- Original Document ---

LangChain is a framework for developing applications powered by language models.
It is designed to help build context-aware and reasoning applications.

Key components of LangChain include:
1. Models: Interfaces to various LLMs (e.g., OpenAI, Hugging Face).
2. Prompts: Tools for managing and optimizing prompts.
3. Chains: Sequences of calls (to models or other tools).
4. Agents: LLMs that can decide what actions to take.

RAG (Retrieval-Augmented Generation) is a popular use case.
RAG involves retrieving relevant documents from a knowledge base
and passing them to the LLM as context.
This reduces hallucination and allows for knowledge updating.

This entire text is one single Document.
It is too long to be efficient for retrieval.
We need to split it.



## LangChain provides **Text Splitters** for this purpose. (https://docs.langchain.com/oss/python/integrations/splitters)

Full reference doc: https://reference.langchain.com/python/langchain_text_splitters/?h=splitter#langchain-text-splitters

`TextSplitter` class has the following parameters:

|PARAMETER | DESCRIPTION| TYPE | DEFAULT |
| -- | -- | -- | -- |
|chunk_size |	Maximum size of chunks to return| TYPE: int | DEFAULT: 4000 |
|chunk_overlap|	Overlap in characters between chunks| TYPE: int | DEFAULT: 200|
|length_function	|Function that measures the length of given chunks | TYPE: Callable\[[str], int\] | DEFAULT: len|
|keep_separator	|Whether to keep the separator and where to place it in each corresponding chunk (True='start') | TYPE: bool OR Literal['start', 'end'] | DEFAULT: False|
|add_start_index|	If True, includes chunk's start index in metadata| TYPE: bool | DEFAULT: False|
|strip_whitespace	|If True, strips whitespace from the start and end of every document| TYPE: bool | DEFAULT: True|

## `CharacterTextSplitter`

This is the simplest splitter. It splits text based on a "character" count.

separator: What character to split by first. (Default: \n\n - paragraphs)

chunk_size: The maximum number of "characters" in a chunk.

chunk_overlap: The number of "characters" to overlap between chunks. This prevents cutting a sentence in a weird place.

In [None]:
from langchain_text_splitters import CharacterTextSplitter

# 1. Create the splitter (200 chars, 50 char overlap)
text_splitter = CharacterTextSplitter(
    separator="\n",       # First, try to split by newline (this prevents unwanted splitting)
    chunk_size=200,       # 200 character size
    chunk_overlap=50,     # 50 character overlap
    length_function=len   # Use the standard 'len' function to count chars
)

# 2. Split the documents
chunks = text_splitter.split_documents(documents)

print(f"--- Split into {len(chunks)} chunks ---")

# 3. Check the results (each chunk is a new Document)
print("\n--- Chunk 1 ---")
print(chunks[0].page_content)
print(chunks[0].metadata) # The original metadata is copied!

print("\n--- Chunk 2 ---")
print(chunks[1].page_content)

--- Split into 5 chunks ---

--- Chunk 1 ---
LangChain is a framework for developing applications powered by language models.
It is designed to help build context-aware and reasoning applications.
Key components of LangChain include:
{'source': 'my_manual.txt', 'author': 'AI_Writer'}

--- Chunk 2 ---
Key components of LangChain include:
1. Models: Interfaces to various LLMs (e.g., OpenAI, Hugging Face).
2. Prompts: Tools for managing and optimizing prompts.


## `TokenTextSplitter`

What is a token? It's the basic unit of understanding for an LLM. (e.g., "RAG" might be 1 token, but "Retrieval" might be 1 token, and "Augmented" might be 2 tokens: "Aug" + "mented").

We use the tiktoken library to count tokens but there could be many different ways. (https://github.com/openai/tiktoken)

In [None]:
from langchain_text_splitters import TokenTextSplitter

# 1. Let's see how tokens are counted
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
token_count = len(encoding.encode("RAG (Retrieval-Augmented Generation)"))
print(f"Token count for the test sentence: {token_count}") # 34 chars, but only 8 tokens!

# 2. Create the splitter (100 tokens, 20 token overlap)
token_splitter = TokenTextSplitter(
    chunk_size=100, # 100 'tokens'
    chunk_overlap=20  # 20 'token' overlap
)

# 3. Split the documents
token_chunks = token_splitter.split_documents(documents)

print(f"\n--- Split into {len(token_chunks)} chunks ---")

# 4. Check the results
print("\n--- Chunk 1 ---")
print(token_chunks[0].page_content)

print("\n--- Token count for Chunk 1 ---")
print(len(encoding.encode(token_chunks[0].page_content)))

print("\n--- Chunk 2 ---")
print(token_chunks[1].page_content)

print("\n--- Token count for Chunk 2---")
print(len(encoding.encode(token_chunks[1].page_content)))

print("\n--- Chunk 2 ---")
print(token_chunks[2].page_content)

print("\n--- Token count for Chunk 2---")
print(len(encoding.encode(token_chunks[2].page_content)))

Token count for the test sentence: 10

--- Split into 3 chunks ---

--- Chunk 1 ---

LangChain is a framework for developing applications powered by language models.
It is designed to help build context-aware and reasoning applications.

Key components of LangChain include:
1. Models: Interfaces to various LLMs (e.g., OpenAI, Hugging Face).
2. Prompts: Tools for managing and optimizing prompts.
3. Chains: Sequences of calls (to models or other tools).
4. Agents: LLMs that can decide

--- Token count for Chunk 1 ---
91

--- Chunk 2 ---
ences of calls (to models or other tools).
4. Agents: LLMs that can decide what actions to take.

RAG (Retrieval-Augmented Generation) is a popular use case.
RAG involves retrieving relevant documents from a knowledge base
and passing them to the LLM as context.
This reduces hallucination and allows for knowledge updating.

This entire text is one single Document.
It is too long to be efficient for retrieval.
We need to

--- Token count for Chunk 2---
93


## `RecursiveCharacterTextSplitter`

Two splitters above are length-based, which is very straightforward and keeps chunk size consistent. However, they are not good at preserving the semantics. Thus, instead of just splitting by character count, we split "recursively" using a list of separators that preserve meaning. (https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter)

* separators: A list of separators to try in order. (Default: ["\n\n", "\n", " ", ""])

How it works:

* First, it tries to split the text by `\n\n` (paragraphs).

* If any chunk is still larger than chunk_size? -> It takes that chunk and tries to split it by the next separator, `\n` (lines).

* Still too big? -> It splits that piece by `" "`(spaces).

* Still too big? -> It's forced to split by `""`(characters).

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Create the splitter (200 chars, 50 char overlap)
# It will automatically use the separators ["\n\n", "\n", " ", ""]
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    length_function=len
)

# 2. Split the documents
recursive_chunks = recursive_splitter.split_documents(documents)

print(f"--- Split into {len(recursive_chunks)} chunks ---")

# 3. Check the results (compare them to the CharacterTextSplitter)
print("\n--- Chunk 1 ---")
print(recursive_chunks[0].page_content)

print("\n--- Chunk 2 ---")
print(recursive_chunks[1].page_content)

print("\n--- Chunk 3 ---")
print(recursive_chunks[2].page_content)

print("\n--- Chunk 4 ---")
print(recursive_chunks[3].page_content)

--- Split into 6 chunks ---

--- Chunk 1 ---
LangChain is a framework for developing applications powered by language models.
It is designed to help build context-aware and reasoning applications.

--- Chunk 2 ---
Key components of LangChain include:
1. Models: Interfaces to various LLMs (e.g., OpenAI, Hugging Face).
2. Prompts: Tools for managing and optimizing prompts.

--- Chunk 3 ---
3. Chains: Sequences of calls (to models or other tools).
4. Agents: LLMs that can decide what actions to take.

--- Chunk 4 ---
RAG (Retrieval-Augmented Generation) is a popular use case.
RAG involves retrieving relevant documents from a knowledge base
and passing them to the LLM as context.


## [Own Data] In this lab, we'll use `RecursiveCharacterTextSplitter`.

In [None]:
print("Splitting documents into chunks...")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(docs)
print(f"Split into {len(docs)} chunks.")

Splitting documents into chunks...
Split into 2180 chunks.


# 2. Embedding

## Goal: Learn how to convert the "text chunks" from Section 1 into "number vectors" that a computer can understand.

## Why do we need "Embedding"?

In Section 1, we split our Documents into "chunks." However, the computer still sees this text as a simple string of characters.

A computer doesn't know that the word "RAG" is semantically related to the word "AI." To make a computer understand "meaning" and calculate "similarity," we must convert text into numerical "coordinates." This process is called Embedding.

We will not go further about embedding in this lab practice. Please feel free to read the Chapter 5 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) textbook.

## LangChain provides **Embeddings** interface for this purpose ([here](https://docs.langchain.com/oss/python/integrations/text_embedding) and [official reference](https://reference.langchain.com/python/langchain_core/embeddings/?_gl=1*ogx43l*_gcl_au*MTMzMDg3ODYwNi4xNzYxMTAyNDcz*_ga*MTgzODEwOTA0Ny4xNzYxNTU2Njkw*_ga_47WX3HKKY2*czE3NjE2MzcxMDYkbzkkZzEkdDE3NjE2MzczMDQkajExJGwwJGgw#langchain_core.embeddings.embeddings.Embeddings))

We have to choose an embedding model to convert texts into embeddings. Of course, we can use paid models such as `OpenAIEmbeddings` but we'll use a free local model here. One downside of it is that we must first download the model. More importantly, it will use your CPU(or GPU?) so using a very large model for large dataset would take a while.

HuggingFace is the most popular AI hub which hosts innumerous models. Especially, embedding models can be easily used with `SentenceTransformers` library (https://sbert.net/). Since LangChain supports `HuggingFaceEmbeddings` API (https://docs.langchain.com/oss/python/integrations/text_embedding/huggingfacehub) , we do not have to worry about details. However, it is still very useful to look at the `SentenceTransformers` library website to find the most appropriate embedding model for your use.

## Example - English Only

We use `all-MiniLM-L6-v2` [model](https://sbert.net/docs/sentence_transformer/pretrained_models.html#original-models) which is lightweight and works fine.

We have two main methods in `Embeddings`:
* `embed_documents`(texts: List[str]) → List\[List[float]]: Embeds a list of documents.
* `embed_query`(text: str) → List[float]: Embeds a single query.

However, in our RAG pipeline, we will not need to call these methods directly.

In [None]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# 1. Create the Hugging Face embedding model
model_name = "all-MiniLM-L6-v2"

# Caution: THIS IS JUST A MODEL INTERFACE! NOT EMBEDDINGS!
embeddings_model_hf = HuggingFaceEmbeddings(
    model_name=model_name,
)

print(f"--- '{model_name}' model loaded ---")

# 2. Test: .embed_query() (a single text)
query_text_hf = "What is RAG?"
query_vector_hf = embeddings_model_hf.embed_query(query_text_hf)

print(f"\n--- Query: {query_text_hf} ---")
print(f"Embedding dimension: {len(query_vector_hf)}")
print("Embedding vector (first 5 values):")
print(query_vector_hf[:5])

# 3. Test: .embed_documents() (a list of chunks)
document_chunks_hf = [
    "RAG stands for Retrieval-Augmented Generation.",
    "Our school's mascot is a cat."
]
document_vectors_hf = embeddings_model_hf.embed_documents(document_chunks_hf)

print(f"\n--- Embedded {len(document_vectors_hf)} documents. ---")
print(f"Doc 1 vector (first 5 values): {document_vectors_hf[0][:5]}")
print(f"Doc 2 vector (first 5 values): {document_vectors_hf[1][:5]}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

--- 'all-MiniLM-L6-v2' model loaded ---

--- Query: What is RAG? ---
Embedding dimension: 384
Embedding vector (first 5 values):
[-0.06957074254751205, 0.09520001709461212, 0.016021426767110825, 0.00680149719119072, -0.08840496093034744]

--- Embedded 2 documents. ---
Doc 1 vector (first 5 values): [-0.11001930385828018, 0.06776804476976395, 0.013266481459140778, -0.005145775154232979, -0.11221088469028473]
Doc 2 vector (first 5 values): [0.09851555526256561, 0.047536689788103104, 0.020794745534658432, -0.027100034058094025, -0.035211656242609024]


## [Own Data] We choose `distiluse-base-multilingual-cased-v1` model (see [multilingual models](https://sbert.net/docs/sentence_transformer/pretrained_models.html#multilingual-models)) for our Korean dataset

In [None]:
# Check if we can use GPU (You will need to change the Colab Runtime to use GPU such as T4)
!nvidia-smi

Fri Oct 31 03:35:39 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   44C    P0             26W /   70W |     234MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
import torch
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# the parameters for this model can be chosen from https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer

# If the backend supports cuda, we use it
if torch.cuda.is_available():
  model_kwargs = {"device": "cuda"}
else:
  model_kwargs = {"device": "cpu"}

embeddings = HuggingFaceEmbeddings(
    model_name="distiluse-base-multilingual-cased-v1",
    model_kwargs=model_kwargs
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

This is it! We'll see why in the next section.

# 3. Vector Store & Retriever

## Goal: Learn how to store the "vectors" from Section 2 and quickly search for the vectors that are "closest" (most similar) to a given query vector.

## Why do we need a "Vector Store"?

In Section 2, we turned our text chunks into a list\[list[float]] (a list of number vectors).

**The Problem: What if you have 1 million chunks?**

When a user asks a question, you would have to compare their query vector to all 1 million document vectors (e.g., a for loop that runs 1 million times). This is way, way too slow.

**The Solution**: A vector store is a "special database" that stores vectors in a way that is optimized for search. It pre-organizes the vectors, grouping "neighbors" like "AI" and "Deep Learning" together. This makes searching almost instantaneous. (e.g., using ANN algorithms).

## But why is it a "Vector Store", not a "Vector Database"?

Although there is no single **correct** answer, [this link](https://www.tigerdata.com/learn/vector-store-vs-vector-database) provides a well-grounded explanation on the difference.

LangChain uses "Vector Store."

## LangChain provides `VectorStore` interface for this purpose (https://docs.langchain.com/oss/python/integrations/vectorstores)

Full reference doc: [here](https://reference.langchain.com/python/langchain_core/vectorstores/?h=&_gl=1*n2l2cs*_gcl_au*MTMzMDg3ODYwNi4xNzYxMTAyNDcz*_ga*MTgzODEwOTA0Ny4xNzYxNTU2Njkw*_ga_47WX3HKKY2*czE3NjE2MzcxMDYkbzkkZzEkdDE3NjE2MzkyOTAkajUyJGwwJGgw#langchain_core.vectorstores.base.VectorStore)

`VectorStore` has many methods but the most important ones are:
* `add_documents` - Add documents to the store.
* `delete` - Remove stored documents by ID.
* `similarity_search` - Query for semantically similar documents.

In [None]:
# Let's use the following toy example.

# Sample document and chunk data
from langchain_core.documents import Document
chunks = [
    Document(page_content="RAG stands for Retrieval-Augmented Generation.", metadata={"source": "doc1.pdf", "page": 1}),
    Document(page_content="RAG is a great technique to reduce LLM hallucination.", metadata={"source": "doc1.pdf", "page": 2}),
    Document(page_content="Our school's mascot is a cute cat.", metadata={"source": "doc2.txt"}),
    Document(page_content="AI (Artificial Intelligence) is a field of computer science.", metadata={"source": "doc3.md"}),
]

# Sample embedding model (HuggingFace)
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
model_name = "all-MiniLM-L6-v2"
embeddings_model_hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)
print("--- Embedding model ready ---")


--- Embedding model ready ---


We will use an in-memory vector index [FAISS](https://github.com/facebookresearch/faiss).

Since we already have the document chunks and the embedding model, we do not need to `add` each document. We can build a vector store from scratch (c.f., bulk insert), using `from_documents` method. We need to specify the `documents` and `embedding` (embedding model) here.

`from_documents` method will do the rest.

In [None]:
from langchain_community.vectorstores import FAISS

print("--- Creating FAISS DB (Embedding chunks...) ---")
vectorstore_faiss = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings_model_hf
)
print("--- FAISS DB created (in memory) ---")

--- Creating FAISS DB (Embedding chunks...) ---
--- FAISS DB created (in memory) ---


Now we can perform the similarity search. `similarity_search` method has a parameter `k`, which is used to decide the number of results to return, i.e., k-nearest neighbors.

See how metadata is retrieved together.

In [None]:
query_faiss = "Tell me about the school mascot"

# .similarity_search()
retrieved_docs_faiss = vectorstore_faiss.similarity_search(query_faiss, k=2)

print(f"\n--- Query: '{query_faiss}' ---")
print(f"--- Found {len(retrieved_docs_faiss)} relevant documents ---")

print("\n--- [Result 1] ---")
print(retrieved_docs_faiss[0].page_content)
print(retrieved_docs_faiss[0].metadata)

print("\n--- [Result 2] ---")
print(retrieved_docs_faiss[1].page_content) # this is not that "similar".
print(retrieved_docs_faiss[1].metadata)


--- Query: 'Tell me about the school mascot' ---
--- Found 2 relevant documents ---

--- [Result 1] ---
Our school's mascot is a cute cat.
{'source': 'doc2.txt'}

--- [Result 2] ---
RAG stands for Retrieval-Augmented Generation.
{'source': 'doc1.pdf', 'page': 1}


In [None]:
# You can even 'filter' the results with metadata
# We intentionally choose an inappropriate document here(doc1.pdf)
retrieved_docs_faiss_filtered = vectorstore_faiss.similarity_search(query_faiss, k=2, filter={"source": "doc1.pdf"})

print(f"\n--- Query: '{query_faiss}' ---")
print(f"--- Found {len(retrieved_docs_faiss_filtered)} relevant documents ---")

print("\n--- [Result 1] ---")
print(retrieved_docs_faiss_filtered[0].page_content)
print(retrieved_docs_faiss_filtered[0].metadata)

print("\n--- [Result 2] ---")
print(retrieved_docs_faiss_filtered[1].page_content)
print(retrieved_docs_faiss_filtered[1].metadata)


--- Query: 'Tell me about the school mascot' ---
--- Found 2 relevant documents ---

--- [Result 1] ---
RAG stands for Retrieval-Augmented Generation.
{'source': 'doc1.pdf', 'page': 1}

--- [Result 2] ---
RAG is a great technique to reduce LLM hallucination.
{'source': 'doc1.pdf', 'page': 2}


The example below uses [Chroma](https://www.trychroma.com/), which can persist to disk.

In [None]:
from langchain_community.vectorstores import Chroma

persist_directory = 'my_chroma_db'

print("--- Creating Chroma DB (Embedding chunks...) ---")
vectorstore_chroma = Chroma.from_documents( # the same method `from_documents`
    documents=chunks,
    embedding=embeddings_model_hf,
    persist_directory=persist_directory # Folder to save to
)
print(f"--- Chroma DB created at ({persist_directory}) ---")

# Now you can see a new folder named `my_chroma_db`!

# What if you already have a DB?
# On your second run, you "load" the DB instead of creating it.
# vectorstore_chroma = Chroma(
#     persist_directory=persist_directory,
#     embedding_function=embeddings_model_hf
# )

--- Creating Chroma DB (Embedding chunks...) ---
--- Chroma DB created at (my_chroma_db) ---


In [None]:
# Below is the same.

query = "What is RAG?"

retrieved_docs = vectorstore_chroma.similarity_search(query, k=2)

print(f"\n--- Query: '{query}' ---")
print(f"--- Found {len(retrieved_docs)} relevant documents ---")

print("\n--- [Result 1] ---")
print(retrieved_docs[0].page_content)
print(retrieved_docs[0].metadata)

print("\n--- [Result 2] ---")
print(retrieved_docs[1].page_content)
print(retrieved_docs[1].metadata)


--- Query: 'What is RAG?' ---
--- Found 2 relevant documents ---

--- [Result 1] ---
RAG stands for Retrieval-Augmented Generation.
{'source': 'doc1.pdf', 'page': 1}

--- [Result 2] ---
RAG is a great technique to reduce LLM hallucination.
{'page': 2, 'source': 'doc1.pdf'}


## [Own Data] Our example will use faiss.
reference doc: https://docs.langchain.com/oss/python/integrations/vectorstores/faiss

We build the vector store from scratch, using the `docs` and `embeddings` we prepared above.

While this is an in-memory index, we can save it and load it later.

In [None]:
# This will take a few minutes without GPU
# However, you can finish it in just a few seconds with GPU!
myvectorstore = FAISS.from_documents(docs, embeddings)

# You can store the index
myvectorstore.save_local("myfaissidx")
# You can see the file now

In [None]:
from langchain_community.vectorstores import FAISS

# You can load the stored index.
# You need to specify the embedding model by passing an embedding object parameter
new_vector_store = FAISS.load_local(
    "myfaissidx", embeddings, allow_dangerous_deserialization=True
)

# we're doing it without any parameter, but you can set `k` and `filter`, etc.
docs = new_vector_store.similarity_search("아프리카")
print(docs)

[Document(id='1804e31d-185c-41c6-82a0-8a58c92c190c', metadata={'source': '/content/data/press_releases_pages_101-150.jsonl', 'seq_num': 89, 'id': '375068', 'title': '‘내가 만난 아프리카, 아프리카 다시 보다’', 'date': '2024-05-31'}, page_content='외교부(장관 조태열)는 2024 한ㆍ아프리카 정상회의(6.4.-6.5.) 개최를 계기로 미래 협력 파트너인 아프리카에 대한 우리 국민들의 아프리카 사진․영상 공모전(4.17.~5.10.)을 개최하였다. 이번 공모전에는 총 265점(사진 234점, 영상 31점)이 응모하였고 이 중 26점이 수상작으로 선정되었다.이번 공모전은 2개 주제로 나누어 진행되었다. 첫 번째 주제 <내가 만난 아프리카! 내 주변의 아프리카!>는 일상 속 또는 여행 중에 접한 아프리카의 친근하고 긍정적 이미지를, 두 번째 주제 <아프리카 다시 보다 새로 보다(Africa beyond Stereotypes)>는 아프리카에 대한 정형화된 선입견을 넘어 아프리카의 발전 가능성과 긍정적인 가치를 전달할 수 있는 사진․영상을 공모하였다.* 시상 규모 및 시상 금액 : 각 주제별 최우수상 1점(3백만원), 우수상 1점(2백만원), 장려상 1점(1백만원), 입선 10점(5만원 상당 문화상품권)<내가 만난 아프리카! 내 주변의 아프리카!> 주제의 최우수상은 ‘아프리카 아이들과 즐긴 한국의 전통놀이(영상)’로, 탄자니아에서 교육 봉사활동 중, 학생들과 한국의 전통 놀이를 함께 즐겼던 시간을 영상으로 담아내었다.우수상 ‘컬러풀 아프리카!(Colorful Africa!)(사진)’는 케냐 마사이 마을을 방문했을 때 화려한 색감의 전통 의상 슈카(Shuka)를 입고 밝게 웃어주는 마을 주민의 모습에서 친근함과 편안함을 느낀 경험을 소개하였고, 장려상 ‘시장에서(사진)’는 나란히 앉은 남매의 정겨운 모습을 사진

## What is a retriever? (optional, depending on your implementation)

A retriever is another interface provided by LangChain. It is more general than a vector store. Every vector store can be casted into a retriever. You do not need to worry about a retriever here. (https://docs.langchain.com/oss/python/integrations/retrievers)

Depending on your implementation, you might need a retriever or not. We introduce how to cast your vector store into a retriever.

`as_retriever` will make a retriever from your vector store. (https://docs.langchain.com/oss/python/langchain/knowledge-base#4-retrievers)

It takes `search_type` and `search_kwargs` parameters.
* `search_type` can be "similarity" (default), "mmr" (maximum marginal relevance; see [this](https://wikidocs.net/231585)), and "similarity_score_threshold"
* search_kwargs: Keyword arguments to pass to the search function. Can include things like:
  * k: Amount of documents to return (Default: 4)
  * score_threshold: Minimum relevance threshold for similarity_score_threshold
  * fetch_k: Amount of documents to pass to MMR algorithm (Default: 20)
  * lambda_mult: Diversity of results returned by MMR; 1 for minimum diversity and 0 for maximum. (Default: 0.5)
  * filter: Filter by document metadata

In [None]:
myretriever = myvectorstore.as_retriever(search_type="mmr", search_kwargs={'k': 3, 'filter': {"source": "/content/data/press_releases_pages_1-50.jsonl"}})

In [None]:
myretriever.invoke("아프리카")

[Document(id='5b44af54-5b5e-4dce-af5d-56c01ff5c3c2', metadata={'source': '/content/data/press_releases_pages_1-50.jsonl', 'seq_num': 312, 'id': '375880', 'title': '조태열 외교장관,  연쇄 양자회담 개최(2.21.)', 'date': '2025-02-22'}, page_content='아프리카 진출 관문임을 강조하면서, 남아공에 진출한 우리 기업들에 대한 남아공 정부의 관심과 지원을 당부하였다.조 장관은 아타프 알제리 장관과의 회담에서 양국이 활발한 고위급 교류 뿐만 아니라 양국간 교역이 역대 최대 규모(39억불)를 기록하는 등 최근 양국 관계의 발전을 평가하면서, 알제리에 진출한 우리 기업들이 겪고 있는 애로사항들에 대한 알제리 정부의 지원을 당부하였다. 아타프 장관은 건설·인프라 분야에 있어 한국의 기여에 사의를 표하고, 한국 기업들의 알제리 내에서의 원활한 기업 활동을 위해 모든 노력을 다하겠다고 하는 한편, 방산, 스타트업 등 분야에서도 양국간 협력이 심화될 수 있기를 기대한다고 하였다. 양 장관은 양국이 모두 올해 유엔 안보리 비상임이사국으로서 북한, 중동 문제 등 국제 이슈들에 있어 긴밀히 소통하고 협력해 나가기로 하였다.남아공, 알제리 외교장관과의 양자 회담은 우리의 아프리카 주요협력국인 두 국가와의 양자관계를 강화하고 글로벌 사우스와의 협력을 증진하는 계기가 된 것으로 평가된다.'),
 Document(id='17b25d2b-6fa2-483b-a28d-6c0ef09fb7b4', metadata={'source': '/content/data/press_releases_pages_1-50.jsonl', 'seq_num': 14, 'id': '376192', 'title': '외교부 2025년 상반기 여행경보단계 정기조정', 'date': '2025-06-30'}, page_content='예멘, 시리아, 리비아, 우크라이나

# 4. LLM Integration & Building a RAG Agent

## Goal: Combine the VectorStore (Retriever) from Section 3 with an LLM to build a complete RAG system that generates referenced answers to questions.

Up to Section 3, we have completed the 'R' (Retrieval) part of RAG.

docs: Original documents

chunks: Split document chunks

embeddings: The embedding model

myvectorstore: The vector database where chunks are stored

myretriever: The retriever for searching the database

Now, it's time to implement the 'G' (Generation). We will bring in an LLM and tie everything together to create an "Agent."

## 1. What is an Agent?

The simplest analogy is: if an LLM is the 'brain,' an Agent is the 'brain' plus 'hands and feet (tools).'

- Brain (LLM): Reasons, makes decisions, and generates text.

- Tools (Hands/Feet): Perform actions like 'searching,' 'calculating,' or 'calling an API.'

## Langchain provides `create_agent` function to create an agent within a few lines of code

* For Agents in Langchain in general: [here](https://docs.langchain.com/oss/python/langchain/agents)
* For the specific usage of `create_agent`: [here](https://reference.langchain.com/python/langchain/agents/?_gl=1*1i1dhr4*_gcl_au*MTMzMDg3ODYwNi4xNzYxMTAyNDcz*_ga*MTgzODEwOTA0Ny4xNzYxNTU2Njkw*_ga_47WX3HKKY2*czE3NjE3MDcwODUkbzEwJGcxJHQxNzYxNzEyMzQ2JGo1MyRsMCRoMA..#langchain.agents)


### `LangChain` vs. `LangGraph`

While LangChain enables you to easily build agents and LLM applications, LangGraph provides low-level agent orchestration. If you want a highly customizable agent, you should consider using [LangGraph](https://docs.langchain.com/oss/python/langchain/overview). **However, in our lab, we will only use LangChain since the agent itself is not our focuse.** (LangChain's agent is built upon LangGraph, neverthelss)


### `tools` vs. `middleware`

We can build a RAG application using either `tools` or `middleware` (or both!).

* `tools`: the agent can **optionally** choose to use these. For example, an LLM can decide to use a Google Search tool if it thinks it cannot answer a query from its own knowledge. Calculator or Python Code Executor could be a good example too. (https://docs.langchain.com/oss/python/langchain/tools)
* `middleware`(it is how it is called in LangChain): the agent runs this **always** or **before** / **after** making decision. For example, we can use this to make the agent **always** search the vector store first and then inject the relevant documents into the prompt.

### In this lab, we will set `tools=[]` (no optional tools) and use the `@dynamic_prompt` middleware to implement our RAG pipeline.

## 2. Load the LLM

LangChain provides `init_chat_model` interface to initialize a new LLM model (https://reference.langchain.com/python/langchain/models/#langchain.chat_models).

You can choose whichever model & model provider you want (https://docs.langchain.com/oss/python/integrations/chat) but we will use Google Gemini since it offers free usage.

Be cautious with your usage limit. The model may be 'overloaded' and refuses to answer your query. If you use a paid billing account, you can be charged with a fee (https://ai.google.dev/gemini-api/docs/pricing?hl=ko).

You can get a Google Gemini API key here: https://aistudio.google.com/

In [None]:
import os
from langchain.chat_models import init_chat_model

os.environ["GOOGLE_API_KEY"] = "ENTER YOUR API KEY" # enter your api key

model = init_chat_model("google_genai:gemini-2.5-flash") # you may change to another model if you wish

## 3. Create the RAG Middleware

There are two different types of middleware supported in LangChain: class-based and decorator-based. (https://docs.langchain.com/oss/python/langchain/middleware#when-to-use-decorators) Since we don't need a complicated middleware, we will use a decorator. But what is a Python decorator? Please read [this](https://bluese05.tistory.com/30).


The `@dynamic_prompt` decorator turns a function into middleware. This function runs just before the LLM is called, dynamically generating the system prompt that will be sent to the LLM.

We will first ask the LLM to 'rewrite' the given query, then retrieve the relevant documents, and finally generate the response.

### Example

Let's first look at the following example (taken from [here](https://docs.langchain.com/oss/python/langchain/rag#rag-agents)) to understand LangChain's related interfaces.

Imagine that this function is called right before the actual LLM API call by the agent. You may want to know more about the following things to fully understand the code.

1. `ModelRequest`: model request information for the agent. You can find its definition [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain_v1/langchain/agents/middleware/types.py#L76). You can see it contains `state: AgentState`.
2. `AgentState`: the agent uses it to manage its short-term memory(https://docs.langchain.com/oss/python/langchain/short-term-memory#customizing-agent-memory). It has `messages` field (https://github.com/langchain-ai/langchain/blob/master/libs/langchain_v1/langchain/agents/middleware/types.py#L166).
3. messages: messages are, yes, messages. They represent input/output of the model (https://docs.langchain.com/oss/python/langchain/messages).



In [None]:
from langchain.agents.middleware import dynamic_prompt, ModelRequest
from langchain.agents import create_agent

@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
    """Inject context into state messages."""
    last_query = request.state["messages"][-1].text

    retrieved_docs = myvectorstore.similarity_search(last_query) # using the faiss vector store from our own dataset

    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    system_message = (
        "You are a helpful assistant. Use the following context in your response:"
        f"\n\n{docs_content}"
    )

    return system_message


agent = create_agent(model, tools=[], middleware=[prompt_with_context]) # using the Google Gemini model

A created agent can be invoked as following. The input is a dictionary with `messages` as the key.

In [None]:
agent.invoke({"messages": [{"role": "user", "content": "한국과 일본 관계에 대해 설명해줘"}]})

{'messages': [HumanMessage(content='한국과 일본 관계에 대해 설명해줘', additional_kwargs={}, response_metadata={}, id='50883425-8052-405f-83bd-d4173a4284fb'),
  AIMessage(content='제공된 문맥에 따르면 한국과 일본의 관계는 협력과 갈등이 공존하는 복합적인 양상을 띠고 있습니다.\n\n**협력 분야:**\n\n*   **북한 위협 대응:** 한국, 미국, 일본은 북한의 지속적인 무기 이전 및 군사 협력 심화에 대해 강력히 규탄하며, 지역 및 세계 안보에 대한 북한의 위협에 대응하기 위해 외교 및 안보 협력을 더욱 강화할 의사를 재확인했습니다. 특히 러시아-북한 파트너십의 발전에 대해 중대한 우려를 표명하며, 대화의 길이 열려 있음을 재확인하고 북한의 도발 중단과 협상 복귀를 촉구했습니다.\n*   **사이버 안보 협력:** 한미일 3국은 북한의 악의적인 사이버 활동 및 불법 수익 창출에 대응하기 위해 긴밀히 협력하고 있습니다. 북한 IT 인력에 대한 공고문을 발표하고, 민간 부문(특히 블록체인 및 프리랜서 업계)에 사이버 위협 경감 방안을 숙지하고 북한 IT 인력 고용 위험을 줄일 것을 권고하고 있습니다. 또한 북한 사이버 행위자에 대한 제재 지정, 인도-태평양 지역 내 사이버 보안 역량 강화 등 협력을 지속해 나갈 예정입니다.\n*   **경제 및 사회 문제 협력:** 한일 경제인들은 양국 관계 정상화에 따라 경제, 기술, 공급망에서의 협력뿐만 아니라 저출산, 지방소멸 등 양국이 함께 직면한 공통 과제를 해결하기 위한 협력이 필요하다고 강조했습니다. 외교부는 이러한 논의를 바탕으로 한일 국교정상화 60주년을 내실 있게 준비하고, 미래지향적 발전을 위한 구체적 방안을 모색하고 있습니다.\n\n**갈등 분야:**\n\n*   **독도 영유권 주장:** 일본 정부가 방위백서를 통해 독도에 대한 부당한 영유권 주장을 되풀이한 것에 대해 한국 정부는 강력

However, this is not pretty. We can use the following to get the clean result.

In [None]:
result = agent.invoke({"messages": [{"role": "user", "content": "한국과 일본 관계에 대해 설명해줘"}]})
print(result["messages"][-1].pretty_print())


제공된 정보를 바탕으로 한국과 일본의 관계는 다음과 같이 설명할 수 있습니다.

**협력과 공동의 우려:**

*   **북한 문제에 대한 공조:** 한국, 미국, 일본은 북한의 동향과 전술을 주시하며 긴밀히 협력하고 있습니다. 특히 러시아와 북한 간의 군사 협력 심화(무기 이전 포함)를 강력히 규탄하며, 이는 유엔 안보리 결의 위반이자 동북아시아와 유럽의 안정을 위협하는 중대한 우려 사항으로 보고 있습니다.
*   **사이버 위협 대응:** 한미일 3국은 북한의 악의적인 사이버 활동 및 불법 수익 창출에 대응하기 위해 협력을 강화하고 있습니다. 북한 IT 인력의 민간 부문 침투에 대한 경고문을 발표하고, 사이버 보안 역량 강화 및 제재 지정을 통해 공동 대응하고 있습니다. 민관 협력을 통해 불법 가상자산 관련 정보 공유 및 분석을 강화하고 있습니다.
*   **미래지향적 관계 구축 노력:** 양국 관계 정상화에 따라 한일 경제인 및 정부 부처 간 대화가 원활해지고 있으며, 기술, 경제, 공급망 협력뿐만 아니라 저출산, 지방 소멸 등 양국이 직면한 공통 과제 해결을 위한 협력의 필요성이 강조되고 있습니다. 외교부는 국교정상화 60주년을 맞아 미래지향적 관계 발전을 위한 다양한 의견을 수렴하고 있습니다.

**갈등 및 이견:**

*   **독도 영유권 문제:** 일본 정부가 방위백서를 통해 독도에 대한 부당한 영유권 주장을 되풀이하는 것에 대해 한국 정부는 강력히 항의하며 즉각 철회를 촉구하고 있습니다. 한국은 독도가 역사적, 지리적, 국제법적으로 명백한 한국 고유의 영토임을 강조하며, 일본의 어떠한 도발에도 단호히 대응할 것임을 밝히고 있습니다. 이 문제는 미래지향적 한일 관계 구축에 도움이 되지 않는다고 지적하고 있습니다.

요약하자면, 한국과 일본은 북한의 위협과 사이버 안보 같은 공동의 안보 문제에 대해서는 미국을 포함한 3국 협력을 통해 긴밀히 공조하고 있으며, 경제 및 사회 문제 해결을 위한 미래지향적 관계 구축 노력을 기울이고 있습니다. 그러나 

In [None]:
result["messages"][-1]

AIMessage(content='제공된 문맥을 바탕으로 한국과 일본의 관계는 다음과 같이 설명할 수 있습니다:\n\n1.  **독도 영유권 주장으로 인한 갈등:**\n    *   일본 정부가 방위백서를 통해 독도에 대한 부당한 영유권 주장을 되풀이하는 것에 대해 한국 정부는 강력히 항의하며 즉각 철회를 촉구합니다.\n    *   한국은 독도가 역사적, 지리적, 국제법적으로 명백한 한국 고유의 영토임을 분명히 하며, 일본의 어떠한 주장도 한국의 주권에 영향을 미치지 못한다고 강조합니다.\n    *   한국은 독도에 대한 일본의 어떠한 도발에 대해서도 단호히 대응할 것임을 밝히며, 일본의 부당한 주장이 미래지향적 한일 관계 구축에 도움이 되지 않는다고 경고합니다.\n\n2.  **북한 문제 및 사이버 안보 협력 (주로 한미일 3국 틀 내에서):**\n    *   한국과 일본은 북한의 동향과 전술을 주시하고 있으며, 북한의 악의적 사이버 활동 및 불법 수익 창출에 대응하기 위해 긴밀히 협력합니다.\n    *   양국(미국 포함) 정부 기관들은 북한 IT 인력의 민간 부문 내부자 위협에 대한 공고문을 발표해 왔습니다.\n    *   한미일 3국은 블록체인 및 프리랜서 업계 등 민간 부문 단체들이 사이버 위협 경감 방안을 숙지하고 북한 IT 인력 고용 위험을 경감할 것을 권고합니다.\n    *   민관 협력 심화를 통해 북한의 사이버 범죄 활동을 차단하고 국제 금융 시스템의 안전을 지키는 데 필수적이라고 인식합니다.\n    *   북한 사이버 행위자에 대한 제재 지정, 사이버 보안 역량 강화 등 협력을 지속해 나갈 것입니다.\n\n3.  **외교적 교류 및 관계 발전 노력:**\n    *   양국 외교차관 회담(김홍균 외교부 제1차관-오카노 마사타카 일본 외무성 사무차관) 등을 통해 양국 관계 전반, 한일·한미일 협력 및 북한 문제 등에 대해 의견을 교환합니다.\n    *   양측은 한일관계의 중요성을 인식하고 변함없이 관계를 유지·발전시켜 나가자는

In this case, it is okay to do so because we only need the final response. Otherwise, we can use the streaming(https://docs.langchain.com/oss/python/langchain/streaming).

In [None]:
query = "아프리카와 한국의 협력 현황은?"

for step in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    step["messages"][-1].pretty_print()


아프리카와 한국의 협력 현황은?

현재 한국은 안보리 비상임이사국으로서 아프리카 역내 평화와 안정을 위해 노력하고 있습니다.

특히 작년 한-아프리카 정상회의를 통해 동아프리카 주요 협력국인 케냐와의 우호 관계를 돈독히 하고, 양국 간 실질 협력 확대 방안을 모색하는 계기가 되었습니다.


### [Own Data] Let's add 'query rewriting' to this.

We will use `ChatPromptTemplate`(https://reference.langchain.com/python/langchain_core/prompts/) to first make a request to rewrite the given query.

In [None]:
from langchain.agents.middleware import dynamic_prompt, ModelRequest
from langchain_core.prompts import ChatPromptTemplate

# We use 'myvectorstore' created in Section 3.
print(f"Vector store (myvectorstore) is ready: {myvectorstore is not None}")

@dynamic_prompt
def prompt_with_context_and_rewrite(request: ModelRequest) -> str:
    # 1. Extract the user's last query from request.state["messages"]
    last_query = request.state["messages"][-1].content
    print(f"\n--- [Middleware] Original Query: '{last_query}' ---")

    # 2. Request to rewrite the query
    rewrite_system_msg = """You are an expert query assistant. Your task is to rewrite the user's question into an optimized query for a vector database search. Your rewritten query will be used for similarity search.
    Only output the rewritten query."""

    # Make a template
    rewrite_template = ChatPromptTemplate(
        [
            ("system", rewrite_system_msg),
            ("human", "{user_input}")
        ]
    )

    # Fill in the template with the query content
    rewrite_prompt_value = rewrite_template.invoke(
        {
            "user_input": last_query,
        }
    )

    rewrite_response = model.invoke(rewrite_prompt_value.messages)

    rewritten_query = rewrite_response.content
    print(f"--- [Middleware] Rewritten Query: '{rewritten_query}' ---")

    # 3. Search for documents
    try:
      retrieved_docs = myvectorstore.similarity_search(rewritten_query, k=3) # Get top 3
    except Exception as e:
      print(f"Check your vector store: {e}")
      retrieved_docs = []

    # 4. Join the page_content of the retrieved docs into a single string
    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    print(f"--- [Middleware] Retrieved {len(retrieved_docs)} docs ---")

    # 5. Dynamically create the system prompt to be sent to the LLM
    system_message = (
        "You are a helpful assistant. Use the following context in your response:"
        "\n\n--- CONTEXT ---"
        f"\n{docs_content}"
        "\n--- END CONTEXT ---"
    )

    return system_message

agent = create_agent(model, tools=[], middleware=[prompt_with_context_and_rewrite])

Vector store (myvectorstore) is ready: True


In [None]:
query = "내가 과제를 해야하는데, 아프리카와 한국 협력 현황에 대해 알려줘. 지금 벡터 데이터베이스에는 아마 외교부의 보도자료들이 들어있을거야."

for step in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    step["messages"][-1].pretty_print()


내가 과제를 해야하는데, 아프리카와 한국 협력 현황에 대해 알려줘. 지금 벡터 데이터베이스에는 아마 외교부의 보도자료들이 들어있을거야.

--- [Middleware] Original Query: '내가 과제를 해야하는데, 아프리카와 한국 협력 현황에 대해 알려줘. 지금 벡터 데이터베이스에는 아마 외교부의 보도자료들이 들어있을거야.' ---
--- [Middleware] Rewritten Query: '한국-아프리카 협력 현황' ---
--- [Middleware] Retrieved 3 docs ---

제공해주신 외교부 보도자료 내용을 바탕으로 한국과 아프리카 국가들 간의 협력 현황을 정리해 드릴게요. 과제에 활용하시기 좋을 겁니다.

---

### 한국-아프리카 협력 현황 (주요 내용)

현재 한국과 아프리카 국가들 간의 협력은 다가오는 **「한-아프리카 정상회의」**를 중심으로 매우 활발하게 진행되고 있으며, 양자 및 다자 차원에서 다양한 분야로 확대되고 있습니다.

**1. 「한-아프리카 정상회의」 개최 준비 및 의미:**
*   **주요 행사:** 오는 6월 4~5일 개최 예정인 「한-아프리카 정상회의」는 한국과 아프리카 국가들 간의 협력을 강화하는 가장 중요한 계기입니다.
*   **참여 열기:** 적도기니 대통령은 조기에 참석을 확정했으며, 적도기니 외교장관의 2006년 이후 18년 만의 공식 방한은 이 정상회의를 앞두고 양국 관계를 점검하고 실질 협력 확대 방안을 협의하는 중요한 계기가 되었습니다.
*   **AU의 지지:** 아프리카 연합(AU)은 한국 정부가 아프리카와의 협력을 강화하고자 하는 의지를 높이 평가하며, 2024 한-아프리카 정상회의의 성공적인 개최를 위해 AU 차원에서 지원을 다할 것이라고 밝혔습니다.

**2. 양자 협력 강화:**
*   **케냐:** 작년 한-아프리카 정상회의를 통해 다져진 동아프리카 주요 협력국인 케냐와의 우호 관계를 돈독히 하고, 양국 간 실질협력 확대 방안을 모색하고 있습니다.
*   **적도기니:

# 5. Evaluate the RAG Agent

This part is based on the LangSmith's [tutorial](https://docs.langchain.com/langsmith/evaluate-rag-tutorial).

## Goal: Evaluate "how well" our RAG agent works using LangSmith.

LangSmith is a platform used to debug and evaluate LLM applications. You need to sign up and get a LangSmith API key first: https://docs.langchain.com/langsmith/home#get-started


In [None]:
import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "YOUR LANGSMITH API KEY" # enter your langsmith api key here

But why do we need to evaluate our agent? Because we don't know if it is doing well. For example:
* Retrieval Evaluation: Is our similarity search retrieving relevant documents?
* Generation Evaluation: Is our LLM generating the response faithfully based on the retrieved results?

We will need multiple criteria for evaluation:
* **Relevance**: Response vs input(“how well does the generated response address the initial user input”)
* **Groundedness**: Response vs retrieved docs(“to what extent does the generated response agree with the retrieved context”)
* **Retrieval relevance**: Retrieved docs vs input(“how relevant are my retrieved results for this query”)

To evaluate following these criteria, we will need the retrieved docs for each query. However, our agent only returns us the final response and cannot expose the retrieved documents outside the runtime. That is why we need a helper to store the context.

## 1. Build an Evaluation Helper for our Agent

This is perhaps not the most appropriate solution for this issue but it enables us to evaluate without significantly modifying the agent.

What the helper should do is very simple: it stores retrieved documents and provides it later.

In [None]:
from langchain_core.documents import Document

# (1) Define a simple helper class
class RAGContextHolder:
    def __init__(self):
        # A variable to store the most recently retrieved docs
        self.last_retrieved_docs = []

    def set_docs(self, docs: list[Document]):
        """Called by the middleware to save the retrieved docs"""
        self.last_retrieved_docs = docs

    def get_docs(self) -> list[Document]:
        """Called by the evaluation function to get the saved docs"""
        return self.last_retrieved_docs

# (2) Create a "global" instance of this class
context_holder = RAGContextHolder()

print("--- Context Holder Ready ---")

--- Context Holder Ready ---


Now let's put the helper's `set_docs` method inside our agent.

In [None]:
from langchain.agents.middleware import dynamic_prompt, ModelRequest
from langchain_core.prompts import ChatPromptTemplate

print(f"Vector store (myvectorstore) is ready: {myvectorstore is not None}")

@dynamic_prompt
def prompt_with_context_and_rewrite(request: ModelRequest) -> str:
    last_query = request.state["messages"][-1].content
    print(f"\n--- [Middleware] Original Query: '{last_query}' ---")

    rewrite_system_msg = """You are an expert query assistant. Your task is to rewrite the user's question into an optimized query for a vector database search. Your rewritten query will be used for similarity search.
    Only output the rewritten query."""

    rewrite_template = ChatPromptTemplate(
        [
            ("system", rewrite_system_msg),
            ("human", "{user_input}")
        ]
    )

    rewrite_prompt_value = rewrite_template.invoke(
        {
            "user_input": last_query,
        }
    )

    rewrite_response = model.invoke(rewrite_prompt_value.messages)

    rewritten_query = rewrite_response.content
    print(f"--- [Middleware] Rewritten Query: '{rewritten_query}' ---")

    try:
      retrieved_docs = myvectorstore.similarity_search(rewritten_query, k=3) # Get top 3
    except Exception as e:
      print(f"Check your vector store: {e}")
      retrieved_docs = []

    ############### NEW STEP - store the retrieved docs ##################
    context_holder.set_docs(retrieved_docs)
    print(f"--- [Middleware] Saved {len(retrieved_docs)} docs to Context Holder ---")
    ######################################################################

    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    print(f"--- [Middleware] Retrieved {len(retrieved_docs)} docs ---")

    system_message = (
        "You are a helpful assistant. Use the following context in your response:"
        "\n\n--- CONTEXT ---"
        f"\n{docs_content}"
        "\n--- END CONTEXT ---"
    )

    return system_message

agent = create_agent(model, tools=[], middleware=[prompt_with_context_and_rewrite])

Vector store (myvectorstore) is ready: True


## 2. Create the Agent Wrapper Function for Evaluation

With the helper above, we make a wrapper for the agent, so that it can be called by the LangSmith for evaluation. LangSmith would need the agent's answer and the retrieved documents. The wrapper only needs to return this.

It is very important that we returns a dictionary with two keys, `answer` and `documents`. This will be used by evaluators.


In [None]:
def run_agent_for_evaluation(input_query: str) -> dict:
    """
    A wrapper function that LangSmith evaluation will call.
    inputs_dict must be in the format {"question": "..."}.
    """

    # 1. Run the agent
    # (This call internally triggers the 'prompt_with_context_and_rewrite_and_save' middleware)
    result = agent.invoke({"messages": [{"role": "user", "content": input_query}]})
    answer = result["messages"][-1].content

    # 2. Get the "hidden" retrieved docs
    retrieved_docs = context_holder.get_docs()

    # 3. Return in the format required by the evaluation tutorial
    return {
        "answer": answer,
        "documents": [d.page_content for d in retrieved_docs]
    }

# test
print("--- Wrapper Function Test ---")
test_output = run_agent_for_evaluation("한국과 아프리카의 협력 현황은?")
print(f"Answer: {test_output['answer'][:50]}...")
print(f"Documents Count: {len(test_output['documents'])}")

--- Wrapper Function Test ---

--- [Middleware] Original Query: '한국과 아프리카의 협력 현황은?' ---
--- [Middleware] Rewritten Query: '한국 아프리카 협력 현황' ---
--- [Middleware] Saved 3 docs to Context Holder ---
--- [Middleware] Retrieved 3 docs ---
Answer: 한국과 아프리카는 다양한 분야에서 협력을 강화하고 있습니다. 제시된 문맥에 따르면 다음과 ...
Documents Count: 3


We also need to define a target function, which will finally call the wrapper function.

In [None]:
def target(inputs: dict) -> dict:
    return run_agent_for_evaluation(inputs["query"])

## 3. Prepare Evaluators

Until now, we have slightly modified our agent to enable evaluation. We now prepare actual evaluators.

We will use [LLM-as-judge](https://docs.langchain.com/langsmith/evaluation-concepts#llm-as-judge) concept for our evaluators. It is nothing very special; we let another LLM model to read the inputs and return a score. This is useful since we do not have the "ground truth" for our RAG application, e.g., we don't know cooperation between Korea and Africa, etc.

### Evaluator (1) [Relevance: Response vs input](https://docs.langchain.com/langsmith/evaluate-rag-tutorial#relevance%3A-response-vs-input)

We use the tutorial's evaluator. You don't need to understand everything here.

* We define a schema for structured output of the evaluator LLM model.
* We prepare a prompt to evaluate relevance.
* We create an actual evaluator function that will be provided to LangSmith framework.

Be aware that this evaluator expects `inputs` dictionary which has a `query` key as an input. We will need to prepare the dataset observing this expectation.

In [None]:
from typing_extensions import Annotated, TypedDict
from langchain.messages import HumanMessage, SystemMessage

# output schema for structured output
class RelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[
        int, ..., "Score from 1 to 5, where 5 is most relevant and 1 is least relevant"
    ]

# Grade prompt
relevance_instructions = """You are an impartial evaluator. Your task is to assess the relevance of a provided ANSWER to a given QUESTION using a 1-5 score.

You will be given a QUESTION and an ANSWER. Here is the grading criteria:
- **1 (Poor):** The ANSWER is completely off-topic, evasive, or does not address the QUESTION at all.
- **2 (Fair):** The ANSWER is tangentially related but does not directly answer the core of the QUESTION.
- **3 (Average):** The ANSWER partially addresses the QUESTION but misses key aspects or includes irrelevant information.
- **4 (Good):** The ANSWER directly addresses the QUESTION and is helpful, but could be slightly more complete or concise.
- **5 (Excellent):** The ANSWER directly, fully, and helpfully addresses the QUESTION's intent.

Explain your reasoning in a step-by-step manner. First, analyze the question's intent. Second, analyze the answer's content. Finally, provide your score from 1 to 5.
"""

# Grader LLM
relevance_llm = model.with_structured_output(
    RelevanceGrade, method="json_schema", strict=True
)

# Evaluator
def relevance(inputs: dict, outputs: dict) -> bool:
    messages = [
        SystemMessage(content=relevance_instructions),
        HumanMessage(content=f"QUESTION: {inputs['query']}\nANSWER: {outputs['answer']}")
    ]
    grade = relevance_llm.invoke(messages)
    return grade["relevant"]

### Evaluator (2) [Groundedness: Response vs retrieved docs](https://docs.langchain.com/langsmith/evaluate-rag-tutorial#groundedness%3A-response-vs-retrieved-docs)

Very similar but we use the retrieved documents.

In [None]:
# Grade output schema
class GroundedGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    grounded: Annotated[
        int, ..., "Score from 1 to 5, where 5 is fully grounded and 1 is hallucinated"
    ]

# Grade prompt
grounded_instructions = """You are an impartial evaluator. Your task is to assess whether an ANSWER is "grounded in" a set of provided CONTEXTS using a 1-5 score.

You will be given a set of CONTEXTS and an ANSWER. Here are the grading criteria:
- **1 (Not Grounded):** The ANSWER contains significant information or claims that are NOT supported by the CONTEXTS (i.e., hallucination).
- **2 (Poorly Grounded):** The ANSWER contains some claims that are not supported, or significantly misrepresents the CONTEXTS.
- **3 (Partially Grounded):** The ANSWER is mostly supported by the CONTEXTS, but may contain minor claims or details not found in the CONTEXTS.
- **4 (Well Grounded):** The ANSWER is almost entirely supported by the CONTEXTS, with only very minor embellishments.
- **5 (Fully Grounded):** Every single claim in the ANSWER is explicitly supported by the provided CONTEXTS.

Explain your reasoning in a step-by-step manner. First, break down the ANSWER into individual claims. Second, for each claim, check if it is supported by the CONTEXTS. Finally, provide your score from 1 to 5.
"""

# Grader LLM
grounded_llm = model.with_structured_output(
    GroundedGrade, method="json_schema", strict=True
)

# Evaluator
def groundedness(inputs: dict, outputs: dict) -> bool:
# --- FIX ---
    # The 'run_agent_for_evaluation' wrapper returns a list of strings in the 'documents' key
    if not outputs["documents"]:
        # If no document was retrieved, any answer (other than "I don't know") is by definition ungrounded.
        return 1

    doc_string = "\n\n".join(outputs["documents"])

    answer_string = f"CONTEXTS: {doc_string}\n\nANSWER: {outputs['answer']}"

    messages = [
        SystemMessage(content=grounded_instructions),
        HumanMessage(content=answer_string)
    ]

    grade = grounded_llm.invoke(messages)
    return grade["grounded"]

### Evaluator (3) [Retrieval relevance: Retrieved docs vs input](https://docs.langchain.com/langsmith/evaluate-rag-tutorial#retrieval-relevance%3A-retrieved-docs-vs-input)



In [None]:
# Grade output schema
class RetrievalRelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    retrieval: Annotated[
        int,
        ...,
        "Score from 1 to 5, where 5 is highly relevant and 1 is not relevant",
    ]

# Grade prompt
retrieval_relevance_instructions = """You are an impartial evaluator. Your task is to assess the relevance of a set of retrieved CONTEXTS to a given QUESTION using a 1-5 score.

You will be given a QUESTION and a set of CONTEXTS. Here are the grading criteria:
- **1 (Poor):** ALL retrieved CONTEXTS are completely irrelevant to the QUESTION.
- **2 (Fair):** Most CONTEXTS are irrelevant, but one or two might be tangentially related.
- **3 (Average):** Some CONTEXTS are relevant to the QUESTION, but many are irrelevant or contain noise.
- **4. (Good):** Most CONTEXTS are relevant and helpful for answering the QUESTION.
- **5 (Excellent):** ALL retrieved CONTEXTS are highly relevant and crucial for answering the QUESTION.

Explain your reasoning in a step-by-step manner. First, analyze the QUESTION's intent. Second, examine each CONTEXT for its relevance. Finally, provide your score from 1 to 5 based on the overall relevance of the set.
"""

# Grader LLM
retrieval_relevance_llm = model.with_structured_output(RetrievalRelevanceGrade, method="json_schema", strict=True)

def retrieval_relevance(inputs: dict, outputs: dict) -> bool:
    """An evaluator for document relevance"""

    if not outputs["documents"]:
        return 1 # No contexts retrieved, so they cannot be relevant.

    doc_string = "\n\n".join(outputs["documents"])

    answer_string = f"CONTEXTS: {doc_string}\n\nQUESTION: {inputs['query']}"

    messages = [
        SystemMessage(content=retrieval_relevance_instructions),
        HumanMessage(content=answer_string)
    ]

    # Run evaluator
    grade = retrieval_relevance_llm.invoke(messages)
    return grade["retrieval"]

## Prepare Evaluation Dataset

Finally, we make a dataset. Do not forget to use the keys, e.g. `inputs`, `query`, we used to define other functions.

In [None]:
examples = [
    {
        "inputs": {"query": "한국과 아프리카의 협력 현황은?"},
    },
    {
        "inputs": {"query": "한-미-일 3국간의 공조는?"},
    },
]

We add this to the LangSmith 'client'.

In [None]:
from langsmith import Client

client = Client()
dataset_name = "RAG evaluation_01"
dataset = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
    dataset_id=dataset.id,
    examples=examples
)

{'example_ids': ['abaefa0a-493a-48b4-8568-f4309ce0c463',
  '2c34bb25-a6d6-47f9-b63c-82970b3f743a'],
 'count': 2}

In [None]:
experiment_results = client.evaluate(
    target,
    data=dataset_name,
    evaluators=[groundedness, relevance, retrieval_relevance],
    experiment_prefix="rag-doc-relevance",
    metadata={"version": "none"},
)

View the evaluation results for experiment: 'rag-doc-relevance-94b88aec' at:
https://smith.langchain.com/o/97ed7887-c202-4703-b790-2ee48dc6156d/datasets/b6ada865-8ec9-44a7-839b-bf4635ca8f76/compare?selectedSessions=df5f901d-6f8b-48d9-80e6-d6a165ccf702




0it [00:00, ?it/s]


--- [Middleware] Original Query: '한-미-일 3국간의 공조는?' ---
--- [Middleware] Rewritten Query: '한미일 3국 공조' ---
--- [Middleware] Saved 3 docs to Context Holder ---
--- [Middleware] Retrieved 3 docs ---

--- [Middleware] Original Query: '한국과 아프리카의 협력 현황은?' ---
--- [Middleware] Rewritten Query: 'South Korea Africa cooperation current status' ---
--- [Middleware] Saved 3 docs to Context Holder ---
--- [Middleware] Retrieved 3 docs ---


In [None]:
experiment_results.to_pandas()

Unnamed: 0,inputs.query,outputs.answer,outputs.documents,error,feedback.groundedness,feedback.relevance,feedback.retrieval_relevance,execution_time,example_id,id
0,한-미-일 3국간의 공조는?,"제공된 정보에 따르면, 한미일 3국간의 공조는 다음과 같습니다.\n\n* **한...",[□외교부는 한일중 3국 청소년을 대상으로 ‘배우고 싶은 이웃나라의 문화’를 주제로...,,5,5,3,9.875213,2c34bb25-a6d6-47f9-b63c-82970b3f743a,f17dbdd0-ef84-4e06-99e3-768dc7ee26e5
1,한국과 아프리카의 협력 현황은?,한국과 아프리카의 협력 현황은 다음과 같습니다:\n\n* **한-아프리카 정상회...,[한반도 뿐 아니라 세계 평화와 안정을 위협하고 있는데 대해 국제사회의 단합된 대응...,,5,5,4,6.648114,abaefa0a-493a-48b4-8568-f4309ce0c463,1f911067-eae0-494c-8167-1129fcd21060


# Done!

# License

We referred to official reference of LangChain to create this content. Also, Google Gemini 2.5 Pro helped writing of textual explanation and code.

MIT License

Copyright (c) LangChain, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.