# Semantic Chunking with LangChain!

Today we'll be exploring Semantic Chunking!

Let's first grab the dependencies we'll be using to explore what Semantic Chunking is - and why it's useful!

In [None]:
!pip install -qU langchain_experimental langchain_openai langchain_community langchain ragas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.5/73.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.6/271.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.9/262.9 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.6/86.6 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install -qU faiss-cpu tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[?25h

Today we'll be working with "Alice and Wonderland" as our source material - let's grab it and load it into memory.

In [None]:
!wget https://www.gutenberg.org/files/11/11-0.txt -O alice.txt

--2024-03-27 15:01:37--  https://www.gutenberg.org/files/11/11-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 154638 (151K) [text/plain]
Saving to: ‘alice.txt’


2024-03-27 15:01:37 (3.14 MB/s) - ‘alice.txt’ saved [154638/154638]



In [None]:
with open("./alice.txt") as f:
  alice_in_wonderland = f.read()

## RecursiveCharacterTextSplitter AKA "Naive Chunking"

Let's look at our documents if we use a traditional non-semantic chunking strategy!

> NOTE: The chunk size chosen here is purely for illustrative purposes.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False
)

In [None]:
naive_chunks = text_splitter.split_text(alice_in_wonderland)

In [None]:
for chunk in naive_chunks[10:15]:
  print(chunk + "\n")

dear! Oh dear! I shall be late!” (when she thought it over afterwards,
it occurred to her that she ought to have wondered at this, but at the

time it all seemed quite natural); but when the Rabbit actually _took a
watch out of its waistcoat-pocket_, and looked at it, and then hurried

on, Alice started to her feet, for it flashed across her mind that she
had never before seen a rabbit with either a waistcoat-pocket, or a

watch to take out of it, and burning with curiosity, she ran across the
field after it, and fortunately was just in time to see it pop down a
large rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how
in the world she was to get out again.



Notice how our chunks wind up split across sentences, and we have similar context split across chunks as well.

We could use a number of awesome strategies to counter this problem - but we're going to focus on Semantic Chunking today!

## Semantic Chunking

Let's start by providing our OpenAI API key - which will be required for the specific example used in this notebook.

> NOTE: You could substitute this for any embedding process. The better the embedding method - the better the results should be!

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

Enter your OpenAI API Key:··········


Now let's implement the `SemanticChunker`!

We're going to be using the `percentile` threshold as an example today - but there's three different strategies you could use (descriptions provided by the [LangChain docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker) on Semantic Chunking):

- `percentile` (default) - In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.

- `standard_deviation` - In this method, any difference greater than X standard deviations is split.

- `interquartile` - In this method, the interquartile distance is used to split chunks.

The basic idea is as follows:

1. Split our document into sentences (based on `.`, `?`, and `!`)
2. Index each sentence based on position
3. Add a `buffer_size` (`int`) of sentences on either side of our selected sentence
4. Calculate distances between groups of sentences
5. Merge groups based on similarity based on the above thresholds

> NOTE: This method is currently experimental and is not in a stable final form - expect updates and improvements in the coming months



In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_chunker = SemanticChunker(OpenAIEmbeddings(model="text-embedding-3-large"), breakpoint_threshold_type="percentile")

Now we can create our documents.

In [None]:
semantic_chunks = semantic_chunker.create_documents([alice_in_wonderland])

Let's look at the chunk associated with the above naive chunks.

Notice how much more information is retained and included in each chunk.

Also notice how much larger this chunk is!

In [None]:
for semantic_chunk in semantic_chunks:
  if "waistcoat-pocket_" in semantic_chunk.page_content:
    print(semantic_chunk.page_content)
    print(len(semantic_chunk.page_content))

*** START OF THE PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN
WONDERLAND ***
[Illustration]




Alice’s Adventures in Wonderland

by Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0

Contents

 CHAPTER I. Down the Rabbit-Hole
 CHAPTER II. The Pool of Tears
 CHAPTER III. A Caucus-Race and a Long Tale
 CHAPTER IV. The Rabbit Sends in a Little Bill
 CHAPTER V. Advice from a Caterpillar
 CHAPTER VI. Pig and Pepper
 CHAPTER VII. A Mad Tea-Party
 CHAPTER VIII. The Queen’s Croquet-Ground
 CHAPTER IX. The Mock Turtle’s Story
 CHAPTER X. The Lobster Quadrille
 CHAPTER XI. Who Stole the Tarts? CHAPTER XII. Alice’s Evidence




CHAPTER I. Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”

So she was considering in her own m

## Creating a RAG Pipeline Utilizing Semantic Chunking

Let's create a RAG LCEL chain that leverages our created Semantic Chunks.

We'll start by creating our Retriever.

### Retrieval

We're going to use Meta's FAISS-backed vectorstore, and we'll use `text-embedding-3-large` (the same embedding model used to do the semantic chunking)

> NOTE: There is not specific research or reason that suggests your vectorstore embedding model should be the same as your chunking embedding model - though intuition suggests they should be the same.

In [None]:
from langchain_community.vectorstores import FAISS

semantic_chunk_vectorstore = FAISS.from_documents(semantic_chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-large"))

We will "limit" our semantic retriever to `k = 1` to demonstrate the power of the semantic chunking strategy while maintaining similar token counts between the semantic and naive retrieved context.

In [None]:
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})

In [None]:
semantic_chunk_retriever.invoke("Who has a pocket watch?")

[Document(page_content='“He won’t stand beating. Now, if you only kept on good terms with him, he’d do almost anything\nyou liked with the clock. For instance, suppose it were nine o’clock in\nthe morning, just time to begin lessons: you’d only have to whisper a\nhint to Time, and round goes the clock in a twinkling! Half-past one,\ntime for dinner!”\n\n(“I only wish it was,” the March Hare said to itself in a whisper.)\n\n“That would be grand, certainly,” said Alice thoughtfully: “but then—I\nshouldn’t be hungry for it, you know.”\n\n“Not at first, perhaps,” said the Hatter: “but you could keep it to\nhalf-past one as long as you liked.”\n\n“Is that the way _you_ manage?” Alice asked. The Hatter shook his head mournfully.')]

### Augmented

We'll create a classic RAG prompt to augment our question with the retrieved context.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

rag_template = """\
Use the following context to answer the user's query. If you cannot answer, please respond with 'I don't know'.

User's Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

### Generation

We'll use the default `ChatOpenAI` model for our generator today!

In [None]:
from langchain_openai import ChatOpenAI

base_model = ChatOpenAI()

### LCEL Chain

We'll create our classic LCEL chain here to test the RAG LCEL chain!

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

semantic_rag_chain = (
    {"context" : semantic_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | base_model
    | StrOutputParser()
)

Let's test it out!

In [None]:
semantic_rag_chain.invoke("How does Alice find herself falling down the rabbit hole into Wonderland?")

'Alice finds herself falling down the rabbit hole into Wonderland after she sees a White Rabbit with pink eyes run by her and then take a watch out of its waistcoat-pocket. Curious, she follows the rabbit and sees it disappear down a large rabbit-hole under the hedge. Without considering how to get out, Alice goes after the rabbit and falls down the rabbit hole herself.'

In [None]:
semantic_rag_chain.invoke("Who is Dinah, and what is their importance to Alice?")

"Dinah is Alice's cat, and she is important to Alice because she is a great mouse catcher."

These answers seem great!

Let's repeat this process for our naive chunking!

In [None]:
naive_chunk_vectorstore = FAISS.from_texts(naive_chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-large"))

Notice that we're going to use `k = 15` here - this is to "make it a fair comparison" between the two strategies.

In [None]:
naive_chunk_retriever = naive_chunk_vectorstore.as_retriever(search_kwargs={"k" : 15})

In [None]:
naive_rag_chain = (
    {"context" : naive_chunk_retriever, "question" : RunnablePassthrough()}
    | rag_prompt
    | base_model
    | StrOutputParser()
)

In [None]:
naive_rag_chain.invoke("How does Alice find herself falling down the rabbit hole into Wonderland?")

'Alice finds herself falling down the rabbit hole into Wonderland when she sees a large rabbit-hole under the hedge and runs after the White Rabbit, who went down it.'

In [None]:
naive_rag_chain.invoke("Who is Dinah, and what is their importance to Alice?")

"Dinah is Alice's cat, and she is important to Alice because she is described as the best cat in the world and Alice is very fond of her."

These answers are not bad - but they lack a certain depth that the previous answers did.

## Ragas Assessment Comparison

Let's go ahead and leverage a great tool: [Ragas](https://docs.ragas.io/en/stable/getstarted/index.html)!

We're going to split our documents utilizing a different chunking strategy to avoid any "cheating" by the naive retriever.

In [None]:
synthetic_data_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False
)

In [None]:
synthetic_data_chunks = synthetic_data_splitter.create_documents([alice_in_wonderland])

Then we will create:

- Questions - synthetically generated (`gpt-3.5-turbo`)
- Contexts - created above
- Ground Truths - synthetically generated (`gpt-4-turbo-preview`)
- Answers - generated from our Semantic RAG Chain

In [None]:
questions = []
ground_truths_semantic = []
contexts = []
answers = []

question_prompt = """\
You are a teacher preparing a test. Please create a question that can be answered by referencing the following context.

Context:
{context}
"""

question_prompt = ChatPromptTemplate.from_template(question_prompt)

ground_truth_prompt = """\
Use the following context and question to answer this question using *only* the provided context.

Question:
{question}

Context:
{context}
"""

ground_truth_prompt = ChatPromptTemplate.from_template(ground_truth_prompt)

question_chain = question_prompt | ChatOpenAI(model="gpt-3.5-turbo") | StrOutputParser()
ground_truth_chain = ground_truth_prompt | ChatOpenAI(model="gpt-4-turbo-preview") | StrOutputParser()

for chunk in synthetic_data_chunks[10:20]:
  questions.append(question_chain.invoke({"context" : chunk.page_content}))
  contexts.append([chunk.page_content])
  ground_truths_semantic.append(ground_truth_chain.invoke({"question" : questions[-1], "context" : contexts[-1]}))
  answers.append(semantic_rag_chain.invoke(questions[-1]))

We'll format those into a dataset!

In [None]:
from datasets import load_dataset, Dataset

qagc_list = []

for question, answer, context, ground_truth in zip(questions, answers, contexts, ground_truths_semantic):
  qagc_list.append({
      "question" : question,
      "answer" : answer,
      "contexts" : context,
      "ground_truth" : ground_truth
  })

eval_dataset = Dataset.from_list(qagc_list)

In [None]:
eval_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 10
})

Now we can implement Ragas metrics and evaluate our created dataset.

In [None]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

You can check out our [previous webinar](https://www.youtube.com/watch?v=Anr1br0lLz8) about Ragas to learn a bit more about these metrics.

In [None]:
from ragas import evaluate

result = evaluate(
    eval_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

In [None]:
result

{'context_precision': 1.0000, 'faithfulness': 0.5000, 'answer_relevancy': 0.7399, 'context_recall': 1.0000}

In [None]:
results_df = result.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall
0,"Question:\nIn the given context, how does Alic...",Alice feels desperate and ready to ask for hel...,"[“Well!” thought Alice to herself, “after such...","Alice feels that after her recent fall, tumbli...",1.0,,0.936448,1.0
1,"Question:\nIn the context provided, how far do...",Alice estimates that she has fallen about four...,"[Down, down, down. Would the fall _never_ come...",Alice estimates she has fallen about four thou...,1.0,1.0,0.923001,1.0
2,"Question: In the given context, what does Alic...",Alice wonders if she has been changed in the n...,[though this was not a _very_ good opportunity...,"In the given context, Alice wonders about what...",1.0,0.0,0.907038,1.0
3,"Question: In the given context, Alice expresse...",Alice refers to the people who walk with their...,[Presently she began again. “I wonder if I sha...,"In the given context, Alice uses the term ""The...",1.0,1.0,0.914198,1.0
4,"Question: In the provided context, what was th...",The little girl was unsure of her own identity...,"[country is, you know. Please, Ma’am, is this ...","In the provided context, the little girl was u...",1.0,0.0,0.917209,1.0
5,Question:\n\nWho or what is Dinah in the conte...,"Dinah is a cat, as mentioned in the context pr...","[Down, down, down. There was nothing else to d...",Dinah is the cat in the provided context.,1.0,1.0,0.942229,1.0
6,"Question: In the context provided, what is Ali...",I don't know.,"[very like a mouse, you know. But do cats eat ...","In the provided context, Alice is pondering ab...",1.0,,0.0,1.0
7,Question: What was Alice dreaming about just b...,I don't know.,"[that she was dozing off, and had just begun t...",Alice was dreaming about walking hand in hand ...,1.0,,0.0,1.0
8,Question: What did Alice do after she realized...,Alice took up the fan and gloves that the Whit...,"[Alice was not a bit hurt, and she jumped up o...",After realizing the White Rabbit was still in ...,1.0,0.0,0.887359,1.0
9,"Question: In the passage provided, what is the...","The setting described as ""lit up by a row of l...","[and whiskers, how late it’s getting!” She was...","The setting described as ""lit up by a row of l...",1.0,,0.971589,1.0


The results indicate that this is a "fine" result - largely.

But let's compare to our naive strategy!

In [None]:
for chunk in synthetic_data_chunks[10:20]:
  questions.append(question_chain.invoke({"context" : chunk.page_content}))
  contexts.append([chunk.page_content])
  ground_truths_semantic.append(ground_truth_chain.invoke({"question" : questions[-1], "context" : contexts[-1]}))
  answers.append(naive_rag_chain.invoke(questions[-1]))

In [None]:
naive_result = evaluate(
    eval_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

In [None]:
naive_result

{'context_precision': 1.0000, 'faithfulness': 0.5000, 'answer_relevancy': 0.6492, 'context_recall': 1.0000}

In [None]:
naive_results_df = result.to_pandas()
naive_results_df

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall
0,"Question:\nIn the given context, how does Alic...",Alice feels desperate and ready to ask for hel...,"[“Well!” thought Alice to herself, “after such...","Alice feels that after her recent fall, tumbli...",1.0,,0.936448,1.0
1,"Question:\nIn the context provided, how far do...",Alice estimates that she has fallen about four...,"[Down, down, down. Would the fall _never_ come...",Alice estimates she has fallen about four thou...,1.0,1.0,0.923001,1.0
2,"Question: In the given context, what does Alic...",Alice wonders if she has been changed in the n...,[though this was not a _very_ good opportunity...,"In the given context, Alice wonders about what...",1.0,0.0,0.907038,1.0
3,"Question: In the given context, Alice expresse...",Alice refers to the people who walk with their...,[Presently she began again. “I wonder if I sha...,"In the given context, Alice uses the term ""The...",1.0,1.0,0.914198,1.0
4,"Question: In the provided context, what was th...",The little girl was unsure of her own identity...,"[country is, you know. Please, Ma’am, is this ...","In the provided context, the little girl was u...",1.0,0.0,0.917209,1.0
5,Question:\n\nWho or what is Dinah in the conte...,"Dinah is a cat, as mentioned in the context pr...","[Down, down, down. There was nothing else to d...",Dinah is the cat in the provided context.,1.0,1.0,0.942229,1.0
6,"Question: In the context provided, what is Ali...",I don't know.,"[very like a mouse, you know. But do cats eat ...","In the provided context, Alice is pondering ab...",1.0,,0.0,1.0
7,Question: What was Alice dreaming about just b...,I don't know.,"[that she was dozing off, and had just begun t...",Alice was dreaming about walking hand in hand ...,1.0,,0.0,1.0
8,Question: What did Alice do after she realized...,Alice took up the fan and gloves that the Whit...,"[Alice was not a bit hurt, and she jumped up o...",After realizing the White Rabbit was still in ...,1.0,0.0,0.887359,1.0
9,"Question: In the passage provided, what is the...","The setting described as ""lit up by a row of l...","[and whiskers, how late it’s getting!” She was...","The setting described as ""lit up by a row of l...",1.0,,0.971589,1.0


As we can see this result is noticeably worse!

In [None]:
naive_result

{'context_precision': 1.0000, 'faithfulness': 0.5000, 'answer_relevancy': 0.6492, 'context_recall': 1.0000}

In [None]:
result

{'context_precision': 1.0000, 'faithfulness': 0.5000, 'answer_relevancy': 0.7399, 'context_recall': 1.0000}