# Semantic Chunking

In this lab, we'll demonstrate how to use LangChain's Semantic Chunker to break apart our text into relevant chunks. Before we explore semantic chunking, let's try to convince ourselves that it's worth the effort by exploring some more rudimentary chunking techniques.

First, let's look at a sample document:

In [None]:
with open("launch_school_docs.txt") as f:
    launch_school_docs = f.read()

print(launch_school_docs)

Our text has three basic parts:

1. ESLint Installation Instructions
2. A Capstone FAQ
3. A Forum post for a Launch School Women's Group event.

In the realm of data, this is actually quite tidy. Sure, it's fabricated and more messy than actual Launch School data, but it does have good formatting. Let's come up with a few rudimentary strategies we might use and see how our chunks come out.

## Strategy 1: Using New-line Characters

In [None]:
# Chunking Based on Newline Characters

import textwrap

chunks = launch_school_docs.split("\n")
print(*chunks, sep="\n-----------\n")

print('**************** Chunk data ****************')

average_chunk_size = sum([len(chunk) for chunk in chunks]) / len(chunks)
print(f"\nAverage chunk size: {average_chunk_size}")

smallest_chunk = min(chunks, key=len)
print(f"\nSmallest chunk: {smallest_chunk}")
print(f"\nSmallest chunk length: {len(smallest_chunk)}")

chunks = [chunk for chunk in chunks if len(chunk.strip()) > 0]

print(f"\nSmallest non-empty chunk: {min(chunks, key=len)}")
print(f"\nSmallest non-empty chunk length: {len(min(chunks, key=len))}")

largest_chunk = max(chunks, key=len)
print(f"\nLargest chunk: \n{textwrap.fill(largest_chunk, width=100)}")
print(f"\nLargest chunk length: {len(largest_chunk)}")

This hasn't worked very well. It would probably give us some relevant chunks, but we have a few problems.

1. The lines of code have lost all context. Stand-alone, they don't mean much. Same story with the questions and their answers from the FAQ.
2. The chunk sizes vary quite a bit. We can easily filter out empty chunks, but even then the smallest non-empty chunk is just the character "}", while the largest has 753 characters. That’s still reasonable, but it’s a wide range. And if our text contained documents without any newline characters, we could easily end up with very large chunks.

## Strategy 2: Splitting Into Even-length Chunks

Let's try a different technique that should solve some of our problems. Since we're using a small document for demonstration, we'll use smaller chunks than is typical.

In [None]:
chunks = [launch_school_docs[i:i+300] for i in range(0, len(launch_school_docs), 300)]

print(*chunks, sep="\n-----------\n")

print('**************** Chunk data ****************')

average_chunk_size = sum([len(chunk) for chunk in chunks]) / len(chunks)
print(f"\nAverage chunk size: {average_chunk_size}")

smallest_chunk = min(chunks, key=len)
print(f"\nSmallest chunk: {smallest_chunk}")
print(f"\nSmallest chunk length: {len(smallest_chunk)}")

largest_chunk = max(chunks, key=len)
print(f"\nLargest chunk: \n{textwrap.fill(largest_chunk, width=100)}")
print(f"\nLargest chunk length: {len(largest_chunk)}")

This seems a bit better! At least our code instructions have some context, and we don't have single-letter meaningless chunks like "}".

That said, we have an issue of chunks existing between two different topics, like this one:

> amentals
> Databases (nosql, rdbms) & Database Design
> Full-stack Development and Frameworks
> Cloud Infrastructure
> Agile Team-based Development
> Software Architecture & System Design
> Distributed Systems
> Service Oriented Architectures
> 
> What if I don’t reside in the US?
> While we have a strong preference fo

We also cut sentences themselves in half. Though this is less of a concern as we could easily implement a bit more logic to make the splits at sentences.

The biggest problem we need to solve is how to create chunks based on **semantic meaning.**

## Thought Challenge: Chunking Semantically

Try to think of how we might be able to break our document into chunks based on semantic meaning, given the tools we already know about. Brainstorm a few ideas.

<details>
<summary>💡 Hint 1 💡</summary>

We don't want chunks to split mid-sentence, so as a starting point, imagine we have a list of sentences that make up a document. How can we determine if two sentences should be grouped in the same chunk, or be split into separate chunks?

</details>

<details>
<summary>💡 Hint 2 💡</summary>

We could embed sentences to give a numerical representation of their "similarity" relative to one another.

</details>

<details>
<summary>💡 Hint 3 💡</summary>

You'll need a `cosine_similarity` and `generate_embedding` function. You can snag these from previous labs.

</details>

Once you've given it some thought, go ahead and follow the steps outlined below, or implement your own idea!

</details>

<details>
<summary>📝 Steps to Chunking 📝</summary>

Your task is to implement our own chunking strategy. Here's what you need to do:

1. Split the text into sentences. The order should be preserved. Even if two sentences are very similar, if they're in different sections, they shouldn't be chunked together.

2. Iterate over the sentences, building chunks as you go. To create your chunks:
- Embed two sentences.
- Calculate a similarity score.
- If the score is above a certain threshold, add it to the current chunk.
- If the score is below a certain threshold, start a new chunk and add it to this new chunk.

3. Return the chunks

</details>

In [None]:
def semantic_chunking(text, similarity_threshold=0.5):

  # TODO

  pass


chunks = semantic_chunking(launch_school_docs)

print(*chunks, sep="\n-----------\n")

print('**************** Chunk data ****************')

print(f"\nNumber of chunks: {len(chunks)}")

average_chunk_size = sum([len(chunk) for chunk in chunks]) / len(chunks)
print(f"\nAverage chunk size: {average_chunk_size}")

smallest_chunk = min(chunks, key=len)
print(f"\nSmallest chunk: {smallest_chunk}")
print(f"\nSmallest chunk length: {len(smallest_chunk)}")

largest_chunk = max(chunks, key=len)
print(f"\nLargest chunk: \n{textwrap.fill(largest_chunk, width=100)}")
print(f"\nLargest chunk length: {len(largest_chunk)}")

Hopefully, your implementation solved the problem of unrelated sentences existing in the same space. What it might not have solved, however, was having some very small chunks and some very large chunks. Let's move on to see how we can use a chunker provided by LangChain. After that, there are some bonus exercises that you may wish to implement that address the chunk size issue.


# Chunking with LangChain

LangChain provides us with a chunking tool that takes a similar approach to what we've just implemented ourselves. Let's see an example in action to see how it compares to our solution.

In [None]:
import os
import textwrap
import dotenv
dotenv.load_dotenv()

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings


text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=50.0,
    min_chunk_size=200,
)

chunk_documents = text_splitter.create_documents([launch_school_docs])
chunks = [document.page_content for document in chunk_documents]

print(*chunks, sep="\n-----------\n")

print('**************** Chunk data ****************')

print(f"\nNumber of chunks: {len(chunks)}")

average_chunk_size = sum([len(chunk) for chunk in chunks]) / len(chunks)
print(f"\nAverage chunk size: {average_chunk_size}")

smallest_chunk = min(chunks, key=len)
print(f"\nSmallest chunk: {smallest_chunk}")
print(f"\nSmallest chunk length: {len(smallest_chunk)}")

largest_chunk = max(chunks, key=len)
print(f"\nLargest chunk: \n{textwrap.fill(largest_chunk, width=100)}")
print(f"\nLargest chunk length: {len(largest_chunk)}")

Experiment with the `breakpoint_threshold_amount` to see the different chunks. You can also take a look at [the documentation](https://python.langchain.com/api_reference/experimental/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html) to see the other parameters we can use to control the chunking.

## Bonus Features

These bonus features are for your custom chunking implementation that you made above.

- **Managing Chunk Size**: Add a `min_chunk_size` and `max_chunk_size` keyword argument that allows us to set some limits. You might consider a two-step chunking process. Does it make sense to cut of a chunk as soon as it reaches it's maximum, or revisit the largest chunks later and split them evenly?

- **Overlapping Chunks**: Try implementing a sliding-window technique such that the end of one chunk is included in the beginning of the next chunk. This can help to avoid context loss between chunks.

- **Token Based Chunking**: Use `tiktoken` to base your chunk size on token count rather than number of characters.