# Class Introduction

## Objective
Document splitting is often a crucial preprocessing step for many applications. It involves breaking down large texts into smaller, manageable chunks. This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. There are several strategies for splitting documents, each with its own advantages.


**Relevant Links**
- [HuggingFace chunk visualizer](https://huggingface.co/spaces/Nymbo/chunk_visualizer) 
- [5 Levels of text splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)
- [RAG Course](https://www.youtube.com/watch?v=sVcwVQRHIc8)
- 

In [1]:
from langchain_openai import OpenAIEmbeddings
from utils import *
import os
os.environ["LANGSMITH_PROJECT"] = "llm-training-05-rag-p2"


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
)

## Character Splitting

Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.

This method isn't recommended for any applications - but it's a great starting point for us to understand the basics.

- **Pros:** Easy & Simple
- **Cons:** Very rigid and doesn't take into account the structure of your text

Concepts to know:

- **Chunk Size** - The number of characters you would like in your chunks. 50, 100, 100,000, etc.
- **Chunk Overlap** - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.

In [3]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

In [4]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size = 50, chunk_overlap=10, separator='', strip_whitespace=False)
text_splitter.create_documents([text])


[Document(metadata={}, page_content='This is the text I would like to chunk up. It is t'),
 Document(metadata={}, page_content='p. It is the example text for this exercise')]

## Recursive Character Text Splitting

Let's jump a level of complexity.

The problem with Level #1 is that we don't take into account the structure of our document at all. We simply split by a fix number of characters.

The Recursive Character Text Splitter helps with this. With it, we'll specify a series of separatators which will be used to split our docs.

You can see the default separators for LangChain here. Let's take a look at them one by one.

- "\n\n" - Double new line, or most commonly paragraph breaks
- "\n" - New lines
- " " - Spaces
- "" - Characters

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)
text_splitter.create_documents([text])


[Document(metadata={}, page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
 Document(metadata={}, page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
 Document(metadata={}, page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, th

## Level 3: Document Specific Splitting <a id=\"DocumentSpecific\"></a>

Stepping up our levels ladder, let's start to handle document types other than normal prose in a .txt. What if you have pictures? or a PDF? or code snippets?

Our first two levels wouldn't work great for this so we'll need to find a different tactic.

This level is all about making your chunking strategy fit your different data formats. Let's run through a bunch of examples of this in action

The Markdown, Python, and JS splitters will basically be similar to Recursive Character, but with different separators.

See all of LangChains document splitters [here](https://python.langchain.com/docs/how_to/code_splitter/)
### Markdown

You can see the separators [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1175).

Separators:
* `\#{1,6}` - Split by new lines followed by a header (H1 through H6)
* ```` ```\ ```` - Code blocks
* `\\\\\*\\\\*\\\\*+\` - Horizontal Lines
* `\---+\` - Horizontal Lines
* `\___+\` - Horizontal Lines
* `\\` Double new lines
* `\` - New line
* `\" \"` - Spaces
* `\"\"` - Character

In [7]:
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [8]:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 60, chunk_overlap=0)
splitter.create_documents([markdown_text])

[Document(metadata={}, page_content='# Fun in California\n\n## Driving'),
 Document(metadata={}, page_content='Try driving on the 1 down to San Diego\n\n### Food'),
 Document(metadata={}, page_content="Make sure to eat a burrito while you're there\n\n## Hiking"),
 Document(metadata={}, page_content='Go to Yosemite')]

## Agentic Chunking

## Level 5: Agentic Chunking
Can we instruct an LLM to do this task like a human would?

How does a human even go about chunking in the first place?

1. I would get myself a scratch piece of paper or notepad
2. I'd start at the top of the essay and assume the first part will be a chunk (since we don't have any yet)
3. Then I would keep going down the essay and evaluate if a new sentence or piece of the essay should be a part of the first chunk, if not, then create a new one
4. Then keep doing that all the way down the essay until we got to the end.


Example: `Greg went to the park. He likes walking` > `['Greg went to the park.', 'Greg likes walking']`


In [None]:
from langchain_openai import ChatOpenAI
from typing import List
from pydantic import BaseModel
from langchain import hub

In [18]:
# Pydantic data class
class Sentences(BaseModel):
    sentences: List[str]


obj = hub.pull("wfh/proposal-indexing")
llm = ChatOpenAI(model='gpt-4.1-nano').with_structured_output(Sentences)

# use it in a runnable
runnable = obj | llm

# Then wrap it together in a function that'll return a list of propositions to us


def get_propositions(text):
    result = runnable.invoke({
        "input": text
    })
    print(text)
    print(result)  # Debugging: print the result to see its structure
    print("==" * 20)
    # result is expected to be a dict with 'sentences' key
    return result

In [19]:
with open('resources_rag/superlinear.txt') as file:
    essay = file.read()

In [20]:
paragraphs = essay.split("\n\n")
len(paragraphs)

#get just a couple of paragraphs
essay_propositions = []

for i, para in enumerate(paragraphs[:5]):
    propositions = get_propositions(para)
    
    essay_propositions.extend(propositions)
    print (f"Done with {i}")

One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
sentences=["One of the most important things the speaker didn't understand about the world when the speaker was a child is the degree to which the returns for performance are superlinear.", 'The speaker is referencing their childhood.', "The speaker's childhood was a time when the speaker did not understand this particular concept.", 'The concept involves the relationship between performance and returns.', 'The returns for performance are superlinear, meaning they increase more than proportionally as performance improves.']
Done with 0
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out o