# **Apply Text Splitting Techniques to Enhance Model Responsiveness**


In [1]:
%%capture
!pip install "langchain==0.2.7"
!pip install "langchain-core==0.2.20"
!pip install "langchain-text-splitters==0.2.2"
!pip install "lxml==5.2.2"

## Text splitters


Key parameters
When using the splitter, you can customize several key parameters to fit your needs:

separator: Define the characters that will be used for splitting the text.
chunk_size: Specify the maximum size of your chunks to ensure they are as granular or broad as needed.
chunk_overlap: Maintain context between chunks by setting the chunk_overlap parameter, which determines the number of characters that overlap between consecutive chunks. This helps ensure that information isn't lost at the chunk boundaries.
length_function: Define how the length of chunks is calculated.

### Prepare the document


A long document has been prepared for this project to demonstrate the performance of each splitter. Run the following code to download it.


In [1]:
# !wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/YRYau14UJyh0DdiLDdzFcA/companypolicies.txt"

In [3]:
# This is a long document you can split up.
company_policies_data = "Data/companypolicies.txt"
with open(company_policies_data) as f:
    companypolicies = f.read()

In [6]:
print(companypolicies)

1.	Code of Conduct

Our Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built on integrity, respect, and accountability.
Integrity: We hold ourselves to the highest ethical standards. This means acting honestly and transparently in all our interactions, whether with colleagues, clients, or the broader community. We respect and protect sensitive information, and we avoid conflicts of interest.
Respect: We embrace diversity and value each individual's contributions. Discrimination, harassment, or any form of disrespectful behavior is unacceptable. We create an inclusive environment where differences are celebrated and everyone is treated with dignity and courtesy.
Accountability: We take responsibility for our actions and decisions. We follow all relevant laws and regulations, and we strive to continuously improve our practices. We report any potential violations of 

### Document object

Before introducing the splitters, let's take a look at the document object in LangChain, which is a data structure used to represent and manage text data in RAG process.

A Document object in LangChain contains information about some data. It has two attributes:

- `page_content: str`: The content of this document. Currently is only a string.
- `metadata: dict`: Arbitrary metadata associated with this document. Can track the document id, file name, etc.


### Split by Character

This is the simplest method, which splits the text based on characters (by default `"\n\n"`) and measures chunk length by the number of characters.
- **How the text is split**: By single character.
- **How the chunk size is measured**: By number of characters.


In the following code, you will use `CharacterTextSplitter` to split the document by character. 
- Separator: Set to `''`, meaning that any character can act as a separator once the chunk size reaches the set limit.
- Chunk size: Set to `200`, meaning that once a chunk reaches 200 characters, it will be split.
- Chunk overlap: Set to `20`, meaning there will be `20` characters overlapping between chunks.
- Length function: Set to `len`.



In [12]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
)

In [13]:
texts = text_splitter.split_text(companypolicies)

In [14]:
len(texts)

87

In [15]:
texts[0]

'1.\tCode of Conduct\n\nOur Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built'

You can also use the following code to add metadata to the text, forming it into a `document` object using LangChain.


In [16]:
texts = text_splitter.create_documents([companypolicies], metadatas=[{"document":"Company Policies"}])  # pass the metadata as well

In [17]:
texts[0]

Document(metadata={'document': 'Company Policies'}, page_content='1.\tCode of Conduct\n\nOur Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built')

### Recursively Split by Character


This text splitter is the recommended one for generic text. It is parameterized by a list of characters, and it tries to split on them in order until the chunks are small enough. The default list is `["\n\n", "\n", " ", ""]`.

It processes the large text by attempting to split it by the first character, `\n\n`. If the first split by \n\n results in chunks that are still too large, it moves to the next character, `\n`, and attempts to split by it. This process continues through the list of characters until the chunks are less than the specified chunk size.

This method aims to keep all paragraphs (then sentences, then words) together as much as possible, as these are generally the most semantically related pieces of text.

- **How the text is split**: by list of characters.
- **How the chunk size is measured**: by number of characters.


The `RecursiveCharacterTextSplitter` class from LangChain is used to implement it.
- You use the default separator list, which is `["\n\n", "\n", " ", ""]`.
- Chunk size is set to `100`.
- Chunk overlap is set to `20`.
- And the length function is `len`.


In [18]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
)

texts = text_splitter.create_documents([companypolicies])
texts

[Document(page_content='1.\tCode of Conduct'),
 Document(page_content='Our Code of Conduct outlines the fundamental principles and ethical standards that guide every'),
 Document(page_content='that guide every member of our organization. We are committed to maintaining a workplace that is'),
 Document(page_content='a workplace that is built on integrity, respect, and accountability.'),
 Document(page_content='Integrity: We hold ourselves to the highest ethical standards. This means acting honestly and'),
 Document(page_content='acting honestly and transparently in all our interactions, whether with colleagues, clients, or the'),
 Document(page_content='clients, or the broader community. We respect and protect sensitive information, and we avoid'),
 Document(page_content='and we avoid conflicts of interest.'),
 Document(page_content="Respect: We embrace diversity and value each individual's contributions. Discrimination,"),
 Document(page_content='Discrimination, harassment, or any form

In [19]:
len(texts)

215

### Split Code


The `CodeTextSplitter` allows you to split your code, supporting multiple programming languages. It is based on the `RecursiveCharacterTextSplitter` strategy. Simply import enum `Language` and specify the language.


In [20]:
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
[e.value for e in Language]

['cpp',
 'go',
 'java',
 'kotlin',
 'js',
 'ts',
 'php',
 'proto',
 'python',
 'rst',
 'ruby',
 'rust',
 'scala',
 'swift',
 'markdown',
 'latex',
 'html',
 'sol',
 'csharp',
 'cobol',
 'c',
 'lua',
 'perl',
 'haskell',
 'elixir']

In [21]:
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

#### Python

The following demonstrates how to split Python code using the `RecursiveCharacterTextSplitter` class.

The main difference between splitting code and using the original `RecursiveCharacterTextSplitter` is that you need to call `.from_language` after the `RecursiveCharacterTextSplitter` and specify the `language`. The other parameter settings remain the same as before.


In [22]:
PYTHON_CODE = """
    def hello_world():
        print("Hello, World!")
    
    # Call the function
    hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='def hello_world():'),
 Document(page_content='print("Hello, World!")'),
 Document(page_content='# Call the function\n    hello_world()')]

In [23]:
RecursiveCharacterTextSplitter.get_separators_for_language(Language.JS)


['\nfunction ',
 '\nconst ',
 '\nlet ',
 '\nvar ',
 '\nclass ',
 '\nif ',
 '\nfor ',
 '\nwhile ',
 '\nswitch ',
 '\ncase ',
 '\ndefault ',
 '\n\n',
 '\n',
 ' ',
 '']

In [24]:
JS_CODE = """
    function helloWorld() {
      console.log("Hello, World!");
    }
    
    // Call the function
    helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)
js_docs = js_splitter.create_documents([JS_CODE])
js_docs

[Document(page_content='function helloWorld() {'),
 Document(page_content='console.log("Hello, World!");\n    }'),
 Document(page_content='// Call the function\n    helloWorld();')]

### Markdown Header Text Splitter

As mentioned, chunking often aims to keep text with a common context together. With this in mind, you might want to specifically honor the structure of the document itself. For example, a Markdown file is organized by headers. Creating chunks within specific header groups is an intuitive approach.

To address this challenge, you can use `MarkdownHeaderTextSplitter`. This splitter will divide a Markdown file based on a specified set of headers.


In [25]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
md = "# Foo\n\n## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n### Boo \n\nHi this is Lance \n\n## Baz\n\nHi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(md)
md_header_splits

[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Jim  \nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='Hi this is Lance'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]

From the split results, you can see that the Markdown file is divided into several chunks formatted as document objects. The `page_content` contains the text under the headings, and the `metadata` contains the header information corresponding to the `page_content`.


In [26]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(md)
md_header_splits

[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='# Foo  \n## Bar  \nHi this is Jim  \nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='### Boo  \nHi this is Lance'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='## Baz  \nHi this is Molly')]

### Split by HTML

Similar in concept to the `MarkdownHeaderTextSplitter`, the HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.


In [27]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Baz'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Some text about Baz'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some concluding text about Foo')]

#### Split by HTML section

Similar to the `HTMLHeaderTextSplitter`, the `HTMLSectionSplitter` is also a "structure-aware" chunker that splits text section by section based on headings.



In [28]:
from langchain_text_splitters import HTMLSectionSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]

html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Foo'}, page_content='Foo \n Some intro text about Foo.'),
 Document(metadata={'Header 2': 'Bar main section'}, page_content='Bar main section \n Some intro text about Bar.'),
 Document(metadata={'Header 3': 'Bar subsection 1'}, page_content='Bar subsection 1 \n Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 3': 'Bar subsection 2'}, page_content='Bar subsection 2 \n Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 2': 'Baz'}, page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo')]