**https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/**
<span style="background-color: yellow; color: black; padding: 2px;">[Website](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)</span>

**Text Splitters**

At a high level, text splitters work as following:

1) plit the text up into small, semantically meaningful chunks (often sentences).
2) Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
3) Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

* That means there are two different axes along which you can customize your text splitter:

1) How the text is split
2) How the chunk size is measured

# <span style="background-color: green; color: white; padding: 2px;">HTML Text Splitter.</span>

	Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML)


In [8]:
# !pip install lxml

In [3]:
# Basic imports
from langchain_community.document_loaders import WebBaseLoader #helps to load data from website
from langchain_text_splitters import HTMLHeaderTextSplitter

In [9]:
# sample data
html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits


[Document(metadata={}, page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Baz'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Some text about Baz'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some concluding text about Foo')]

In [10]:
## Let's fecth datafrom real website
data= """<html class="no-js" lang="en"> <!--<![endif]-->
<head>
    <meta charset="utf-8">
    <meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />

    <meta name="viewport" content="width=device-width, initial-scale=1.0">

    
        <title>langchain_text_splitters.html.HTMLHeaderTextSplitter &mdash; 🦜🔗 LangChain 0.2.17</title>
    
    <link rel="canonical"
          href="https://api.python.langchain.com/en/latest/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html"/>

    

    <link rel="stylesheet"
          href="../_static/css/vendor/bootstrap.min.css"
          type="text/css"/>
            <link rel="stylesheet" href="../_static/pygments.css" type="text/css"/>
            <link rel="stylesheet" href="../_static/css/theme.css" type="text/css"/>
            <link rel="stylesheet" href="../_static/autodoc_pydantic.css" type="text/css"/>
            <link rel="stylesheet" href="../_static/copybutton.css" type="text/css"/>
            <link rel="stylesheet" href="../_static/sphinx-dropdown.css" type="text/css"/>
            <link rel="stylesheet" href="../_static/panels-bootstrap.min.css" type="text/css"/>
            <link rel="stylesheet" href="../_static/css/custom.css" type="text/css"/>
    <link rel="stylesheet" href="../_static/css/theme.css" type="text/css"/>
    <script id="documentation_options" data-url_root="../"
            src="../_static/documentation_options.js"></script>
    <script src="../_static/jquery.js"></script> 
<script async type="text/javascript" src="/_/static/javascript/readthedocs-addons.js"></script><meta name="readthedocs-project-slug" content="langchain" /><meta name="readthedocs-version-slug" content="latest" /><meta name="readthedocs-resolver-filename" content="/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html" /><meta name="readthedocs-http-status" content="200" /></head>
<body>
<div class="banner">
    <p>This is a legacy site. Please use the latest <a href="https://python.langchain.com/v0.2/api_reference/reference.html">v0.2</a> and <a href="https://python.langchain.com/api_reference/">v0.3</a> API references instead.</p>
</div>
"""

In [13]:
headers_to_split_on = [
    ("link rel", "Header 1"),
    
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(data)
html_header_splits

[Document(metadata={}, page_content='This is a legacy site. Please use the latest v0.2 and v0.3 API references instead.')]

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitters import HTMLSectionSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text(html_string)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits

[Document(metadata={'Header 1': 'Foo'}, page_content='Foo \n Some intro text about Foo.'),
 Document(metadata={'Header 2': 'Bar main section'}, page_content='Bar main section \n Some intro text about Bar.'),
 Document(metadata={'Header 3': 'Bar subsection 1'}, page_content='Bar subsection 1 \n Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 3': 'Bar subsection 2'}, page_content='Bar subsection 2 \n Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 2': 'Baz'}, page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo')]

# Split text from url

In [17]:
html_splitter= HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]
splitted_text= html_splitter.split_text_from_url("https://api.python.langchain.com/en/latest/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html")


In [18]:
splitted_text

[Document(metadata={}, page_content='This is a legacy site. Please use the latest v0.2 and v0.3 API references instead.  \nLangChain Core Community Experimental Text splitters ai21 airbyte anthropic astradb aws azure-dynamic-sessions box chroma cohere couchbase elasticsearch exa fireworks google-community google-genai google-vertexai groq huggingface ibm milvus mistralai mongodb nomic nvidia-ai-endpoints ollama openai pinecone postgres prompty qdrant robocorp together unstructured voyageai weaviate Partner libs Docs  \nai21 airbyte anthropic astradb aws azure-dynamic-sessions box chroma cohere couchbase elasticsearch exa fireworks google-community google-genai google-vertexai groq huggingface ibm milvus mistralai mongodb nomic nvidia-ai-endpoints ollama openai pinecone postgres prompty qdrant robocorp together unstructured voyageai weaviate  \nToggle Menu  \nlangchain_text_splitters.html.HTMLHeaderTextSplitter  \nHTMLHeaderTextSplitter  \nHTMLHeaderTextSplitter.__init__() HTMLHeaderTex