# How to split ny HTML header
Similar in concept to the MarkdownHeaderTextSplitter, the HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

In [6]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "header 1"),
    ("h2", "header 2"),
    ("h3", "header 3")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
split_texts = html_splitter.split_text(html_string) 
split_texts


[Document(metadata={'header 1': 'Foo'}, page_content='Foo'),
 Document(metadata={'header 1': 'Foo'}, page_content='Some intro text about Foo.'),
 Document(metadata={'header 1': 'Foo', 'header 2': 'Bar main section'}, page_content='Bar main section'),
 Document(metadata={'header 1': 'Foo', 'header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'header 1': 'Foo', 'header 2': 'Bar main section', 'header 3': 'Bar subsection 1'}, page_content='Bar subsection 1'),
 Document(metadata={'header 1': 'Foo', 'header 2': 'Bar main section', 'header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'header 1': 'Foo', 'header 2': 'Bar main section', 'header 3': 'Bar subsection 2'}, page_content='Bar subsection 2'),
 Document(metadata={'header 1': 'Foo', 'header 2': 'Bar main section', 'header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'header 

In [13]:
from langchain_text_splitters import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter

url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "header 1"),
    ("h2", "header 2"),
    ("h3", "header 3")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 500
chunk_overlap = 50

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", " ", ""]
)

splits  =  text_splitter.split_documents(html_header_splits)

splits[80:100]  # Display a few splits to see the results

[Document(metadata={'header 1': 'Kurt Gödel'}, page_content='follows:  \n(Gödel’s First Incompleteness\nTheorem) If is ω-consistent, then there is a sentence which is\nneither provable nor refutable from .  \nTheorem 3  \nP  \nP  \nBy judicious coding of syntax referred to above, write\na formula\n Prf( , ) of number theory, representable in , so that  \nProof:  \nx  \ny  \n[ ]  \n11  \nP  \ncodes a proof of φ ⇒ ⊢\nPrf( , ).  \nn  \nP  \nn  \n⌈  \nφ  \n⌉  \nand  \ndoes not code a proof of φ ⇒ ⊢ ¬Prf( , ).  \nn  \nP  \nn  \n⌈  \nφ  \n⌉  \nLet Prov( ) denote the formula ∃ Prf( , ) .'),
 Document(metadata={'header 1': 'Kurt Gödel'}, page_content='⌉  \nLet Prov( ) denote the formula ∃ Prf( , ) .\n By Theorem 2 there is a sentence φ with the property  \ny  \nx  \nx  \ny  \n[ ]  \n12  \n⊢ (φ ↔\n¬Prov( )).  \nP  \n⌈  \nφ  \n⌉  \nThus φ says ‘I am not provable.’ We now observe, if ⊢ φ, then by (1) there is such that ⊢ Prf( , ), hence ⊢ Prov( ), hence,\nby (3) ⊢ ¬φ, so is inconsistent.\nThus  \