### HTML Header Text Splitter

HTMLHeaderTextSplitter is a "structure-aware" chunker that splits the text at the HTML element level and adds metadata for each header "relevant" to any given chunk. 

It can return chunks element by element or combine elements with the same metadata, with the objectives of:

- keeping related text grouped (more or less semantically)
- preserving context rich information encoded in document structures 

It can be used with other text splitters as a part of chunking pipeline

In [5]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!doctype html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>DigitalNext</title>
  </head>
  <body>
    <div id="app">
        <h1>text in header 1</h1>
        <h2>text in header 2</h2>
        <h3>text in header 3</h3>
    </div>
    
  </body>
</html>

"""

In [7]:
headers_to_split = [
    ("h1", "Header 1"),
    ("h2", "Header 2")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split)
html_header_splits = html_splitter.split_text(html_string)

html_header_splits

[Document(metadata={'Header 1': 'text in header 1'}, page_content='text in header 1'),
 Document(metadata={'Header 1': 'text in header 1', 'Header 2': 'text in header 2'}, page_content='text in header 2'),
 Document(metadata={'Header 1': 'text in header 1', 'Header 2': 'text in header 2'}, page_content='text in header 3')]

In [8]:
url = "https://python.langchain.com/docs/integrations/document_loaders/"


headers_to_split = [
    ("h1", "Header 1"),
    ("h2", "Header 2")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split)
html_header_splits = html_splitter.split_text_from_url(url)

html_header_splits[0]

Document(metadata={}, page_content='!function(){function t(t){document.documentElement.setAttribute("data-theme",t)}var e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return window.localStorage.getItem("theme")}catch(t){}}();null!==e?t(e):window.matchMedia("(prefers-color-scheme: dark)").matches?t("dark"):(window.matchMedia("(prefers-color-scheme: light)").matches,t("light"))}(),function(){try{const n=new URLSearchParams(window.location.search).entries();for(var[t,e]of n)if(t.startsWith("docusaurus-data-")){var a=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}(),document.documentElement.setAttribute("data-announcement-bar-initially-dismissed",function(){try{return"true"===localStorage.getItem("docusaurus.announcement.dismiss")}catch(t){}return!1}())  \nSkip to main content  \nOur course is now available on LangChain Academy!  \nBuilding Ambient Agents with LangGraph 

In [9]:
html_header_splits[2]

Document(metadata={}, page_content='DocumentLoaders load data into the standard LangChain Document format.  \nEach DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method.\nAn example use case is as follows:  \nfrom  \nlangchain_community  \n.  \ndocument_loaders  \n.  \ncsv_loader  \nimport  \nCSVLoader  \nloader  \n=  \nCSVLoader  \n(  \n.  \n.  \n.  \n# <-- Integration specific parameters here  \n)  \ndata  \n=  \nloader  \n.  \nload  \n(  \n)')