# HTML

>[HTML](https://en.wikipedia.org/wiki/HMTL) s the standard markup language for documents designed to be displayed in a web browser.

`HtmlTextSplitter` splits text along Markdown headings, code blocks, or horizontal rules. It's implemented as a simple subclass of `RecursiveCharacterSplitter` with HTML-specific separators. See the source code to see the HTML syntax expected by default.

1. How the text is split: by list of `HTML` specific separators
2. How the chunk size is measured: by number of characters

In [1]:
from langchain.text_splitter import HtmlTextSplitter

In [12]:
html_text = """
<!DOCTYPE html>
<html>
    <head>
        <title>🦜️🔗 LangChain</title>
        <style>
            body {
                font-family: Arial, sans-serif;
            }
            h1 {
                color: darkblue;
            }
        </style>
    </head>
    <body>
        <div>
            <h1>🦜️🔗 LangChain</h1>
            <p>⚡ Building applications with LLMs through composability ⚡</p>
        </div>
        <div>
            As an open source project in a rapidly developing field, we are extremely open to contributions.
        </div>
    </body>
</html>
"""

html_splitter = HtmlTextSplitter(chunk_size=175, chunk_overlap=20)

In [13]:
docs = html_splitter.create_documents([html_text])

In [14]:
docs

[Document(page_content='<!DOCTYPE html>\n<html>', metadata={}),
 Document(page_content='<title>🦜️🔗 LangChain</title>', metadata={}),
 Document(page_content='body {\n                font-family: Arial, sans-serif;\n            }\n            h1 {\n                color: darkblue;\n            }\n        </style>\n    </head>', metadata={}),
 Document(page_content='/style>\n    </head>', metadata={}),
 Document(page_content='<div>\n            <h1>🦜️🔗 LangChain</h1>\n            <p>⚡ Building applications with LLMs through composability ⚡</p>\n        </div>', metadata={}),
 Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.\n        </div>\n    </body>\n</html>', metadata={})]

In [15]:
html_splitter.split_text(html_text)

['<!DOCTYPE html>\n<html>',
 '<title>🦜️🔗 LangChain</title>',
 'body {\n                font-family: Arial, sans-serif;\n            }\n            h1 {\n                color: darkblue;\n            }\n        </style>\n    </head>',
 '/style>\n    </head>',
 '<div>\n            <h1>🦜️🔗 LangChain</h1>\n            <p>⚡ Building applications with LLMs through composability ⚡</p>\n        </div>',
 'As an open source project in a rapidly developing field, we are extremely open to contributions.\n        </div>\n    </body>\n</html>']