<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_postprocessor/FileNodeProcessors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# File Based Node Parsers

The `SimpleFileNodeParser` and `FlatReader` are designed to allow opening a variety of file types and automatically selecting the best `NodeParser` to process the files. The `FlatReader` loads the file in a raw text format and attaches the file information to the metadata, then the `SimpleFileNodeParser` maps file types to node parsers in `node_parser/file`, selecting the best node parser for the job.

The `SimpleFileNodeParser` does not perform token based chunking of the text, and is intended to be used in combination with a token node parser.

Let's look at an example of using the `FlatReader` and `SimpleFileNodeParser` to load content. For the README file I will be using the LlamaIndex README and the HTML file is the Stack Overflow landing page, however any README and HTML file will work.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
%pip install llama-index-readers-file

In [None]:
!pip install llama-index

In [None]:
from llama_index.core.node_parser import SimpleFileNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path



In [None]:
reader = FlatReader()
html_file = reader.load_data(Path("./stack-overflow.html"))
md_file = reader.load_data(Path("./README.md"))
print(html_file[0].metadata)
print(html_file[0])
print("----")
print(md_file[0].metadata)
print(md_file[0])

{'filename': 'stack-overflow.html', 'extension': '.html'}
Doc ID: a6750408-b0fa-466d-be28-ff2fcbcbaa97
Text: <!DOCTYPE html>       <html class="html__responsive
html__unpinned-leftnav" lang="en">      <head>          <title>Stack
Overflow - Where Developers Learn, Share, &amp; Build Careers</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackove
rflow/Img/favicon.ico?v=ec617d715196">         <link rel="apple-touch-
icon" hr...
----
{'filename': 'README.md', 'extension': '.md'}
Doc ID: 1d872f44-2bb3-4693-a1b8-a59392c23be2
Text: # 🗂️ LlamaIndex 🦙 [![PyPI -
Downloads](https://img.shields.io/pypi/dm/llama-
index)](https://pypi.org/project/llama-index/) [![GitHub contributors]
(https://img.shields.io/github/contributors/jerryjliu/llama_index)](ht
tps://github.com/jerryjliu/llama_index/graphs/contributors) [![Discord
](https://img.shields.io/discord/1059199217496772688)](https:...


## Parsing the files

The flat reader has simple loaded the content of the files into Document objects for further processing. We can see that the file information is retained in the metadata. Let's pass the documents to the node parser to see the parsing.

In [None]:
parser = SimpleFileNodeParser()
md_nodes = parser.get_nodes_from_documents(md_file)
html_nodes = parser.get_nodes_from_documents(html_file)
print(md_nodes[0].metadata)
print(md_nodes[0].text)
print(md_nodes[1].metadata)
print(md_nodes[1].text)
print("----")
print(html_nodes[0].metadata)
print(html_nodes[0].text)

{'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙'}
🗂️ LlamaIndex 🦙
[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-index)](https://pypi.org/project/llama-index/)
[![GitHub contributors](https://img.shields.io/github/contributors/jerryjliu/llama_index)](https://github.com/jerryjliu/llama_index/graphs/contributors)
[![Discord](https://img.shields.io/discord/1059199217496772688)](https://discord.gg/dGcwcsnxhU)


LlamaIndex (GPT Index) is a data framework for your LLM application.

PyPI: 
- LlamaIndex: https://pypi.org/project/llama-index/.
- GPT Index (duplicate): https://pypi.org/project/gpt-index/.

LlamaIndex.TS (Typescript/Javascript): https://github.com/run-llama/LlamaIndexTS.

Documentation: https://gpt-index.readthedocs.io/.

Twitter: https://twitter.com/llama_index.

Discord: https://discord.gg/dGcwcsnxhU.
{'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙', 'Header 3': 'Ecosystem'}
Ecosystem

- LlamaHub (community librar

## Furter processing of files

We can see that the Markdown and HTML files have been split into chunks based on the structure of the document. The markdown node parser splits on any headers and attaches the hierarchy of headers into metadata. The HTML node parser extracted text from common text elements to simplifiy the HTML file, and combines neighbouring nodes of the same element. Compared to working with raw HTML, this is alreadly a big improvement in terms of retrieving meaningful text content.

Because these files were only split according to the structure of the file, we can apply further processing with a text splitter to prepare the content into nodes of limited token length.

In [None]:
from llama_index.core.node_parser import SentenceSplitter

# For clarity in the demo, make small splits without overlap
splitting_parser = SentenceSplitter(chunk_size=200, chunk_overlap=0)

html_chunked_nodes = splitting_parser(html_nodes)
md_chunked_nodes = splitting_parser(md_nodes)
print(f"\n\nHTML parsed nodes: {len(html_nodes)}")
print(html_nodes[0].text)

print(f"\n\nHTML chunked nodes: {len(html_chunked_nodes)}")
print(html_chunked_nodes[0].text)

print(f"\n\nMD parsed nodes: {len(md_nodes)}")
print(md_nodes[0].text)

print(f"\n\nMD chunked nodes: {len(md_chunked_nodes)}")
print(md_chunked_nodes[0].text)



HTML parsed nodes: 67
About
Products
For Teams
Stack Overflow
Public questions & answers
Stack Overflow for Teams
Where developers & technologists share private knowledge with coworkers
Talent

								Build your employer brand
Advertising
Reach developers & technologists worldwide
Labs
The future of collective knowledge sharing
About the company
current community
















            Stack Overflow
        



help
chat









            Meta Stack Overflow
        






your communities            



Sign up or log in to customize your list.                


more stack exchange communities

company blog


HTML chunked nodes: 87
About
Products
For Teams
Stack Overflow
Public questions & answers
Stack Overflow for Teams
Where developers & technologists share private knowledge with coworkers
Talent

								Build your employer brand
Advertising
Reach developers & technologists worldwide
Labs
The future of collective knowledge sharing
About the company
current community




## Summary

We can see that the files have been further processed within the splits created by `SimpleFileNodeParser`, and are now ready to be ingested by an index or vector store. The code cell below shows just the chaining of the parsers to go from raw file to chunked nodes:

In [None]:
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    documents=reader.load_data(Path("./README.md")),
    transformations=[
        SimpleFileNodeParser(),
        SentenceSplitter(chunk_size=200, chunk_overlap=0),
    ],
)

md_chunked_nodes = pipeline.run()
print(md_chunked_nodes)

[TextNode(id_='e6236169-45a1-4699-9762-c8d3d89f8fa0', embedding=None, metadata={'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e7bc328f-85c1-430a-9772-425e59909a58', node_type=None, metadata={'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙'}, hash='e538ad7c04f635f1c707eba290b55618a9f0942211c4b5ca2a4e54e1fdf04973'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='51b40b54-dfd3-48ed-b377-5ca58a0f48a3', node_type=None, metadata={'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙'}, hash='ca9e3590b951f1fca38687fd12bb43fbccd0133a38020c94800586b3579c3218')}, hash='ec733c85ad1dca248ae583ece341428ee20e4d796bc11adea1618c8e4ed9246a', text='🗂️ LlamaIndex 🦙\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-index)](https://pypi.org/project/llama-index/)\n[![GitHub contrib