-------------------------------
#### Understanding the simple text splitters
------------------------------

- **Text splitters** break down the document into **smaller pieces** at the raw text level.
- They are useful when the content has a **flat structure** and does not come in a specific format.


#### SimpleFileNodeParser

- **SimpleFileNodeParser** automatically selects the appropriate node parser based on file types.
- It can handle and transform various file formats into nodes, simplifying the interaction with different types of content.
- The file formats it can manage include:
  - **PDFs**
  - **DOCX** (Word documents)
  - **CSVs** (Comma-Separated Values)
  - **Plain text files**


In [20]:
from llama_index.readers.file import FlatReader

In [21]:
from pathlib import Path

In [22]:
reader = FlatReader()

In [23]:
document = reader.load_data(Path("files/sample_document1.txt"))

In [24]:
document

[Document(id_='89023799-860a-4c30-8b1c-11ade2d2e0e6', embedding=None, metadata={'filename': 'sample_document1.txt', 'extension': '.txt'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="In ancient Rome, the city of Rome itself was the heart of the vast Roman Empire. It was known for its grand architecture, including iconic structures like the Colosseum and the Pantheon. The Romans were skilled engineers and builders, creating an extensive network of roads, aqueducts, and bridges that connected their far-reaching territories. The Roman Republic, with its Senate and elected officials, gave rise to the famous Roman legions, which conquered vast lands and brought them under Roman rule. The Roman civilization's influence on art, law, and governance can still be seen in modern societies today.", mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_sepera

In [25]:
print(f"Metadata: {document[0].metadata}")
print(f"Text: {document[0].text}")

Metadata: {'filename': 'sample_document1.txt', 'extension': '.txt'}
Text: In ancient Rome, the city of Rome itself was the heart of the vast Roman Empire. It was known for its grand architecture, including iconic structures like the Colosseum and the Pantheon. The Romans were skilled engineers and builders, creating an extensive network of roads, aqueducts, and bridges that connected their far-reaching territories. The Roman Republic, with its Senate and elected officials, gave rise to the famous Roman legions, which conquered vast lands and brought them under Roman rule. The Roman civilization's influence on art, law, and governance can still be seen in modern societies today.


#### HTMLNodeParser

In [26]:
from llama_index.core.node_parser import HTMLNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path

In [27]:
reader = FlatReader()
document = reader.load_data(Path("files/others/sample.html"))

In [28]:
my_tags = ["p", "span"]  
html_parser = HTMLNodeParser(tags=my_tags)
nodes = html_parser.get_nodes_from_documents(document)

In [29]:
print('<span> elements:')
for node in nodes:
    if node.metadata['tag']=='span':
        print(node.text)

<span> elements:
Example:


In [30]:
print('<p> elements:') 
for node in nodes:
    if node.metadata['tag']=='p':
        print(node.text)

<p> elements:
First line
Second line
Third line


#### MarkdownNodeParser
- This parser processes raw markdown text and generates nodes reflecting its structure and content. 
- The markdown node parser divides the content into nodes for each header encountered in the file and incorporates the header hierarchy into the metadata. 

In [31]:
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path

In [32]:
reader = FlatReader()
document = reader.load_data(Path("files/others/sample.md"))

In [33]:
parser = MarkdownNodeParser.from_defaults()
nodes  = parser.get_nodes_from_documents(document)

In [34]:
for node in nodes:
    print(f"Metadata {node.metadata} \nText: {node.text}")

Metadata {'filename': 'sample.md', 'extension': '.md'} 
Text: An h1 header

Paragraphs are separated by a blank line.

2nd paragraph. *Italic*, **bold**, and `monospace`. Itemized lists
look like:

  * this one
  * that one
  * the other one

Note that --- not considering the asterisk --- the actual text
content starts at 4-columns in.

> Block quotes are
> written like so.
>
> They can span multiple paragraphs,
> if you like.

Use 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it's all
in chapters 12--14"). Three dots ... will be converted to an ellipsis.
Unicode is supported. ☺



An h2 header
------------

Here's a numbered list:

 1. first item
 2. second item
 3. third item

Note again how the actual text starts at 4 columns in (4 characters
from the left side). Here's a code sample:

    # Let me re-iterate ...
    for i in 1 .. 10 { do-something(i) }

As you probably guessed, indented 4 spaces. By the way, instead of
indenting the block, you can use delimited blocks, if

#### JSONNodeParser

This parser is specialized in processing and querying structured data in JSON format. In a similar
way to the Markdown parser, the JSON parser can be used like this:

In [35]:
from llama_index.core.node_parser import JSONNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path

In [17]:
reader = FlatReader()
document = reader.load_data(Path("files/others/sample.json"))

In [18]:
json_parser = JSONNodeParser.from_defaults()
nodes = json_parser.get_nodes_from_documents(document)

In [19]:
for node in nodes:
    print(f"Metadata {node.metadata} \nText: {node.text}")

Metadata {'filename': 'sample.json', 'extension': '.json'} 
Text: quiz sport q1 question Which one is correct team name in NBA?
quiz sport q1 options New York Bulls
quiz sport q1 options Los Angeles Kings
quiz sport q1 options Golden State Warriros
quiz sport q1 options Huston Rocket
quiz sport q1 answer Huston Rocket
quiz maths q1 question 5 + 7 = ?
quiz maths q1 options 10
quiz maths q1 options 11
quiz maths q1 options 12
quiz maths q1 options 13
quiz maths q1 answer 12
quiz maths q2 question 12 - 8 = ?
quiz maths q2 options 1
quiz maths q2 options 2
quiz maths q2 options 3
quiz maths q2 options 4
quiz maths q2 answer 4
