# Document Goal
The purpose of this notebook is:
- to understand and choose a Markdown text splitter.  Is the splitter splitting in a way that makes sense?
- to evaluate the contents of the text within the returned text splits.  Should the text be cleaned, are there nodes that are too large or too small?

In [2]:
# This notebook is in the eval folder.  Change to the root folder.
%cd ..
%pwd  # To verify the current working directory

c:\Users\happy\Documents\Projects\askgrowbuddy


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


'c:\\Users\\happy\\Documents\\Projects\\askgrowbuddy'

In [4]:
# This is a cool library for printing data in a way that is easy to read.
from rich import print


# Langchain's MarkdownTextSplitter
I'm focusing on Langchain's splitters for now. I tried LlamaIndex's Markdown splitter but did not like the aggressiveness of the splitting.  First I'll try Langchain's `MarkdownTextSplitter`.

The `MarkdownTextSplitter` is a `RecursiveCharacterTextSplitter` that has set the separators to include the Markdown headers. Here is how the splitter will split the text (see markdown.py):

```
 elif language == Language.MARKDOWN:
            return [
                # First, try to split along Markdown headings (starting with level 2)
                "\n#{1,6} ",
                # Note the alternative syntax for headings (below) is not handled here
                # Heading level 2
                # ---------------
                # End of code block
                "```\n",
                # Horizontal lines
                "\n\\*\\*\\*+\n",
                "\n---+\n",
                "\n___+\n",
                # Note that this splitter doesn't handle horizontal lines defined
                # by *three or more* of ***, ---, or ___, but this is not handled
                "\n\n",
                "\n",
                " ",
                "",
            ]
```

As shown in the simple example:
- The chunk size and chunk overlap define the size of the text chunk. You can play around with these parameters and see how they affect the output.
- No metadata is added or maintained during the splitting process.


In [1]:
# Document Specific Splitting - Markdown
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 50, chunk_overlap=5)
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""
print(splitter.create_documents([markdown_text]))

[Document(metadata={}, page_content='# Fun in California\n\n## Driving'), Document(metadata={}, page_content='Try driving on the 1 down to San Diego\n\n### Food'), Document(metadata={}, page_content="Make sure to eat a burrito while you're there"), Document(metadata={}, page_content='## Hiking\n\nGo to Yosemite')]


# Langchain's MarkdownHeaderTextSplitter
The `MarkdownHeaderTextSplitter` does not inherit from `RecursiveCharacterTextSplitter`.  It's chunk size is defined by the header level specified in the `headers_to_split_on` list. This could mean really large or small chunks since it is based on the user's choice of headers. I could imagine an approach that starts here and then uses a `RecursiveCharacterTextSplitter` to clean up chunks broken on the header that are too large.

Play around with the `headers_to_split_on` list to see how the splitting behaves.
Notice:
- The headers are included in the text of the chunk as well as the metadata.
- The chunk size is defined by the header level.


In [3]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    # ("##", "Header 2"),
    # ("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on,strip_headers=False)

print(splitter.split_text(markdown_text))

[Document(metadata={'Header 1': 'Fun in California'}, page_content="# Fun in California  \n## Driving  \nTry driving on the 1 down to San Diego  \n### Food  \nMake sure to eat a burrito while you're there  \n## Hiking  \nGo to Yosemite")]


# Split Obsidian Notes
Based on the above, I'm focusing on the `MarkdownHeaderTextSplitter` for now. Let's large document and evaluate how text splitting looks.

In [4]:
from src.ingest_service import IngestService
from src.doc_stats import DocStats
ingest_service = IngestService()
obsidian_notes_path = 'eval/obsidian_notes'
# obsidian_notes_path = r'G:\My Drive\Audios_To_Knowledge\knowledge\AskGrowBuddy\AskGrowBuddy\Knowledge\soil_test_knowlege'
docs = ingest_service.load_obsidian_notes(obsidian_notes_path)

DocStats.print_llama_index_docs_summary_stats(docs)


resource module not available on Windows



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_community.utilities.requests import TextRequestsWrapper
* 'allow_population_by_field_name' has been renamed to 'populate_by_name'


PydanticUserError: The `__modify_schema__` method is not supported in Pydantic v2. Use `__get_pydantic_json_schema__` instead in class `SecretStr`.

For further information visit https://errors.pydantic.dev/2.9/u/custom-json-schema

In [None]:
from src.ingest_service import IngestService
ingest_service = IngestService()
nodes = ingest_service.chunk_text(docs)


DocStats.print_llama_index_docs_summary_stats(nodes)


In [None]:
from node_view import launch_node_viewer
launch_node_viewer(nodes)