# LlamaIndex Bottoms-Up Development - Documents and Nodes
In order to answer questions about the LlamaIndex docs, we first need to load them!

A majority of our documentation is in markdown format. For the sake of scope, we will ONLY worry about markdown files for now.

When parsing these files, there are a few things we might want to keep track of

- Current header (and header hierarchy!)
- Code blocks
- Text
- Source file names

While LlamaIndex DOES HAVE a built-in markdown loader, we can write our own to fit our requirements exactly! Loaders are not magic -- they just read files and create documents. So building our own is easy!

We have provided an implementation of a custom markdown loaded in the source code. Let's test it out to see how it works!

In [1]:
import os
import sys
sys.path.append(os.path.join(os.getcwd(), '..')) # Allows Python to import modules from the parent directory

In [2]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import MarkdownNodeParser

def load_markdown_docs(filepath):
    """Load markdown docs from a directory, including only Markdown files."""
    markdown_parser = MarkdownNodeParser()
    
    try:
        loader = SimpleDirectoryReader(
            input_dir=filepath,
            required_exts=[".md"],
            recursive=True
        )

        documents = loader.load_data()
        nodes = []
        for doc in documents:
            nodes.extend(markdown_parser.get_nodes_from_documents([doc]))
        
        return nodes
    except Exception as e:
        print(f"Error loading documents from {filepath}: {str(e)}")
        return []

# Dictionary of document categories and their corresponding paths
doc_paths = {
    "getting_started": "../docs/getting_started",
    "community": "../docs/community",
    "data": "../docs/core_modules/data_modules",
    "agent": "../docs/core_modules/agent_modules",
    "model": "../docs/core_modules/model_modules",
    "query": "../docs/core_modules/query_modules",
    "supporting": "../docs/core_modules/supporting_modules",
    "tutorials": "../docs/end_to_end_tutorials",
    "contributing": "../docs/development"
}

# Load documents for each category
# Loads Markdown documents from the specified file paths and stores them in a dictionary of document collections, where the keys are the document categories and the values are lists of the loaded documents.
# This code is used to load all the Markdown documentation files for the LlamaIndex project, organized by different categories such as "getting_started", "community", "data", "agent", "model", "query", "supporting", "tutorials", and "contributing". The loaded documents can then be accessed by their category keys in the resulting `doc_collections` dictionary.
doc_collections = {category: load_markdown_docs(path) for category, path in doc_paths.items()}

# You can now access the documents like this:
# getting_started_docs = doc_collections["getting_started"]
# community_docs = doc_collections["community"]
# ... and so on

In [3]:
getting_started_docs = doc_collections["getting_started"]

In [4]:
# Make our printing look nice
from llama_index.core.schema import MetadataMode

In [5]:
# print(agent_docs[5].get_content(metadata_mode=MetadataMode.ALL))
print(doc_collections["agent"][5].get_content(metadata_mode=MetadataMode.ALL))

file_path: /Users/enricobusto/Documents/SOFTWARE/LlamaIndex/llama_docs_bot/2_documents_nodes/../docs/core_modules/agent_modules/agents/root.md
file_name: root.md
file_size: 2340
creation_date: 2024-07-23
last_modified_date: 2024-07-23

Reasoning Loop
The reasoning loop depends on the type of agent. We have support for the following agents: 
- OpenAI Function agent (built on top of the OpenAI Function API)
- a ReAct agent (which works across any chat/text completion endpoint).


In [6]:
#print(agent_docs[0].metadata)
print(doc_collections["agent"][5].metadata)

{'file_path': '/Users/enricobusto/Documents/SOFTWARE/LlamaIndex/llama_docs_bot/2_documents_nodes/../docs/core_modules/agent_modules/agents/root.md', 'file_name': 'root.md', 'file_size': 2340, 'creation_date': '2024-07-23', 'last_modified_date': '2024-07-23'}


Looks not bad! We can see that we have metadata, as well as nicely formatted content.

But, we can improve the formatting even further! We can provide better templating, so that the LLM and embedding models can get a better idea of what they are reading.

In [7]:
text_template = "Content Metadata:\n{metadata_str}\n\nContent:\n{content}"
metadata_template = "{key}: {value},"
metadata_separator = " "

# Assuming doc_collections is already defined and populated
agent_docs = doc_collections.get("agent", [])

for doc in agent_docs:
    if hasattr(doc, 'set_content_template'):
        doc.set_content_template(text_template)
    if hasattr(doc, 'set_metadata_template'):
        doc.set_metadata_template(metadata_template)
    if hasattr(doc, 'set_metadata_separator'):
        doc.set_metadata_separator(metadata_separator)

In [8]:
#print(agent_docs[0].get_content(metadata_mode=MetadataMode.ALL))
print(doc_collections["agent"][5].get_content(metadata_mode=MetadataMode.ALL))

file_path: /Users/enricobusto/Documents/SOFTWARE/LlamaIndex/llama_docs_bot/2_documents_nodes/../docs/core_modules/agent_modules/agents/root.md
file_name: root.md
file_size: 2340
creation_date: 2024-07-23
last_modified_date: 2024-07-23

Reasoning Loop
The reasoning loop depends on the type of agent. We have support for the following agents: 
- OpenAI Function agent (built on top of the OpenAI Function API)
- a ReAct agent (which works across any chat/text completion endpoint).


### Advanced Customization
Going even further with metadata, we can also customize which metadata fields will be seen by both the embedding model and LLM.

# Conclusion
In this notebook, we covered how to use a custom data loader, as well as how to customize the text representations of your data when including metadata for both LLMs and embedding models.