# Code Hierarchy Node Parser

The `CodeHierarchyNodeParser` is useful to split long code files into more reasonable chunks. What this will do is create a "Hierarchy" of sorts, where sections of the code are made more reasonable by replacing the scope body with short comments telling the LLM to search for a referenced node if it wants to read that context body. This is called skeletonization, and is toggled by setting `skeleton` to `True` which it is by default. Nodes in this hierarchy will be split based on scope, like function, class, or method scope, and will have links to their children and parents so the LLM can traverse the tree.

## Installation and Import

First be sure to install the necessary [tree-sitter](https://tree-sitter.github.io/tree-sitter/) libraries.

`pip install tree-sitter tree-sitter-languages`

In [2]:
from llama_index.node_parser.code_hierarchy import CodeHierarchyNodeParser
from llama_index.text_splitter import CodeSplitter
from llama_index.readers import SimpleDirectoryReader
from pathlib import Path
from pprint import pprint

In [18]:
from IPython.display import Markdown, display
def print_python(python_text):
    """This function prints python text in ipynb nicely formatted."""
    display(Markdown("```python\n"+python_text+"```"))

Now, choose a directory you want to scan, and glob for all the code files you want to import.

In this case I'm going to glob all "*.py" files in the `llama_index/node_parser` directory.

In [24]:
reader = SimpleDirectoryReader(
    input_files=[Path("../../../../llama_index/node_parser/code_hierarchy.py")],
    file_metadata=lambda x: {"filepath": x},
)
nodes = reader.load_data()
len(nodes)

1

This should be the code hierarchy node parser itself. Lets have it parse itself!

In [32]:
print(f"Length of text: {len(nodes[0].text)}")
print_python(nodes[0].text[:1500]+"\n\n# ...")

Length of text: 32984


```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.extractors.metadata_extractors import BaseExtractor
from llama_index.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field


from tree_sitter import Node

from llama_index.callbacks.base import CallbackManager
from llama_index.callbacks.schema import CBEventType, EventPayload
from llama_index.schema import BaseNode, Document, NodeRelationship, TextNode
from llama_index.text_splitter import CodeSplitter
from llama_index.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    """
    Unfortunately some languages need special options for how to make a signature.

    For example, html element signatures should include their closing >, there is no
    easy way to include this using an always-exclusive system.

    However, using an always-inclusive system, python decorators don't work,
    as there isn't an easy to define terminator for decorators that is inclusive
    to their signature.
    """

    type: str = Field(description="The type string to match on.")
    inclusive: bool = Field(
        description=(
            "Whether to include the text of the node matched by this type or not."
        ),
    )


class _SignatureCaptureOptions(BaseModel):
    start_signature_types: Optional[List[_SignatureCaptureType]] = Field(

# ...```

So what are we to do? Well lets try splitting it. We are going to use the `CodeHierarchyNodeParser` to split the nodes into more reasonable chunks.

In [41]:
split_nodes = CodeHierarchyNodeParser(
    language="python",
    # You can further parameterize the CodeSplitter to split the code
    # into "chunks" that match your context window size using
    # chunck_lines and max_chars parameters, here we just use the defaults
    code_splitter=CodeSplitter(language="python", max_chars=1000, chunk_lines=10),
).get_nodes_from_documents(nodes)
print("Number of nodes after splitting:", len(split_nodes))

Number of nodes after splitting: 86


Great! So that split up our data from 8 nodes into 112 nodes! Whats the max length of any of these nodes?

In [42]:
print(f"Longest text in nodes: {max(len(n.text) for n in split_nodes)}")

Longest text in nodes: 1160


That's much shorter than before! Let's look at a sample.

In [43]:
print_python(split_nodes[0].text)

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.extractors.metadata_extractors import BaseExtractor
from llama_index.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field


from tree_sitter import Node

from llama_index.callbacks.base import CallbackManager
from llama_index.callbacks.schema import CBEventType, EventPayload
from llama_index.schema import BaseNode, Document, NodeRelationship, TextNode
from llama_index.text_splitter import CodeSplitter
from llama_index.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 857be953-32d9-46f3-acdb-faaa20001b75


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id af1033f4-834f-4daa-9254-083cc851868c
    # Code replaced for brevity. See node_id 9076fc5d-c9a6-42a1-a8fe-1ee5ee9492c0```

Without even needing a long printout we can see everything this module imported in the first document (which is at the module level) and the single class it defines. However, now instead of the class body, we see a comment: 

`# Code replaced for brevity. See node_id {node_id}`

What if we go to that node_id?

In [44]:
split_nodes_by_id = {n.node_id: n for n in split_nodes}
uuid_from_text = split_nodes[0].text.splitlines()[-1].split(" ")[-1]
print("Going to print the node with UUID:", uuid_from_text)
print_python(split_nodes_by_id[uuid_from_text].text)

Going to print the node with UUID: 9076fc5d-c9a6-42a1-a8fe-1ee5ee9492c0


```python
# Code replaced for brevity. See node_id 0280460a-d023-4eb3-8a8c-d95a83011117
"""
Maps language -> Node Type -> SignatureCaptureOptions

The best way for a developer to discover these is to put a breakpoint at the TIP
tag in _chunk_node, and then create a unit test for some code, and then iterate
through the code discovering the node names.
"""
    # Code replaced for brevity. See node_id c973e5ad-6305-4189-83f5-ff10823cdfbb```

We can also see the relationships on this node.

In [12]:
pprint(split_nodes_by_id[uuid_from_text].relationships)

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='60a63ceb-3c0a-4eee-8ae1-7fdc14510edf', node_type=<ObjectType.TEXT: '1'>, metadata={'language': 'python', 'inclusive_scopes': [], 'filepath': '../../../../llama_index/node_parser/code_hierarchy.py'}, hash='9fa8b6ef0b523dabf7b1618041956504f6abd633b7d600c53250c8444a0fe906'),
 <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='60a63ceb-3c0a-4eee-8ae1-7fdc14510edf', node_type=<ObjectType.TEXT: '1'>, metadata={'language': 'python', 'inclusive_scopes': [], 'filepath': '../../../../llama_index/node_parser/code_hierarchy.py'}, hash='ae4500dea30a8e7958b560a7ddc7953adab1865453160af0870a8858b7187404'),
 <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='67278947-e13b-4dd3-bfb9-c0a319cbdd10', node_type=<ObjectType.TEXT: '1'>, metadata={'language': 'python', 'inclusive_scopes': [], 'filepath': '../../../../llama_index/node_parser/code_hierarchy.py'}, hash='02eb2776daa0b6b9e82333fc4a81327d73645eb26715b31ddcd3e1c4da597c9b'),
 <NodeR

The `NEXT` `PREV` relationships come from the `CodeSplitter` which is a component of the `CodeHierarchyNodeParser`. It is responsible for cutting up the nodes into chunks that are a certain character length. For more information about the `CodeSplitter` read this:

[Code Splitter](llama_index/docs/module_guides/loading/node_parsers/modules.md#CodeSplitter)

The `PARENT` and `CHILD` relationships come from the `CodeHierarchyNodeParser` which is responsible for creating the hierarchy of nodes. Things like classes, functions, and methods are nodes in this hierarchy.

The `SOURCE` is the original file that this node came from.

For example, we can get the `NEXT` few nodes like so:

In [52]:
from llama_index.schema import NodeRelationship

node_id = uuid_from_text
for i in range(5):
    if NodeRelationship.NEXT not in split_nodes_by_id[node_id].relationships:
        print("No next node found!")
        break
    next_node_relationship_info = split_nodes_by_id[node_id].relationships[
        NodeRelationship.NEXT
    ]
    next_node = split_nodes_by_id[next_node_relationship_info.node_id]
    print_python(next_node.text)
    node_id = next_node.node_id

```python
# Code replaced for brevity. See node_id 9076fc5d-c9a6-42a1-a8fe-1ee5ee9492c0
_DEFAULT_SIGNATURE_IDENTIFIERS: Dict[str, Dict[str, _SignatureCaptureOptions]] =
    # Code replaced for brevity. See node_id 95dcf765-200b-4ac4-8efc-493f02153460```

```python
# Code replaced for brevity. See node_id c973e5ad-6305-4189-83f5-ff10823cdfbb
{
    "python": {
        "function_definition": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="block", inclusive=False)],
            name_identifier="identifier",
        ),
        "class_definition": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="block", inclusive=False)],
            name_identifier="identifier",
        ),
    },
    "html": {
        "element": _SignatureCaptureOptions(
            start_signature_types=[_SignatureCaptureType(type="<", inclusive=True)],
            end_signature_types=[_SignatureCaptureType(type=">", inclusive=True)],
            name_identifier="tag_name",
        )
    },
    # Code replaced for brevity. See node_id 54cf6a02-6771-4869-9273-5b87b991da04```

```python
# Code replaced for brevity. See node_id 95dcf765-200b-4ac4-8efc-493f02153460
"cpp": {
        "class_specifier": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="{", inclusive=False)],
            name_identifier="type_identifier",
        ),
        "function_definition": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="{", inclusive=False)],
            name_identifier="function_declarator",
        ),
    },
    # Code replaced for brevity. See node_id 3c34fdb4-98e9-4db8-ba54-feeb05fed250```

```python
# Code replaced for brevity. See node_id 54cf6a02-6771-4869-9273-5b87b991da04
"typescript":
    # Code replaced for brevity. See node_id 1ab1abea-2410-4dd1-95a6-e869bea7a568```

```python
# Code replaced for brevity. See node_id 3c34fdb4-98e9-4db8-ba54-feeb05fed250
{
        "interface_declaration": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="{", inclusive=False)],
            name_identifier="type_identifier",
        ),
        "lexical_declaration": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="{", inclusive=False)],
            name_identifier="identifier",
        ),
        "function_declaration": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="{", inclusive=False)],
            name_identifier="identifier",
        ),
        "class_declaration": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="{", inclusive=False)],
            name_identifier="type_identifier",
        ),
        "method_definition": _SignatureCaptureOptions(
            end_signature_types=[_SignatureCaptureType(type="{", inclusive=False)],
            name_identifier="property_identifier",
        ),
    }
    # Code replaced for brevity. See node_id 60b230cb-0085-473b-9099-cc185ad7d9a9```

We can also get the children of this node.

In [59]:
from llama_index.schema import NodeRelationship

def recur_children(node_id):
    next_node_relationship_info = split_nodes_by_id[node_id].relationships[
        NodeRelationship.CHILD
    ]
    for children in next_node_relationship_info:
        if NodeRelationship.CHILD in split_nodes_by_id[children.node_id].relationships:
            next_node = split_nodes_by_id[children.node_id]
            print_python(next_node.text)
            recur_children(next_node.node_id)
            break

recur_children(uuid_from_text)

```python
class _SignatureCaptureType(BaseModel):
    """
    Unfortunately some languages need special options for how to make a signature.

    For example, html element signatures should include their closing >, there is no
    easy way to include this using an always-exclusive system.

    However, using an always-inclusive system, python decorators don't work,
    as there isn't an easy to define terminator for decorators that is inclusive
    to their signature.
    """

    type: str = Field(description="The type string to match on.")
    inclusive: bool = Field(
        description=(
            "Whether to include the text of the node matched by this type or not."
        ),
    )```

# Indices

Lets explore the use of this node parser in an index. We will be able to use any index which allows search by keyword, which should enable us to search for any node by it's uuid, or by any scope name.

Lets use a keyword index to facilitate this kind of operation. We have created a CodeHierarchyKeywordTableIndex which will allow us to search for nodes by their uuid, or by their scope name.

In [67]:
from llama_index.indices.code_hierarchy import (
    CodeHierarchyKeywordTableIndex,
)
from llama_index.indices.keyword_table.base import KeywordTableRetrieverMode

idx = CodeHierarchyKeywordTableIndex(
    nodes=split_nodes,
)
retriever = idx.as_retriever(retriever_mode=KeywordTableRetrieverMode.SIMPLE)

ValueError: 
******
Could not load OpenAI model. If you intended to use OpenAI, please check your OPENAI_API_KEY.
Original error:
No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys

To disable the LLM entirely, set llm=None.
******

Now we can get the same code as before.

In [None]:
pprint(retriever.retrieve(uuid_from_text)[0].node.get_content())

('class SentenceWindowNodeParser(NodeParser):\n'
 '    # Code replaced for brevity. See node_id '
 'fe68e6de-3851-4c77-8acb-445fe45fee6c')


Now what about getting the rest of the code for this scope?

In [None]:
pprint(
    [
        n.node.get_content()
        for n in retriever.retrieve("SentenceWindowNodeParser")
    ]
)

['# Code replaced for brevity. See node_id '
 'd8a4b50c-c0d9-4338-870a-1ab058d1e3c8\n'
 '@classmethod\n'
 '    def from_defaults(\n'
 '        cls,\n'
 '        sentence_splitter: Optional[Callable[[str], List[str]]] = None,\n'
 '        window_size: int = DEFAULT_WINDOW_SIZE,\n'
 '        window_metadata_key: str = DEFAULT_WINDOW_METADATA_KEY,\n'
 '        original_text_metadata_key: str = DEFAULT_OG_TEXT_METADATA_KEY,\n'
 '        include_metadata: bool = True,\n'
 '        include_prev_next_rel: bool = True,\n'
 '        callback_manager: Optional[CallbackManager] = None,\n'
 '        metadata_extractor: Optional[MetadataExtractor] = None,\n'
 '    ) -> "SentenceWindowNodeParser":\n'
 '        # Code replaced for brevity. See node_id '
 'ecb65036-88bb-4008-806e-b9c629f7d038\n'
 '\n'
 '    def get_nodes_from_documents(\n'
 '        self,\n'
 '        documents: Sequence[Document],\n'
 '        show_progress: bool = False,\n'
 '    ) -> List[BaseNode]:\n'
 '        # Code replaced for

The only difficulty is that these are out of order. The CodeSplitter controls how much overlap there is for each of these documents, and how big they are. You can play with its settings to disambiguate any confusion. However, they do have their uuids for their splits in the text themselves, so an LLM should be able to recursively search these documents to put them in some kind of order.

# Code Hierarchy

The namesake of this node parser, creates a tree of scope names to use to search the code.

In [None]:
print(CodeHierarchyNodeParser.get_code_hierarchy_from_nodes(split_nodes))

- ..
  - ..
    - ..
      - ..
        - llama_index
          - node_parser
            - sentence_window.py
              - SentenceWindowNodeParser
                - __init__
                - text_splitter
                - from_defaults
                - get_nodes_from_documents
                - build_window_nodes_from_documents
            - code_hierarchy.py
              - _SignatureCaptureType
              - _SignatureCaptureOptions
              - _ScopeMethod
              - _CommentOptions
              - _ScopeItem
              - _ChunkNodeOutput
              - CodeHierarchyNodeParser
                - class_name
                - __init__
                - _get_node_name
                  - recur
                - _get_node_signature
                  - find_start
                  - find_end
                - _chunk_node
                - get_code_hierarchy_from_nodes
                  - get_subdict
                  - recur_inclusive_scope
                  - dict_