In [None]:
%load_ext autoreload
%autoreload 2

# Code Hierarchy Node Parser

The `CodeHierarchyNodeParser` is useful to split long code files into more reasonable chunks. What this will do is create a "Hierarchy" of sorts, where sections of the code are made more reasonable by replacing the scope body with short comments telling the LLM to search for a referenced node if it wants to read that context body. This is called skeletonization, and is toggled by setting `skeleton` to `True` which it is by default. Nodes in this hierarchy will be split based on scope, like function, class, or method scope, and will have links to their children and parents so the LLM can traverse the tree.

## Installation and Import

First be sure to install the necessary [tree-sitter](https://tree-sitter.github.io/tree-sitter/) libraries.

In [None]:
!pip install tree-sitter tree-sitter-languages python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core import SimpleDirectoryReader
from llama_index.packs.code_hierarchy import CodeHierarchyNodeParser
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

True

In [None]:
from IPython.display import Markdown, display


def print_python(python_text):
    """This function prints python text in ipynb nicely formatted."""
    display(Markdown("```python\n" + python_text + "```"))

# Prepare your Data

Choose a directory you want to scan, and glob for all the code files you want to import.

In this case I'm going to glob all "*.py" files in the `llama_index/node_parser` directory.

In [None]:
reader = SimpleDirectoryReader(
    input_files=[Path("../llama_index/packs/code_hierarchy/code_hierarchy.py")],
    file_metadata=lambda x: {"filepath": x},
)
nodes = reader.load_data()

This should be the code hierarchy node parser itself. Lets have it parse itself!

In [None]:
print(f"Length of text: {len(nodes[0].text)}")
print_python(nodes[0].text[:1500] + "\n\n# ...")

Length of text: 33427


```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field

from tree_sitter import Node

from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    """
    Unfortunately some languages need special options for how to make a signature.

    For example, html element signatures should include their closing >, there is no
    easy way to include this using an always-exclusive system.

    However, using an always-inclusive system, python decorators don't work,
    as there isn't an easy to define terminator for decorators that is inclusive
    to their signature.
    """

    type: str = Field(description="The type string to match on.")
    inclusive: bool = Field(
        description=(
            "Whether to include the text of the node matched by this type or not."
        ),
    )


class _SignatureCaptureOptions(BaseModel):
    """
    Options for capturing the signature of a node.
    """

    start_signature_types: Optional[List[_SignatureCa

# ...```

This is way too long to fit into the context of our LLM. So what are we to do? Well we will split it. We are going to use the `CodeHierarchyNodeParser` to split the nodes into more reasonable chunks.

In [None]:
split_nodes = CodeHierarchyNodeParser(
    language="python",
    # You can further parameterize the CodeSplitter to split the code
    # into "chunks" that match your context window size using
    # chunck_lines and max_chars parameters, here we just use the defaults
    code_splitter=CodeSplitter(language="python", max_chars=1000, chunk_lines=10),
).get_nodes_from_documents(nodes)
print("Number of nodes after splitting:", len(split_nodes))

Number of nodes after splitting: 90


Great! So that split up our data from 1 node into quite a few nodes! Whats the max length of any of these nodes?

In [None]:
print(f"Longest text in nodes: {max(len(n.text) for n in split_nodes)}")

Longest text in nodes: 1152


That's much shorter than before! Let's look at a sample.

In [None]:
print_python(split_nodes[0].text)

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field

from tree_sitter import Node

from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 76793a53-5410-4589-901f-df08e44de8c5


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id 50d35182-2106-4328-ae9e-c076b096b5c9
# Code replaced for brevity. See node_id 7d5b7f71-a7fd-4383-9617-e17a62d9b82a```

Without even needing a long printout we can see everything this module imported in the first document (which is at the module level) and some classes it defines.

We also see that it has put comments in place of code that was removed to make the text size more reasonable.
These can appear at the beginning or end of a chunk, or at a new scope level, like a class or function declaration.

`# Code replaced for brevity. See node_id {node_id}`

# Code Hierarchy

These scopes can be listed by the `CodeHierarchyNodeParser`, giving a "repo map" of sorts.
The namesake of this node parser, it creates a tree of scope names to use to search the code.
Put this in your context to give the LLM a default search hierarchy.

Instruct an LLM using the KeywordQueryEngine (shown later) as a tool to:

```
"Search the tool by any element in this list, or any uuid found in the resulting code, to get more information about that element."
```

Then append this to your context:

In [None]:
print(CodeHierarchyNodeParser.get_code_hierarchy_from_nodes(split_nodes))

(defaultdict(<class 'dict'>, {'..': defaultdict(<class 'dict'>, {'llama_index': defaultdict(<class 'dict'>, {'packs': defaultdict(<class 'dict'>, {'code_hierarchy': defaultdict(<class 'dict'>, {'code_hierarchy': defaultdict(<class 'dict'>, {'_SignatureCaptureType': defaultdict(<class 'dict'>, {}), '_SignatureCaptureOptions': defaultdict(<class 'dict'>, {}), '_ScopeMethod': defaultdict(<class 'dict'>, {}), '_CommentOptions': defaultdict(<class 'dict'>, {}), '_ScopeItem': defaultdict(<class 'dict'>, {}), '_ChunkNodeOutput': defaultdict(<class 'dict'>, {}), 'CodeHierarchyNodeParser': defaultdict(<class 'dict'>, {'class_name': defaultdict(<class 'dict'>, {}), '__init__': defaultdict(<class 'dict'>, {}), '_get_node_name': defaultdict(<class 'dict'>, {'recur': defaultdict(<class 'dict'>, {})}), '_get_node_signature': defaultdict(<class 'dict'>, {'find_start': defaultdict(<class 'dict'>, {}), 'find_end': defaultdict(<class 'dict'>, {})}), '_chunk_node': defaultdict(<class 'dict'>, {}), 'get_c

# Exploration by the Programmer

So that we understand what is going on under the hood, what if we go to that node_id we found above?

In [None]:
split_nodes_by_id = {n.node_id: n for n in split_nodes}
uuid_from_text = split_nodes[9].text.splitlines()[-1].split(" ")[-1]
print("Going to print the node with UUID:", uuid_from_text)
print_python(split_nodes_by_id[uuid_from_text].text)

Going to print the node with UUID: 5b9b71eb-ec2f-4b39-a4e1-cecda086896e


```python
class CodeHierarchyNodeParser(NodeParser):
# Code replaced for brevity. See node_id 37cab2a7-40d6-47d2-8043-a0a9a9afb697```

This is the next split in the file. It is prepended with the node before it and appended with the node after it as a comment.

We can also see the relationships on this node programmatically.

In [None]:
split_nodes_by_id[uuid_from_text].relationships

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='5b9b71eb-ec2f-4b39-a4e1-cecda086896e', node_type=<ObjectType.TEXT: '1'>, metadata={'language': 'python', 'inclusive_scopes': [{'name': 'CodeHierarchyNodeParser', 'type': 'class_definition', 'signature': 'class CodeHierarchyNodeParser(NodeParser):'}], 'start_byte': 6293, 'end_byte': 33426, 'filepath': '../llama_index/packs/code_hierarchy/code_hierarchy.py'}, hash='0e5e84e11e403b09f986b97b1ffcd9ba7994c8bd3bf880037a0e48d513fabaf7'),
 <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='37cab2a7-40d6-47d2-8043-a0a9a9afb697', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='33db2c2bc76ae543d50770ecfbbe61e7f5de9a64ab84e72788e4a4e711f1cfe5'),
 <NodeRelationship.CHILD: '5'>: [RelatedNodeInfo(node_id='1f23f5f9-26c6-487f-ac48-5c0914ec243c', node_type=<ObjectType.TEXT: '1'>, metadata={'inclusive_scopes': [{'name': 'CodeHierarchyNodeParser', 'type': 'class_definition', 'signature': 'class CodeHierarchyNodeParser(NodeParser):'}, {'name

The `NEXT` `PREV` relationships come from the `CodeSplitter` which is a component of the `CodeHierarchyNodeParser`. It is responsible for cutting up the nodes into chunks that are a certain character length. For more information about the `CodeSplitter` read this:

[Code Splitter](https://docs.llamaindex.ai/en/latest/api/llama_index.node_parser.CodeSplitter.html)

The `PARENT` and `CHILD` relationships come from the `CodeHierarchyNodeParser` which is responsible for creating the hierarchy of nodes. Things like classes, functions, and methods are nodes in this hierarchy.

The `SOURCE` is the original file that this node came from.

In [None]:
from llama_index.core.schema import NodeRelationship

node_id = uuid_from_text
if NodeRelationship.NEXT not in split_nodes_by_id[node_id].relationships:
    print("No next node found!")
else:
    next_node_relationship_info = split_nodes_by_id[node_id].relationships[
        NodeRelationship.NEXT
    ]
    next_node = split_nodes_by_id[next_node_relationship_info.node_id]
    print_python(next_node.text)

```python
# Code replaced for brevity. See node_id 5b9b71eb-ec2f-4b39-a4e1-cecda086896e
"""Split code using a AST parser.

    Add metadata about the scope of the code block and relationships between
    code blocks.
    """

    @classmethod
    def class_name(cls) -> str:
        # Code replaced for brevity. See node_id 1f23f5f9-26c6-487f-ac48-5c0914ec243c

    language: str = Field(
        description="The programming language of the code being split."
    )
    signature_identifiers: Dict[str, _SignatureCaptureOptions] = Field(
        description=(
            "A dictionary mapping the type of a split mapped to the first and last type"
            " of itschildren which identify its signature."
        )
    )
    min_characters: int = Field(
        default=80,
        description=(
            "Minimum number of characters per chunk.Defaults to 80 because that's about"
            " how long a replacement comment is in skeleton mode."
        ),
    )
# Code replaced for brevity. See node_id af1e10f7-f928-4160-9068-990f54981e43```

# Keyword Table and Usage by the LLM

Lets explore the use of this node parser in an index. We will be able to use any index which allows search by keyword, which should enable us to search for any node by it's uuid, or by any scope name.

We have created a `CodeHierarchyKeywordQueryEngine` which will allow us to search for nodes by their uuid, or by their scope name. It's `.query` method can be used as a simple search tool for any LLM. Given the repo map we created earlier, or the text of a split file, the LLM should be able to figure out what to search for very naturally.

Lets create the KeywordQueryEngine

In [None]:
from llama_index.packs.code_hierarchy import CodeHierarchyKeywordQueryEngine

idx = CodeHierarchyKeywordQueryEngine(
    nodes=split_nodes,
)

Now we can get the same code as before.

In [None]:
print_python(idx.query(split_nodes[0].node_id).response)

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field

from tree_sitter import Node

from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 76793a53-5410-4589-901f-df08e44de8c5


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id 50d35182-2106-4328-ae9e-c076b096b5c9
# Code replaced for brevity. See node_id 7d5b7f71-a7fd-4383-9617-e17a62d9b82a```

But now we can also search for any node by it's common sense name.

For example, the class `_SignatureCaptureOptions` is a node in the hierarchy. We can search for it by name.

The reason we aren't getting more detail is because our min_characters is too low, try to increase it for more detail for any individual query.

In [None]:
print_python(idx.query("_SignatureCaptureOptions").response)

```python
class _SignatureCaptureOptions(BaseModel):
# Code replaced for brevity. See node_id 5f8ce1f3-7886-4081-a961-ce7ea8e25cbf```

And by module name, in case the LLM sees something in an import statement and wants to know more about it.

In [None]:
print_python(idx.query("code_hierarchy").response)

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field

from tree_sitter import Node

from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 76793a53-5410-4589-901f-df08e44de8c5


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id 50d35182-2106-4328-ae9e-c076b096b5c9
# Code replaced for brevity. See node_id 7d5b7f71-a7fd-4383-9617-e17a62d9b82a```

# As a Tool

To get the langchain tool, just use `as_langchain_tool` on the `CodeHierarchyKeywordQueryEngine` and it will be ready to use in the LLM.

In [None]:
print_python(idx.as_langchain_tool().run("code_hierarchy"))

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field

from tree_sitter import Node

from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 76793a53-5410-4589-901f-df08e44de8c5


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id 50d35182-2106-4328-ae9e-c076b096b5c9
# Code replaced for brevity. See node_id 7d5b7f71-a7fd-4383-9617-e17a62d9b82a```

The description for your LLM to read to learn the tool is:

In [None]:
print("Name: " + idx.as_langchain_tool().name)
display(Markdown("Description: " + idx.as_langchain_tool().description))

Name: Code Search


Description: 
        Search the tool by any element in this list
        to get more information about that element.
        If you see "Code replaced for brevity" then a uuid, you may also search the tool for that uuid to see the full code.
        The list is:

        - ..
  - llama_index
    - packs
      - code_hierarchy
        - code_hierarchy
          - _SignatureCaptureType
          - _SignatureCaptureOptions
          - _ScopeMethod
          - _CommentOptions
          - _ScopeItem
          - _ChunkNodeOutput
          - CodeHierarchyNodeParser
            - class_name
            - __init__
            - _get_node_name
              - recur
            - _get_node_signature
              - find_start
              - find_end
            - _chunk_node
            - get_code_hierarchy_from_nodes
              - get_subdict
              - recur_inclusive_scope
              - dict_to_markdown
            - _parse_nodes
            - _get_indentation
            - _get_comment_text
            - _create_comment_line
            - _get_replacement_text
            - _skeletonize
            - _skeletonize_list
              - recur


        

Now lets finally actually make an agent!

In [None]:
from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain import hub

# First I'm going to use a bigger max_chars and chunk_lines to get a bigger context window
split_nodes = CodeHierarchyNodeParser(
    language="python",
    chunk_min_characters=500,
    # You can further parameterize the CodeSplitter to split the code
    # into "chunks" that match your context window size using
    # chunck_lines and max_chars parameters, here we just use the defaults
    code_splitter=CodeSplitter(language="python", max_chars=5000, chunk_lines=100),
).get_nodes_from_documents(nodes)
idx = CodeHierarchyKeywordQueryEngine(
    nodes=split_nodes,
)

llm = ChatOpenAI(model="gpt-4-turbo-preview", max_tokens=4096)
prompt = hub.pull("hwchase17/react")
tools = [idx.as_langchain_tool()]
agent = create_react_agent(llm, tools, prompt)
display(
    Markdown(
        agent.get_prompts()[0].format(
            input="$INPUT", agent_scratchpad="$AGENT_SCRATCHPAD"
        )
    )
)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
out = agent_executor.invoke(
    {
        "input": """
    How does the code hierarchy node parser work?
    Please provide as much detail as possible, and search recursively for all unknown functions starting from _parse_nodes.
    I want to know how it works as code, not just a description of its interface.
    NodeParsers start at the _parse_nodes method."""
    }
)
display(Markdown(out["output"]))

Answer the following questions as best you can. You have access to the following tools:

Code Search: 
        Search the tool by any element in this list
        to get more information about that element.
        If you see "Code replaced for brevity" then a uuid, you may also search the tool for that uuid to see the full code.
        The list is:

        - ..
  - llama_index
    - packs
      - code_hierarchy
        - code_hierarchy
          - _SignatureCaptureType
          - _SignatureCaptureOptions
          - CodeHierarchyNodeParser
            - __init__
            - _get_node_name
            - _get_node_signature
              - find_start
              - find_end
            - _chunk_node
            - get_code_hierarchy_from_nodes
              - recur_inclusive_scope
              - dict_to_markdown
            - _parse_nodes
            - _get_indentation
            - _create_comment_line
            - _get_replacement_text
            - _skeletonize
            - _skeletonize_list


        

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Code Search]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: $INPUT
Thought:$AGENT_SCRATCHPAD



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo understand how the CodeHierarchyNodeParser works, starting from the _parse_nodes method and exploring each function called within it seems like the best approach.
Action: Code Search
Action Input: _parse_nodes[0m[36;1m[1;3mdef _parse_nodes(
        self,
        nodes: Sequence[BaseNode],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> List[BaseNode]:
# Code replaced for brevity. See node_id b7b5ad87-aeda-458a-b170-c246455faf0e[0m[32;1m[1;3mTo understand how _parse_nodes works, I need to see the full code referenced by the uuid b7b5ad87-aeda-458a-b170-c246455faf0e.
Action: Code Search
Action Input: b7b5ad87-aeda-458a-b170-c246455faf0e[0m[36;1m[1;3m# Code replaced for brevity. See node_id 70002a26-2741-4d33-9b53-049ea412597c
"""
        The main public method of this class.

        Parse documents into nodes.
        """
        out: List[BaseNode] = []

        try:
            import tree_si

The `CodeHierarchyNodeParser` works as follows:

1. Initially, it attempts to import `tree_sitter_languages` to utilize its parser. If the package is not installed, it raises an ImportError.
2. It then tries to get a parser for the specified language. If it fails, it prints an error message and raises an exception.
3. The method processes each node in the provided sequence of nodes. It uses a progress bar if `show_progress` is set to True. Each node's text is parsed into a tree using the `tree_sitter` parser for the specified language.
4. For each tree, it checks if the root node's children do not start with an "ERROR" node. If the check passes:
   - The code is chunked using `_chunk_node`, and for each chunk, metadata (including the language and any node-specific metadata) is added. The source relationship of the node is also set.
   - If skeletonization is enabled, `_skeletonize_list` is called on the chunks.
   - If code splitting is enabled, it further splits the code by lines and characters, using the code splitter to get new nodes from documents. It ensures that the first new node has the same ID as the original node and annotates nodes with UUIDs of adjacent nodes for easy reference. It also updates parent and child relationships based on the new splits.
   - If metadata extraction is enabled, it processes the chunks through a metadata extractor.
5. If the root node's first child is of type "ERROR", it raises a ValueError indicating that the code could not be parsed for the specified language.
6. Finally, it returns the processed chunks or nodes as the output.

Throughout this process, several helper methods are utilized, including `_chunk_node`, `_skeletonize_list`, and various methods for handling metadata, relationships, and code splitting. Each of these methods contributes to transforming the input nodes into a structured and enriched output that represents the code hierarchy and relationships more clearly, serving the purpose of parsing and analyzing code documents in a refined manner.