In [None]:
%load_ext autoreload
%autoreload 2

# Code Hierarchy Node Parser

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-code-hierarchy/examples/CodeHierarchyNodeParserUsage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The `CodeHierarchyNodeParser` is useful to split long code files into more reasonable chunks. What this will do is create a "Hierarchy" of sorts, where sections of the code are made more reasonable by replacing the scope body with short comments telling the LLM to search for a referenced node if it wants to read that context body. This is called skeletonization, and is toggled by setting `skeleton` to `True` which it is by default. Nodes in this hierarchy will be split based on scope, like function, class, or method scope, and will have links to their children and parents so the LLM can traverse the tree.

This notebook gives an initial demo of the pack, and then dives into a deeper technical exploration of how it works.

**NOTE:** Currently, this pack is configured to only work with `OpenAI` LLMs. But feel free to copy/download the source code and edit as needed!

## Installation and Import

First be sure to install the necessary [tree-sitter](https://tree-sitter.github.io/tree-sitter/) libraries.

In [None]:
!pip install llama-index-packs-code-hierarchy llama-index


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.text_splitter import CodeSplitter
from llama_index.llms.openai import OpenAI
from llama_index.packs.code_hierarchy import CodeHierarchyNodeParser
from llama_index.packs.code_hierarchy import CodeHierarchyAgentPack
from pathlib import Path

In [None]:
from IPython.display import Markdown, display


def print_python(python_text):
    """This function prints python text in ipynb nicely formatted."""
    display(Markdown("```python\n" + python_text + "```"))

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

## Initial Demo

First, lets run the pack by using nodes from the included `CodeHierarchyNodeParser`, and from there, explore further how it actually works.

In [None]:
llm = OpenAI(model="gpt-4", temperature=0.2)

documents = SimpleDirectoryReader(
    input_files=[Path("../llama_index/packs/code_hierarchy/code_hierarchy.py")],
    file_metadata=lambda x: {"filepath": x},
).load_data()

split_nodes = CodeHierarchyNodeParser(
    language="python",
    # You can further parameterize the CodeSplitter to split the code
    # into "chunks" that match your context window size using
    # chunck_lines and max_chars parameters, here we just use the defaults
    code_splitter=CodeSplitter(language="python", max_chars=1000, chunk_lines=10),
).get_nodes_from_documents(documents)

pack = CodeHierarchyAgentPack(split_nodes=split_nodes, llm=llm)

In [None]:
print(
    pack.run(
        "How does the get_code_hierarchy_from_nodes function from the code hierarchy node parser work? Provide specific implementation details."
    )
)

Added user message to memory: How does the get_code_hierarchy_from_nodes function from the code hierarchy node parser work? Provide specific implementation details.
=== Calling Function ===
Calling function: code_search with args: {
  "input": "get_code_hierarchy_from_nodes"
}
Got output: def get_code_hierarchy_from_nodes(
        nodes: Sequence[BaseNode],
        max_depth: int = -1,
    ) -> Tuple[Dict[str, Any], str]:
# Code replaced for brevity. See node_id 1b2cbe9a-5846-4110-aaa5-26327110c9ab

=== Calling Function ===
Calling function: code_search with args: {
  "input": "1b2cbe9a-5846-4110-aaa5-26327110c9ab"
}
Got output: # Code replaced for brevity. See node_id ce774d77-8687-4ae5-af74-4a990c085362
"""
        Creates a code hierarchy appropriate to put into a tool description or context
        to make it easier to search for code.

        Call after `get_nodes_from_documents` and pass that output to this function.
        """
        out: Dict[str, Any] = defaultdict(dict)

 

We can see that the agent explored the hierarchy of the code by requesting specific function names and IDs, in order to provide a full explanation of how the function works!

## Technical Explanations/Exploration

### Prepare your Data

Choose a directory you want to scan, and glob for all the code files you want to import.

In this case I'm going to glob all "*.py" files in the `llama_index/node_parser` directory.

In [None]:
documents = SimpleDirectoryReader(
    input_files=[Path("../llama_index/packs/code_hierarchy/code_hierarchy.py")],
    file_metadata=lambda x: {"filepath": x},
).load_data()

split_nodes = CodeHierarchyNodeParser(
    language="python",
    # You can further parameterize the CodeSplitter to split the code
    # into "chunks" that match your context window size using
    # chunck_lines and max_chars parameters, here we just use the defaults
    code_splitter=CodeSplitter(language="python", max_chars=1000, chunk_lines=10),
).get_nodes_from_documents(documents)

This should be the code hierarchy node parser itself. Lets have it parse itself!

In [None]:
print(f"Length of text: {len(documents[0].text)}")
print_python(documents[0].text[:1500] + "\n\n# ...")

Length of text: 33375


```python
from collections import defaultdict
from enum import Enum
from tree_sitter import Node
from typing import Any, Dict, List, Optional, Sequence, Tuple


from llama_index.core.bridge.pydantic import BaseModel, Field
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    """
    Unfortunately some languages need special options for how to make a signature.

    For example, html element signatures should include their closing >, there is no
    easy way to include this using an always-exclusive system.

    However, using an always-inclusive system, python decorators don't work,
    as there isn't an easy to define terminator for decorators that is inclusive
    to their signature.
    """

    type: str = Field(description="The type string to match on.")
    inclusive: bool = Field(
        description=(
            "Whether to include the text of the node matched by this type or not."
        ),
    )


class _SignatureCaptureOptions(BaseModel):
    """
    Options for capturing the signature of a node.
    """

    start_signature_types: Optional[List[_SignatureCaptureType]] = Field(
        None,
        descripti

# ...```

This is way too long to fit into the context of our LLM. So what are we to do? Well we will split it. We are going to use the `CodeHierarchyNodeParser` to split the nodes into more reasonable chunks.

In [None]:
split_nodes = CodeHierarchyNodeParser(
    language="python",
    # You can further parameterize the CodeSplitter to split the code
    # into "chunks" that match your context window size using
    # chunck_lines and max_chars parameters, here we just use the defaults
    code_splitter=CodeSplitter(language="python", max_chars=1000, chunk_lines=10),
).get_nodes_from_documents(documents)
print("Number of nodes after splitting:", len(split_nodes))

Number of nodes after splitting: 90


Great! So that split up our data from 1 node into quite a few nodes! Whats the max length of any of these nodes?

In [None]:
print(f"Longest text in nodes: {max(len(n.text) for n in split_nodes)}")

Longest text in nodes: 1152


That's much shorter than before! Let's look at a sample.

In [None]:
print_python(split_nodes[0].text)

```python
from collections import defaultdict
from enum import Enum
from tree_sitter import Node
from typing import Any, Dict, List, Optional, Sequence, Tuple


from llama_index.core.bridge.pydantic import BaseModel, Field
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id b30b6043-4cba-420e-bd6b-e91beea08819


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id e0961aad-bd9f-4295-927d-90ac6e2b06c8
# Code replaced for brevity. See node_id 0f6bc262-ef8b-4051-8c8e-486863e4cbe2```

Without even needing a long printout we can see everything this module imported in the first document (which is at the module level) and some classes it defines.

We also see that it has put comments in place of code that was removed to make the text size more reasonable.
These can appear at the beginning or end of a chunk, or at a new scope level, like a class or function declaration.

`# Code replaced for brevity. See node_id {node_id}`

### Code Hierarchy

These scopes can be listed by the `CodeHierarchyNodeParser`, giving a "repo map" of sorts.
The namesake of this node parser, it creates a tree of scope names to use to search the code.

In [None]:
print(CodeHierarchyNodeParser.get_code_hierarchy_from_nodes(split_nodes))

(defaultdict(<class 'dict'>, {'..': defaultdict(<class 'dict'>, {'llama_index': defaultdict(<class 'dict'>, {'packs': defaultdict(<class 'dict'>, {'code_hierarchy': defaultdict(<class 'dict'>, {'code_hierarchy': defaultdict(<class 'dict'>, {'_SignatureCaptureType': defaultdict(<class 'dict'>, {}), '_SignatureCaptureOptions': defaultdict(<class 'dict'>, {}), '_ScopeMethod': defaultdict(<class 'dict'>, {}), '_CommentOptions': defaultdict(<class 'dict'>, {}), '_ScopeItem': defaultdict(<class 'dict'>, {}), '_ChunkNodeOutput': defaultdict(<class 'dict'>, {}), 'CodeHierarchyNodeParser': defaultdict(<class 'dict'>, {'class_name': defaultdict(<class 'dict'>, {}), '__init__': defaultdict(<class 'dict'>, {}), '_get_node_name': defaultdict(<class 'dict'>, {'recur': defaultdict(<class 'dict'>, {})}), '_get_node_signature': defaultdict(<class 'dict'>, {'find_start': defaultdict(<class 'dict'>, {}), 'find_end': defaultdict(<class 'dict'>, {})}), '_chunk_node': defaultdict(<class 'dict'>, {}), 'get_c

### Exploration by the Programmer

So that we understand what is going on under the hood, what if we go to that node_id we found above?

In [None]:
split_nodes_by_id = {n.node_id: n for n in split_nodes}
uuid_from_text = split_nodes[9].text.splitlines()[-1].split(" ")[-1]
print("Going to print the node with UUID:", uuid_from_text)
print_python(split_nodes_by_id[uuid_from_text].text)

Going to print the node with UUID: 6d205ded-3ee7-454a-9498-7d5f63963d4c


```python
class CodeHierarchyNodeParser(NodeParser):
# Code replaced for brevity. See node_id 1b87e4b8-08ef-4b34-ac71-9fbcca8bed76```

This is the next split in the file. It is prepended with the node before it and appended with the node after it as a comment.

We can also see the relationships on this node programmatically.

In [None]:
split_nodes_by_id[uuid_from_text].relationships

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='6d205ded-3ee7-454a-9498-7d5f63963d4c', node_type=<ObjectType.TEXT: '1'>, metadata={'language': 'python', 'inclusive_scopes': [{'name': 'CodeHierarchyNodeParser', 'type': 'class_definition', 'signature': 'class CodeHierarchyNodeParser(NodeParser):'}], 'start_byte': 6241, 'end_byte': 33374, 'filepath': '../llama_index/packs/code_hierarchy/code_hierarchy.py'}, hash='714b8e8a6c2e99ae5f43521fe600587eda6d2cee8411082c4ba3255701ad443f'),
 <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='1b87e4b8-08ef-4b34-ac71-9fbcca8bed76', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='443391a4ee2bdb50953f94fe46e8d93c8be044ea84c7cc30efdfd5a2234a3c6f'),
 <NodeRelationship.CHILD: '5'>: [RelatedNodeInfo(node_id='c81c5ec6-02da-43f1-beab-70cdff2ea7e8', node_type=<ObjectType.TEXT: '1'>, metadata={'inclusive_scopes': [{'name': 'CodeHierarchyNodeParser', 'type': 'class_definition', 'signature': 'class CodeHierarchyNodeParser(NodeParser):'}, {'name

The `NEXT` `PREV` relationships come from the `CodeSplitter` which is a component of the `CodeHierarchyNodeParser`. It is responsible for cutting up the nodes into chunks that are a certain character length. For more information about the `CodeSplitter` read this:

[Code Splitter](https://docs.llamaindex.ai/en/latest/api/llama_index.node_parser.CodeSplitter.html)

The `PARENT` and `CHILD` relationships come from the `CodeHierarchyNodeParser` which is responsible for creating the hierarchy of nodes. Things like classes, functions, and methods are nodes in this hierarchy.

The `SOURCE` is the original file that this node came from.

In [None]:
from llama_index.core.schema import NodeRelationship

node_id = uuid_from_text
if NodeRelationship.NEXT not in split_nodes_by_id[node_id].relationships:
    print("No next node found!")
else:
    next_node_relationship_info = split_nodes_by_id[node_id].relationships[
        NodeRelationship.NEXT
    ]
    next_node = split_nodes_by_id[next_node_relationship_info.node_id]
    print_python(next_node.text)

```python
# Code replaced for brevity. See node_id 6d205ded-3ee7-454a-9498-7d5f63963d4c
"""Split code using a AST parser.

    Add metadata about the scope of the code block and relationships between
    code blocks.
    """

    @classmethod
    def class_name(cls) -> str:
        # Code replaced for brevity. See node_id c81c5ec6-02da-43f1-beab-70cdff2ea7e8

    language: str = Field(
        description="The programming language of the code being split."
    )
    signature_identifiers: Dict[str, _SignatureCaptureOptions] = Field(
        description=(
            "A dictionary mapping the type of a split mapped to the first and last type"
            " of itschildren which identify its signature."
        )
    )
    min_characters: int = Field(
        default=80,
        description=(
            "Minimum number of characters per chunk.Defaults to 80 because that's about"
            " how long a replacement comment is in skeleton mode."
        ),
    )
# Code replaced for brevity. See node_id b5ffc7d6-b795-4304-9dcc-b31568291861```

### Keyword Table and Usage by the LLM

Lets explore the use of this node parser in an index. We will be able to use any index which allows search by keyword, which should enable us to search for any node by it's uuid, or by any scope name.

We have created a `CodeHierarchyKeywordQueryEngine` which will allow us to search for nodes by their uuid, or by their scope name. It's `.query` method can be used as a simple search tool for any LLM. Given the repo map we created earlier, or the text of a split file, the LLM should be able to figure out what to search for very naturally.

Lets create the KeywordQueryEngine

In [None]:
from llama_index.packs.code_hierarchy import CodeHierarchyKeywordQueryEngine

query_engine = CodeHierarchyKeywordQueryEngine(
    nodes=split_nodes,
)

Now we can get the same code as before.

In [None]:
print_python(query_engine.query(split_nodes[0].node_id).response)

```python
from collections import defaultdict
from enum import Enum
from tree_sitter import Node
from typing import Any, Dict, List, Optional, Sequence, Tuple


from llama_index.core.bridge.pydantic import BaseModel, Field
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id b30b6043-4cba-420e-bd6b-e91beea08819


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id e0961aad-bd9f-4295-927d-90ac6e2b06c8
# Code replaced for brevity. See node_id 0f6bc262-ef8b-4051-8c8e-486863e4cbe2```

But now we can also search for any node by it's common sense name.

For example, the class `_SignatureCaptureOptions` is a node in the hierarchy. We can search for it by name.

The reason we aren't getting more detail is because our min_characters is too low, try to increase it for more detail for any individual query.

In [None]:
print_python(query_engine.query("_SignatureCaptureOptions").response)

```python
class _SignatureCaptureOptions(BaseModel):
# Code replaced for brevity. See node_id f3ccdeee-207a-4d71-9451-7a9aa93bec33```

And by module name, in case the LLM sees something in an import statement and wants to know more about it.

In [None]:
print_python(query_engine.query("code_hierarchy").response)

```python
from collections import defaultdict
from enum import Enum
from tree_sitter import Node
from typing import Any, Dict, List, Optional, Sequence, Tuple


from llama_index.core.bridge.pydantic import BaseModel, Field
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.extractors.metadata_extractors import BaseExtractor
from llama_index.core.node_parser.interface import NodeParser
from llama_index.core.schema import BaseNode, NodeRelationship, TextNode
from llama_index.core.text_splitter import CodeSplitter
from llama_index.core.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id b30b6043-4cba-420e-bd6b-e91beea08819


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id e0961aad-bd9f-4295-927d-90ac6e2b06c8
# Code replaced for brevity. See node_id 0f6bc262-ef8b-4051-8c8e-486863e4cbe2```

### As an Agent

We can convert the query engine to be used as a tool for an agent!

In [None]:
from llama_index.core.tools import QueryEngineTool

tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="code_lookup",
    description="Useful for looking up information about the code hierarchy codebase.",
)

There is also a helpful description of the tool here, which works best as a system prompt.

In [None]:
display(Markdown("Description: " + query_engine.get_tool_instructions()))

Description: Search the tool by any element in this list to get more information about that element.
If you see 'Code replaced for brevity' then a uuid, you may also search the tool with that uuid to see the full code.
You may need to use the tool multiple times to fully answer the user message.
The list is:
- ..
  - llama_index
    - packs
      - code_hierarchy
        - code_hierarchy
          - _SignatureCaptureType
          - _SignatureCaptureOptions
          - _ScopeMethod
          - _CommentOptions
          - _ScopeItem
          - _ChunkNodeOutput
          - CodeHierarchyNodeParser
            - class_name
            - __init__
            - _get_node_name
              - recur
            - _get_node_signature
              - find_start
              - find_end
            - _chunk_node
            - get_code_hierarchy_from_nodes
              - get_subdict
              - recur_inclusive_scope
              - dict_to_markdown
            - _parse_nodes
            - _get_indentation
            - _get_comment_text
            - _create_comment_line
            - _get_replacement_text
            - _skeletonize
            - _skeletonize_list
              - recur



Now lets finally actually make an agent!

Note that this requires some complex reasoning, and works best with GPT-4-like LLMs.

In [None]:
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4", temperature=0.1)

agent = FunctionAgent(
    tools=[tool],
    llm=llm,
    system_prompt=query_engine.get_tool_instructions(),
)

# Create a context for the agent
from llama_index.core.workflow import Context

ctx = Context(agent)

In [None]:
response = await agent.run(
    "How does the get_code_hierarchy_from_nodes function from the code hierarchy node parser work? Provide specific implementation details.",
    ctx=ctx,
)
print(response)