Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]CodeHierarchyNodeParser: The child text is not contained inside the parent text. #12924

Closed
josem7 opened this issue Apr 18, 2024 · 2 comments · Fixed by #12941
Closed

[Bug]CodeHierarchyNodeParser: The child text is not contained inside the parent text. #12924

josem7 opened this issue Apr 18, 2024 · 2 comments · Fixed by #12941
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@josem7
Copy link
Contributor

josem7 commented Apr 18, 2024

Bug Description

When parsing pvlib repo the irradiance.py file with CodeHierarchyNodeParser it threw "The child text is not contained inside the parent text." Error

Version

llama-index-packs-code-hierarchy: 0.1.3

Steps to Reproduce

  1. Git clone https://github.com/swe-bench/pvlib__pvlib-python.git
  2. Load irradiance.py file to Code Hierarchy node Parser
  3. use get_nodes_from_documents

Script:

path="pvlib__pvlib-python/pvlib/irradiance.py"
documents = SimpleDirectoryReader(
    input_files=[path],
    file_metadata=lambda x: {"filepath": x},
).load_data()

code = CodeHierarchyNodeParser(
    language="python",
    chunk_min_characters=3,
    code_splitter=CodeSplitter(language="python", max_chars=10000, chunk_lines=10),
)
split_nodes = code.get_nodes_from_documents(documents)

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "/home/jose/Desktop/Trabajo/blar/code-graph/llama_index/llama-index-packs/llama-index-packs-code-hierarchy/llama_index/packs/code_hierarchy/test.py", line 18, in <module>
    split_nodes = code.get_nodes_from_documents(documents)
  File "/home/jose/.local/lib/python3.10/site-packages/llama_index/core/node_parser/interface.py", line 76, in get_nodes_from_documents
    nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
  File "/home/jose/Desktop/Trabajo/blar/code-graph/llama_index/llama-index-packs/llama-index-packs-code-hierarchy/llama_index/packs/code_hierarchy/code_hierarchy.py", line 568, in _parse_nodes
    self._skeletonize_list(chunks)
  File "/home/jose/Desktop/Trabajo/blar/code-graph/llama_index/llama-index-packs/llama-index-packs-code-hierarchy/llama_index/packs/code_hierarchy/code_hierarchy.py", line 849, in _skeletonize_list
    recur(n)
  File "/home/jose/Desktop/Trabajo/blar/code-graph/llama_index/llama-index-packs/llama-index-packs-code-hierarchy/llama_index/packs/code_hierarchy/code_hierarchy.py", line 844, in recur
    recur(child_node)
  File "/home/jose/Desktop/Trabajo/blar/code-graph/llama_index/llama-index-packs/llama-index-packs-code-hierarchy/llama_index/packs/code_hierarchy/code_hierarchy.py", line 843, in recur
    cls._skeletonize(parent_node=node, child_node=child_node)
  File "/home/jose/Desktop/Trabajo/blar/code-graph/llama_index/llama-index-packs/llama-index-packs-code-hierarchy/llama_index/packs/code_hierarchy/code_hierarchy.py", line 826, in _skeletonize
    raise ValueError("The child text is not contained inside the parent text.")
ValueError: The child text is not contained inside the parent text.
@josem7 josem7 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Apr 18, 2024
Copy link

dosubot bot commented Apr 18, 2024

The error "The child text is not contained inside the parent text" you're encountering with CodeHierarchyNodeParser when parsing irradiance.py suggests an issue with how child nodes are identified within the parent text. This could be due to incorrect text splitting logic, miscalculation of start and end character indices for child nodes, or improper handling of metadata that might interfere with text matching.

To address this, ensure the text splitting logic accurately partitions the parent text into child nodes. Verify that the start and end indices of child nodes are correctly calculated, and check that metadata, if any, does not disrupt the matching of child text within the parent text. Adding detailed logging around the creation of child nodes and their matching process can help pinpoint where the mismatch occurs.

If you're not on the latest version of llama-index-packs-code-hierarchy, updating might resolve the issue if it has been identified and fixed in a newer release. For further assistance, reviewing the implementation details of CodeHierarchyNodeParser and its interaction with irradiance.py could provide more insights into the problem.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@josem7 josem7 changed the title [Bug]: The child text is not contained inside the parent text. [Bug]: The child text is not contained inside the parent text. Code Hierarchy node Parser Apr 18, 2024
@josem7 josem7 changed the title [Bug]: The child text is not contained inside the parent text. Code Hierarchy node Parser [Bug]CodeHierarchyNodeParser: The child text is not contained inside the parent text. Apr 18, 2024
@josem7
Copy link
Contributor Author

josem7 commented Apr 18, 2024

The bug is produced because the file contains 2 function definitions that are exactly the same.

    if isinstance(datetime_or_doy, pd.DatetimeIndex):
        to_doy = tools._pandas_to_doy  # won't be evaluated unless necessary
        def to_datetimeindex(x): return x                       # noqa: E306
        to_output = partial(pd.Series, index=datetime_or_doy)
    elif isinstance(datetime_or_doy, pd.Timestamp):
        to_doy = tools._pandas_to_doy
        to_datetimeindex = \
            tools._datetimelike_scalar_to_datetimeindex
        to_output = tools._scalar_out
    elif isinstance(datetime_or_doy,
                    (datetime.date, datetime.datetime, np.datetime64)):
        to_doy = tools._datetimelike_scalar_to_doy
        to_datetimeindex = \
            tools._datetimelike_scalar_to_datetimeindex
        to_output = tools._scalar_out
    elif np.isscalar(datetime_or_doy):  # ints and floats of various types
        def to_doy(x): return x                                 # noqa: E306
        to_datetimeindex = partial(tools._doy_to_datetimeindex,
                                   epoch_year=epoch_year)
        to_output = tools._scalar_out
    else:  # assume that we have an array-like object of doy
        def to_doy(x): return x                                 # noqa: E306
        to_datetimeindex = partial(tools._doy_to_datetimeindex,
                                   epoch_year=epoch_year)
        to_output = tools._array_out

The function

def to_doy(x): return x                                 # noqa: E306

is defined 2 times. When using skeletonized as True, the function _skeletonize tries to replace the function with the replacement_text. The first child node enters the function with text = "def to_doy(x): return x # noqa: E306".

Then in line 832 of llama-index-packs/llama-index-packs-code-hierarchy/llama_index/packs/code_hierarchy/code_hierarchy.py

        parent_node.text = parent_node.text.replace(child_node.text, replacement_text)

Both functions def to_doy(x): return x # noqa: E306 are replaced with the replacement text.

The second child node enters with exactly the same text def to_doy(x): return x # noqa: E306 but it has already been replaced by the first child node, causing the error to be raised.

        if child_node.text not in parent_node.text:
            raise ValueError("The child text is not contained inside the parent text.")

Proposed Solution

When replacing the text only replace the 1st appearance of it

index = parent_node.text.find(child_node.text)

# If the text is found, replace only the first occurrence
if index != -1:
    parent_node.text = parent_node.text[:index] + replacement_text + parent_node.text[index + len(child_node.text):]

I will open a PR with the proposed solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant