-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: SentenceNodeParser ignores max_length of embed model #14148
Comments
The issue you're encountering with the Here's a breakdown of the relevant configurations and considerations:
Here is the relevant code snippet from the # exclude window metadata from embed and llm
node.excluded_embed_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key]
)
node.excluded_llm_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key]
) To address your issue, ensure that the Additionally, consider the following performance optimizations:
By carefully configuring these parameters, you can manage large passages and avoid exceeding the |
@dosu Did I misunderstand that the value of "max_length" is only related to the embed model max length but not to the sentence window max length? Because that would explain that the chunks are still 2000+ Tokens, because the setting only relates to the embed model but not the length of the sentence window node |
Yes, you are correct. The The To ensure that the nodes do not exceed the |
Unfortunately I really don't know how github works, but I implemented a solution for my Problem. I guess this would be a pull request? The size of the sentence window is now also dependent on the set max token size. To make it completely irrelevant if you only want to use the sentence window, we could use a blatently high token count so that it gets ignored. But so far I like this very much. """Simple node parser."""
from typing import Any, Callable, List, Optional, Sequence
from transformers import AutoTokenizer
from llama_index.core.bridge.pydantic import Field
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.node_parser.interface import NodeParser
from llama_index.core.node_parser.node_utils import (
build_nodes_from_splits,
default_id_func,
)
from llama_index.core.node_parser.text.utils import split_by_sentence_tokenizer
from llama_index.core.schema import BaseNode, Document
from llama_index.core.utils import get_tqdm_iterable
DEFAULT_WINDOW_SIZE = 3
DEFAULT_WINDOW_METADATA_KEY = "window"
DEFAULT_OG_TEXT_METADATA_KEY = "original_text"
DEFAULT_WINDOW_TOKEN_SIZE = 2000
class SentenceWindowNodeParser(NodeParser):
"""Sentence window node parser.
Splits a document into Nodes, with each node being a sentence.
Each node contains a window from the surrounding sentences in the metadata.
Args:
sentence_splitter (Optional[Callable]): splits text into sentences
include_metadata (bool): whether to include metadata in nodes
include_prev_next_rel (bool): whether to include prev/next relationships
"""
sentence_splitter: Callable[[str], List[str]] = Field(
default_factory=split_by_sentence_tokenizer,
description="The text splitter to use when splitting documents.",
exclude=True,
)
window_size: int = Field(
default=DEFAULT_WINDOW_SIZE,
description="The number of sentences on each side of a sentence to capture.",
gt=0,
)
window_metadata_key: str = Field(
default=DEFAULT_WINDOW_METADATA_KEY,
description="The metadata key to store the sentence window under.",
)
original_text_metadata_key: str = Field(
default=DEFAULT_OG_TEXT_METADATA_KEY,
description="The metadata key to store the original sentence in.",
)
tokenizer: AutoTokenizer = Field(
default_factory=lambda: AutoTokenizer.from_pretrained("mlabonne/NeuralDaredevil-8B-abliterated"),
description="The tokenizer to use for counting tokens.",
exclude=True,
)
window_token_size: int = Field(
default=DEFAULT_WINDOW_TOKEN_SIZE,
description="The maximum token size for the window.",
gt=0,
)
@classmethod
def class_name(cls) -> str:
return "SentenceWindowNodeParser"
@classmethod
def from_defaults(
cls,
sentence_splitter: Optional[Callable[[str], List[str]]] = None,
window_size: int = DEFAULT_WINDOW_SIZE,
window_token_size: int = DEFAULT_WINDOW_TOKEN_SIZE,
window_metadata_key: str = DEFAULT_WINDOW_METADATA_KEY,
original_text_metadata_key: str = DEFAULT_OG_TEXT_METADATA_KEY,
include_metadata: bool = True,
include_prev_next_rel: bool = True,
callback_manager: Optional[CallbackManager] = None,
id_func: Optional[Callable[[int, Document], str]] = None,
) -> "SentenceWindowNodeParser":
callback_manager = callback_manager or CallbackManager([])
sentence_splitter = sentence_splitter or split_by_sentence_tokenizer()
id_func = id_func or default_id_func
return cls(
sentence_splitter=sentence_splitter,
window_size=window_size,
window_token_size=window_token_size,
window_metadata_key=window_metadata_key,
original_text_metadata_key=original_text_metadata_key,
include_metadata=include_metadata,
include_prev_next_rel=include_prev_next_rel,
callback_manager=callback_manager,
id_func=id_func,
)
def _parse_nodes(
self,
nodes: Sequence[BaseNode],
show_progress: bool = False,
**kwargs: Any,
) -> List[BaseNode]:
"""Parse document into nodes."""
all_nodes: List[BaseNode] = []
nodes_with_progress = get_tqdm_iterable(nodes, show_progress, "Parsing nodes")
for node in nodes_with_progress:
nodes = self.build_window_nodes_from_documents([node])
all_nodes.extend(nodes)
return all_nodes
def build_window_nodes_from_documents(
self, documents: Sequence[Document]
) -> List[BaseNode]:
"""Build window nodes from documents."""
all_nodes: List[BaseNode] = []
for doc in documents:
text = doc.text
text_splits = self.sentence_splitter(text)
nodes = build_nodes_from_splits(
text_splits,
doc,
id_func=self.id_func,
)
# Add window to each node
for i, node in enumerate(nodes):
window_nodes = []
window_token_count = 0
for j in range(max(0, i - self.window_size), min(i + self.window_size + 1, len(nodes))):
window_node = nodes[j]
window_node_tokens = self.tokenizer(window_node.text, return_tensors="pt")
window_token_count += window_node_tokens.input_ids.size(-1)
if window_token_count > self.window_token_size:
break
window_nodes.append(window_node)
node.metadata[self.window_metadata_key] = " ".join([n.text for n in window_nodes])
node.metadata[self.original_text_metadata_key] = node.text
# Exclude window metadata from embed and LLM
node.excluded_embed_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key]
)
node.excluded_llm_metadata_keys.extend(
[self.window_metadata_key, self.original_text_metadata_key]
)
all_nodes.extend(nodes)
return all_nodes |
Question Validation
Question
I have a problem with the sentenceWindowParser and I think that it has to do with my configuration, while I'm not entirely sure if its a bug. So I try to increase the window size to 14 so that large chunks are passed into the llm context, but some passages are very very big. So I tried to set the max_length to 2000 tokens in the embed model, because otherwise I'd get the error, that the initial token count is overstepped.
This is my little snippet in which I tried to accomplish this.
Edit: The Problem is, that some chunks are still + 2000 tokens long which prompts the error, that the tokens are too many. Just to clearify this again
Is there something wrong with that?
The text was updated successfully, but these errors were encountered: