Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix text splitter by dealing with empty tokens #199

Merged
merged 1 commit into from
Jan 10, 2023

Conversation

jerryjliu
Copy link
Collaborator

When a text has a lot of empty spaces, e.g. "hello world", the separator splits the intermediate spaces into empty strings, which were counted as 0 when tokenized by itself. However when counted as part of the overall text e.g. "hello world", it would count as its own token (this example would have three tokens, not two tokens unlike "hello world").

Need to deal with this edge case. Hopefully this should resolve tokenization issues.

Closes #195

@jerryjliu jerryjliu merged commit f5930ac into main Jan 10, 2023
@jerryjliu jerryjliu deleted the jerry/fix_text_splitter branch January 10, 2023 07:21
viveksilimkhan1 pushed a commit to viveksilimkhan1/llama_index that referenced this pull request Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

requested number of tokens exceed the max supported by the model
1 participant