Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling for long text #11

Open
wesslen opened this issue Jan 2, 2024 · 7 comments
Open

Better handling for long text #11

wesslen opened this issue Jan 2, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@wesslen
Copy link
Owner

wesslen commented Jan 2, 2024

Currently use approx word counts by whitespace before sending to API; will truncate.

Also skips if OpenAI comes back with error.

Need a better solution to check with OpenAI's tokenizer.

@wesslen wesslen added the enhancement New feature or request label Jan 2, 2024
@wesslen
Copy link
Owner Author

wesslen commented Jan 2, 2024

@wesslen
Copy link
Owner Author

wesslen commented Jan 3, 2024

Check if appendix is included for this example https://browse.arxiv.org/html/2401.00437v1

@wesslen
Copy link
Owner Author

wesslen commented Jan 3, 2024

Moved to 13,500 99add83

Still need to improve handling.

@wesslen
Copy link
Owner Author

wesslen commented Jan 4, 2024

Now skips when too long, but still unsatisfactory.

Here are some current ones:

2312.07392v1
2312.13107v1
2312.17164v1
2401.01149v1

@wesslen
Copy link
Owner Author

wesslen commented Jan 4, 2024

Found that many are due to equations. Removed ".ltx_equation" with this update: 6f65cbe

@wesslen
Copy link
Owner Author

wesslen commented Jan 4, 2024

Removed .ltx_Math and `.ltx_theroem" too

This seems to improve unexpected lengths

@wesslen
Copy link
Owner Author

wesslen commented Jan 29, 2024

Added LangChain plus MapReduce #23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant