-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add StringUtils.truncateToByteLength #1392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Agree, very useful when dealing with UTF8 databases. Wonder if it should have a utf8 variant, where it does not have to re truncate, it can just look at the byte patterns at the border. The current version does not deal with UTF16 code units properly. (Substring might cut them in half) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello all,
I think you'll want tests that cover grapheme clusters to avoid problems like https://issues.apache.org/jira/browse/LANG-1770
I added some test cases for emoji characters 🚀✨🎉 After adding |
@kiddos |
Oh, right. |
I'm not requesting support for grapheme cluster in the runtime, but we should set expectations in unit tests, whether they are supported or not. This is a larger discussion, which I raised in https://issues.apache.org/jira/browse/LANG-1770 |
…ller or equal then expected and not null
for the test case, I only check if the final output is not null and the bytes size is actully smaller then specified byte size. |
We sometimes need to store Unicode text in a fixed space (e.g., in a database column of type
CHARACTER(32)
). It's acceptable for the text to be truncated, but because we're dealing with Unicode, we can't simply treat the text as raw bytes and truncate it at 16 bytes — that might split a character in the middle. The functionStringUtils.truncateToByteLength(String str, int maxBytes, Charset charset)
helps handle this by safely truncating the string based on byte length while preserving valid character boundaries.