Add StringUtils.truncateToByteLength #1392

kiddos · 2025-05-27T17:34:58Z

We sometimes need to store Unicode text in a fixed space (e.g., in a database column of type CHARACTER(32)). It's acceptable for the text to be truncated, but because we're dealing with Unicode, we can't simply treat the text as raw bytes and truncate it at 16 bytes — that might split a character in the middle. The function StringUtils.truncateToByteLength(String str, int maxBytes, Charset charset) helps handle this by safely truncating the string based on byte length while preserving valid character boundaries.

ecki · 2025-05-28T00:07:55Z

Agree, very useful when dealing with UTF8 databases. Wonder if it should have a utf8 variant, where it does not have to re truncate, it can just look at the byte patterns at the border.

The current version does not deal with UTF16 code units properly. (Substring might cut them in half)

garydgregory

Hello all,

I think you'll want tests that cover grapheme clusters to avoid problems like https://issues.apache.org/jira/browse/LANG-1770

kiddos · 2025-05-28T17:39:48Z

I added some test cases for emoji characters 🚀✨🎉
I did some testing and found that current implementation the escape characters worked ("\uD83D\uDE80\u2728\uD83C\uDF89")
but "🚀✨🎉" doesn't

After adding <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> in pom.xml, "🚀✨🎉" seems to work.

garydgregory · 2025-05-28T18:07:51Z

@kiddos
Please see my previous comment.

kiddos · 2025-05-28T19:11:07Z

Oh, right.
it's just tricky to handle grapheme cluster.
the codePoint solution you mention does seems to work.
I'll add more tests using grapheme clusters.

garydgregory · 2025-05-28T19:30:19Z

I'm not requesting support for grapheme cluster in the runtime, but we should set expectations in unit tests, whether they are supported or not. This is a larger discussion, which I raised in https://issues.apache.org/jira/browse/LANG-1770

…ller or equal then expected and not null

kiddos · 2025-06-29T15:25:21Z

for the test case, I only check if the final output is not null and the bytes size is actully smaller then specified byte size.
Is that ok?

Add StringUtils.truncateToByteLength

375e9cf

garydgregory requested changes May 28, 2025

View reviewed changes

fix case for emojis

b12a9ee

kiddos added 3 commits June 8, 2025 22:34

add test cases

4ba9a41

Merge branch 'master' of github.com:apache/commons-lang

ea235b3

case with graphene cluster only check if output bytes is actually sma…

e9a1a31

…ller or equal then expected and not null

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add StringUtils.truncateToByteLength #1392

Add StringUtils.truncateToByteLength #1392

kiddos commented May 27, 2025

Uh oh!

ecki commented May 28, 2025 •

edited

Loading

Uh oh!

garydgregory left a comment

Uh oh!

kiddos commented May 28, 2025 •

edited

Loading

Uh oh!

garydgregory commented May 28, 2025

Uh oh!

kiddos commented May 28, 2025

Uh oh!

garydgregory commented May 28, 2025

Uh oh!

kiddos commented Jun 29, 2025

Uh oh!

Uh oh!

Add StringUtils.truncateToByteLength #1392

Are you sure you want to change the base?

Add StringUtils.truncateToByteLength #1392

Conversation

kiddos commented May 27, 2025

Uh oh!

ecki commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

kiddos commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garydgregory commented May 28, 2025

Uh oh!

kiddos commented May 28, 2025

Uh oh!

garydgregory commented May 28, 2025

Uh oh!

kiddos commented Jun 29, 2025

Uh oh!

Uh oh!

ecki commented May 28, 2025 •

edited

Loading

kiddos commented May 28, 2025 •

edited

Loading