-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String#truncate_bytes #27319
String#truncate_bytes #27319
Conversation
How do we feel about how this will behave in the presence of combining characters? ISTM we should be dropping the entire grapheme, not effectively replacing it with a different one... but the simplicity of this implementation definitely holds some competing appeal. |
It's worth documenting at least. >> "hï".truncate_bytes(2, omission: nil)
=> "hi"
>> "💅🏾".truncate_bytes(5, omission: nil)
=> "💅"
>> "👩👩👧👦".truncate_bytes(18, omission: nil)
=> "👩👩👧"
>> "👩👩👧👦".truncate_bytes(12, omission: nil)
=> "👩👩"
>> "👩👩👧👦".truncate_bytes(8, omission: nil)
=> "👩" 🤘 |
Looking at truncating around grapheme clusters: significantly trickier, but doable.
>> grapheme = "👩👩👧👦"
=> "👩👩👧👦"
>> grapheme.size
=> 7
>> grapheme.bytesize
=> 25
>> grapheme.chars
=> ["👩", "", "👩", "", "👧", "", "👦"]
>> (1..grapheme.bytesize).map { |i| grapheme.mb_chars.limit(i).to_s }
=> ["", "", "", "👩", "👩", "👩", "👩", "👩", "👩", "👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧👦"]
>> (1..grapheme.bytesize).map { |i| grapheme.byteslice(0,i).scrub('') }
=> ["", "", "", "👩", "👩", "👩", "👩", "👩", "👩", "👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧👦"] |
I suppose once we merged #26743, the grapheme handling will become way faster because it's implemented in C. |
47b0774
to
1263444
Compare
1263444
to
e3a4a51
Compare
Switched to using |
e3a4a51
to
c4e9639
Compare
77df1d8
to
31441a7
Compare
This faithfully preserves grapheme clusters (characters composed of other characters and combining marks) and other multibyte characters.
31441a7
to
9f44380
Compare
Limit to N bytes without breaking multibytes chars.
Can be done with
foo.mb_chars.limit(n.bytes)
, but that's much slower.This joins our
#truncate
and#truncate_words
family.