String#truncate_bytes#27319
Merged
Merged
Conversation
Member
|
How do we feel about how this will behave in the presence of combining characters? ISTM we should be dropping the entire grapheme, not effectively replacing it with a different one... but the simplicity of this implementation definitely holds some competing appeal. |
Contributor
|
It's worth documenting at least. >> "hï".truncate_bytes(2, omission: nil)
=> "hi"
>> "💅🏾".truncate_bytes(5, omission: nil)
=> "💅"
>> "👩👩👧👦".truncate_bytes(18, omission: nil)
=> "👩👩👧"
>> "👩👩👧👦".truncate_bytes(12, omission: nil)
=> "👩👩"
>> "👩👩👧👦".truncate_bytes(8, omission: nil)
=> "👩"🤘 |
Member
Author
|
Looking at truncating around grapheme clusters: significantly trickier, but doable.
>> grapheme = "👩👩👧👦"
=> "👩👩👧👦"
>> grapheme.size
=> 7
>> grapheme.bytesize
=> 25
>> grapheme.chars
=> ["👩", "", "👩", "", "👧", "", "👦"]
>> (1..grapheme.bytesize).map { |i| grapheme.mb_chars.limit(i).to_s }
=> ["", "", "", "👩", "👩", "👩", "👩", "👩", "👩", "👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧👦"]
>> (1..grapheme.bytesize).map { |i| grapheme.byteslice(0,i).scrub('') }
=> ["", "", "", "👩", "👩", "👩", "👩", "👩", "👩", "👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧", "👩👩👧👦"] |
Member
|
I suppose once we merged #26743, the grapheme handling will become way faster because it's implemented in C. |
47b0774 to
1263444
Compare
1263444 to
e3a4a51
Compare
Member
Author
|
Switched to using |
e3a4a51 to
c4e9639
Compare
77df1d8 to
31441a7
Compare
This faithfully preserves grapheme clusters (characters composed of other characters and combining marks) and other multibyte characters.
31441a7 to
9f44380
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Limit to N bytes without breaking multibytes chars.
Can be done with
foo.mb_chars.limit(n.bytes), but that's much slower.This joins our
#truncateand#truncate_wordsfamily.