New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String#truncate_bytes #27319

Merged
merged 1 commit into from Feb 18, 2018

Conversation

Projects
None yet
5 participants
@jeremy
Member

jeremy commented Dec 10, 2016

Limit to N bytes without breaking multibytes chars.

Can be done with foo.mb_chars.limit(n.bytes), but that's much slower.

This joins our #truncate and #truncate_words family.

@matthewd

This comment has been minimized.

Show comment
Hide comment
@matthewd

matthewd Dec 10, 2016

Member

How do we feel about how this will behave in the presence of combining characters?

ISTM we should be dropping the entire grapheme, not effectively replacing it with a different one... but the simplicity of this implementation definitely holds some competing appeal.

Member

matthewd commented Dec 10, 2016

How do we feel about how this will behave in the presence of combining characters?

ISTM we should be dropping the entire grapheme, not effectively replacing it with a different one... but the simplicity of this implementation definitely holds some competing appeal.

@javan

This comment has been minimized.

Show comment
Hide comment
@javan

javan Dec 10, 2016

Member

It's worth documenting at least.

>> "hï".truncate_bytes(2, omission: nil)
=> "hi"

>> "💅🏾".truncate_bytes(5, omission: nil)
=> "💅"

>> "👩‍👩‍👧‍👦".truncate_bytes(18, omission: nil)
=> "👩‍👩‍👧"

>> "👩‍👩‍👧‍👦".truncate_bytes(12, omission: nil)
=> "👩‍👩"

>> "👩‍👩‍👧‍👦".truncate_bytes(8, omission: nil)
=> "👩‍"

🤘

Member

javan commented Dec 10, 2016

It's worth documenting at least.

>> "hï".truncate_bytes(2, omission: nil)
=> "hi"

>> "💅🏾".truncate_bytes(5, omission: nil)
=> "💅"

>> "👩‍👩‍👧‍👦".truncate_bytes(18, omission: nil)
=> "👩‍👩‍👧"

>> "👩‍👩‍👧‍👦".truncate_bytes(12, omission: nil)
=> "👩‍👩"

>> "👩‍👩‍👧‍👦".truncate_bytes(8, omission: nil)
=> "👩‍"

🤘

@jeremy

This comment has been minimized.

Show comment
Hide comment
@jeremy

jeremy Dec 10, 2016

Member

Looking at truncating around grapheme clusters: significantly trickier, but doable.

Multibyte::Chars#limit also splits:

>> grapheme = "👩‍👩‍👧‍👦"
=> "👩‍👩‍👧‍👦"
>> grapheme.size
=> 7
>> grapheme.bytesize
=> 25
>> grapheme.chars
=> ["👩", "", "👩", "", "👧", "", "👦"]
>> (1..grapheme.bytesize).map { |i| grapheme.mb_chars.limit(i).to_s }
=> ["", "", "", "👩", "👩", "👩", "👩‍", "👩‍", "👩‍", "👩‍", "👩‍👩", "👩‍👩", "👩‍👩", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍👧", "👩‍👩‍👧", "👩‍👩‍👧", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍👦"]
>> (1..grapheme.bytesize).map { |i| grapheme.byteslice(0,i).scrub('') }
=> ["", "", "", "👩", "👩", "👩", "👩‍", "👩‍", "👩‍", "👩‍", "👩‍👩", "👩‍👩", "👩‍👩", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍👧", "👩‍👩‍👧", "👩‍👩‍👧", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍👦"]
Member

jeremy commented Dec 10, 2016

Looking at truncating around grapheme clusters: significantly trickier, but doable.

Multibyte::Chars#limit also splits:

>> grapheme = "👩‍👩‍👧‍👦"
=> "👩‍👩‍👧‍👦"
>> grapheme.size
=> 7
>> grapheme.bytesize
=> 25
>> grapheme.chars
=> ["👩", "", "👩", "", "👧", "", "👦"]
>> (1..grapheme.bytesize).map { |i| grapheme.mb_chars.limit(i).to_s }
=> ["", "", "", "👩", "👩", "👩", "👩‍", "👩‍", "👩‍", "👩‍", "👩‍👩", "👩‍👩", "👩‍👩", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍👧", "👩‍👩‍👧", "👩‍👩‍👧", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍👦"]
>> (1..grapheme.bytesize).map { |i| grapheme.byteslice(0,i).scrub('') }
=> ["", "", "", "👩", "👩", "👩", "👩‍", "👩‍", "👩‍", "👩‍", "👩‍👩", "👩‍👩", "👩‍👩", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍", "👩‍👩‍👧", "👩‍👩‍👧", "👩‍👩‍👧", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍", "👩‍👩‍👧‍👦"]
@amatsuda

This comment has been minimized.

Show comment
Hide comment
@amatsuda

amatsuda Feb 2, 2017

Member

I suppose once we merged #26743, the grapheme handling will become way faster because it's implemented in C.

Member

amatsuda commented Feb 2, 2017

I suppose once we merged #26743, the grapheme handling will become way faster because it's implemented in C.

@jeremy

This comment has been minimized.

Show comment
Hide comment
@jeremy

jeremy Feb 26, 2017

Member

Switched to using /\X/ to match grapheme clusters with the aim of letting this PR sit until we rely on Ruby 2.4+.

Member

jeremy commented Feb 26, 2017

Switched to using /\X/ to match grapheme clusters with the aim of letting this PR sit until we rely on Ruby 2.4+.

@jeremy jeremy added this to the 6.0.0 milestone Feb 18, 2018

String#truncate_bytes: limit to N bytes without breaking multibyte chars
This faithfully preserves grapheme clusters (characters composed of other
characters and combining marks) and other multibyte characters.

@jeremy jeremy merged commit 4940cc4 into rails:master Feb 18, 2018

0 of 2 checks passed

codeclimate Code Climate is analyzing this code.
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

@jeremy jeremy deleted the jeremy:as/string-truncate-bytes branch Feb 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment