Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Reline::Unicode.get_mbchar_width #632

Merged
merged 1 commit into from
Aug 29, 2024

Conversation

tompng
Copy link
Member

@tompng tompng commented Jan 4, 2024

Performance and Regression check

Tested with this benchmark

# Ignore some character that reline was calculating wrong
chars = (0..0x10ffff).filter_map{_1.chr('utf-8') rescue nil}.reject { _1 =~ /\p{M}/ && _1 !~ /\p{Mn}/ } - ["\u2e3b"];
measure
def Reline.ambiguous_width = 3 # Set to any number except 0, 1, 2 to make the checksum work better.
chars.map{Reline::Unicode.get_mbchar_width(_1)*_1.ord}.sum # checksum to ensure no regression

Result with yjit

master branch
processing time: 0.784064s #=> 924024865779
this branch
processing time: 0.404117s #=> 924024865779

Result without yjit

master branch
processing time: 0.900042s #=> 924024865779
this branch
processing time: 0.566599s #=> 924024865779

Implementation

Uses bsearch_index. Time complexity of bsearch_index is O(log(N)), N is total count of unicode characters.
We can also choose less-memory O(1) lookup (shallow tree with bignum, https://gist.github.com/tompng/6be795d487e1a0105ada41e24f9528c4) but the generated file will be unreadable.

I think this bsearch_index is a good choice because:

  • Most chars are ascii, so multibyte width calculation is not so important
  • The balance of performance, unicode.rb code simplicity, and readability of generated east_asian_width.rb are good

Bug fixes

Fixed these two type of chars. these are excluded from the performance/regression benchmark.

Nonspacing Mark

Reline returned 0 for /\p{M}/ (Mark). I think it was a mistake of /\p{Mn}/ (Nonspacing Mark).

# Chars matches /\p{M}/ but not /\p{Mn}/, 465 chars
marks = (0..0x10ffff).filter_map{''<<_1 rescue nil}.select{_1 =~ /\p{M}/ && _1 !~ /\p{Mn}/}
# Measure actual width in terminal emulator by "\e[6n" (Device Status Report)
marks.count{ $><< "\ra#{_1}b\e[6n";STDIN.raw{STDIN.readpartial(10)[/\e\[\d+;(\d+)R/, 1]}.to_i - 1 == 2 }
# =>
# 0 means /\p{Mn}/ is correct, 465 means /\p{M}/ is correct.
# Terminal.app: 0
# iTerm2: 36
# Alacritty: 13
# VSCode Terminal: 14

Three Em Dash

Reline returned 3 for three em dash "\u2e3b".
Reline returned 1 for two em dash "\u2e3a".
It's defined as N(Neutral) and shuold be 1. Terminal.app, VSCode, iTerm, Alacrytty uses width=1 (but overflows because font is very wide)

Comment on lines 66 to 67
chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o }
size = Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index]
Copy link

@mblumtritt mblumtritt Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk_index can be nil and then line 67 will fail!

Suggested change
chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o }
size = Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index]
chunk_index = Reline::Unicode::EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o }
size = chunk_index ? Reline::Unicode::EastAsianWidth::CHUNK_WIDTH[chunk_index] : 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. Can you provide an input character that chunk_index will be nil?

I think utf8_mbchar.ord is always less than or equal to 0x10ffff (maximum codepoint of unicode)

The last two values of CHUNK_LAST are: 0x10fffd and 0x7fffffff.
The max value of 32bit signed integer 0x7fffffff is added at the end of CHUNK_LAST to prevent bsearch_index to return nil.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - I missed the additional integer at the end!

I only saw the comment about how the table was generated and missed the "non-unicode" addition…

Thanks for clarification!

@mblumtritt
Copy link

mblumtritt commented Aug 16, 2024

Sorry for another question: Am I right that country flags (like '🇵🇹') are not supported?

@tompng
Copy link
Member Author

tompng commented Aug 16, 2024

Am I right that country flags (like '🇵🇹') are not supported?

In my opinion, Reline supports calculating flag width correctly.
Flags are defined as N(Neutral == narrow or half), width=1 character in Unicode EastAsianWidth.txt.

# EastAsianWidth.txt
1F1E6..1F1FF   ; N  # So    [26] REGIONAL INDICATOR SYMBOL LETTER A..REGIONAL INDICATOR SYMBOL LETTER Z

But flag rendering fails in many terminal emulators with various reasons.

Mac Terminal.app

Renders with half-width visually but calculates column as full-width internally. Can't support a bug.

iTerm2

Flag width can be changed by experimental configuration Fag emoji render full-width: yes/no and the default is full-width.
Width of some half-width emojis(example: ⛴️) are also configurable.
There is no plan to detect this kind of configuration value from Reline.

Alacritty, VSCode Terminal

Does not support rendering flags. "🇵🇹" is rendered as "P"+"T"

@elfham
Copy link
Contributor

elfham commented Aug 17, 2024

IMHO, emojis of flags should be treated as double-width characters if they are combined in the same way as half-width katakana characters.

% irb
irb(main):001> Reline::Unicode.get_mbchar_width("カ")
=> 1
irb(main):002> Reline::Unicode.get_mbchar_width("ガ")
=> 2
irb(main):003> 

However, this does not seem to be well documented in UAX#11 and is handled differently by different implementations.

I think there is still room for further discussion.

So, I think it would be better to discuss the flag width as a separate Issue from this PR.

@mblumtritt
Copy link

Ok. thanks for your answer!

Let's have this PR merged asap - which will help a lot to make things better. And maybe we handle the issue with some special chars which are different handled in terminals later…

Thanks!

@st0012
Copy link
Member

st0012 commented Aug 17, 2024

@ima1zumi do you have time to give this a look? I think I don't have enough background knowledge to give a deep review.

@ima1zumi
Copy link
Member

@tompng Could you rebase it?

@tompng
Copy link
Member Author

tompng commented Aug 19, 2024

Rebase done

f.each_line do |line|
next unless m = line.match(/^(\h+)(?:\.\.(\h+))?\s*;\s*(\w+)\s+#.+/)
next unless /^(?<first>\h+)(?:\.\.(?<last>\h+))?\s*;\s*(?<type>\w+)\s+# +(?<category>[^ ]+)/ =~ line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝

0021..0023     ; Na # Po     [3] EXCLAMATION MARK..NUMBER SIGN
^first ^last     ^type ^category

https://www.unicode.org/Public/15.1.0/ucd/EastAsianWidth.txt


chunks = widths.each_with_index.chunk { |width, _idx| width || 1 }
chunk_last_ords = chunks.map { |width, chunk| [chunk.last.last, width] }
chunk_last_ords << [0x7fffffff, 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think chunk_last_ords should be up to 0x10FFFFF.

In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF
2.4 Code Points and Characters
https://www.unicode.org/versions/Unicode15.1.0/ch02.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this value(max value of int32) as an alternative of infinity. This way, we don't need to test 0x10ffff not to raise no method error.
I think it is not so obvious that chunk_index = EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o } does not return nil when CHUNK_LAST.last == 0x10FFFF and ord == 0x10FFFF .
Of course we can change it to 0x10FFFF and add a boundary value test. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I agree that using 0x7fffffff is good, just as you suggested.

@@ -5,6 +5,18 @@ if ARGV.empty?
exit 1
end

def unicode_width(type, category)
return 0 if category == 'Mn' # Nonspacing Mark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why the Nonspacing Mark should be set to 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old implementation, Mark(Mn (nonspacing mark) + Mc (spacing combining mark) + Me (enclosing mark)) are all set to 0.

MBCharWidthRE = /
  ...
  | (?<width_0>^\p{M})
  ...
/

I think this is just a simple mistake of \p{Mn} because Mark is not always zero width.
I think there are two choices and I choose the latter one.

# Choice 1, Don't change the behavior now. Leave it, or fix it in another pull request
return 0 if category =~ '/Mn|Mc|Me/'
# Choice 2, Fix it to nonspacing
return 0 if category == 'Mn'

Actual width calculated by the script below are:

Terminal emulator Mn Other marks(Mc, Me)
Terminal.app 0 >=1
Alacritty 0 mostly(97%) >=1, some(3%) are 0
iTerm, VSCode Terminal 1 >= 1
mn = (0..0x10ffff).filter_map{_1.chr('utf-8') rescue nil}.grep(/\p{Mn}/)
mn.map{
  print "\e[H"+_1+"\e[6n"
  STDIN.raw{STDIN.readpartial 10}
}.tally

Mn is not zero-width in some terminal emulators, but I prefer not changing the original intention in this pull request except bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. That makes sense to me now.

@@ -13,66 +25,32 @@ open(ARGV.first, 'rt') do |f|
unicode_version = nil
end

list = []
widths = []
letter_modifiers = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used.

Suggested change
letter_modifiers = []

Comment on lines +72 to +76
elsif size == 1 && utf8_mbchar.size >= 2
second_char_ord = utf8_mbchar[1].ord
# Halfwidth Dakuten Handakuten
# Only these two character has Letter Modifier category and can be combined in a single grapheme cluster
(second_char_ord == 0xFF9E || second_char_ord == 0xFF9F) ? 2 : 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 This is a special case specifically for when there is a character with width 1 before U+FF9E or U+FF9F. In other words, it allows the width to be calculated correctly for something like ‘ガ’. However, it cannot calculate the width correctly for ‘が’.
There are many other combining characters, and I’m not sure why this character alone is being handled, but there doesn’t seem to be a particular reason to remove it.

Copy link
Member

@ima1zumi ima1zumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -5,6 +5,18 @@ if ARGV.empty?
exit 1
end

def unicode_width(type, category)
return 0 if category == 'Mn' # Nonspacing Mark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. That makes sense to me now.


chunks = widths.each_with_index.chunk { |width, _idx| width || 1 }
chunk_last_ords = chunks.map { |width, chunk| [chunk.last.last, width] }
chunk_last_ords << [0x7fffffff, 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I agree that using 0x7fffffff is good, just as you suggested.

@ima1zumi ima1zumi merged commit 0851e93 into ruby:master Aug 29, 2024
40 checks passed
@tompng tompng deleted the faster_mbwidth branch August 29, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants