Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Reline::Unicode.get_mbchar_width #632

Merged
merged 1 commit into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 35 additions & 58 deletions bin/generate_east_asian_width
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,18 @@ if ARGV.empty?
exit 1
end

def unicode_width(type, category)
return 0 if category == 'Mn' # Nonspacing Mark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why the Nonspacing Mark should be set to 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old implementation, Mark(Mn (nonspacing mark) + Mc (spacing combining mark) + Me (enclosing mark)) are all set to 0.

MBCharWidthRE = /
  ...
  | (?<width_0>^\p{M})
  ...
/

I think this is just a simple mistake of \p{Mn} because Mark is not always zero width.
I think there are two choices and I choose the latter one.

# Choice 1, Don't change the behavior now. Leave it, or fix it in another pull request
return 0 if category =~ '/Mn|Mc|Me/'
# Choice 2, Fix it to nonspacing
return 0 if category == 'Mn'

Actual width calculated by the script below are:

Terminal emulator Mn Other marks(Mc, Me)
Terminal.app 0 >=1
Alacritty 0 mostly(97%) >=1, some(3%) are 0
iTerm, VSCode Terminal 1 >= 1
mn = (0..0x10ffff).filter_map{_1.chr('utf-8') rescue nil}.grep(/\p{Mn}/)
mn.map{
  print "\e[H"+_1+"\e[6n"
  STDIN.raw{STDIN.readpartial 10}
}.tally

Mn is not zero-width in some terminal emulators, but I prefer not changing the original intention in this pull request except bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. That makes sense to me now.

case type
when 'F', 'W' # Fullwidth, Wide
2
when 'H', 'Na', 'N' # Halfwidth, Narrow, Neutral
1
when 'A' # Ambiguous
-1
end
end

open(ARGV.first, 'rt') do |f|
if m = f.gets.match(/^# EastAsianWidth-(\d+\.\d+\.\d+)\.txt/)
unicode_version = m[1]
Expand All @@ -13,66 +25,31 @@ open(ARGV.first, 'rt') do |f|
unicode_version = nil
end

list = []
widths = []
f.each_line do |line|
next unless m = line.match(/^(\h+)(?:\.\.(\h+))?\s*;\s*(\w+)\s+#.+/)
next unless /^(?<first>\h+)(?:\.\.(?<last>\h+))?\s*;\s*(?<type>\w+)\s+# +(?<category>[^ ]+)/ =~ line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝

0021..0023     ; Na # Po     [3] EXCLAMATION MARK..NUMBER SIGN
^first ^last     ^type ^category

https://www.unicode.org/Public/15.1.0/ucd/EastAsianWidth.txt


first = m[1].to_i(16)
last = m[2]&.to_i(16) || first
type = m[3].to_sym
if !list.empty? and (list.last[:range].last + 1) == first and list.last[:type] == type
list.last[:range] = (list.last[:range].first..last)
else
# [\u{D800}-\u{DFFF}] cause error.
unless ((0xD800..0xDFFF).to_a & (first..last).to_a).empty?
unless (first..0xD7FF).to_a.empty?
list << {
range: (first..0xD7FF),
type: type.to_sym
}
end
unless (0xE000..last).to_a.empty?
list << {
range: (first..0xD7FF),
type: type.to_sym
}
end
else
list << {
range: (first..last),
type: type.to_sym
}
end
end
range = first.to_i(16)..(last || first).to_i(16)
widths.fill(unicode_width(type, category), range)
end
grouped = list.group_by { |item| item[:type] }.map { |item| [item.first, item.last.map { |row| row[:range] }] }.to_h
grouped = %i{F H W Na A N}.map { |type| [type, grouped[type]] }
puts <<EOH
class Reline::Unicode::EastAsianWidth
# This is based on EastAsianWidth.txt
# UNICODE_VERSION = #{unicode_version ? "'#{unicode_version}'" : 'nil'}

EOH
puts grouped.map { |item|
type, ranges = item
output = " # %s\n" %
case type
when :F then 'Fullwidth'
when :H then 'Halfwidth'
when :W then 'Wide'
when :Na then 'Narrow'
when :A then 'Ambiguous'
when :N then 'Neutral'
end
output += " TYPE_%s = /^[\#{ %%W(\n" % type.upcase
output += ranges.map { |range|
if range.first == range.last
' \u{%04X}' % range.first
else
' \u{%04X}-\u{%04X}' % [range.first, range.last]
end
}.join("\n")
output += "\n ).join }]/\n"
}.join("\n")
puts 'end'
# EscapedPairs
[*0x00..0x1F, 0x7F].each { |ord| widths[ord] = 2 }
# printable ASCII chars
(0x20..0x7E).each { |ord| widths[ord] = 1 }

chunks = widths.each_with_index.chunk { |width, _idx| width || 1 }
chunk_last_ords = chunks.map { |width, chunk| [chunk.last.last, width] }
chunk_last_ords << [0x7fffffff, 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think chunk_last_ords should be up to 0x10FFFFF.

In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF
2.4 Code Points and Characters
https://www.unicode.org/versions/Unicode15.1.0/ch02.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this value(max value of int32) as an alternative of infinity. This way, we don't need to test 0x10ffff not to raise no method error.
I think it is not so obvious that chunk_index = EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o } does not return nil when CHUNK_LAST.last == 0x10FFFF and ord == 0x10FFFF .
Of course we can change it to 0x10FFFF and add a boundary value test. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I agree that using 0x7fffffff is good, just as you suggested.


puts <<~EOH
class Reline::Unicode::EastAsianWidth
# This is based on EastAsianWidth.txt
# UNICODE_VERSION = #{unicode_version ? "'#{unicode_version}'" : 'nil'}

CHUNK_LAST, CHUNK_WIDTH = [
#{chunk_last_ords.map { |ord, width| " [0x#{ord.to_s(16)}, #{width}]" }.join(",\n")}
].transpose.map(&:freeze)
end
EOH
end
53 changes: 14 additions & 39 deletions lib/reline/unicode.rb
Original file line number Diff line number Diff line change
Expand Up @@ -56,51 +56,26 @@ def self.escape_for_print(str)

require 'reline/unicode/east_asian_width'

HalfwidthDakutenHandakuten = /[\u{FF9E}\u{FF9F}]/

MBCharWidthRE = /
(?<width_2_1>
[#{ EscapedChars.map {|c| "\\x%02x" % c.ord }.join }] (?# ^ + char, such as ^M, ^H, ^[, ...)
)
| (?<width_3>^\u{2E3B}) (?# THREE-EM DASH)
| (?<width_0>^\p{M})
| (?<width_2_2>
#{ EastAsianWidth::TYPE_F }
| #{ EastAsianWidth::TYPE_W }
)
| (?<width_1>
#{ EastAsianWidth::TYPE_H }
| #{ EastAsianWidth::TYPE_NA }
| #{ EastAsianWidth::TYPE_N }
)(?!#{ HalfwidthDakutenHandakuten })
| (?<width_2_3>
(?: #{ EastAsianWidth::TYPE_H }
| #{ EastAsianWidth::TYPE_NA }
| #{ EastAsianWidth::TYPE_N })
#{ HalfwidthDakutenHandakuten }
)
| (?<ambiguous_width>
#{EastAsianWidth::TYPE_A}
)
/x

def self.get_mbchar_width(mbchar)
ord = mbchar.ord
if (0x00 <= ord and ord <= 0x1F) # in EscapedPairs
if ord <= 0x1F # in EscapedPairs
return 2
elsif (0x20 <= ord and ord <= 0x7E) # printable ASCII chars
elsif ord <= 0x7E # printable ASCII chars
return 1
end
m = mbchar.encode(Encoding::UTF_8).match(MBCharWidthRE)
case
when m.nil? then 1 # TODO should be U+FFFD � REPLACEMENT CHARACTER
when m[:width_2_1], m[:width_2_2], m[:width_2_3] then 2
when m[:width_3] then 3
when m[:width_0] then 0
when m[:width_1] then 1
when m[:ambiguous_width] then Reline.ambiguous_width
utf8_mbchar = mbchar.encode(Encoding::UTF_8)
ord = utf8_mbchar.ord
chunk_index = EastAsianWidth::CHUNK_LAST.bsearch_index { |o| ord <= o }
size = EastAsianWidth::CHUNK_WIDTH[chunk_index]
if size == -1
Reline.ambiguous_width
elsif size == 1 && utf8_mbchar.size >= 2
second_char_ord = utf8_mbchar[1].ord
# Halfwidth Dakuten Handakuten
# Only these two character has Letter Modifier category and can be combined in a single grapheme cluster
(second_char_ord == 0xFF9E || second_char_ord == 0xFF9F) ? 2 : 1
Comment on lines +72 to +76
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 This is a special case specifically for when there is a character with width 1 before U+FF9E or U+FF9F. In other words, it allows the width to be calculated correctly for something like ‘ガ’. However, it cannot calculate the width correctly for ‘が’.
There are many other combining characters, and I’m not sure why this character alone is being handled, but there doesn’t seem to be a particular reason to remove it.

else
nil
size
end
end

Expand Down
Loading
Loading