Improve scan of regexp handling#1572
Conversation
057b0bb to
604a9d6
Compare
|
🚀 Preview deployment available at: https://b15e5470.rdoc-6cd.pages.dev (commit: 5561522) |
| # If the start positions are the same, prefer the one with higher priority (registered earlier one) | ||
| matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] } | ||
|
|
||
| chars = text.chars |
There was a problem hiding this comment.
What's the difference between char[pos...beg_pos] and text[pos...beg_pos]? That is, do we need to create this character array and join the characters to get the substring later?
There was a problem hiding this comment.
Updated to text[pos...beg_pos]. There's no reason to use char array.
Here's the wrong reason that I used chars[range]:
Worst case, extremely long multibyte string, text[range] may not be performant.
s = 'a'*1000000+'α'
byte_range = range = 1000000-20..1000000-10
10000.times{s[range]} #=> 1.615413s
10000.times{chars[range].join} #=> 0.005913s
10000.times{s.byteslice(byte_range)} #=> 0.002450sBut... MatchData#begin was already slow in this case.
s.scan(/a/){Regexp.last_match} #=> 0.220335s
s.scan(/a/){Regexp.last_match.byteoffset 0}; #=> 0.252146s, requires ruby >= 3.2
s.scan(/a/){Regexp.last_match.begin 0}; #=> 18.810536sIn normal case: relatively short ascii-only text, it won't be a bottleneck.
604a9d6 to
ddc4b67
Compare
ddc4b67 to
441de5d
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes RDoc::Markup::Formatter#apply_regexp_handling by changing it from an incremental “find next earliest match” loop into a two-phase approach that scans each registered regexp once, collects match locations, then replays matches in order to build the final [text, converted?] segments.
Changes:
- Scan
textonce per regexp handling, collecting match ranges and metadata. - Sort collected matches by start position (and registration priority) before emitting output segments.
- Emit handled/unhandled segments in a single forward pass through the sorted match list.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @markup.regexp_handlings.each_with_index do |(pattern, name), priority| | ||
| text.scan(pattern) do | ||
| m = Regexp.last_match | ||
| idx = m[1] ? 1 : 0 | ||
| if m.begin(idx) < pos | ||
| pos = m.begin(idx) | ||
| matched_text = m[idx] | ||
| matched_name = name | ||
| end | ||
| end | ||
| output << [text[start...pos], false] if pos > start | ||
| if matched_name | ||
| handled = public_send(:"handle_regexp_#{matched_name}", matched_text) | ||
| output << [handled, true] | ||
| start = pos + matched_text.size | ||
| else | ||
| start = pos | ||
| matched << [m.begin(idx), m.end(idx), m[idx], name, priority] | ||
| end |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
lib/rdoc/markup/formatter.rb:95
- Using
text.scan(pattern)enumerates non-overlapping matches for each regexp. The previous implementation usedpattern.match(text, start)and could still find a later match that starts inside a previously found match for the samepatternif another regexp handling advancedstartinto that region (e.g., when two handlings begin at the same position and the earlier-registered one consumes fewer characters). Withscan, those overlapping candidates are never collected, which can change output for some regexp handling patterns (including user-added ones, perRDoc::Markup#add_regexp_handlingdocs). Consider generating matches with amatchloop (advancing by the capture end you actually consume) or otherwise ensuring overlapping matches are discoverable when the globalposadvances into a previously-matched region.
@markup.regexp_handlings.each_with_index do |(pattern, name), priority|
text.scan(pattern) do
m = Regexp.last_match
idx = m[1] ? 1 : 0
matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
end
end
lib/rdoc/markup/formatter.rb:109
- This implementation collects all matches for all regexp handlings into
matchedand then sorts them. This increases peak memory (O(total_matches)) and adds an O(total_matches log total_matches) sort, which can be expensive for inputs with many small matches (e.g., quote handling matching every'/"`" character). If the goal is performance, consider a streaming merge (e.g., maintain the next match for each pattern and repeatedly pick the leftmost using a small priority queue) so you don’t have to materialize and sort the full match list.
matched = []
@markup.regexp_handlings.each_with_index do |(pattern, name), priority|
text.scan(pattern) do
m = Regexp.last_match
idx = m[1] ? 1 : 0
matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
end
end
# If the start positions are the same, prefer the earlier-registered one
# (lower numeric priority from each_with_index).
matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] }
pos = 0
output = []
matched.each do |beg_pos, end_pos, s, name|
next if beg_pos < pos
output << [text[pos...beg_pos], false] if beg_pos != pos
handled = public_send(:"handle_regexp_#{name}", s)
output << [handled, true]
pos = end_pos
end
| def apply_regexp_handling(text) | ||
| output = [] | ||
| start = 0 | ||
| loop do | ||
| pos = text.size | ||
| matched_name = matched_text = nil | ||
| @markup.regexp_handlings.each do |pattern, name| | ||
| m = text.match(pattern, start) | ||
| next unless m | ||
| matched = [] | ||
| @markup.regexp_handlings.each_with_index do |(pattern, name), priority| | ||
| text.scan(pattern) do | ||
| m = Regexp.last_match | ||
| idx = m[1] ? 1 : 0 | ||
| if m.begin(idx) < pos | ||
| pos = m.begin(idx) | ||
| matched_text = m[idx] | ||
| matched_name = name | ||
| end | ||
| end | ||
| output << [text[start...pos], false] if pos > start | ||
| if matched_name | ||
| handled = public_send(:"handle_regexp_#{matched_name}", matched_text) | ||
| output << [handled, true] | ||
| start = pos + matched_text.size | ||
| else | ||
| start = pos | ||
| matched << [m.begin(idx), m.end(idx), m[idx], name, priority] | ||
| end | ||
| break if pos == text.size | ||
| end | ||
| # If the start positions are the same, prefer the earlier-registered one | ||
| # (lower numeric priority from each_with_index). | ||
| matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] } | ||
|
|
Scanning regexp handling that I implemented in #1559 was not performant.
string.scanfor each regexp handling once