Skip to content

Improve scan of regexp handling#1572

Merged
tompng merged 2 commits into
ruby:masterfrom
tompng:regexp_handling_scan
May 13, 2026
Merged

Improve scan of regexp handling#1572
tompng merged 2 commits into
ruby:masterfrom
tompng:regexp_handling_scan

Conversation

@tompng
Copy link
Copy Markdown
Member

@tompng tompng commented Jan 19, 2026

Scanning regexp handling that I implemented in #1559 was not performant.

  1. Run string.scan for each regexp handling once
  2. Collect matched locations
  3. Scan collected location from the beginning

@tompng tompng had a problem deploying to fork-preview-protection January 19, 2026 19:24 — with GitHub Actions Failure
@tompng tompng force-pushed the regexp_handling_scan branch from 057b0bb to 604a9d6 Compare January 19, 2026 19:43
@tompng tompng temporarily deployed to fork-preview-protection January 19, 2026 19:43 — with GitHub Actions Inactive
@matzbot
Copy link
Copy Markdown
Collaborator

matzbot commented Jan 19, 2026

🚀 Preview deployment available at: https://b15e5470.rdoc-6cd.pages.dev (commit: 5561522)

Comment thread lib/rdoc/markup/formatter.rb Outdated
# If the start positions are the same, prefer the one with higher priority (registered earlier one)
matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] }

chars = text.chars
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between char[pos...beg_pos] and text[pos...beg_pos]? That is, do we need to create this character array and join the characters to get the substring later?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to text[pos...beg_pos]. There's no reason to use char array.

Here's the wrong reason that I used chars[range]:
Worst case, extremely long multibyte string, text[range] may not be performant.

s = 'a'*1000000+'α'
byte_range = range = 1000000-20..1000000-10
10000.times{s[range]} #=> 1.615413s
10000.times{chars[range].join} #=> 0.005913s
10000.times{s.byteslice(byte_range)} #=> 0.002450s

But... MatchData#begin was already slow in this case.

s.scan(/a/){Regexp.last_match} #=> 0.220335s
s.scan(/a/){Regexp.last_match.byteoffset 0}; #=> 0.252146s, requires ruby >= 3.2
s.scan(/a/){Regexp.last_match.begin 0}; #=> 18.810536s

In normal case: relatively short ascii-only text, it won't be a bottleneck.

@tompng tompng force-pushed the regexp_handling_scan branch from 604a9d6 to ddc4b67 Compare March 15, 2026 19:47
@tompng tompng temporarily deployed to fork-preview-protection March 15, 2026 19:47 — with GitHub Actions Inactive
Copilot AI review requested due to automatic review settings May 13, 2026 05:39
@tompng tompng force-pushed the regexp_handling_scan branch from ddc4b67 to 441de5d Compare May 13, 2026 05:39
@tompng tompng requested a deployment to fork-preview-protection May 13, 2026 05:39 — with GitHub Actions Waiting
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes RDoc::Markup::Formatter#apply_regexp_handling by changing it from an incremental “find next earliest match” loop into a two-phase approach that scans each registered regexp once, collects match locations, then replays matches in order to build the final [text, converted?] segments.

Changes:

  • Scan text once per regexp handling, collecting match ranges and metadata.
  • Sort collected matches by start position (and registration priority) before emitting output segments.
  • Emit handled/unhandled segments in a single forward pass through the sorted match list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +89 to 94
@markup.regexp_handlings.each_with_index do |(pattern, name), priority|
text.scan(pattern) do
m = Regexp.last_match
idx = m[1] ? 1 : 0
if m.begin(idx) < pos
pos = m.begin(idx)
matched_text = m[idx]
matched_name = name
end
end
output << [text[start...pos], false] if pos > start
if matched_name
handled = public_send(:"handle_regexp_#{matched_name}", matched_text)
output << [handled, true]
start = pos + matched_text.size
else
start = pos
matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
end
Comment thread lib/rdoc/markup/formatter.rb Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 13, 2026 05:50
@tompng tompng temporarily deployed to fork-preview-protection May 13, 2026 05:50 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

lib/rdoc/markup/formatter.rb:95

  • Using text.scan(pattern) enumerates non-overlapping matches for each regexp. The previous implementation used pattern.match(text, start) and could still find a later match that starts inside a previously found match for the same pattern if another regexp handling advanced start into that region (e.g., when two handlings begin at the same position and the earlier-registered one consumes fewer characters). With scan, those overlapping candidates are never collected, which can change output for some regexp handling patterns (including user-added ones, per RDoc::Markup#add_regexp_handling docs). Consider generating matches with a match loop (advancing by the capture end you actually consume) or otherwise ensuring overlapping matches are discoverable when the global pos advances into a previously-matched region.
    @markup.regexp_handlings.each_with_index do |(pattern, name), priority|
      text.scan(pattern) do
        m = Regexp.last_match
        idx = m[1] ? 1 : 0
        matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
      end
    end

lib/rdoc/markup/formatter.rb:109

  • This implementation collects all matches for all regexp handlings into matched and then sorts them. This increases peak memory (O(total_matches)) and adds an O(total_matches log total_matches) sort, which can be expensive for inputs with many small matches (e.g., quote handling matching every '/"`" character). If the goal is performance, consider a streaming merge (e.g., maintain the next match for each pattern and repeatedly pick the leftmost using a small priority queue) so you don’t have to materialize and sort the full match list.
    matched = []
    @markup.regexp_handlings.each_with_index do |(pattern, name), priority|
      text.scan(pattern) do
        m = Regexp.last_match
        idx = m[1] ? 1 : 0
        matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
      end
    end
    # If the start positions are the same, prefer the earlier-registered one
    # (lower numeric priority from each_with_index).
    matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] }

    pos = 0
    output = []
    matched.each do |beg_pos, end_pos, s, name|
      next if beg_pos < pos

      output << [text[pos...beg_pos], false] if beg_pos != pos
      handled = public_send(:"handle_regexp_#{name}", s)
      output << [handled, true]
      pos = end_pos
    end

Comment on lines 87 to +99
def apply_regexp_handling(text)
output = []
start = 0
loop do
pos = text.size
matched_name = matched_text = nil
@markup.regexp_handlings.each do |pattern, name|
m = text.match(pattern, start)
next unless m
matched = []
@markup.regexp_handlings.each_with_index do |(pattern, name), priority|
text.scan(pattern) do
m = Regexp.last_match
idx = m[1] ? 1 : 0
if m.begin(idx) < pos
pos = m.begin(idx)
matched_text = m[idx]
matched_name = name
end
end
output << [text[start...pos], false] if pos > start
if matched_name
handled = public_send(:"handle_regexp_#{matched_name}", matched_text)
output << [handled, true]
start = pos + matched_text.size
else
start = pos
matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
end
break if pos == text.size
end
# If the start positions are the same, prefer the earlier-registered one
# (lower numeric priority from each_with_index).
matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] }

@tompng tompng merged commit 1f93543 into ruby:master May 13, 2026
34 checks passed
@tompng tompng deleted the regexp_handling_scan branch May 13, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants