Improve scan of regexp handling by tompng · Pull Request #1572 · ruby/rdoc

tompng · 2026-01-19T19:24:32Z

Scanning regexp handling that I implemented in #1559 was not performant.

Run string.scan for each regexp handling once
Collect matched locations
Scan collected location from the beginning

matzbot · 2026-01-19T19:46:25Z

🚀 Preview deployment available at: https://b15e5470.rdoc-6cd.pages.dev (commit: 5561522)

st0012 · 2026-03-14T23:38:04Z

+    # If the start positions are the same, prefer the one with higher priority (registered earlier one)
+    matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] }
+
+    chars = text.chars


What's the difference between char[pos...beg_pos] and text[pos...beg_pos]? That is, do we need to create this character array and join the characters to get the substring later?

Updated to text[pos...beg_pos]. There's no reason to use char array.

Here's the wrong reason that I used chars[range]:
Worst case, extremely long multibyte string, text[range] may not be performant.

s = 'a'*1000000+'α' byte_range = range = 1000000-20..1000000-10 10000.times{s[range]} #=> 1.615413s 10000.times{chars[range].join} #=> 0.005913s 10000.times{s.byteslice(byte_range)} #=> 0.002450s

But... MatchData#begin was already slow in this case.

s.scan(/a/){Regexp.last_match} #=> 0.220335s s.scan(/a/){Regexp.last_match.byteoffset 0}; #=> 0.252146s, requires ruby >= 3.2 s.scan(/a/){Regexp.last_match.begin 0}; #=> 18.810536s

In normal case: relatively short ascii-only text, it won't be a bottleneck.

Copilot

Pull request overview

This PR optimizes RDoc::Markup::Formatter#apply_regexp_handling by changing it from an incremental “find next earliest match” loop into a two-phase approach that scans each registered regexp once, collects match locations, then replays matches in order to build the final [text, converted?] segments.

Changes:

Scan text once per regexp handling, collecting match ranges and metadata.
Sort collected matches by start position (and registration priority) before emitting output segments.
Emit handled/unhandled segments in a single forward pass through the sorted match list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    @markup.regexp_handlings.each_with_index do |(pattern, name), priority|
+      text.scan(pattern) do
+        m = Regexp.last_match
        idx = m[1] ? 1 : 0
-        if m.begin(idx) < pos
-          pos = m.begin(idx)
-          matched_text = m[idx]
-          matched_name = name
-        end
-      end
-      output << [text[start...pos], false] if pos > start
-      if matched_name
-        handled = public_send(:"handle_regexp_#{matched_name}", matched_text)
-        output << [handled, true]
-        start = pos + matched_text.size
-      else
-        start = pos
+        matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
      end


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

lib/rdoc/markup/formatter.rb:95

Using text.scan(pattern) enumerates non-overlapping matches for each regexp. The previous implementation used pattern.match(text, start) and could still find a later match that starts inside a previously found match for the same pattern if another regexp handling advanced start into that region (e.g., when two handlings begin at the same position and the earlier-registered one consumes fewer characters). With scan, those overlapping candidates are never collected, which can change output for some regexp handling patterns (including user-added ones, per RDoc::Markup#add_regexp_handling docs). Consider generating matches with a match loop (advancing by the capture end you actually consume) or otherwise ensuring overlapping matches are discoverable when the global pos advances into a previously-matched region.

    @markup.regexp_handlings.each_with_index do |(pattern, name), priority|
      text.scan(pattern) do
        m = Regexp.last_match
        idx = m[1] ? 1 : 0
        matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
      end
    end

lib/rdoc/markup/formatter.rb:109

This implementation collects all matches for all regexp handlings into matched and then sorts them. This increases peak memory (O(total_matches)) and adds an O(total_matches log total_matches) sort, which can be expensive for inputs with many small matches (e.g., quote handling matching every '/"`" character). If the goal is performance, consider a streaming merge (e.g., maintain the next match for each pattern and repeatedly pick the leftmost using a small priority queue) so you don’t have to materialize and sort the full match list.

    matched = []
    @markup.regexp_handlings.each_with_index do |(pattern, name), priority|
      text.scan(pattern) do
        m = Regexp.last_match
        idx = m[1] ? 1 : 0
        matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
      end
    end
    # If the start positions are the same, prefer the earlier-registered one
    # (lower numeric priority from each_with_index).
    matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] }

    pos = 0
    output = []
    matched.each do |beg_pos, end_pos, s, name|
      next if beg_pos < pos

      output << [text[pos...beg_pos], false] if beg_pos != pos
      handled = public_send(:"handle_regexp_#{name}", s)
      output << [handled, true]
      pos = end_pos
    end

  def apply_regexp_handling(text)
-    output = []
-    start = 0
-    loop do
-      pos = text.size
-      matched_name = matched_text = nil
-      @markup.regexp_handlings.each do |pattern, name|
-        m = text.match(pattern, start)
-        next unless m
+    matched = []
+    @markup.regexp_handlings.each_with_index do |(pattern, name), priority|
+      text.scan(pattern) do
+        m = Regexp.last_match
        idx = m[1] ? 1 : 0
-        if m.begin(idx) < pos
-          pos = m.begin(idx)
-          matched_text = m[idx]
-          matched_name = name
-        end
-      end
-      output << [text[start...pos], false] if pos > start
-      if matched_name
-        handled = public_send(:"handle_regexp_#{matched_name}", matched_text)
-        output << [handled, true]
-        start = pos + matched_text.size
-      else
-        start = pos
+        matched << [m.begin(idx), m.end(idx), m[idx], name, priority]
      end
-      break if pos == text.size
    end
+    # If the start positions are the same, prefer the earlier-registered one
+    # (lower numeric priority from each_with_index).
+    matched.sort_by! {|beg_pos, _, _, _, priority| [beg_pos, priority] }
+


tompng had a problem deploying to fork-preview-protection January 19, 2026 19:24 — with GitHub Actions Failure

tompng force-pushed the regexp_handling_scan branch from 057b0bb to 604a9d6 Compare January 19, 2026 19:43

tompng temporarily deployed to fork-preview-protection January 19, 2026 19:43 — with GitHub Actions Inactive

st0012 reviewed Mar 14, 2026

View reviewed changes

tompng force-pushed the regexp_handling_scan branch from 604a9d6 to ddc4b67 Compare March 15, 2026 19:47

tompng temporarily deployed to fork-preview-protection March 15, 2026 19:47 — with GitHub Actions Inactive

st0012 approved these changes May 13, 2026

View reviewed changes

Improve scan of regexp handling

441de5d

Copilot AI review requested due to automatic review settings May 13, 2026 05:39

tompng force-pushed the regexp_handling_scan branch from ddc4b67 to 441de5d Compare May 13, 2026 05:39

tompng requested a deployment to fork-preview-protection May 13, 2026 05:39 — with GitHub Actions Waiting

Copilot started reviewing on behalf of tompng May 13, 2026 05:40 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Potential fix for pull request finding

5561522

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 13, 2026 05:50

tompng temporarily deployed to fork-preview-protection May 13, 2026 05:50 — with GitHub Actions Inactive

Copilot started reviewing on behalf of tompng May 13, 2026 05:51 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

tompng merged commit 1f93543 into ruby:master May 13, 2026
34 checks passed

tompng deleted the regexp_handling_scan branch May 13, 2026 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scan of regexp handling#1572

Improve scan of regexp handling#1572
tompng merged 2 commits into
ruby:masterfrom
tompng:regexp_handling_scan

tompng commented Jan 19, 2026

Uh oh!

matzbot commented Jan 19, 2026 •

edited

Loading

Uh oh!

st0012 Mar 14, 2026

Uh oh!

tompng Mar 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tompng commented Jan 19, 2026

Uh oh!

matzbot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

st0012 Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

tompng Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

matzbot commented Jan 19, 2026 •

edited

Loading