[TST] More benchmark queries for regex #4910

Sicheng-Pan · 2025-06-21T00:12:18Z

Description of changes

Summarize the changes made by this PR.

Improvements & Bug fixes
- This PR adds more regex patterns in the benchmark. The benchmark also serve as an integration for regex as it compares the result with bruteforce evaluation.
- Updates a few dependencies. Verified that there should be no breaking change
- Updates some wal3 test because fragment size changed after dependency. The existing fragment should be compatible and manifest should still be valid
New functionality
- N/A

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

github-actions · 2025-06-21T00:12:32Z

Sicheng-Pan · 2025-06-21T00:12:35Z

[TST] More benchmark queries for regex #4910 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

propel-code-bot · 2025-06-21T00:13:14Z

Expand Regex Benchmark Coverage and Update Dependencies

This PR significantly expands the regex and full-text search benchmark queries, using the bigcode/the-stack-dedup Rust dataset for more comprehensive and realistic benchmarks. It also updates a set of core dependencies (notably Arrow and Parquet to 55.1, with lockfile and cargo file adjustments), and adapts affected k8s WAL integration tests to new fragment sizes and manifest expectations following Arrow/Parquet upgrades. Additionally, a new dataset runner for the Rust dataset is implemented for more robust benchmarking.

Key Changes

• Substantially broadened regex and literal search benchmark patterns for more realistic evaluation, especially in rust/index/benches/literal.rs and rust/worker/benches/regex.rs
• Replaced homegrown Rust code dataset with streaming from HuggingFace bigcode/the-stack-dedup, implementing rust/benchmark/src/datasets/rust.rs
• Updated Arrow and Parquet workspace dependencies from 52.2.0/52 to 55.1, adjusted Cargo.toml and Cargo.lock accordingly
• Refactored benchmark code to use asynchronous, batched Parquet readers and a streaming approach; multiple bugfixes and cleanup for dataset utilities
• Adapted WAL integration tests (k8s) to account for different fragment sizes and stat expectations after Arrow/Parquet changes

Affected Areas

• Benchmarks: regex.rs, literal.rs
• Benchmark dataset utils: datasets/rust.rs, datasets/util.rs, Cargo.toml
• Dependency management: Arrow/Parquet in workspace
• Integration/test fixtures: wal3/tests for k8s, manifest checks
• Benchmark dataset registry: datasets/mod.rs

This summary was automatically generated by @propel-code-bot

sanketkedia

Discussed offline to babysit staging just to be safe that arrow version increment does not break old data

Sicheng-Pan · 2025-07-07T23:24:52Z

Merge activity

Jul 7, 11:24 PM UTC: @Sicheng-Pan merged this pull request with Graphite.

## Description of changes _Summarize the changes made by this PR._ - Improvements & Bug fixes - This PR adds more regex patterns in the benchmark. The benchmark also serve as an integration for regex as it compares the result with bruteforce evaluation. - Updates a few dependencies. Verified that there should be no breaking change - Updates some wal3 test because fragment size changed after dependency. The existing fragment should be compatible and manifest should still be valid - New functionality - N/A ## Test plan _How are these changes tested?_ - [ ] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

Sicheng-Pan marked this pull request as ready for review June 21, 2025 00:12

Sicheng-Pan requested a review from sanketkedia June 21, 2025 00:13

Sicheng-Pan added 3 commits June 20, 2025 18:59

[TST] More benchmark queries for regex

7c530ff

Fix cargo lock

ddacc38

Fix lint

99c96f1

Sicheng-Pan force-pushed the sicheng/06-20-more-regex-bench branch from 4d57d3b to 99c96f1 Compare June 21, 2025 01:59

Update k8s tests for updated parquet dep

b967f30

sanketkedia approved these changes Jul 3, 2025

View reviewed changes

Merge branch 'main' into sicheng/06-20-more-regex-bench

b76ad4a

Sicheng-Pan merged commit 9bda3dc into main Jul 7, 2025
109 of 114 checks passed

Sicheng-Pan deleted the sicheng/06-20-more-regex-bench branch July 7, 2025 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TST] More benchmark queries for regex #4910

[TST] More benchmark queries for regex #4910

Uh oh!

Sicheng-Pan commented Jun 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

Sicheng-Pan commented Jun 21, 2025

Uh oh!

propel-code-bot bot commented Jun 21, 2025 •

edited

Loading

Uh oh!

sanketkedia left a comment

Uh oh!

Uh oh!

Sicheng-Pan commented Jul 7, 2025

Uh oh!

Uh oh!

[TST] More benchmark queries for regex #4910

[TST] More benchmark queries for regex #4910

Uh oh!

Conversation

Sicheng-Pan commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Documentation Changes

Uh oh!

github-actions bot commented Jun 21, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

Sicheng-Pan commented Jun 21, 2025

Uh oh!

propel-code-bot bot commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanketkedia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sicheng-Pan commented Jul 7, 2025

Merge activity

Uh oh!

Uh oh!

Sicheng-Pan commented Jun 21, 2025 •

edited

Loading

propel-code-bot bot commented Jun 21, 2025 •

edited

Loading