Ensure synthetic text datasets remain random across benchmarks #463

markurtz · 2025-11-14T21:16:38Z

Summary

Fix synthetic dataset so it can preserve randomness across benchmarks run in the same session by enforcing set_epoch across the data loader and iterators chain. This was happening due to a new iterator being created from the DataLoader for each benchmark. The solution was to preserve epoch information across datasets so they can make use of that, if needed, and in the case of SyntheticTextGeneration, to increment a random seed.

Details

Test Plan

Automation tests

Related Issues

Resolves Request repetition across sweep rounds leads to 100% prefix-hit ratio (possible bug) #460

[ X] "I certify that all code in this PR is my own, except as noted below."

Use of AI

[ X] Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

… run in the same session by enforcing set_epoch across the data loader and iterators chain Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Copilot

Pull Request Overview

This PR fixes synthetic dataset randomness across benchmark runs by implementing epoch tracking through the data loading chain. The key change is that SyntheticTextGenerator has been refactored into SyntheticTextDataset with a proper _SyntheticTextExamplesIterable that increments random seeds based on iteration count, ensuring different data is generated for each benchmark iteration.

Key changes:

Introduced epoch tracking in DataLoader and DatasetsIterator classes
Refactored SyntheticTextGenerator into SyntheticTextDataset extending IterableDataset with _SyntheticTextExamplesIterable for proper epoch/iteration support
Updated all test references from SyntheticTextGenerator to SyntheticTextDataset

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
src/guidellm/data/loaders.py	Added epoch tracking to `DatasetsIterator` and `DataLoader`, with propagation of epoch to datasets via `set_epoch`
src/guidellm/data/deserializers/synthetic.py	Refactored generator into `SyntheticTextDataset` with `_SyntheticTextExamplesIterable` that increments random seed per iteration
tests/unit/data/deserializers/test_synthetic.py	Updated all test references from `SyntheticTextGenerator` to `SyntheticTextDataset`
src/guidellm/data/deserializers/init.py	Updated exports to replace `SyntheticTextGenerator` with `SyntheticTextDataset`

Comments suppressed due to low confidence (1)

tests/unit/data/deserializers/test_synthetic.py:314

The test is accessing the private method _create_prompt on SyntheticTextDataset, but this method is actually defined in _SyntheticTextExamplesIterable, not on SyntheticTextDataset. This test will fail because SyntheticTextDataset doesn't have a _create_prompt method. The method needs to be exposed on SyntheticTextDataset or the test needs to access it via _ex_iterable.

        result = generator._create_prompt(5, faker, "unique_prefix ")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/guidellm/data/loaders.py

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

sjmonson

Minor clarification question but otherwise LGTM. Tested working.

src/guidellm/data/loaders.py

Fix synthetic dataset so it can preserve randomness across benchmarks…

c073e6e

… run in the same session by enforcing set_epoch across the data loader and iterators chain Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

markurtz requested review from Copilot, jaredoconnell and sjmonson November 14, 2025 21:16

Copilot started reviewing on behalf of markurtz November 14, 2025 21:17 View session

Copilot finished reviewing on behalf of markurtz November 14, 2025 21:19

Copilot AI reviewed Nov 14, 2025

View reviewed changes

src/guidellm/data/loaders.py Show resolved Hide resolved

markurtz and others added 2 commits November 14, 2025 16:30

fix failing unit tests

6b5358b

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Merge branch 'main' into bugs/fix-repeat-requests

fc9346f

sjmonson approved these changes Nov 14, 2025

View reviewed changes

src/guidellm/data/loaders.py Show resolved Hide resolved

markurtz merged commit f6175cd into main Nov 14, 2025
18 checks passed

markurtz deleted the bugs/fix-repeat-requests branch November 14, 2025 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure synthetic text datasets remain random across benchmarks #463

Ensure synthetic text datasets remain random across benchmarks #463

Uh oh!

markurtz commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

sjmonson left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ensure synthetic text datasets remain random across benchmarks #463

Ensure synthetic text datasets remain random across benchmarks #463

Uh oh!

Conversation

markurtz commented Nov 14, 2025

Summary

Details

Test Plan

Related Issues

Use of AI

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

sjmonson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants