Skip to content

Conversation

@markurtz
Copy link
Collaborator

Summary

Fix synthetic dataset so it can preserve randomness across benchmarks run in the same session by enforcing set_epoch across the data loader and iterators chain. This was happening due to a new iterator being created from the DataLoader for each benchmark. The solution was to preserve epoch information across datasets so they can make use of that, if needed, and in the case of SyntheticTextGeneration, to increment a random seed.

Details

Test Plan

Automation tests

Related Issues


  • [ X] "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • [ X] Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

… run in the same session by enforcing set_epoch across the data loader and iterators chain

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes synthetic dataset randomness across benchmark runs by implementing epoch tracking through the data loading chain. The key change is that SyntheticTextGenerator has been refactored into SyntheticTextDataset with a proper _SyntheticTextExamplesIterable that increments random seeds based on iteration count, ensuring different data is generated for each benchmark iteration.

Key changes:

  • Introduced epoch tracking in DataLoader and DatasetsIterator classes
  • Refactored SyntheticTextGenerator into SyntheticTextDataset extending IterableDataset with _SyntheticTextExamplesIterable for proper epoch/iteration support
  • Updated all test references from SyntheticTextGenerator to SyntheticTextDataset

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/guidellm/data/loaders.py Added epoch tracking to DatasetsIterator and DataLoader, with propagation of epoch to datasets via set_epoch
src/guidellm/data/deserializers/synthetic.py Refactored generator into SyntheticTextDataset with _SyntheticTextExamplesIterable that increments random seed per iteration
tests/unit/data/deserializers/test_synthetic.py Updated all test references from SyntheticTextGenerator to SyntheticTextDataset
src/guidellm/data/deserializers/init.py Updated exports to replace SyntheticTextGenerator with SyntheticTextDataset
Comments suppressed due to low confidence (1)

tests/unit/data/deserializers/test_synthetic.py:314

  • The test is accessing the private method _create_prompt on SyntheticTextDataset, but this method is actually defined in _SyntheticTextExamplesIterable, not on SyntheticTextDataset. This test will fail because SyntheticTextDataset doesn't have a _create_prompt method. The method needs to be exposed on SyntheticTextDataset or the test needs to access it via _ex_iterable.
        result = generator._create_prompt(5, faker, "unique_prefix ")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

markurtz and others added 2 commits November 14, 2025 16:30
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Copy link
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor clarification question but otherwise LGTM. Tested working.

@markurtz markurtz merged commit f6175cd into main Nov 14, 2025
18 checks passed
@markurtz markurtz deleted the bugs/fix-repeat-requests branch November 14, 2025 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request repetition across sweep rounds leads to 100% prefix-hit ratio (possible bug)

3 participants