Skip to content

Data Loader tests hang when run in ASAN test job #66223

@mruberry

Description

@mruberry

Current Status

ongoing

Error looks like

  test_typing (__main__.TestDataLoader) ... ok (0.002s)
  test_worker_init_fn (__main__.TestDataLoader) ... ok (0.066s)
  test_worker_seed (__main__.TestDataLoader) ... ok (0.149s)
  test_worker_seed_reproducibility (__main__.TestDataLoader) ... ok (0.292s)
  test_basics (__main__.TestDataLoader2) ... skip (0.001s)
  test_basic_threading (__main__.TestDataLoader2_EventLoop) ... skip (0.001s)
  test_batch_sampler (__main__.TestDataLoaderPersistentWorkers) ... ok (2.137s)
  test_builtin_collection_conversion (__main__.TestDataLoaderPersistentWorkers) ... ok (0.319s)
  test_bulk_loading_nobatch (__main__.TestDataLoaderPersistentWorkers) ... ok (0.107s)
Error: The action has timed out.

See https://github.com/pytorch/pytorch/runs/3811006909.

Incident timeline (all times pacific)

Unknown. Anecdotally this has been happening infrequently for some time.

User impact

The ASAN job failed erronenously sometimes.

Root cause

Unknown.

Mitigation

Recommended mitigation is disabling dataloader tests when running with ASAN.

Prevention/followups

Dataloader tests will need to be reviewed running under ASAN and debugged.

cc @ssnl @VitalyFedyunin @ejguan @NivekT @mruberry @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: dataloaderRelated to torch.utils.data.DataLoader and Samplermodule: testsIssues related to tests (not the torch.testing module)triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions