[Feature] AsyncBatchedCollector: coordinator loop and direct submission mode#3499
Open
vmoens wants to merge 6 commits intogh/vmoens/241/basefrom
Open
[Feature] AsyncBatchedCollector: coordinator loop and direct submission mode#3499vmoens wants to merge 6 commits intogh/vmoens/241/basefrom
vmoens wants to merge 6 commits intogh/vmoens/241/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3499
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New FailuresAs of commit c583dcd with merge base 266e4aa ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
vmoens
added a commit
that referenced
this pull request
Feb 12, 2026
…on mode Rewrite the AsyncBatchedCollector to use a coordinator thread that pipelines env stepping and batched inference without a global sync barrier. Add a `direct=True` mode where each env thread submits directly to the InferenceServer, eliminating the coordinator thread and its serialization overhead. Benchmark results (8 mock pixel envs, Nature-CNN, CPU): AsyncBatchedCollector direct: 3183 fps (+72% vs coordinator) AsyncBatchedCollector threading: 1850 fps (coordinator mode) AsyncBatchedCollector mp: 1042 fps (coordinator mode) Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: 225d2a4 Pull-Request: #3499
Contributor
|
| Name | Max | Mean | Ops | Ops on Repo HEAD
|
Change |
|---|---|---|---|---|---|
| test_tensor_to_bytestream_speed[pickle] | 85.4355μs | 83.9417μs | 11.9130 KOps/s | 12.4012 KOps/s | |
| test_tensor_to_bytestream_speed[torch.save] | 0.1443ms | 0.1439ms | 6.9514 KOps/s | 7.1860 KOps/s | |
| test_tensor_to_bytestream_speed[untyped_storage] | 0.1017s | 0.1013s | 9.8689 Ops/s | 9.7064 Ops/s | |
| test_tensor_to_bytestream_speed[numpy] | 2.5678μs | 2.5599μs | 390.6396 KOps/s | 403.8303 KOps/s | |
| test_tensor_to_bytestream_speed[safetensors] | 39.0869μs | 38.8879μs | 25.7150 KOps/s | 26.9108 KOps/s | |
| test_simple | 0.5428s | 0.5413s | 1.8473 Ops/s | 1.7838 Ops/s | |
| test_transformed | 1.0765s | 1.0751s | 0.9301 Ops/s | 0.9125 Ops/s | |
| test_serial | 1.6451s | 1.6434s | 0.6085 Ops/s | 0.6038 Ops/s | |
| test_parallel | 1.1265s | 1.0257s | 0.9749 Ops/s | 0.9832 Ops/s | |
| test_step_mdp_speed[True-True-True-True-True] | 0.1679ms | 41.5736μs | 24.0537 KOps/s | 24.6411 KOps/s | |
| test_step_mdp_speed[True-True-True-True-False] | 63.2410μs | 23.5299μs | 42.4991 KOps/s | 42.4314 KOps/s | |
| test_step_mdp_speed[True-True-True-False-True] | 59.7810μs | 23.6467μs | 42.2892 KOps/s | 42.9463 KOps/s | |
| test_step_mdp_speed[True-True-True-False-False] | 49.6410μs | 12.9889μs | 76.9889 KOps/s | 78.4521 KOps/s | |
| test_step_mdp_speed[True-True-False-True-True] | 79.3420μs | 44.6939μs | 22.3744 KOps/s | 22.8478 KOps/s | |
| test_step_mdp_speed[True-True-False-True-False] | 63.1820μs | 26.0605μs | 38.3722 KOps/s | 39.3679 KOps/s | |
| test_step_mdp_speed[True-True-False-False-True] | 61.4410μs | 26.0096μs | 38.4473 KOps/s | 40.3306 KOps/s | |
| test_step_mdp_speed[True-True-False-False-False] | 48.9500μs | 15.4105μs | 64.8909 KOps/s | 65.5748 KOps/s | |
| test_step_mdp_speed[True-False-True-True-True] | 80.8710μs | 47.8732μs | 20.8885 KOps/s | 21.6170 KOps/s | |
| test_step_mdp_speed[True-False-True-True-False] | 64.8010μs | 29.2892μs | 34.1423 KOps/s | 35.1547 KOps/s | |
| test_step_mdp_speed[True-False-True-False-True] | 56.9910μs | 26.0726μs | 38.3544 KOps/s | 39.1060 KOps/s | |
| test_step_mdp_speed[True-False-True-False-False] | 41.0310μs | 15.7919μs | 63.3236 KOps/s | 64.4999 KOps/s | |
| test_step_mdp_speed[True-False-False-True-True] | 82.5520μs | 49.7043μs | 20.1190 KOps/s | 20.6984 KOps/s | |
| test_step_mdp_speed[True-False-False-True-False] | 66.3610μs | 31.3011μs | 31.9478 KOps/s | 32.4991 KOps/s | |
| test_step_mdp_speed[True-False-False-False-True] | 62.7810μs | 28.4167μs | 35.1905 KOps/s | 35.4090 KOps/s | |
| test_step_mdp_speed[True-False-False-False-False] | 42.0500μs | 18.2045μs | 54.9315 KOps/s | 56.1476 KOps/s | |
| test_step_mdp_speed[False-True-True-True-True] | 78.9610μs | 47.2680μs | 21.1560 KOps/s | 21.1728 KOps/s | |
| test_step_mdp_speed[False-True-True-True-False] | 46.4310μs | 28.6657μs | 34.8849 KOps/s | 35.6420 KOps/s | |
| test_step_mdp_speed[False-True-True-False-True] | 2.5331ms | 30.4827μs | 32.8055 KOps/s | 32.6281 KOps/s | |
| test_step_mdp_speed[False-True-True-False-False] | 46.3110μs | 17.4470μs | 57.3165 KOps/s | 57.7083 KOps/s | |
| test_step_mdp_speed[False-True-False-True-True] | 81.2910μs | 49.6059μs | 20.1589 KOps/s | 19.5078 KOps/s | |
| test_step_mdp_speed[False-True-False-True-False] | 60.6310μs | 31.2928μs | 31.9562 KOps/s | 31.6995 KOps/s | |
| test_step_mdp_speed[False-True-False-False-True] | 58.9910μs | 32.3176μs | 30.9429 KOps/s | 30.6625 KOps/s | |
| test_step_mdp_speed[False-True-False-False-False] | 54.3110μs | 19.7764μs | 50.5652 KOps/s | 50.1203 KOps/s | |
| test_step_mdp_speed[False-False-True-True-True] | 94.9310μs | 52.9113μs | 18.8996 KOps/s | 19.1889 KOps/s | |
| test_step_mdp_speed[False-False-True-True-False] | 61.8410μs | 34.3334μs | 29.1262 KOps/s | 29.1031 KOps/s | |
| test_step_mdp_speed[False-False-True-False-True] | 78.2910μs | 32.4154μs | 30.8496 KOps/s | 30.6955 KOps/s | |
| test_step_mdp_speed[False-False-True-False-False] | 53.6900μs | 20.2274μs | 49.4379 KOps/s | 49.5752 KOps/s | |
| test_step_mdp_speed[False-False-False-True-True] | 90.0320μs | 54.5743μs | 18.3236 KOps/s | 18.2620 KOps/s | |
| test_step_mdp_speed[False-False-False-True-False] | 0.1100ms | 35.9868μs | 27.7879 KOps/s | 27.2942 KOps/s | |
| test_step_mdp_speed[False-False-False-False-True] | 67.6810μs | 34.6701μs | 28.8433 KOps/s | 28.9698 KOps/s | |
| test_step_mdp_speed[False-False-False-False-False] | 53.7910μs | 22.1387μs | 45.1698 KOps/s | 44.8807 KOps/s | |
| test_non_tensor_env_rollout_speed[1000-single-True] | 0.8436s | 0.7416s | 1.3484 Ops/s | 1.3558 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-single-False] | 0.6932s | 0.6090s | 1.6420 Ops/s | 1.6617 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] | 1.7661s | 1.6731s | 0.5977 Ops/s | 0.6139 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] | 1.4822s | 1.4067s | 0.7109 Ops/s | 0.7132 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-serial-buffers-True] | 1.9435s | 1.8626s | 0.5369 Ops/s | 0.5350 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-serial-buffers-False] | 1.7533s | 1.6611s | 0.6020 Ops/s | 0.6077 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] | 4.6698s | 4.6021s | 0.2173 Ops/s | 0.2170 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] | 4.4581s | 4.3620s | 0.2293 Ops/s | 0.2264 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] | 1.9338s | 1.8449s | 0.5420 Ops/s | 0.5341 Ops/s | |
| test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] | 1.6276s | 1.5501s | 0.6451 Ops/s | 0.6390 Ops/s | |
| test_values[generalized_advantage_estimate-True-True] | 10.0011ms | 9.7916ms | 102.1281 Ops/s | 102.6640 Ops/s | |
| test_values[vec_generalized_advantage_estimate-True-True] | 20.1155ms | 17.4168ms | 57.4157 Ops/s | 56.2353 Ops/s | |
| test_values[td0_return_estimate-False-False] | 0.2075ms | 0.1254ms | 7.9713 KOps/s | 4.7963 KOps/s | |
| test_values[td1_return_estimate-False-False] | 26.7421ms | 26.3658ms | 37.9279 Ops/s | 38.4612 Ops/s | |
| test_values[vec_td1_return_estimate-False-False] | 20.5360ms | 17.7245ms | 56.4191 Ops/s | 56.1317 Ops/s | |
| test_values[td_lambda_return_estimate-True-False] | 39.6530ms | 38.9424ms | 25.6789 Ops/s | 26.0006 Ops/s | |
| test_values[vec_td_lambda_return_estimate-True-False] | 18.4655ms | 17.5660ms | 56.9281 Ops/s | 55.8712 Ops/s | |
| test_gae_speed[generalized_advantage_estimate-False-1-512] | 8.6859ms | 8.6264ms | 115.9230 Ops/s | 116.3894 Ops/s | |
| test_gae_speed[vec_generalized_advantage_estimate-True-1-512] | 1.7215ms | 1.5087ms | 662.8021 Ops/s | 661.7838 Ops/s | |
| test_gae_speed[vec_generalized_advantage_estimate-False-1-512] | 0.4881ms | 0.4053ms | 2.4674 KOps/s | 2.4899 KOps/s | |
| test_gae_speed[vec_generalized_advantage_estimate-True-32-512] | 35.1992ms | 34.5617ms | 28.9338 Ops/s | 28.7967 Ops/s | |
| test_gae_speed[vec_generalized_advantage_estimate-False-32-512] | 1.8853ms | 1.7202ms | 581.3412 Ops/s | 585.5632 Ops/s | |
| test_dqn_speed[False-None] | 1.4967ms | 1.3611ms | 734.6944 Ops/s | 731.0159 Ops/s | |
| test_dqn_speed[False-backward] | 1.9664ms | 1.8620ms | 537.0648 Ops/s | 541.9404 Ops/s | |
| test_dqn_speed[True-None] | 0.6844ms | 0.5424ms | 1.8435 KOps/s | 1.8134 KOps/s | |
| test_dqn_speed[True-backward] | 1.1274ms | 0.9854ms | 1.0148 KOps/s | 849.4576 Ops/s | |
| test_dqn_speed[reduce-overhead-None] | 0.5798ms | 0.5273ms | 1.8965 KOps/s | 1.8534 KOps/s | |
| test_ddpg_speed[False-None] | 3.1046ms | 2.7804ms | 359.6616 Ops/s | 352.6151 Ops/s | |
| test_ddpg_speed[False-backward] | 4.0538ms | 3.9363ms | 254.0487 Ops/s | 254.2734 Ops/s | |
| test_ddpg_speed[True-None] | 1.5716ms | 1.3964ms | 716.1406 Ops/s | 720.5199 Ops/s | |
| test_ddpg_speed[True-backward] | 2.3883ms | 2.3459ms | 426.2803 Ops/s | 366.3585 Ops/s | |
| test_ddpg_speed[reduce-overhead-None] | 1.5002ms | 1.3814ms | 723.8966 Ops/s | 700.3871 Ops/s | |
| test_sac_speed[False-None] | 8.3609ms | 7.7292ms | 129.3799 Ops/s | 129.5013 Ops/s | |
| test_sac_speed[False-backward] | 11.1905ms | 10.8433ms | 92.2232 Ops/s | 92.9572 Ops/s | |
| test_sac_speed[True-None] | 2.3166ms | 2.1374ms | 467.8517 Ops/s | 452.4028 Ops/s | |
| test_sac_speed[True-backward] | 4.0785ms | 3.9659ms | 252.1482 Ops/s | 235.5251 Ops/s | |
| test_sac_speed[reduce-overhead-None] | 2.2938ms | 2.1287ms | 469.7811 Ops/s | 475.5219 Ops/s | |
| test_redq_speed[False-None] | 13.5202ms | 10.2690ms | 97.3807 Ops/s | 100.3654 Ops/s | |
| test_redq_speed[False-backward] | 18.7753ms | 17.5491ms | 56.9829 Ops/s | 59.4519 Ops/s | |
| test_redq_speed[True-None] | 4.7231ms | 4.3835ms | 228.1287 Ops/s | 226.9820 Ops/s | |
| test_redq_speed[True-backward] | 10.0445ms | 9.7723ms | 102.3298 Ops/s | 103.5432 Ops/s | |
| test_redq_speed[reduce-overhead-None] | 4.8742ms | 4.3858ms | 228.0063 Ops/s | 223.5392 Ops/s | |
| test_redq_deprec_speed[False-None] | 11.3546ms | 10.7311ms | 93.1873 Ops/s | 93.6663 Ops/s | |
| test_redq_deprec_speed[False-backward] | 16.1891ms | 15.4068ms | 64.9063 Ops/s | 66.2743 Ops/s | |
| test_redq_deprec_speed[True-None] | 3.8098ms | 3.6277ms | 275.6598 Ops/s | 270.9693 Ops/s | |
| test_redq_deprec_speed[True-backward] | 7.5851ms | 7.3709ms | 135.6691 Ops/s | 136.9931 Ops/s | |
| test_redq_deprec_speed[reduce-overhead-None] | 3.8498ms | 3.5441ms | 282.1577 Ops/s | 283.6548 Ops/s | |
| test_td3_speed[False-None] | 7.9697ms | 7.7779ms | 128.5695 Ops/s | 128.4114 Ops/s | |
| test_td3_speed[False-backward] | 10.8510ms | 10.5028ms | 95.2131 Ops/s | 94.8757 Ops/s | |
| test_td3_speed[True-None] | 2.2732ms | 1.8403ms | 543.3861 Ops/s | 542.4061 Ops/s | |
| test_td3_speed[True-backward] | 3.8484ms | 3.6264ms | 275.7579 Ops/s | 246.7815 Ops/s | |
| test_td3_speed[reduce-overhead-None] | 1.8439ms | 1.7826ms | 560.9874 Ops/s | 549.6944 Ops/s | |
| test_cql_speed[False-None] | 29.2419ms | 25.9021ms | 38.6069 Ops/s | 39.2085 Ops/s | |
| test_cql_speed[False-backward] | 41.3921ms | 35.5981ms | 28.0914 Ops/s | 28.8870 Ops/s | |
| test_cql_speed[True-None] | 12.8935ms | 12.3717ms | 80.8295 Ops/s | 82.0127 Ops/s | |
| test_cql_speed[True-backward] | 18.5168ms | 18.1509ms | 55.0938 Ops/s | 54.1750 Ops/s | |
| test_cql_speed[reduce-overhead-None] | 12.7091ms | 12.3760ms | 80.8017 Ops/s | 80.5922 Ops/s | |
| test_a2c_speed[False-None] | 5.6249ms | 5.3467ms | 187.0301 Ops/s | 187.5239 Ops/s | |
| test_a2c_speed[False-backward] | 12.1369ms | 11.5859ms | 86.3119 Ops/s | 86.4450 Ops/s | |
| test_a2c_speed[True-None] | 4.1721ms | 3.7192ms | 268.8732 Ops/s | 271.3160 Ops/s | |
| test_a2c_speed[True-backward] | 9.1001ms | 8.5640ms | 116.7682 Ops/s | 106.5249 Ops/s | |
| test_a2c_speed[reduce-overhead-None] | 4.1763ms | 3.7042ms | 269.9619 Ops/s | 269.5602 Ops/s | |
| test_ppo_speed[False-None] | 6.2403ms | 5.8635ms | 170.5463 Ops/s | 172.4129 Ops/s | |
| test_ppo_speed[False-backward] | 12.6487ms | 12.2446ms | 81.6686 Ops/s | 82.2948 Ops/s | |
| test_ppo_speed[True-None] | 3.9716ms | 3.6038ms | 277.4881 Ops/s | 278.3743 Ops/s | |
| test_ppo_speed[True-backward] | 8.5365ms | 8.3648ms | 119.5487 Ops/s | 120.1945 Ops/s | |
| test_ppo_speed[reduce-overhead-None] | 3.7535ms | 3.6183ms | 276.3762 Ops/s | 276.8814 Ops/s | |
| test_reinforce_speed[False-None] | 4.9189ms | 4.5114ms | 221.6605 Ops/s | 226.2260 Ops/s | |
| test_reinforce_speed[False-backward] | 7.4412ms | 7.2149ms | 138.6019 Ops/s | 139.7823 Ops/s | |
| test_reinforce_speed[True-None] | 3.4590ms | 2.9003ms | 344.7940 Ops/s | 336.9547 Ops/s | |
| test_reinforce_speed[True-backward] | 8.2408ms | 7.7120ms | 129.6676 Ops/s | 132.5259 Ops/s | |
| test_reinforce_speed[reduce-overhead-None] | 3.2251ms | 2.8614ms | 349.4770 Ops/s | 354.4638 Ops/s | |
| test_iql_speed[False-None] | 20.3158ms | 19.6724ms | 50.8326 Ops/s | 50.9146 Ops/s | |
| test_iql_speed[False-backward] | 30.9451ms | 29.7813ms | 33.5781 Ops/s | 33.5752 Ops/s | |
| test_iql_speed[True-None] | 8.9041ms | 8.5168ms | 117.4157 Ops/s | 117.9447 Ops/s | |
| test_iql_speed[True-backward] | 16.9866ms | 16.5553ms | 60.4036 Ops/s | 58.8125 Ops/s | |
| test_iql_speed[reduce-overhead-None] | 9.0711ms | 8.5627ms | 116.7857 Ops/s | 117.3078 Ops/s | |
| test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 6.5044ms | 6.0522ms | 165.2279 Ops/s | 164.1403 Ops/s | |
| test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 2.1670ms | 0.2807ms | 3.5626 KOps/s | 3.3658 KOps/s | |
| test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 0.5237ms | 0.3120ms | 3.2049 KOps/s | 3.8117 KOps/s | |
| test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 6.2291ms | 5.8928ms | 169.6984 Ops/s | 171.2132 Ops/s | |
| test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 0.7945ms | 0.3024ms | 3.3064 KOps/s | 3.5550 KOps/s | |
| test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 0.6317ms | 0.2993ms | 3.3406 KOps/s | 3.9043 KOps/s | |
| test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] | 1.6533ms | 1.2559ms | 796.2576 Ops/s | 804.0953 Ops/s | |
| test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] | 1.5566ms | 1.1667ms | 857.1273 Ops/s | 868.3154 Ops/s | |
| test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 9.4585ms | 6.1287ms | 163.1671 Ops/s | 167.0838 Ops/s | |
| test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 2.0487ms | 0.4263ms | 2.3459 KOps/s | 2.2328 KOps/s | |
| test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 0.8596ms | 0.5353ms | 1.8682 KOps/s | 2.4346 KOps/s | |
| test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 6.2993ms | 5.8757ms | 170.1931 Ops/s | 173.0149 Ops/s | |
| test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 1.7038ms | 0.3084ms | 3.2427 KOps/s | 2.7054 KOps/s | |
| test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 0.5759ms | 0.3238ms | 3.0885 KOps/s | 2.7833 KOps/s | |
| test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 6.4909ms | 5.8655ms | 170.4888 Ops/s | 171.4604 Ops/s | |
| test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 1.0356ms | 0.3083ms | 3.2440 KOps/s | 2.7358 KOps/s | |
| test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 0.7012ms | 0.2872ms | 3.4816 KOps/s | 2.8971 KOps/s | |
| test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 6.1998ms | 6.0488ms | 165.3213 Ops/s | 166.8789 Ops/s | |
| test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 0.9232ms | 0.5069ms | 1.9727 KOps/s | 1.9693 KOps/s | |
| test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 0.6856ms | 0.4980ms | 2.0082 KOps/s | 2.0563 KOps/s | |
| test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] | 6.6325ms | 5.0919ms | 196.3915 Ops/s | 199.6703 Ops/s | |
| test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] | 4.8394ms | 2.1272ms | 470.0997 Ops/s | 452.0521 Ops/s | |
| test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] | 2.0999ms | 1.0985ms | 910.3000 Ops/s | 1.1540 KOps/s | |
| test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] | 0.5574s | 16.2351ms | 61.5950 Ops/s | 58.0587 Ops/s | |
| test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] | 3.9556ms | 1.7654ms | 566.4516 Ops/s | 513.5848 Ops/s | |
| test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] | 1.0742ms | 0.9076ms | 1.1018 KOps/s | 794.5530 Ops/s | |
| test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] | 9.5366ms | 5.3837ms | 185.7467 Ops/s | 190.4632 Ops/s | |
| test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] | 9.3258ms | 2.0564ms | 486.2798 Ops/s | 493.0474 Ops/s | |
| test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] | 2.2197ms | 1.1651ms | 858.2984 Ops/s | 942.8862 Ops/s | |
| test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] | 40.2344ms | 35.9140ms | 27.8443 Ops/s | 28.0547 Ops/s | |
| test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] | 19.1506ms | 17.8059ms | 56.1610 Ops/s | 55.5942 Ops/s | |
| test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] | 40.9200ms | 36.9788ms | 27.0425 Ops/s | 26.8316 Ops/s | |
| test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] | 19.6681ms | 18.1378ms | 55.1333 Ops/s | 54.5772 Ops/s | |
| test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] | 41.2425ms | 38.5927ms | 25.9116 Ops/s | 25.3430 Ops/s | |
| test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] | 21.6189ms | 20.1565ms | 49.6117 Ops/s | 49.0555 Ops/s | |
| test_storage_write_lazystack[50-img_shape0-small] | 0.8713ms | 0.2237ms | 4.4698 KOps/s | 2.3341 KOps/s | |
| test_storage_write_lazystack[100-img_shape1-atari] | 1.8481ms | 1.3802ms | 724.5442 Ops/s | 709.6567 Ops/s | |
| test_storage_write_lazystack[100-img_shape2-large_img] | 2.7119ms | 2.2625ms | 441.9856 Ops/s | 420.1617 Ops/s | |
| test_storage_write_lazystack[200-img_shape3-large_batch] | 3.3935ms | 2.9008ms | 344.7295 Ops/s | 341.3319 Ops/s | |
| test_storage_write_contiguous[50-img_shape0-small] | 0.2487ms | 0.1319ms | 7.5824 KOps/s | 7.6276 KOps/s | |
| test_storage_write_contiguous[100-img_shape1-atari] | 0.3468ms | 0.1762ms | 5.6749 KOps/s | 5.3287 KOps/s | |
| test_storage_write_contiguous[100-img_shape2-large_img] | 2.0623ms | 1.7614ms | 567.7395 Ops/s | 558.0763 Ops/s | |
| test_storage_write_contiguous[200-img_shape3-large_batch] | 1.3961ms | 1.2800ms | 781.2352 Ops/s | 777.4343 Ops/s | |
| test_collector_stack_then_write[50-img_shape0-small] | 1.5153ms | 1.1137ms | 897.9403 Ops/s | 893.6280 Ops/s | |
| test_collector_stack_then_write[100-img_shape1-atari] | 4.0130ms | 3.5400ms | 282.4870 Ops/s | 280.5371 Ops/s | |
| test_collector_stack_then_write[100-img_shape2-large_img] | 6.5416ms | 5.5577ms | 179.9293 Ops/s | 180.9658 Ops/s | |
| test_collector_stack_then_write[200-img_shape3-large_batch] | 7.4078ms | 6.9119ms | 144.6778 Ops/s | 143.4382 Ops/s | |
| test_collector_lazystack_then_write[50-img_shape0-small] | 0.7170ms | 0.2778ms | 3.6003 KOps/s | 3.6253 KOps/s | |
| test_collector_lazystack_then_write[100-img_shape1-atari] | 1.9424ms | 1.4868ms | 672.5713 Ops/s | 657.9243 Ops/s | |
| test_collector_lazystack_then_write[100-img_shape2-large_img] | 2.5302ms | 2.3929ms | 417.9048 Ops/s | 402.1758 Ops/s | |
| test_collector_lazystack_then_write[200-img_shape3-large_batch] | 3.2960ms | 3.1194ms | 320.5794 Ops/s | 317.2372 Ops/s | |
| test_collector_without_rb[100-img_shape0-atari] | 33.6044ms | 32.4504ms | 30.8163 Ops/s | 30.7904 Ops/s | |
| test_collector_without_rb[200-img_shape1-large_batch] | 65.3389ms | 63.6581ms | 15.7089 Ops/s | 15.6162 Ops/s | |
| test_collector_with_rb[100-img_shape0-atari] | 37.9688ms | 37.0339ms | 27.0023 Ops/s | 27.0913 Ops/s | |
| test_collector_with_rb[200-img_shape1-large_batch] | 72.3797ms | 71.8261ms | 13.9225 Ops/s | 13.8838 Ops/s |
Contributor
Result of GPU Benchmark TestsExpand to view detailed results
|
vmoens
added a commit
that referenced
this pull request
Feb 12, 2026
…on mode Rewrite the AsyncBatchedCollector to use a coordinator thread that pipelines env stepping and batched inference without a global sync barrier. Add a `direct=True` mode where each env thread submits directly to the InferenceServer, eliminating the coordinator thread and its serialization overhead. Benchmark results (8 mock pixel envs, Nature-CNN, CPU): AsyncBatchedCollector direct: 3183 fps (+72% vs coordinator) AsyncBatchedCollector threading: 1850 fps (coordinator mode) AsyncBatchedCollector mp: 1042 fps (coordinator mode) Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: c4d370a Pull-Request: #3499
vmoens
added a commit
that referenced
this pull request
Feb 13, 2026
…on mode Rewrite the AsyncBatchedCollector to use a coordinator thread that pipelines env stepping and batched inference without a global sync barrier. Add a `direct=True` mode where each env thread submits directly to the InferenceServer, eliminating the coordinator thread and its serialization overhead. Benchmark results (8 mock pixel envs, Nature-CNN, CPU): AsyncBatchedCollector direct: 3183 fps (+72% vs coordinator) AsyncBatchedCollector threading: 1850 fps (coordinator mode) AsyncBatchedCollector mp: 1042 fps (coordinator mode) Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: 784abfc Pull-Request: #3499 Co-authored-by: Cursor <cursoragent@cursor.com>
vmoens
added a commit
that referenced
this pull request
Feb 14, 2026
…on mode Rewrite the AsyncBatchedCollector to use a coordinator thread that pipelines env stepping and batched inference without a global sync barrier. Add a `direct=True` mode where each env thread submits directly to the InferenceServer, eliminating the coordinator thread and its serialization overhead. Benchmark results (8 mock pixel envs, Nature-CNN, CPU): AsyncBatchedCollector direct: 3183 fps (+72% vs coordinator) AsyncBatchedCollector threading: 1850 fps (coordinator mode) AsyncBatchedCollector mp: 1042 fps (coordinator mode) Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: 270d3e3 Pull-Request: #3499 Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Rewrite the AsyncBatchedCollector to use a coordinator thread that
pipelines env stepping and batched inference without a global sync
barrier. Add a
direct=Truemode where each env thread submitsdirectly to the InferenceServer, eliminating the coordinator thread
and its serialization overhead.
Benchmark results (8 mock pixel envs, Nature-CNN, CPU):
AsyncBatchedCollector direct: 3183 fps (+72% vs coordinator)
AsyncBatchedCollector threading: 1850 fps (coordinator mode)
AsyncBatchedCollector mp: 1042 fps (coordinator mode)
Co-authored-by: Cursor cursoragent@cursor.com