Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Faster RNNs #1732

Merged
merged 22 commits into from
Dec 15, 2023
Merged

[Performance] Faster RNNs #1732

merged 22 commits into from
Dec 15, 2023

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Dec 4, 2023

The following code snippet tests lstm with compile and vmap (results below)

import timeit

import torch.cuda
from torch.nn import LSTM

from tensordict import TensorDict
from torchrl.modules import LSTM as LSTM_RL

device = "cuda" if torch.cuda.device_count() else "cpu"
mode = "default"
backend = "cudagraphs"

inp = 128
outp = 256
layers = 1
B = 64
T = 64

lstm0 = LSTM(inp, outp, num_layers=layers, device=device, batch_first=True)
lstm1 = LSTM_RL(inp, outp, num_layers=layers, device=device, batch_first=True)
x = torch.randn(B, T, inp, device=device)
hx = torch.zeros(layers, B, outp, device=device)
cx = torch.zeros(layers, B, outp, device=device)

params = TensorDict.from_module(lstm1, as_module=True).expand(10).clone()


def vmap_lstm1(inp, hx, cx, params):
    with params.to_module(lstm1):
        return lstm1(inp, (hx, cx))


lstm0(x, (hx, cx))
print(
    "nn.Module",
    timeit.repeat("lstm(x, (hx, cx))", globals={"lstm": lstm0, "x": x, "hx": hx, "cx": cx}, number=200)
    )
lstm1(x)
print(
    "rl",
    timeit.repeat("lstm(x, (hx, cx))", globals={"lstm": lstm1, "x": x, "hx": hx, "cx": cx}, number=200)
    )

lstm1_comp = torch.compile(lstm1, mode=mode, fullgraph=True, backend=backend)
lstm1_comp(x, (hx, cx))
print(
    "rl - comp",
    timeit.repeat("lstm(x, (hx, cx))", globals={"lstm": lstm1_comp, "x": x, "hx": hx, "cx": cx}, number=200)
    )

v1 = torch.vmap(vmap_lstm1, (None, None, None, 0))
print(
    "rl - vmap", timeit.repeat(
        "v1(x, hx, cx, params)",
        globals={"torch": torch, "v1": v1, "x": x, "hx": hx, "cx": cx,
                 "params": params}, number=200
        )
    )

v2 = torch.compile(v1, mode=mode, backend=backend)
v2(x, hx, cx, params)
print(
    "rl - comp - vmap", timeit.repeat(
        "v2(x, hx, cx, params)",
        globals={"torch": torch, "v2": v2, "x": x, "hx": hx, "cx": cx,
                 "params": params}, number=200
        )
    )
v3 = torch.vmap(torch.compile(vmap_lstm1, mode=mode, backend=backend), (None, None, None, 0))
v3(x, hx, cx, params)
print(
    "rl - vmap - comp", timeit.repeat(
        "v3(x, hx, cx, params)",
        globals={"torch": torch, "v3": v3, "x": x, "hx": hx, "cx": cx,
                 "params": params}, number=200
        )
    )

# forward backward
print(
    "nn.Module",
    timeit.repeat("lstm(x)[0].mean().backward()", globals={"lstm": lstm0, "x": x}, number=200)
    )
lstm1(x)
print(
    "rl",
    timeit.repeat("lstm(x)[0].mean().backward()", globals={"lstm": lstm1, "x": x}, number=200)
    )
print(
    "rl - comp",
    timeit.repeat("lstm(x)[0].mean().backward()", globals={"lstm": lstm1_comp, "x": x}, number=200)
    )
print(
    "rl - vmap", timeit.repeat(
        "v1(x, hx, cx, params)[0].mean().backward()",
        globals={"torch": torch, "v1": v1, "x": x, "hx": hx, "cx": cx,
                 "params": params}, number=200
        )
    )
print(
    "rl - comp - vmap", timeit.repeat(
        "v2(x, hx, cx, params)[0].mean().backward()",
        globals={"torch": torch, "v2": v2, "x": x, "hx": hx, "cx": cx,
                 "params": params}, number=200
        )
    )
print(
    "rl - vmap - comp", timeit.repeat(
        "v3(x, hx, cx, params)[0].mean().backward()",
        globals={"torch": torch, "v3": v3, "x": x, "hx": hx, "cx": cx,
                 "params": params}, number=200
        )
    )

## Results of forward calls:

nn.Module [0.1985985580831766, 0.21020206296816468, 0.21042610611766577, 0.21009117970243096, 0.21014584274962544]
rl [2.223391553387046, 2.2246690620668232, 2.218345619738102, 2.215493839699775, 2.216132593806833]
rl - comp [2.098668306134641, 1.9071342102251947, 1.903880384285003, 1.9047306668944657, 1.9056234462186694]
rl - vmap [5.82506698789075, 5.802758268080652, 5.797261876054108, 5.784914615098387, 5.792825186159462]
rl - comp - vmap [5.859712489880621, 5.861479918006808, 5.862077071797103, 5.8909066510386765, 5.891659554094076]
rl - vmap - comp [5.862315001897514, 5.867215735372156, 5.867957653943449, 5.866320278029889, 5.864287442993373]

Results of forward + backward calls:

nn.Module [0.7210929421707988, 0.4839882277883589, 0.48286808328703046, 0.48307183710858226, 0.4828226496465504]
rl [7.291498672217131, 7.13550530700013, 7.131908990908414, 7.136489370837808, 7.137898595072329]
rl - comp [19.776460175402462, 2.9335607159882784, 2.93492828309536, 2.9360084398649633, 2.9411989389918745]
rl - vmap [10.88541623018682, 10.87921998044476, 10.88731099711731, 10.892555037047714, 10.885157434269786]
rl - comp - vmap [10.932933965232223, 10.943286961875856, 10.940583536401391, 10.918374385219067, 10.939673755783588]
rl - vmap - comp [10.894704687874764, 10.90229798387736, 10.90534247783944, 10.909869996830821, 10.916080442722887]

In short: compile does a good job at reducing the (otherwise very high) compute time of LSTM.
Backward benefits from it too (5x slower with compile, vs 15x without). For vmap calls, compile is of little help (it isn't very clear why), whether we put the compile around the vmap or the opposite.

Copy link

pytorch-bot bot commented Dec 4, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/1732

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit 383372c with merge base 0906206 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 4, 2023
Copy link

github-actions bot commented Dec 4, 2023

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 89. Improved: $\large\color{#35bf28}5$. Worsened: $\large\color{#d91a1a}11$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 64.1659ms 63.0740ms 15.8544 Ops/s 14.6836 Ops/s $\textbf{\color{#35bf28}+7.97\%}$
test_sync 47.2133ms 40.1436ms 24.9105 Ops/s 27.1948 Ops/s $\textbf{\color{#d91a1a}-8.40\%}$
test_async 74.9472ms 33.8204ms 29.5680 Ops/s 29.6069 Ops/s $\color{#d91a1a}-0.13\%$
test_simple 0.4891s 0.4330s 2.3094 Ops/s 2.2391 Ops/s $\color{#35bf28}+3.14\%$
test_transformed 0.6503s 0.5972s 1.6746 Ops/s 1.6348 Ops/s $\color{#35bf28}+2.44\%$
test_serial 1.3511s 1.3062s 0.7656 Ops/s 0.7272 Ops/s $\textbf{\color{#35bf28}+5.28\%}$
test_parallel 1.3570s 1.3144s 0.7608 Ops/s 0.7606 Ops/s $\color{#35bf28}+0.02\%$
test_step_mdp_speed[True-True-True-True-True] 0.1428ms 21.8280μs 45.8127 KOps/s 45.9460 KOps/s $\color{#d91a1a}-0.29\%$
test_step_mdp_speed[True-True-True-True-False] 38.1110μs 13.1503μs 76.0437 KOps/s 76.1189 KOps/s $\color{#d91a1a}-0.10\%$
test_step_mdp_speed[True-True-True-False-True] 28.8640μs 12.7018μs 78.7290 KOps/s 77.8758 KOps/s $\color{#35bf28}+1.10\%$
test_step_mdp_speed[True-True-True-False-False] 42.8600μs 7.6481μs 130.7508 KOps/s 128.9362 KOps/s $\color{#35bf28}+1.41\%$
test_step_mdp_speed[True-True-False-True-True] 66.5140μs 23.0509μs 43.3823 KOps/s 43.2860 KOps/s $\color{#35bf28}+0.22\%$
test_step_mdp_speed[True-True-False-True-False] 40.5960μs 14.5870μs 68.5541 KOps/s 68.5451 KOps/s $\color{#35bf28}+0.01\%$
test_step_mdp_speed[True-True-False-False-True] 59.4710μs 14.0399μs 71.2254 KOps/s 70.4017 KOps/s $\color{#35bf28}+1.17\%$
test_step_mdp_speed[True-True-False-False-False] 27.9220μs 9.0281μs 110.7654 KOps/s 108.9540 KOps/s $\color{#35bf28}+1.66\%$
test_step_mdp_speed[True-False-True-True-True] 58.6190μs 24.8414μs 40.2554 KOps/s 40.4187 KOps/s $\color{#d91a1a}-0.40\%$
test_step_mdp_speed[True-False-True-True-False] 40.3460μs 16.0488μs 62.3099 KOps/s 62.0565 KOps/s $\color{#35bf28}+0.41\%$
test_step_mdp_speed[True-False-True-False-True] 52.3380μs 14.0572μs 71.1381 KOps/s 70.9842 KOps/s $\color{#35bf28}+0.22\%$
test_step_mdp_speed[True-False-True-False-False] 27.3210μs 9.1000μs 109.8898 KOps/s 110.0616 KOps/s $\color{#d91a1a}-0.16\%$
test_step_mdp_speed[True-False-False-True-True] 72.5140μs 26.0208μs 38.4308 KOps/s 38.4953 KOps/s $\color{#d91a1a}-0.17\%$
test_step_mdp_speed[True-False-False-True-False] 62.3360μs 17.2837μs 57.8578 KOps/s 57.1183 KOps/s $\color{#35bf28}+1.29\%$
test_step_mdp_speed[True-False-False-False-True] 37.6000μs 15.2465μs 65.5890 KOps/s 65.1209 KOps/s $\color{#35bf28}+0.72\%$
test_step_mdp_speed[True-False-False-False-False] 48.7210μs 10.2202μs 97.8452 KOps/s 95.7264 KOps/s $\color{#35bf28}+2.21\%$
test_step_mdp_speed[False-True-True-True-True] 71.0920μs 24.7482μs 40.4070 KOps/s 40.7186 KOps/s $\color{#d91a1a}-0.77\%$
test_step_mdp_speed[False-True-True-True-False] 34.9960μs 15.9040μs 62.8773 KOps/s 61.6696 KOps/s $\color{#35bf28}+1.96\%$
test_step_mdp_speed[False-True-True-False-True] 52.5370μs 16.4908μs 60.6400 KOps/s 60.3425 KOps/s $\color{#35bf28}+0.49\%$
test_step_mdp_speed[False-True-True-False-False] 28.4730μs 10.3532μs 96.5888 KOps/s 95.3432 KOps/s $\color{#35bf28}+1.31\%$
test_step_mdp_speed[False-True-False-True-True] 58.1480μs 25.9707μs 38.5049 KOps/s 38.9946 KOps/s $\color{#d91a1a}-1.26\%$
test_step_mdp_speed[False-True-False-True-False] 36.8880μs 17.2250μs 58.0550 KOps/s 57.8578 KOps/s $\color{#35bf28}+0.34\%$
test_step_mdp_speed[False-True-False-False-True] 55.2190μs 17.6616μs 56.6200 KOps/s 56.9846 KOps/s $\color{#d91a1a}-0.64\%$
test_step_mdp_speed[False-True-False-False-False] 29.4750μs 11.5327μs 86.7098 KOps/s 86.4967 KOps/s $\color{#35bf28}+0.25\%$
test_step_mdp_speed[False-False-True-True-True] 68.1860μs 27.2904μs 36.6430 KOps/s 36.8233 KOps/s $\color{#d91a1a}-0.49\%$
test_step_mdp_speed[False-False-True-True-False] 70.6840μs 18.2187μs 54.8887 KOps/s 53.3471 KOps/s $\color{#35bf28}+2.89\%$
test_step_mdp_speed[False-False-True-False-True] 36.9990μs 17.6423μs 56.6819 KOps/s 56.4148 KOps/s $\color{#35bf28}+0.47\%$
test_step_mdp_speed[False-False-True-False-False] 57.1670μs 11.4624μs 87.2419 KOps/s 86.4171 KOps/s $\color{#35bf28}+0.95\%$
test_step_mdp_speed[False-False-False-True-True] 65.4310μs 28.3160μs 35.3158 KOps/s 35.6526 KOps/s $\color{#d91a1a}-0.94\%$
test_step_mdp_speed[False-False-False-True-False] 62.6970μs 19.8041μs 50.4946 KOps/s 50.3332 KOps/s $\color{#35bf28}+0.32\%$
test_step_mdp_speed[False-False-False-False-True] 62.7060μs 18.9183μs 52.8589 KOps/s 53.7593 KOps/s $\color{#d91a1a}-1.67\%$
test_step_mdp_speed[False-False-False-False-False] 34.6140μs 12.7367μs 78.5130 KOps/s 78.5891 KOps/s $\color{#d91a1a}-0.10\%$
test_values[generalized_advantage_estimate-True-True] 12.8939ms 11.8846ms 84.1426 Ops/s 83.9162 Ops/s $\color{#35bf28}+0.27\%$
test_values[vec_generalized_advantage_estimate-True-True] 35.3123ms 27.7774ms 36.0005 Ops/s 38.2026 Ops/s $\textbf{\color{#d91a1a}-5.76\%}$
test_values[td0_return_estimate-False-False] 0.2531ms 0.1756ms 5.6936 KOps/s 5.5336 KOps/s $\color{#35bf28}+2.89\%$
test_values[td1_return_estimate-False-False] 25.8484ms 25.3479ms 39.4511 Ops/s 39.7437 Ops/s $\color{#d91a1a}-0.74\%$
test_values[vec_td1_return_estimate-False-False] 35.3737ms 27.7551ms 36.0295 Ops/s 37.5707 Ops/s $\color{#d91a1a}-4.10\%$
test_values[td_lambda_return_estimate-True-False] 35.8830ms 35.3378ms 28.2983 Ops/s 28.4351 Ops/s $\color{#d91a1a}-0.48\%$
test_values[vec_td_lambda_return_estimate-True-False] 35.7831ms 27.8661ms 35.8859 Ops/s 37.4642 Ops/s $\color{#d91a1a}-4.21\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 8.9521ms 8.0259ms 124.5968 Ops/s 125.8645 Ops/s $\color{#d91a1a}-1.01\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 10.2126ms 1.8926ms 528.3609 Ops/s 519.6546 Ops/s $\color{#35bf28}+1.68\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.5332ms 0.4244ms 2.3562 KOps/s 2.3105 KOps/s $\color{#35bf28}+1.98\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 43.9820ms 38.3013ms 26.1088 Ops/s 26.8299 Ops/s $\color{#d91a1a}-2.69\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 3.5404ms 2.6219ms 381.4086 Ops/s 381.9022 Ops/s $\color{#d91a1a}-0.13\%$
test_dqn_speed 10.2216ms 1.6258ms 615.0898 Ops/s 604.6451 Ops/s $\color{#35bf28}+1.73\%$
test_ddpg_speed 12.1207ms 3.6455ms 274.3082 Ops/s 274.6707 Ops/s $\color{#d91a1a}-0.13\%$
test_sac_speed 78.0308ms 10.9177ms 91.5940 Ops/s 97.5800 Ops/s $\textbf{\color{#d91a1a}-6.13\%}$
test_redq_speed 27.5107ms 19.0123ms 52.5977 Ops/s 52.1183 Ops/s $\color{#35bf28}+0.92\%$
test_redq_deprec_speed 23.4420ms 15.0264ms 66.5496 Ops/s 65.8915 Ops/s $\color{#35bf28}+1.00\%$
test_td3_speed 18.0218ms 10.4729ms 95.4845 Ops/s 94.3396 Ops/s $\color{#35bf28}+1.21\%$
test_cql_speed 47.3083ms 38.6661ms 25.8625 Ops/s 25.3669 Ops/s $\color{#35bf28}+1.95\%$
test_a2c_speed 16.2878ms 8.1006ms 123.4472 Ops/s 72.4170 Ops/s $\textbf{\color{#35bf28}+70.47\%}$
test_ppo_speed 17.0793ms 8.3906ms 119.1808 Ops/s 117.5203 Ops/s $\color{#35bf28}+1.41\%$
test_reinforce_speed 15.9125ms 7.1545ms 139.7719 Ops/s 139.3745 Ops/s $\color{#35bf28}+0.29\%$
test_iql_speed 42.7150ms 34.1893ms 29.2489 Ops/s 29.0171 Ops/s $\color{#35bf28}+0.80\%$
test_sample_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.9773ms 1.8568ms 538.5752 Ops/s 493.2184 Ops/s $\textbf{\color{#35bf28}+9.20\%}$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.1009s 2.1576ms 463.4781 Ops/s 506.2262 Ops/s $\textbf{\color{#d91a1a}-8.44\%}$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3.2240ms 1.9824ms 504.4361 Ops/s 503.3468 Ops/s $\color{#35bf28}+0.22\%$
test_sample_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 2.3932ms 1.8665ms 535.7528 Ops/s 465.9432 Ops/s $\textbf{\color{#35bf28}+14.98\%}$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 98.1648ms 2.1441ms 466.3896 Ops/s 500.7125 Ops/s $\textbf{\color{#d91a1a}-6.85\%}$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 2.9999ms 1.9666ms 508.5025 Ops/s 508.6317 Ops/s $\color{#d91a1a}-0.03\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 2.3913ms 1.8504ms 540.4356 Ops/s 545.7313 Ops/s $\color{#d91a1a}-0.97\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 99.8966ms 2.1920ms 456.2040 Ops/s 505.9534 Ops/s $\textbf{\color{#d91a1a}-9.83\%}$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2.9439ms 1.9805ms 504.9262 Ops/s 508.3101 Ops/s $\color{#d91a1a}-0.67\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.5308ms 1.8642ms 536.4214 Ops/s 542.8105 Ops/s $\color{#d91a1a}-1.18\%$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.1044s 2.1984ms 454.8701 Ops/s 491.8336 Ops/s $\textbf{\color{#d91a1a}-7.52\%}$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 2.9102ms 1.9770ms 505.8130 Ops/s 507.9670 Ops/s $\color{#d91a1a}-0.42\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 2.3991ms 1.8600ms 537.6389 Ops/s 530.9514 Ops/s $\color{#35bf28}+1.26\%$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.1055s 2.2338ms 447.6629 Ops/s 505.3037 Ops/s $\textbf{\color{#d91a1a}-11.41\%}$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3.3809ms 1.9824ms 504.4270 Ops/s 503.1591 Ops/s $\color{#35bf28}+0.25\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 2.3928ms 1.8849ms 530.5182 Ops/s 541.2887 Ops/s $\color{#d91a1a}-1.99\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.1047s 2.1633ms 462.2481 Ops/s 511.2612 Ops/s $\textbf{\color{#d91a1a}-9.59\%}$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2.7457ms 1.9586ms 510.5582 Ops/s 509.2546 Ops/s $\color{#35bf28}+0.26\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.1473s 16.7387ms 59.7416 Ops/s 57.9970 Ops/s $\color{#35bf28}+3.01\%$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 0.1030s 15.8159ms 63.2274 Ops/s 63.6203 Ops/s $\color{#d91a1a}-0.62\%$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 0.1001s 15.7780ms 63.3793 Ops/s 63.7730 Ops/s $\color{#d91a1a}-0.62\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1001s 15.8805ms 62.9704 Ops/s 71.6455 Ops/s $\textbf{\color{#d91a1a}-12.11\%}$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 99.6975ms 15.7293ms 63.5755 Ops/s 62.7885 Ops/s $\color{#35bf28}+1.25\%$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 98.2683ms 15.6716ms 63.8096 Ops/s 63.1373 Ops/s $\color{#35bf28}+1.06\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1024s 15.7388ms 63.5374 Ops/s 63.2792 Ops/s $\color{#35bf28}+0.41\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 0.1027s 17.7215ms 56.4286 Ops/s 63.3636 Ops/s $\textbf{\color{#d91a1a}-10.94\%}$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 0.1018s 15.8206ms 63.2089 Ops/s 63.2628 Ops/s $\color{#d91a1a}-0.09\%$

Copy link

github-actions bot commented Dec 4, 2023

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 92. Improved: $\large\color{#35bf28}6$. Worsened: $\large\color{#d91a1a}1$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 0.1286s 0.1276s 7.8387 Ops/s 7.9352 Ops/s $\color{#d91a1a}-1.22\%$
test_sync 0.1035s 0.1025s 9.7521 Ops/s 9.4995 Ops/s $\color{#35bf28}+2.66\%$
test_async 0.2747s 0.1013s 9.8718 Ops/s 9.9094 Ops/s $\color{#d91a1a}-0.38\%$
test_single_pixels 0.1357s 0.1353s 7.3914 Ops/s 6.7296 Ops/s $\textbf{\color{#35bf28}+9.84\%}$
test_sync_pixels 96.2275ms 95.3724ms 10.4852 Ops/s 10.4999 Ops/s $\color{#d91a1a}-0.14\%$
test_async_pixels 0.2498s 92.1901ms 10.8471 Ops/s 10.8136 Ops/s $\color{#35bf28}+0.31\%$
test_simple 0.9928s 0.9382s 1.0659 Ops/s 1.0887 Ops/s $\color{#d91a1a}-2.09\%$
test_transformed 1.2452s 1.1831s 0.8452 Ops/s 0.8514 Ops/s $\color{#d91a1a}-0.73\%$
test_serial 2.7244s 2.6217s 0.3814 Ops/s 0.3773 Ops/s $\color{#35bf28}+1.08\%$
test_parallel 2.6005s 2.5174s 0.3972 Ops/s 0.3975 Ops/s $\color{#d91a1a}-0.06\%$
test_step_mdp_speed[True-True-True-True-True] 0.1066ms 33.6404μs 29.7262 KOps/s 30.3768 KOps/s $\color{#d91a1a}-2.14\%$
test_step_mdp_speed[True-True-True-True-False] 51.6210μs 19.6560μs 50.8752 KOps/s 51.1062 KOps/s $\color{#d91a1a}-0.45\%$
test_step_mdp_speed[True-True-True-False-True] 47.8810μs 19.2425μs 51.9683 KOps/s 52.8304 KOps/s $\color{#d91a1a}-1.63\%$
test_step_mdp_speed[True-True-True-False-False] 38.5410μs 11.3846μs 87.8381 KOps/s 88.1763 KOps/s $\color{#d91a1a}-0.38\%$
test_step_mdp_speed[True-True-False-True-True] 57.6420μs 35.1438μs 28.4545 KOps/s 28.6411 KOps/s $\color{#d91a1a}-0.65\%$
test_step_mdp_speed[True-True-False-True-False] 44.9500μs 21.6892μs 46.1059 KOps/s 46.0833 KOps/s $\color{#35bf28}+0.05\%$
test_step_mdp_speed[True-True-False-False-True] 50.6210μs 21.6389μs 46.2130 KOps/s 47.9023 KOps/s $\color{#d91a1a}-3.53\%$
test_step_mdp_speed[True-True-False-False-False] 33.4400μs 13.3467μs 74.9246 KOps/s 74.9402 KOps/s $\color{#d91a1a}-0.02\%$
test_step_mdp_speed[True-False-True-True-True] 69.4200μs 37.6978μs 26.5267 KOps/s 26.7853 KOps/s $\color{#d91a1a}-0.97\%$
test_step_mdp_speed[True-False-True-True-False] 45.6810μs 23.8423μs 41.9423 KOps/s 42.3346 KOps/s $\color{#d91a1a}-0.93\%$
test_step_mdp_speed[True-False-True-False-True] 71.5300μs 21.1906μs 47.1906 KOps/s 47.2039 KOps/s $\color{#d91a1a}-0.03\%$
test_step_mdp_speed[True-False-True-False-False] 44.6510μs 13.4178μs 74.5279 KOps/s 74.4363 KOps/s $\color{#35bf28}+0.12\%$
test_step_mdp_speed[True-False-False-True-True] 69.6610μs 39.2170μs 25.4992 KOps/s 25.7909 KOps/s $\color{#d91a1a}-1.13\%$
test_step_mdp_speed[True-False-False-True-False] 62.2810μs 25.6749μs 38.9486 KOps/s 39.3988 KOps/s $\color{#d91a1a}-1.14\%$
test_step_mdp_speed[True-False-False-False-True] 57.6800μs 23.0967μs 43.2963 KOps/s 43.7986 KOps/s $\color{#d91a1a}-1.15\%$
test_step_mdp_speed[True-False-False-False-False] 42.3210μs 15.2396μs 65.6186 KOps/s 66.2093 KOps/s $\color{#d91a1a}-0.89\%$
test_step_mdp_speed[False-True-True-True-True] 68.4210μs 37.6604μs 26.5531 KOps/s 27.1070 KOps/s $\color{#d91a1a}-2.04\%$
test_step_mdp_speed[False-True-True-True-False] 53.8200μs 23.7149μs 42.1676 KOps/s 42.8198 KOps/s $\color{#d91a1a}-1.52\%$
test_step_mdp_speed[False-True-True-False-True] 62.1800μs 25.6522μs 38.9830 KOps/s 40.6276 KOps/s $\color{#d91a1a}-4.05\%$
test_step_mdp_speed[False-True-True-False-False] 38.7000μs 15.2191μs 65.7067 KOps/s 66.4633 KOps/s $\color{#d91a1a}-1.14\%$
test_step_mdp_speed[False-True-False-True-True] 80.3110μs 39.0703μs 25.5949 KOps/s 26.0996 KOps/s $\color{#d91a1a}-1.93\%$
test_step_mdp_speed[False-True-False-True-False] 56.9910μs 25.7080μs 38.8984 KOps/s 39.2218 KOps/s $\color{#d91a1a}-0.82\%$
test_step_mdp_speed[False-True-False-False-True] 52.7310μs 26.9675μs 37.0817 KOps/s 37.7768 KOps/s $\color{#d91a1a}-1.84\%$
test_step_mdp_speed[False-True-False-False-False] 44.0900μs 16.9347μs 59.0502 KOps/s 59.1566 KOps/s $\color{#d91a1a}-0.18\%$
test_step_mdp_speed[False-False-True-True-True] 68.2720μs 40.8300μs 24.4918 KOps/s 24.8342 KOps/s $\color{#d91a1a}-1.38\%$
test_step_mdp_speed[False-False-True-True-False] 58.1400μs 27.2023μs 36.7616 KOps/s 36.4576 KOps/s $\color{#35bf28}+0.83\%$
test_step_mdp_speed[False-False-True-False-True] 51.7600μs 27.1397μs 36.8464 KOps/s 37.8655 KOps/s $\color{#d91a1a}-2.69\%$
test_step_mdp_speed[False-False-True-False-False] 44.1800μs 16.9282μs 59.0732 KOps/s 59.6708 KOps/s $\color{#d91a1a}-1.00\%$
test_step_mdp_speed[False-False-False-True-True] 76.3800μs 42.1745μs 23.7110 KOps/s 23.8361 KOps/s $\color{#d91a1a}-0.52\%$
test_step_mdp_speed[False-False-False-True-False] 55.7110μs 29.2995μs 34.1303 KOps/s 34.6178 KOps/s $\color{#d91a1a}-1.41\%$
test_step_mdp_speed[False-False-False-False-True] 58.3110μs 28.4964μs 35.0922 KOps/s 36.3273 KOps/s $\color{#d91a1a}-3.40\%$
test_step_mdp_speed[False-False-False-False-False] 43.1100μs 18.9800μs 52.6870 KOps/s 53.5311 KOps/s $\color{#d91a1a}-1.58\%$
test_values[generalized_advantage_estimate-True-True] 27.0852ms 26.6389ms 37.5390 Ops/s 39.1844 Ops/s $\color{#d91a1a}-4.20\%$
test_values[vec_generalized_advantage_estimate-True-True] 0.1003s 3.6073ms 277.2148 Ops/s 91.0625 Ops/s $\textbf{\color{#35bf28}+204.42\%}$
test_values[td0_return_estimate-False-False] 0.1431ms 69.0716μs 14.4777 KOps/s 14.7815 KOps/s $\color{#d91a1a}-2.05\%$
test_values[td1_return_estimate-False-False] 58.2017ms 57.7733ms 17.3090 Ops/s 17.8141 Ops/s $\color{#d91a1a}-2.84\%$
test_values[vec_td1_return_estimate-False-False] 2.0505ms 1.8057ms 553.7981 Ops/s 556.0273 Ops/s $\color{#d91a1a}-0.40\%$
test_values[td_lambda_return_estimate-True-False] 95.3040ms 93.3819ms 10.7087 Ops/s 11.0316 Ops/s $\color{#d91a1a}-2.93\%$
test_values[vec_td_lambda_return_estimate-True-False] 2.0857ms 1.8038ms 554.3923 Ops/s 552.0846 Ops/s $\color{#35bf28}+0.42\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 25.6183ms 25.3779ms 39.4043 Ops/s 39.1497 Ops/s $\color{#35bf28}+0.65\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 0.8993ms 0.7475ms 1.3378 KOps/s 1.3256 KOps/s $\color{#35bf28}+0.92\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7588ms 0.6974ms 1.4339 KOps/s 1.4443 KOps/s $\color{#d91a1a}-0.72\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.5506ms 1.4946ms 669.0898 Ops/s 673.6982 Ops/s $\color{#d91a1a}-0.68\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.9890ms 0.7183ms 1.3922 KOps/s 1.3950 KOps/s $\color{#d91a1a}-0.20\%$
test_dqn_speed 8.0848ms 1.5029ms 665.3943 Ops/s 612.0605 Ops/s $\textbf{\color{#35bf28}+8.71\%}$
test_ddpg_speed 4.4461ms 3.3759ms 296.2137 Ops/s 294.9319 Ops/s $\color{#35bf28}+0.43\%$
test_sac_speed 10.4121ms 9.5279ms 104.9551 Ops/s 104.9302 Ops/s $\color{#35bf28}+0.02\%$
test_redq_speed 18.4687ms 16.9026ms 59.1625 Ops/s 59.8014 Ops/s $\color{#d91a1a}-1.07\%$
test_redq_deprec_speed 13.9432ms 13.0756ms 76.4782 Ops/s 75.6222 Ops/s $\color{#35bf28}+1.13\%$
test_td3_speed 19.2978ms 9.7598ms 102.4607 Ops/s 102.7311 Ops/s $\color{#d91a1a}-0.26\%$
test_cql_speed 32.5304ms 31.4896ms 31.7566 Ops/s 31.0982 Ops/s $\color{#35bf28}+2.12\%$
test_a2c_speed 8.2376ms 7.1167ms 140.5143 Ops/s 138.2081 Ops/s $\color{#35bf28}+1.67\%$
test_ppo_speed 8.4743ms 7.4999ms 133.3355 Ops/s 136.2593 Ops/s $\color{#d91a1a}-2.15\%$
test_reinforce_speed 7.3595ms 6.1454ms 162.7234 Ops/s 162.2473 Ops/s $\color{#35bf28}+0.29\%$
test_iql_speed 28.5740ms 27.3240ms 36.5978 Ops/s 36.5266 Ops/s $\color{#35bf28}+0.19\%$
test_sample_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.9214ms 2.4887ms 401.8229 Ops/s 398.8449 Ops/s $\color{#35bf28}+0.75\%$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 4.1663ms 2.6940ms 371.1988 Ops/s 336.0483 Ops/s $\textbf{\color{#35bf28}+10.46\%}$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3.7173ms 2.6592ms 376.0521 Ops/s 374.1238 Ops/s $\color{#35bf28}+0.52\%$
test_sample_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.1486ms 2.5072ms 398.8556 Ops/s 397.6370 Ops/s $\color{#35bf28}+0.31\%$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3.6310ms 2.6726ms 374.1740 Ops/s 333.1903 Ops/s $\textbf{\color{#35bf28}+12.30\%}$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3.9533ms 2.6917ms 371.5122 Ops/s 374.5692 Ops/s $\color{#d91a1a}-0.82\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.1317ms 2.5140ms 397.7662 Ops/s 398.7505 Ops/s $\color{#d91a1a}-0.25\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 3.8354ms 2.6796ms 373.1902 Ops/s 374.0578 Ops/s $\color{#d91a1a}-0.23\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 4.2158ms 2.6928ms 371.3653 Ops/s 372.8537 Ops/s $\color{#d91a1a}-0.40\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.2418ms 2.5084ms 398.6537 Ops/s 397.3889 Ops/s $\color{#35bf28}+0.32\%$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 3.7603ms 2.6939ms 371.2152 Ops/s 372.9937 Ops/s $\color{#d91a1a}-0.48\%$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3.5342ms 2.6973ms 370.7471 Ops/s 372.8397 Ops/s $\color{#d91a1a}-0.56\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.1893ms 2.5171ms 397.2884 Ops/s 395.7339 Ops/s $\color{#35bf28}+0.39\%$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3.8802ms 2.6910ms 371.6108 Ops/s 372.9256 Ops/s $\color{#d91a1a}-0.35\%$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3.7666ms 2.6952ms 371.0277 Ops/s 372.9339 Ops/s $\color{#d91a1a}-0.51\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.2146ms 2.5249ms 396.0539 Ops/s 400.0923 Ops/s $\color{#d91a1a}-1.01\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 3.8503ms 2.6997ms 370.4099 Ops/s 371.8926 Ops/s $\color{#d91a1a}-0.40\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 3.8711ms 2.7008ms 370.2548 Ops/s 372.7423 Ops/s $\color{#d91a1a}-0.67\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.1945s 18.7925ms 53.2128 Ops/s 54.3001 Ops/s $\color{#d91a1a}-2.00\%$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 0.1224s 15.1014ms 66.2191 Ops/s 58.7659 Ops/s $\textbf{\color{#35bf28}+12.68\%}$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 0.1183s 17.0335ms 58.7080 Ops/s 58.9240 Ops/s $\color{#d91a1a}-0.37\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1200s 17.1001ms 58.4791 Ops/s 58.3218 Ops/s $\color{#35bf28}+0.27\%$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 0.1195s 17.1781ms 58.2135 Ops/s 66.9865 Ops/s $\textbf{\color{#d91a1a}-13.10\%}$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 0.1191s 17.1885ms 58.1785 Ops/s 58.3511 Ops/s $\color{#d91a1a}-0.30\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1188s 17.1288ms 58.3812 Ops/s 58.6521 Ops/s $\color{#d91a1a}-0.46\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 0.1201s 17.1114ms 58.4405 Ops/s 58.5974 Ops/s $\color{#d91a1a}-0.27\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 0.1184s 17.0642ms 58.6022 Ops/s 58.3415 Ops/s $\color{#35bf28}+0.45\%$

@vmoens vmoens added the performance Performance issue or suggestion for improvement label Dec 14, 2023
.pre-commit-config.yaml Outdated Show resolved Hide resolved
torchrl/modules/tensordict_module/rnn.py Outdated Show resolved Hide resolved
@@ -1342,7 +1346,7 @@ def forward(self, tensordict: TensorDictBase):
# if splits is not None:
# value = torch.nn.utils.rnn.pack_padded_sequence(value, splits, batch_first=True)
if is_init.any() and hidden is not None:
hidden[is_init] = 0
hidden = torch.where(is_init, 0, hidden)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertbou92 this too

@vmoens vmoens marked this pull request as ready for review December 14, 2023 14:47
torchrl/objectives/sac.py Outdated Show resolved Hide resolved
torchrl/objectives/sac.py Outdated Show resolved Hide resolved
torchrl/objectives/sac.py Outdated Show resolved Hide resolved
torchrl/objectives/sac.py Outdated Show resolved Hide resolved
@vmoens vmoens merged commit b3d2aa6 into main Dec 15, 2023
55 of 63 checks passed
@vmoens vmoens deleted the faster-rnn branch December 15, 2023 11:58
prob = dist.probs
log_prob = torch.log(torch.where(prob == 0, 1e-8, prob))
log_prob = prob.clamp_min(torch.finfo(prob.dtype).resolution)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vmoens Where did the log go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. performance Performance issue or suggestion for improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants