[CI] Fix olddeps / opt-deps / gym smoke tests broken by #3738 and #3704#3739
Merged
Conversation
* rnn.py: gate ``_has_torch_scan`` on ``torch.__version__ >= 2.6`` instead of
``importlib.util.find_spec("torch._higher_order_ops.scan")``. ``find_spec``
eagerly imports the (missing) ``torch._higher_order_ops`` parent on
torch < 2.4 and raises ``ModuleNotFoundError`` instead of returning ``None``,
which broke every gym smoke test at import time (torch 2.0.1 stack).
* olddeps install.sh: install tensordict from source on PR / nightly runs and
from PyPI only on ``release/*`` branches, matching every other CI job
(``linux/scripts/run_all.sh``, ``linux_distributed/scripts/install.sh``).
The previous ``tensordict>=0.12.0,<0.13.0`` pin (introduced in #3266 to
guard against tensordict main dropping Python 3.10) froze us below the
``generator`` kwarg added in tensordict #1689, which #3704 forwards from
``SafeProbabilisticModule``, causing 3660 ``TypeError`` failures.
* test-linux.yml: drop the now-unused ``TORCHRL_TENSORDICT_SPEC`` export.
* test_distributions.py: ``test_tanhdelta_inv_ones`` was flaky on CUDA
float32 -- with 4M unseeded ``randn`` samples a handful of draws had
``|x|`` past ``atanh(1 - finfo.resolution) ≈ 7.25`` and could not roundtrip
through ``SafeTanhTransform``. Seed for determinism and clamp inputs into
the non-saturated region.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3739
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Same bug pattern as the ``torch._higher_order_ops.scan`` probe: Triton 2.0
(shipped with torch 2.0.1 on the gym smoke matrix) lacks
``triton.language.extra``, so ``find_spec("triton.language.extra.libdevice")``
eagerly imports the missing parent and raises ``ModuleNotFoundError`` at
torchrl import time. Read the version from package metadata and gate on
``triton >= 2.2`` instead. Applied identically in ``rnn.py`` and
``_rnn_triton.py``.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three currently red CI jobs on every PR / main:
ModuleNotFoundError: No module named 'torch._higher_order_ops'. Caused by[Feature][Performance] Triton backend for GRU / LSTM with intermediate resets #3738, which added
_has_torch_scan = importlib.util.find_spec(\"torch._higher_order_ops.scan\")to
rnn.py. The job installs torch 2.0.1, andfind_speceagerly importsthe missing parent and raises instead of returning
None.TypeError: ProbabilisticTensorDictModule.__init__() got an unexpected keyword argument 'generator'.[Feature] Forward generator kwarg through ProbabilisticActor #3704 unconditionally forwards
generator=fromSafeProbabilisticModuletothe tensordict parent class, but the olddeps job was pinned to
tensordict>=0.12.0,<0.13.0(introduced in [BugFix] Fix old pytorch dependencies #3266 to guard against tensordictmain dropping older Pythons), and that release line does not include the
generatorkwarg added in tensordict [BugFix] MaxValueWriter cuda compatibility #1689.test/test_distributions.py::TestDelta::test_tanhdelta_inv_ones: 4M unseededfloat32
randnvalues occasionally exceedatanh(1 - finfo.resolution) ≈ 7.25, soSafeTanhTransformsaturates andthe inv∘fwd roundtrip can't recover the input.
Changes
torchrl/modules/tensordict_module/rnn.py— gate_has_torch_scanonversion.parse(torch.__version__) >= version.parse(\"2.6.0\")(matching theexisting
@implement_for(\"torch\", \"2.6.0\", ...)decorator on_scan)..github/unittest/linux_olddeps/scripts_gym_0_13/install.sh— installtensordict from source on PR/nightly runs, from PyPI only on
release/*,matching
linux/scripts/run_all.shandlinux_distributed/scripts/install.sh.Restores the
pybind11[global]install needed by the source build..github/workflows/test-linux.yml— drop the now-unusedTORCHRL_TENSORDICT_SPECexport.test/test_distributions.py— seed and clamp inputs intest_tanhdelta_inv_onesso the test stays inside the SafeTanh roundtripregion.
Trade-off
Going back to source builds for olddeps re-exposes the brittleness #3266 was
guarding against: the day tensordict main drops Python 3.10 the olddeps job
breaks until we bump the Python version or temporarily re-pin. tensordict main
currently still declares
requires-python = \">=3.10\", so this resolves today.Test plan
unittests-gympasses (every gym version)tests-olddepspasses (no more 3660TypeError)tests-optdepspasses (no flake intest_tanhdelta_inv_ones)