Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/2.3] Fix miopenStatusInternalError caused in new ROCm6.0 CI docker images #126942

Merged
merged 3 commits into from
May 23, 2024

Conversation

jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented May 23, 2024

Fixes MIOpen sqlite error observed when trying to open database file.

inductor/test_torchinductor.py::GPUTests::test_alexnet_prefix_cuda MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/MLOpen/src/sqlite_db.cpp:229: Internal error while accessing SQLite database: unable to open database file 

This was observed when recent changes to bump the triton commit triggered a rebuild of the CI base docker images. These errors were observed first during the ROCm 6.0 CI upgrade and a workaround was put in place.

However, it seems that workaround didn't work during the rebuild, so this PR attempts a (better?) workaround, essentially by trying to set journal_mode to delete instead of off. For some reason, this seems to work for the gfx90a and gfx908 kdbs which have journal_mode originally set to wal (write-ahead logging) - which needs write permissions for the user invoking MIOpen (ie. jenkins). This PR also introduces logic to check that the final journal_mode is either delete or off, either of which should be sufficient to get around the permission error.

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

Copy link

pytorch-bot bot commented May 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126942

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a1aed3a with merge base 86a2d67 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels May 23, 2024
@jithunnair-amd
Copy link
Collaborator Author

@atalman This PR was tested with your changes from #126890 and all ROCm tests passed! https://github.com/pytorch/pytorch/actions/runs/9200896260/job/25311196428

@jithunnair-amd jithunnair-amd changed the title Fix miopenStatusInternalError caused in new ROCm6.0 CI docker images [release/2.3] Fix miopenStatusInternalError caused in new ROCm6.0 CI docker images May 23, 2024
@jithunnair-amd jithunnair-amd marked this pull request as ready for review May 23, 2024 15:18
@jithunnair-amd jithunnair-amd requested review from a team and jeffdaily as code owners May 23, 2024 15:18
@jithunnair-amd
Copy link
Collaborator Author

@atalman All ROCm CI passed. Ready to merge

@atalman atalman merged commit 661c3de into pytorch:release/2.3 May 23, 2024
124 of 125 checks passed
@jithunnair-amd jithunnair-amd deleted the change_miopen_journal_mode branch June 4, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants