Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 12.4 ARM wheel integration to CD - nightly build #126174

Closed
wants to merge 36 commits into from

Conversation

tinglvv
Copy link
Collaborator

@tinglvv tinglvv commented May 14, 2024

rebasing #124112.
too many conflict files, so starting a new PR.

Test pytorch/builder#1775 (merged) for ARM wheel addition
Test pytorch/builder#1828 (merged) for setting MAX_JOBS

Current issue to follow up:
#126980

cc @atalman @malfet @ptrblck @nWEIdia @Aidyn-A

Copy link

pytorch-bot bot commented May 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126174

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 42bb184 with merge base f0366de (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

  • pull / linux-focal-cuda12.4-py3.10-gcc9 / build (gh) (#127108)
    /var/lib/jenkins/workspace/aten/src/ATen/cuda/CUDASparseDescriptors.h:119:68: error: ‘cusparseStatus_t cusparseCreateBsrsm2Info(bsrsm2Info**)’ is deprecated: The routine will be removed in the next major release [-Werror=deprecated-declarations]
  • pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / build (gh) (#127104)
    /var/lib/jenkins/workspace/aten/src/ATen/cuda/CUDASparseDescriptors.h:119:68: error: ‘cusparseStatus_t cusparseCreateBsrsm2Info(bsrsm2Info**)’ is deprecated: The routine will be removed in the next major release [-Werror=deprecated-declarations]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 14, 2024
@Aidyn-A Aidyn-A added the ciflow/binaries Trigger all binary build and upload jobs on the PR label May 14, 2024
@tinglvv
Copy link
Collaborator Author

tinglvv commented May 16, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cuda124-arm-ci onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda124-arm-ci && git pull --rebase)

atalman added a commit to pytorch/test-infra that referenced this pull request May 16, 2024
Trying to explore options to help land:
pytorch/pytorch#126174

Current [manywheel-py3_9-cuda-aarch64-build /
build](https://github.com/pytorch/pytorch/actions/runs/9112985273/job/25053413689?pr=126174#logs)
Takes around 6hrs (building only sm90 arch). Hence trying to bring
slightly bigger worker, see if we can bring build time to manageable
time 3-3.5hrs.
@tinglvv
Copy link
Collaborator Author

tinglvv commented May 17, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cuda124-arm-ci onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda124-arm-ci && git pull --rebase)

@tinglvv tinglvv marked this pull request as ready for review May 22, 2024 01:30
@tinglvv tinglvv requested a review from a team as a code owner May 22, 2024 01:30
@tinglvv tinglvv changed the title [draft] [rebase] cuda 124 arm wheel test CUDA 12.4 ARM wheel integration to CI May 22, 2024
README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great!

@@ -71,12 +71,19 @@ jobs:
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
{%- endif %}
{%- if config["gpu_arch_type"] == "cuda-aarch64" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is good, just need to remove else condition with timeout minutes 210

@tinglvv
Copy link
Collaborator Author

tinglvv commented May 23, 2024

@pytorchbot rebase

@tinglvv
Copy link
Collaborator Author

tinglvv commented May 24, 2024

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 24, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: windows-binary-wheel / wheel-py3_8-cpu-test, windows-binary-conda / conda-py3_8-cuda12_1-test

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@tinglvv
Copy link
Collaborator Author

tinglvv commented May 24, 2024

the 3-10 aarch64 failure is due to timeout (6 hrs limit), could be due to network condition, can increase the timeout-minutes accordingly later. Merging as all the rest aarch64 builds are passing. Windows build failure is unrelated.

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-aarch64-binary-manywheel / manywheel-py3_11-cuda-aarch64-build / build

Details for Dev Infra team Raised by workflow job

@tinglvv
Copy link
Collaborator Author

tinglvv commented May 24, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/126174/head returned non-zero exit code 1

Rebasing (1/34)
Rebasing (2/34)
Rebasing (3/34)
Rebasing (4/34)
Rebasing (5/34)
Rebasing (6/34)
Rebasing (7/34)
Rebasing (8/34)
Rebasing (9/34)
Rebasing (10/34)
Rebasing (11/34)
Auto-merging .github/workflows/generated-linux-binary-manywheel-main.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-binary-manywheel-main.yml
error: could not apply 5007c3953c3... move up gpu arch type check to above -test definition
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 5007c3953c3... move up gpu arch type check to above -test definition

Raised by https://github.com/pytorch/pytorch/actions/runs/9229584549

@tinglvv
Copy link
Collaborator Author

tinglvv commented May 26, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@tinglvv
Copy link
Collaborator Author

tinglvv commented May 27, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

titaiwangms pushed a commit to titaiwangms/pytorch that referenced this pull request May 28, 2024
rebasing pytorch#124112.
too many conflict files, so starting a new PR.

Test pytorch/builder#1775 (merged) for ARM wheel addition
Test pytorch/builder#1828 (merged) for setting MAX_JOBS

Current issue to follow up:
pytorch#126980

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
Pull Request resolved: pytorch#126174
Approved by: https://github.com/nWEIdia, https://github.com/atalman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants