Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDAExtension] support all visible cards when building a cudaextension #48891

Closed
wants to merge 8 commits into from

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Dec 5, 2020

Currently CUDAExtension assumes that all cards are of the same type on the same machine and builds the extension with compute capability of the 0th card. This breaks later at runtime if the machine has cards of different types.

Specifically resulting in:

RuntimeError: CUDA error: no kernel image is available for execution on the device

when the cards of the types that weren't compiled for are used. (and the error is far from telling what the problem is to the uninitiated)

My current setup is:

$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
(8, 6)
$ CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_capability())"
(6, 1)

but the extension was getting built with -gencode=arch=compute_80,code=sm_80.

This PR:

  • introduces a loop over all visible at build time devices to ensure the extension will run on all of them (it sorts the new list generated by the loop, so that the output is easier to debug should a card with lower capacity come last)
  • adds +PTX to the last entry of ccs derived from local cards (if not _arch_list:) to support other archs
  • adds a digest of my conversation with @ptrblck on slack in the form of docs which hopefully can help others know which archs to support, how to override defaults, when and how to add PTX, etc.

Please kindly review that my prose is clear and easy to understand.

@ptrblck

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! Nice fix.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dr-ci
Copy link

dr-ci bot commented Dec 5, 2020

💊 CI failures summary and remediations

As of commit b4ecc0d (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 13 times.

@stas00
Copy link
Contributor Author

stas00 commented Dec 8, 2020

Also I don't quite understand what does this do:

num = arch[0] + arch[2]
flags.append(f'-gencode=arch=compute_{num},code=sm_{num}')
if arch.endswith('+PTX'):
flags.append(f'-gencode=arch=compute_{num},code=compute_{num}')

Where did PTX go and why do we have a weird replication of almost the same nvcc flag, yet different? (last line)


edit @mcarilli pointed me to https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#just-in-time-compilation which explains that this is how PTX is encoded in nvcc flags. Quote:

By specifying a virtual code architecture instead of a real GPU, nvcc postpones the assembly of PTX code until application runtime, at which the target GPU is exactly known. For instance, the command below allows generation of exactly matching GPU binary code, when the application is launched on an sm_50 or later architecture.

nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50

So that last line: flags.append(f'-gencode=arch=compute_{num},code=compute_{num}') means --ptx.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@stas00 stas00 deleted the cuda-ext-gpu-mix branch December 8, 2020 23:10
@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 02b6385.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants