Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up GPU CI #1067

Closed
5 of 6 tasks
ivirshup opened this issue Jul 21, 2023 · 19 comments
Closed
5 of 6 tasks

Set up GPU CI #1067

ivirshup opened this issue Jul 21, 2023 · 19 comments

Comments

@ivirshup
Copy link
Member

ivirshup commented Jul 21, 2023

Please describe your wishes and possible alternatives to achieve the desired result.

What does GPU CI need?

Will be partially solved by: #1066

cc @Intron7

@ivirshup
Copy link
Member Author

New issue: caching. My understanding is that all of github actions caching caches data to github servers. This doesn't reduce data ingress when our job isn't running on github servers. Right now this job is downloading about 1gb of data per run.

We should try and enable caching through aws.

@ivirshup
Copy link
Member Author

So the billing was a little higher than expected. Basically about $2 for the one PR.

Admittedly this PR had a lot of trouble shooting pushes – about 29 commits had checks start. However a number of these checks could have been cancelled by follow up pushes, which I'll add.

Right now the entire CI run (as reported by github actions) takes about 4 minutes:

image

But on our billing console it says it took about 12 minutes. So what up with that? Is our billing console reporting time in an unexpected way, is the machine running for longer than github actions knows?

Any thoughts @Zethson @aktech?

@Zethson
Copy link
Member

Zethson commented Jul 28, 2023

I wouldn't be surprised if fetching the image and connecting to Github actions takes some time? But I guess @aktech knows this better...

@aktech
Copy link
Contributor

aktech commented Jul 28, 2023

But on our billing console it says it took about 12 minutes. So what up with that? Is our billing console reporting time in an unexpected way, is the machine running for longer than github actions knows?

GitHub only reports the time it took to run the job on it, nothing before or after.

There are the following times to consider:

  • If you're using NVIDIA images, they take a very long time to start (they have some internal provisioning, I don't know a lot about it)
  • there is some time to provision/initialize github actions by cirun (less than a minute)
  • There is some time delay (variable: 10s to 2 min) for github to start the job on it, it's a known github issue: Very slow queuing with plenty of idle runners available actions/runner#676

Using your custom AMI, (it's just basically spinning up an ubuntu machine with gpu and installing nvidia drivers yourself, and creating and AMI from it), would reduce the spinup time significantly, I can get on a call to help with this, if required.

@flying-sheep
Copy link
Member

Here’s the docs on how to set up custom images with CiRun: https://docs.cirun.io/custom-images/cloud-custom-images

@aktech
Copy link
Contributor

aktech commented Jul 31, 2023

You want the second section in that doc: https://docs.cirun.io/custom-images/cloud-custom-images#aws-building-custom-images-with-user-modification (first one ues nvidia image)

@ivirshup
Copy link
Member Author

Thanks for the info @aktech! I've been able to get something running using that. I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.

Right now I am trying to see how long the instance was actually around for, but I'm not actually sure where I can see logs for this. I think our setup is a little obfuscated here, and the view I have doesn't seem to update quickly.

@aktech
Copy link
Contributor

aktech commented Jul 31, 2023

Thanks for the info @aktech! I've been able to get something running using that. I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.

Are you planning to run test inside docker container? You'd still need nvidia/cuda in the base VM image I think.

Right now I am trying to see how long the instance was actually around for, but I'm not actually sure where I can see logs for this. I think our setup is a little obfuscated here, and the view I have doesn't seem to update quickly.

Currently I don't think we have that statistics in the UI anywhere, but I can consider adding it in the check run. Meanwhile, until the instance is visible in the aws dashboard (it is usually visible for sometime in the dashboard), you can run the following command to see how long it was alive for:

aws ec2 describe-instances --instance-ids INSTANCE_ID --query 'Reservations[0].Instances[0].LaunchTime' --region eu-north-1
aws ec2 describe-instances --instance-ids INSTANCE_ID --query 'Reservations[0].Instances[0].StateTransitionReason' --region eu-north-1

@ivirshup
Copy link
Member Author

I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.
Are you planning to run test inside docker container?

No, but this was just following the amazon ECR instructions for "how to create an image".

I believe you need nvidia-container-toolkit (an extension to docker) to do this kind of thing.

You'd still need nvidia/cuda in the base VM image I think.

I would like to have a more programatic way to construct these containers, so will look into this. Thanks!


I found out that I had managed to get logged into a scope with very little access, which is what was making it so difficult to see anything... Still no idea how I did that, I think maybe via the rackspace site? But now I can look at CloudTrail and have set up Config so I think we can use that.

I think times are down now. It was taking about 12 min a run last friday (according to rackspace), now it's more like 4.5 (according to aws). Github still says it's more like 2, so there's room for improvement, but still better. Of course will be good to compare measurements from the same place.

@aktech, do you have any suggestions for how we could do caching for our CI? A non-trivial amount of time is spent building wheels and downloading things, which I think we could get down. However, I don't think that Github Actions caching is going to help a ton here since it's on github's servers.

@aktech
Copy link
Contributor

aktech commented Jul 31, 2023

I believe you need nvidia-container-toolkit (an extension to docker) to do this kind of thing.

Yes, correct.

I would like to have a more programatic way to construct these containers, so will look into this. Thanks!

Yep, makes sense. We would have built for customers, but NVIDIA's license doesn't allows distribution, but yeah if we had the CI setup for automating this, you could have used that, but currently we don't have one public, it's a WIP.

@aktech, do you have any suggestions for how we could do caching for our CI? A non-trivial amount of time is spent building wheels and downloading things, which I think we could get down. However, I don't think that Github Actions caching is going to help a ton here since it's on github's servers.

I didn't see that, which workflow? This one seems to take less than 2.5 minutes: https://github.com/scverse/anndata/actions/runs/5716599171/job/15494250907?pr=1084

@ivirshup
Copy link
Member Author

ivirshup commented Aug 1, 2023

I didn't see that, which workflow?

It's not that it takes a long time, it's that it takes longer to setup than to run the tests, so I'd like to bring that down.

@ivirshup
Copy link
Member Author

ivirshup commented Aug 1, 2023

Triggering GPU CI

So after billing was a little higher than expected (which may be fixed, but need to confirm once billing updates) we decided not to run CI on every commit. We set the action to run on workflow_dispatch so it would be manually triggered, but it seems like we can't use this as branch protection since workflow_dispatch doesn't count towards passing a check.

So, we need something else to trigger this. It seems our options are:

  • a tag
Implementation (I think)
on:
  pull_request:
    types:
      - labeled
      - edited
      - synchronize

jobs:
  test:
    if: ${{ contains(github.event.pull_request.labels, 'run-gpu-ci') }}
  • a comment
  • approving review
  • merge queue

Currently thinking a tag makes the most sense, since we can easily enable and disable it, and it isn't neccesarily linked with merging. It could be that either a label or auto merge are enough.

@flying-sheep
Copy link
Member

flying-sheep commented Aug 1, 2023

yup, as I thought. except for the merge queue, all of these of course mean that it’ll run for all commits after the label/comment/whatever is added.

one option would be to have the workflow remove the label again:

  1. for each commit, all tests except for GPU tests are run
  2. we add the run GPU tests once label
  3. the workflow job first removes the label again, then runs the GPU tests

@Intron7
Copy link
Member

Intron7 commented Aug 1, 2023

a comment

@ivirshup the rapids-team does this with a comment from a member. this tiggers a ci run. But from what I can tell they use workflow_dispatch

@flying-sheep
Copy link
Member

they use workflow_dispatch

Let’s avoid this if possible. It might be possible to manually call the GitHub API to list all PRs for a branch and then create and update a check for the PR being found, but I’d rather not go down that road when it looks that there’s a much simpler solution.

@ivirshup
Copy link
Member Author

ivirshup commented Aug 1, 2023

I think there is value to giving a PR the green light to use paid CI, and not needed to approve each individual commit.

The one off case could be useful too, but I think triggering via a comment makes more sense in this case.

@ivirshup
Copy link
Member Author

ivirshup commented Aug 1, 2023

@Intron7, I think rapids are using API calls from checks to trigger workflow_dispatch. But they also have a pretty involved CI system: https://github.com/rapidsai/cudf/blob/branch-23.10/.github/workflows/pr.yaml

@aktech
Copy link
Contributor

aktech commented Aug 2, 2023

@ivirshup Another tip: To reduce cost you can use preemptible (spot) instances:

runners:
  - name: aws-gpu-runner
    cloud: aws
    instance_type: g4dn.xlarge
    machine_image: ami-067a4ba2816407ee9
    region: eu-north-1
    preemptible:
      - true
      - false
    labels:
      - cirun-aws-gpu

Doc: https://docs.cirun.io/reference/fallback-runners#example-3-preemptiblenon-preemptible-instances

This would try to spinup a preemptible instance first and if that fails, then it will spinup on-demand instance.
They are upto 90% cheaper and 50% on a average, current price in a couple of regions (https://aws.amazon.com/ec2/spot/pricing/):

  • us-east-2
Capacity/Instance Spot On-Demand
g4dn.xlarge $0.1578 per Hour $0.3418 per Hour
  • eu-north-1
Capacity/Instance Spot On-Demand
g4dn.xlarge $0.1674 per Hour $0.3514 per Hour

@ivirshup
Copy link
Member Author

ivirshup commented Sep 7, 2023

We've still got a little room for improvement on GPU CI, but I think it's pretty much set up!

Costs per run are now down to about 1 cent for anndata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants