Conversation
ddaspit
left a comment
There was a problem hiding this comment.
Why is there such a large difference in runtime?
Reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @johnml1135)
Is it because it's not running on John's smaller gpus now? What kind of gpus does the autoscaler spin up, Matthew? |
|
We talked about it in our standup, recording here that the reason is that there is a startup cost associated with launching a GCP instance for the first SMT job as well as for the first NMT job. And the GPU type that the autoscaler spins up for the NMT jobs is an A100 40GB. |
|
Just saying it out loud:
|
(If I could add)
|
|
Point taken. There is only one GPU available for running these tests (my 3090). If, on the other hand, we use the AQUA server for the CPU jobs and the autoscalar for the GPU jobs, we still can save some time and not run into the bottlenecks. We would not save as much money, but honestly, I am more desirous of the times being short than the cost being less. In October, we had 27 commits to master (a pretty high month). Extrapolating that, we could spend 27120.20 (assuming half the time spent using GPUs) = $65 in total GPU cost for running these tests per year, hardly breaking the bank. |
Enkidu93
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
mshannon-sil
left a comment
There was a problem hiding this comment.
If we run the CPU jobs on the AQuA server, might we still run into bottlenecks if the AQuA server is full?
Also, the CPU jobs are much cheaper, since I'm running them on an e2-standard-32 machine type rather than an accelerator optimized machine type. It currently has a spot price of $0.4607/hr (and $1.3147/hr on demand).
Reviewable status:
complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
|
The differences are likely less than the cost of us talking about it. Let's just do it. |
This PR moves the E2E tests to the autoscaler. The testing step took 22m 43s (24m 51s total), compared to 10m 4s (12m 0s total) if not using the autoscaler. Initial startup time is really only a factor for the first NMT job and the first SMT job, since the autoscaler can reuse instances.
Also, the reason the environment variables I added have
CLEARMLin all caps was to distinguish that they're not an official environment variable from ClearML, which usesClearMLto prepend its own environment variables. But I can change the environment variable names to match or to something else entirely if that's preferable.This change is