-
Notifications
You must be signed in to change notification settings - Fork 15
switch to ported slurm docker cluster #1297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified FilesNo covered modified files...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff :) I only left a couple of smaller comments.
| run: uv sync --frozen | ||
| - name: Check typing | ||
| if: ${{ matrix.executors == 'multiprocessing' && matrix.python-version == '3.11' }} | ||
| if: ${{ matrix.python-version == '3.11' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we adapt this so that these checks are run against the newest python version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idk, but we could create a follow-up PR and do some other cleanup as well (e.g., removing the Kubernetes executor, because it is not used)
| runs-on: ubuntu-latest | ||
| timeout-minutes: 30 | ||
| strategy: | ||
| max-parallel: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is never set now. is this on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not easy to set across multi matrix runs and GitHub can execute all of them in parallel anyway. Restricting the number of jobs to 4 will only increase the time until all jobs are complete.
| for _ in range(10): | ||
| stdout, _, exit_code = call( | ||
| f"sacct -P --format=JobID,State,MaxRSS,ReqMem --unit K -j {job_id_with_index}" | ||
| ) | ||
|
|
||
| if exit_code != 0: | ||
| break | ||
|
|
||
| if len(stdout.splitlines()) <= 1: | ||
| time.sleep(0.2) | ||
| continue | ||
|
|
||
| # Parse stdout into a key-value object | ||
| memory_limit_investigation = self._investigate_memory_consumption(stdout) | ||
| if memory_limit_investigation: | ||
| return memory_limit_investigation | ||
|
|
||
| # Parse stdout into a key-value object | ||
| properties = parse_key_value_pairs(stdout, "\n", ":") | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this doesn't do retries if sacct fails? would it do any harm to do so?
- I find the trailing break in this for-loop unintuitive to read. my suggestion would be this:
max_retries = 10
for _ in range(max_retries):
stdout, _, exit_code = call(
f"sacct -P --format=JobID,State,MaxRSS,ReqMem --unit K -j {job_id_with_index}"
)
if exit_code != 0 or len(stdout.splitlines()) <= 1:
# Sleep and retry in the next loop
time.sleep(0.2)
else:
# Parse stdout into a key-value object
memory_limit_investigation = self._investigate_memory_consumption(stdout)
if memory_limit_investigation:
return memory_limit_investigation
breakThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sacct should only fail if it is not installed or the command line arguments are not correct. In case the accounting information of a job are not ready, or the job does not exist, it will return an empty list (see: https://github.com/scalableminds/webknossos-libs/actions/runs/15167285406/job/42648346345#step:12:19)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff 👍
Co-authored-by: Philipp Otto <philippotto@users.noreply.github.com>
Description:
TODO
scalableminds/slurm-docker-cluster(Requires: Port Slurm Docker Cluster to Debian dockerfiles#67 to be merged)Issues: