Skip to content

Conversation

@robert-oleynik
Copy link
Contributor

@robert-oleynik robert-oleynik commented Apr 30, 2025

Description:

TODO

Issues:

  • fixes CI build failures for Docker Slurm Cluster image

@robert-oleynik robert-oleynik self-assigned this Apr 30, 2025
@github-actions
Copy link

github-actions bot commented Apr 30, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
8635 7212 84% 80% 🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: 33d9631 by action🐍

@robert-oleynik robert-oleynik marked this pull request as draft April 30, 2025 11:37
@robert-oleynik robert-oleynik marked this pull request as ready for review May 22, 2025 12:46
Copy link
Member

@philippotto philippotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff :) I only left a couple of smaller comments.

run: uv sync --frozen
- name: Check typing
if: ${{ matrix.executors == 'multiprocessing' && matrix.python-version == '3.11' }}
if: ${{ matrix.python-version == '3.11' }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we adapt this so that these checks are run against the newest python version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk, but we could create a follow-up PR and do some other cleanup as well (e.g., removing the Kubernetes executor, because it is not used)

runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
max-parallel: 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is never set now. is this on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not easy to set across multi matrix runs and GitHub can execute all of them in parallel anyway. Restricting the number of jobs to 4 will only increase the time until all jobs are complete.

Comment on lines 439 to 456
for _ in range(10):
stdout, _, exit_code = call(
f"sacct -P --format=JobID,State,MaxRSS,ReqMem --unit K -j {job_id_with_index}"
)

if exit_code != 0:
break

if len(stdout.splitlines()) <= 1:
time.sleep(0.2)
continue

# Parse stdout into a key-value object
memory_limit_investigation = self._investigate_memory_consumption(stdout)
if memory_limit_investigation:
return memory_limit_investigation

# Parse stdout into a key-value object
properties = parse_key_value_pairs(stdout, "\n", ":")
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this doesn't do retries if sacct fails? would it do any harm to do so?
  2. I find the trailing break in this for-loop unintuitive to read. my suggestion would be this:
        max_retries = 10
        for _ in range(max_retries):
            stdout, _, exit_code = call(
                f"sacct -P --format=JobID,State,MaxRSS,ReqMem --unit K -j {job_id_with_index}"
            )

            if exit_code != 0 or len(stdout.splitlines()) <= 1:
                # Sleep and retry in the next loop
                time.sleep(0.2)
            else:
                # Parse stdout into a key-value object
                memory_limit_investigation = self._investigate_memory_consumption(stdout)
                if memory_limit_investigation:
                    return memory_limit_investigation
                break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sacct should only fail if it is not installed or the command line arguments are not correct. In case the accounting information of a job are not ready, or the job does not exist, it will return an empty list (see: https://github.com/scalableminds/webknossos-libs/actions/runs/15167285406/job/42648346345#step:12:19)

Copy link
Member

@philippotto philippotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff 👍

@robert-oleynik robert-oleynik enabled auto-merge (squash) May 28, 2025 09:18
@robert-oleynik robert-oleynik merged commit d3bc88f into master May 28, 2025
35 checks passed
@robert-oleynik robert-oleynik deleted the test-new-docker-slurm-cluster branch May 28, 2025 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants