Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
b57bb81
switch to ported slurm docker cluster
robert-oleynik Apr 30, 2025
56177ec
fix slurm config and new slurm image
robert-oleynik Apr 30, 2025
3522a3e
split ci steps for better readability
robert-oleynik May 6, 2025
190a35c
fix ci names
robert-oleynik May 6, 2025
32dc40a
fix job references
robert-oleynik May 6, 2025
065830d
set default working directory
robert-oleynik May 6, 2025
b047832
ci: docker compose: improve startup
robert-oleynik May 7, 2025
5520b80
update test script
robert-oleynik May 7, 2025
6cbc36c
only run test without config change
robert-oleynik May 7, 2025
f4ac2bc
ci: fix logging and make slurm node privilged
robert-oleynik May 7, 2025
8429f23
apply format
robert-oleynik May 7, 2025
f358045
ci: add extra slurm test run
robert-oleynik May 7, 2025
dbbcbbd
add missing executor
robert-oleynik May 7, 2025
ebf99af
register pytest mark
robert-oleynik May 15, 2025
4387776
move marker to .py and print slurm logs
robert-oleynik May 15, 2025
08d6e18
fix python read
robert-oleynik May 15, 2025
0c08004
remove unnecessary decode
robert-oleynik May 15, 2025
936cc07
add pytest_configuration type annotation
robert-oleynik May 15, 2025
d99f7d7
print seff output
robert-oleynik May 15, 2025
aa794bd
add more prints
robert-oleynik May 15, 2025
f0f35bb
make seff optional
robert-oleynik May 15, 2025
e0dc6b8
print properties
robert-oleynik May 15, 2025
a083455
replace seff with sacct
robert-oleynik May 15, 2025
a498951
detect out of memory using sacct
robert-oleynik May 20, 2025
0dc1a3c
sacct print stdout
robert-oleynik May 20, 2025
572eb03
slurm conf constrain ram space
robert-oleynik May 20, 2025
b9335af
slurm job acct: disable over memory kill
robert-oleynik May 20, 2025
a51ade2
switch cgroup v2 and ignore systemd
robert-oleynik May 20, 2025
bb5db29
fix out of memory detection
robert-oleynik May 21, 2025
11fb6fb
restart slurm
robert-oleynik May 21, 2025
b0083f5
remove pytest_configuration
robert-oleynik May 21, 2025
3e9fe4c
apply format
robert-oleynik May 22, 2025
e8964b8
retry gathering job information
robert-oleynik May 22, 2025
e9aa7c4
fix linting errors
robert-oleynik May 22, 2025
8d67f8b
remove max parallel ci jobs
robert-oleynik May 22, 2025
164e62c
decrease sacct request frequencies
robert-oleynik May 22, 2025
5f2d0d8
remove prints
robert-oleynik May 22, 2025
5b36743
remove prints 2
robert-oleynik May 22, 2025
7a3b258
use maste docker cluster
robert-oleynik May 27, 2025
76602e3
apply suggestions
robert-oleynik May 27, 2025
5286668
fix lints and format
robert-oleynik May 27, 2025
26eeaf1
add Changelog entry
robert-oleynik May 27, 2025
20c2236
add missing setuptools entry
robert-oleynik May 27, 2025
e3ed692
cluster_tools: remove setuptools entry
robert-oleynik May 27, 2025
110611d
Update cluster_tools/tests/test_slurm.py
robert-oleynik May 27, 2025
e82e36b
Merge branch 'master' into test-new-docker-slurm-cluster
robert-oleynik May 28, 2025
33d9631
Merge branch 'master' into test-new-docker-slurm-cluster
robert-oleynik May 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
229 changes: 147 additions & 82 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,121 +29,169 @@ jobs:
cluster_tools:
- 'cluster_tools/**'

cluster_tools:
cluster_tools_slurm:
needs: changes
if: ${{ needs.changes.outputs.cluster_tools == 'true' }}
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
max-parallel: 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is never set now. is this on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not easy to set across multi matrix runs and GitHub can execute all of them in parallel anyway. Restricting the number of jobs to 4 will only increase the time until all jobs are complete.

matrix:
executors: [multiprocessing, slurm, kubernetes, dask]
python-version: ["3.13", "3.12", "3.11", "3.10"]
defaults:
run:
working-directory: cluster_tools
steps:
- uses: actions/checkout@v3
- name: Install uv
uses: astral-sh/setup-uv@v3
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v6
with:
version: "0.6.3"
enable-cache: true
cache-dependency-glob: "cluster_tools/uv.lock"

- name: Set up Python ${{ matrix.python-version }}
run: uv python install ${{ matrix.python-version }}
- name: Build/pull dockered-slurm image
if: ${{ matrix.executors == 'slurm' }}
- run: uv python install ${{ matrix.python-version }}
- name: Start Docker Cluster
run: cd ./dockered-slurm && docker compose up -d
- name: Log Core Container
run: |
cd ./dockered-slurm

echo docker compose up
docker compose up -d

# Register cluster (with retry)
for i in {1..5}; do
echo register_cluster
./register_cluster.sh && s=0 && break || s=$?
sleep 10
for name in "slurmctld" "c1" "c2"; do
docker logs "$name"
done

# Show log output for debugging
docker logs slurmctld
docker logs c1
docker logs c2

# Run setup.py on all three nodes
docker exec -w /cluster_tools slurmctld bash -c "uv sync --frozen" &
docker exec -w /cluster_tools c1 bash -c "uv sync --frozen" &
docker exec -w /cluster_tools c2 bash -c "uv sync --frozen" &
wait

- name: Setup Kubernetes-in-Docker
if: ${{ matrix.executors == 'kubernetes' }}
- name: Install UV dependencies
run: |
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.11.1/kind-linux-amd64
chmod +x ./kind
sed -i "s#__PATH__#$(pwd)#g" tests/cluster-config.yaml
./kind create cluster --config=tests/cluster-config.yaml
./kind export kubeconfig

docker build \
--build-arg PYTHON_VERSION=${{ matrix.python-version }} \
-f tests/Dockerfile \
-t scalableminds/cluster-tools:latest \
.
./kind load docker-image scalableminds/cluster-tools:latest
for name in "slurmctld" "c1" "c2"; do
docker exec -w /cluster_tools "$name" bash -c "uv sync --frozen"
done
- name: "Run Tests (test_all, test_slurm) without modified slurm.conf"
run: |
docker exec \
-w /cluster_tools/tests \
-e PYTEST_EXECUTORS=slurm \
slurmctld bash -c "uv run --frozen python -m pytest -sv test_all.py test_slurm.py -m 'not requires_modified_slurm_config'"
- name: "Run Tests (test_deref_main)"
run: |
docker exec \
-w /cluster_tools/tests \
slurmctld bash -c "uv run --frozen python test_deref_main.py"

- name: Install dependencies (without docker)
if: ${{ matrix.executors == 'multiprocessing' }}
- name: Update Slurm Config
run: |
uv sync --frozen
echo "MaxArraySize=2" >> ./dockered-slurm/slurm.conf
sed "s/JobAcctGatherFrequency=30/JobAcctGatherFrequency=1/g" ./dockered-slurm/slurm.conf > ./dockered-slurm/slurm.conf.tmp
mv ./dockered-slurm/slurm.conf.tmp ./dockered-slurm/slurm.conf
- name: Restart Slurm Cluster
run: cd ./dockered-slurm && docker compose restart slurmctld c1 c2

- name: Install dependencies (without docker)
if: ${{ matrix.executors == 'kubernetes' || matrix.executors == 'dask' }}
- name: "Run Tests (test_all, test_slurm) with modified slurn.conf"
run: |
uv sync --all-extras --frozen
# Run tests requiring a modified slurm config
docker exec \
-w /cluster_tools/tests \
-e PYTEST_EXECUTORS=slurm \
slurmctld bash -c "uv run --frozen python -m pytest -sv test_slurm.py -m 'requires_modified_slurm_config'"

cluster_tools_multiprocessing:
needs: changes
if: ${{ needs.changes.outputs.cluster_tools == 'true' }}
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
matrix:
python-version: ["3.13", "3.12", "3.11", "3.10"]
defaults:
run:
working-directory: cluster_tools
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v6
with:
version: "0.6.3"
enable-cache: true
cache-dependency-glob: "cluster_tools/uv.lock"
- name: Set up Python ${{ matrix.python-version }}
run: uv python install ${{ matrix.python-version }}
- name: Install dependencies (without docker)
run: uv sync --frozen
- name: Check typing
if: ${{ matrix.executors == 'multiprocessing' && matrix.python-version == '3.11' }}
if: ${{ matrix.python-version == '3.11' }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we adapt this so that these checks are run against the newest python version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk, but we could create a follow-up PR and do some other cleanup as well (e.g., removing the Kubernetes executor, because it is not used)

run: ./typecheck.sh

- name: Check formatting
if: ${{ matrix.executors == 'multiprocessing' && matrix.python-version == '3.11' }}
if: ${{ matrix.python-version == '3.11' }}
run: ./format.sh check

- name: Lint code
if: ${{ matrix.executors == 'multiprocessing' && matrix.python-version == '3.11' }}
if: ${{ matrix.python-version == '3.11' }}
run: ./lint.sh

- name: Run multiprocessing tests
if: ${{ matrix.executors == 'multiprocessing' }}
run: |
cd tests
PYTEST_EXECUTORS=multiprocessing,sequential,multiprocessing_with_pickling,sequential_with_pickling \
uv run --frozen python -m pytest -sv test_all.py test_multiprocessing.py

- name: Run slurm tests
if: ${{ matrix.executors == 'slurm' }}
cluster_tools_kubernetes:
needs: changes
if: ${{ needs.changes.outputs.cluster_tools == 'true' }}
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
matrix:
python-version: ["3.13", "3.12", "3.11", "3.10"]
defaults:
run:
working-directory: cluster_tools
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v6
with:
version: "0.6.3"
enable-cache: true
cache-dependency-glob: "cluster_tools/uv.lock"
- name: Set up Python ${{ matrix.python-version }}
run: uv python install ${{ matrix.python-version }}
- name: Setup Kubernetes-in-Docker
run: |
cd ./dockered-slurm
docker exec \
-w /cluster_tools/tests \
-e PYTEST_EXECUTORS=slurm \
slurmctld bash -c "uv run --frozen python -m pytest -sv test_all.py test_slurm.py"
docker exec \
-w /cluster_tools/tests \
slurmctld bash -c "uv run --frozen python test_deref_main.py"
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.11.1/kind-linux-amd64
chmod +x ./kind
sed -i "s#__PATH__#$(pwd)#g" tests/cluster-config.yaml
./kind create cluster --config=tests/cluster-config.yaml
./kind export kubeconfig

- name: Run kubernetes tests
if: ${{ matrix.executors == 'kubernetes' }}
docker build \
--build-arg PYTHON_VERSION=${{ matrix.python-version }} \
-f tests/Dockerfile \
-t scalableminds/cluster-tools:latest \
.
./kind load docker-image scalableminds/cluster-tools:latest
- name: Install dependencies (without docker)
run: uv sync --all-extras --frozen
- name: "Run Kubernetes"
run: |
cd tests
PYTEST_EXECUTORS=kubernetes uv run --frozen python -m pytest -sv test_all.py test_kubernetes.py

- name: Run dask tests
if: ${{ matrix.executors == 'dask' }}
cluster_tools_dask:
needs: changes
if: ${{ needs.changes.outputs.cluster_tools == 'true' }}
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
matrix:
python-version: ["3.13", "3.12", "3.11", "3.10"]
defaults:
run:
working-directory: cluster_tools
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v6
with:
version: "0.6.3"
enable-cache: true
cache-dependency-glob: "cluster_tools/uv.lock"
- name: Set up Python ${{ matrix.python-version }}
run: uv python install ${{ matrix.python-version }}
- name: Install dependencies (without docker)
run: uv sync --all-extras --frozen
- name: "Run Dask"
run: |
cd tests
PYTEST_EXECUTORS=dask uv run --frozen python -m pytest -sv test_all.py test_dask.py
Expand All @@ -155,9 +203,8 @@ jobs:
${{ needs.changes.outputs.webknossos == 'true' }}
runs-on: ubuntu-latest
strategy:
max-parallel: 4
matrix:
python-version: ["3.12", "3.13", "3.11", "3.10"]
python-version: ["3.13", "3.12", "3.11", "3.10"]
group: [1, 2, 3]
fail-fast: false
defaults:
Expand All @@ -177,7 +224,7 @@ jobs:

- name: Install proxay
run: npm install -g proxay

- name: Set up Python ${{ matrix.python-version }}
run: uv python install ${{ matrix.python-version }}

Expand Down Expand Up @@ -258,12 +305,17 @@ jobs:
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 0.8
thresholdNew: 0.8

- name: Cleanup temporary files
run: rm -rf ~/coverage-files

webknossos_cli_docker:
needs: [cluster_tools, webknossos_linux]
needs:
- cluster_tools_slurm
- cluster_tools_multiprocessing
- cluster_tools_kubernetes
- cluster_tools_dask
- webknossos_linux
if: |
always() &&
!contains(needs.*.result, 'failure') &&
Expand Down Expand Up @@ -335,7 +387,12 @@ jobs:
docker push scalableminds/webknossos-cli:$NORMALIZED_CI_BRANCH

docs:
needs: [cluster_tools, webknossos_linux]
needs:
- cluster_tools_slurm
- cluster_tools_multiprocessing
- cluster_tools_kubernetes
- cluster_tools_dask
- webknossos_linux
runs-on: ubuntu-latest
if: |
always() &&
Expand Down Expand Up @@ -391,7 +448,12 @@ jobs:
"$SLACK_HOOK"

pypi_and_gh_release:
needs: [cluster_tools, webknossos_linux]
needs:
- cluster_tools_slurm
- cluster_tools_multiprocessing
- cluster_tools_kubernetes
- cluster_tools_dask
- webknossos_linux
if: |
always() &&
!contains(needs.*.result, 'failure') &&
Expand Down Expand Up @@ -429,7 +491,10 @@ jobs:
complete:
needs:
[
cluster_tools,
cluster_tools_dask,
cluster_tools_kubernetes,
cluster_tools_multiprocessing,
cluster_tools_slurm,
webknossos_linux,
webknossos_cli_docker,
docs,
Expand Down
1 change: 1 addition & 0 deletions cluster_tools/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ For upgrade instructions, please check the respective *Breaking Changes* section
### Added

### Changed
- Use `sacct` to detect out of memory errors instead of `seff` for Slurm executor. [#1297](https://github.com/scalableminds/webknossos-libs/pull/1297)

### Fixed

Expand Down
Loading
Loading