Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torque integration tests broken? #677

Closed
jmaassen opened this issue Jan 18, 2021 · 9 comments
Closed

Torque integration tests broken? #677

jmaassen opened this issue Jan 18, 2021 · 9 comments

Comments

@jmaassen
Copy link
Member

When I run the integration tests for xenon, the Torque tests fail with the following error:

java.lang.IllegalArgumentException: No internal port '22' for container 'torque': com.palantir.docker.compose.connection.Container$$Lambda$96/0x0000000840137840@2d2ea655
	at com.palantir.docker.compose.connection.Container.lambda$port$11(Container.java:91)

Other integrations test that use docker images (such as slurm and gridengine) seem to work as expected.

@jmaassen
Copy link
Member Author

When starting the docker image manually like so:

 docker run --detach --name xenon-torque --hostname xenon-torque --publish 10022:22 --cap-add SYS_RESOURCE xenonmiddleware/torque

and running the liveTest like this:

./gradlew liveTest -Dxenon.scheduler=torque -Dxenon.username=xenon -Dxenon.password=javagat -Dxenon.scheduler.location=ssh://localhost:10022 -Dxenon.scheduler.workdir=/home/xenon

the test run succesfully. So it seems there is an issue in the testing framework itself, not the code or docker image. Maybe the healthcheck succeeds too quickly?

@jmaassen
Copy link
Member Author

Starting the docker images with docker compose:

docker-compose -f torque-5.0.0.yml up

and running the live tests in the same fashion:

./gradlew liveTest -Dxenon.scheduler=torque -Dxenon.username=xenon -Dxenon.password=javagat -Dxenon.scheduler.location=ssh://localhost:32830 -Dxenon.scheduler.workdir=/home/xenon 

does not work. It results in the same error as with the integration tests.

@sverhoeven
Copy link
Member

Works for me

docker --version
Docker version 20.10.2, build 2291f61

docker image inspect xenonmiddleware/torque | jq '.[0].RepoDigests'
[
  "xenonmiddleware/torque@sha256:5a98982c2ad0cefc6994004ce4da69e68b8f0c4d596a9732dd7574e75f2153d4"
]

gradlew integrationTest --tests '*torque*'
Starting a Gradle Daemon, 2 incompatible Daemons could not be reused, use --status for details

Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/5.4.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 59s
6 actionable tasks: 2 executed, 4 up-to-date

PS. xenonmiddleware/torque is the only Docker image where we don't install the scheduler/fs ourselves.

@sverhoeven
Copy link
Member

Using livetest command also works for me. Did have to prime the known_hosts file by logging in manually before calling gradle.

Also saw that the healthcheck is causing 2021-01-18 13:58:45,612 CRIT reaped unknown pid 5465) in docker compose log.

@jmaassen
Copy link
Member Author

The digest of xenonmiddleware/torque matches. I do have an older version of docker though: Docker version 19.03.8, build afacb8b7f0

@jmaassen
Copy link
Member Author

I also see the unknown pid message, but also these (when I start docker-compose manually):

Starting docker-compose_torque_1 ... done
Attaching to docker-compose_torque_1
torque_1  | /usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
torque_1  |   'Supervisord is running as root and it is searching '
torque_1  | 2021-01-18 14:27:22,213 CRIT Supervisor running as root (no user in config file)
torque_1  | 2021-01-18 14:27:22,215 INFO supervisord started with pid 1
torque_1  | 2021-01-18 14:27:23,217 INFO spawned: 'pbsmom' with pid 16
torque_1  | 2021-01-18 14:27:23,218 INFO spawned: 'sshd' with pid 17
torque_1  | 2021-01-18 14:27:23,219 INFO spawned: 'pbssched' with pid 18
torque_1  | 2021-01-18 14:27:23,220 INFO spawned: 'pbsserver' with pid 20
torque_1  | 2021-01-18 14:27:23,220 INFO spawned: 'trqauthd' with pid 21
torque_1  | 2021-01-18 14:27:23,248 CRIT reaped unknown pid 24)
torque_1  | 2021-01-18 14:27:23,345 INFO exited: pbssched (exit status 0; not expected)
torque_1  | 2021-01-18 14:27:23,352 INFO gave up: pbssched entered FATAL state, too many start retries too quickly
torque_1  | 2021-01-18 14:27:24,444 INFO success: pbsmom entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1  | 2021-01-18 14:27:24,444 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1  | 2021-01-18 14:27:24,444 INFO success: pbsserver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1  | 2021-01-18 14:27:24,444 INFO success: trqauthd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1  | 2021-01-18 14:27:24,444 CRIT reaped unknown pid 58)
torque_1  | 2021-01-18 14:27:25,604 CRIT reaped unknown pid 77)
torque_1  | 2021-01-18 14:27:26,779 CRIT reaped unknown pid 96)

does pbssched fail?

@sverhoeven
Copy link
Member

I have the same log message, but when I log in the pbs_sched process is running and qsub work as expected.

@jmaassen
Copy link
Member Author

Updating docker from 19.03.8 to 20.10.2 did not help, but updating docker-compose from 1.25.0 to 1.27.4 seems to have squashed this bug.

@jmaassen
Copy link
Member Author

Resolved as a docker-compose version issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants