add LSF scheduler #588

takeshi-yoshimura · 2022-08-25T13:18:02Z

I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: torchx log command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub). For distributed apps, it creates multiple bsub. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

component_integration_tests.lsf.txt

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

Example: native hello_world and CLI utils

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0

Example: Docker hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3

Example: Singularity hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3

Example: Docker Distributed

$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

Example: Singularity Distributed

$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

d4l3k

Thanks for contributing! This is very exciting to see :)

Have you looked at https://github.com/IBMSpectrumComputing/lsf-python-api at all? Just wondering if that might be a slightly cleaner approach than shell commands

For the shell commands it'd be good to write some unit tests where we wrap subprocess.run to provide expected output for each method

It'd also be nice to use workspaces for docker/native to enable automatic patching though supporting both might be tricky without some other refactors. Maybe just start with Docker since DirWorkspace seems to not be used here?

d4l3k · 2022-08-25T18:57:34Z

scripts/component_integration_tests.py

@@ -51,7 +51,7 @@ def main() -> None:
    torchx_image = "dummy_image"
    dryrun = False

-    if scheduler in ("kubernetes", "local_docker", "aws_batch"):
+    if scheduler in ("kubernetes", "local_docker", "aws_batch", "lsf"):


is there any easy way for us to setup a local LSF scheduler in the CI test environment (i.e via docker)? Would be nice to have an E2E integration test like we do for the other schedulers

d4l3k · 2022-08-25T18:58:11Z

torchx/schedulers/lsf_scheduler.py

+
+class LsfOpts(TypedDict, total=False):
+    lsf_queue: Optional[str]
+    jobdir: Optional[str]   # NOTE: *job_dir* cannot be used. somehow it overwrites --image flag (bug?). so use *jobdir* (without underscore, _)


this is because you're using the DirWorkspace you can remove that from the inheritance if you need to

d4l3k · 2022-08-25T18:59:07Z

torchx/schedulers/lsf_scheduler.py

+#------------
+{self.materialize()}"""
+
+class LsfScheduler(Scheduler[LsfOpts], DirWorkspace):


I don't think this is actually using the DirWorkspace? Ideally we could use both DockerWorkspace and DirWorkspace though that would require #590

d4l3k · 2022-08-25T19:04:12Z

torchx/schedulers/lsf_scheduler.py

+    host_network: Optional[bool]
+    shm_size: Optional[str]
+
+def get_docker_command(job_name: str, role: Role, cfg: LsfOpts) -> str:


wonder if we can share the same command logic in the DockerScheduler

d4l3k · 2022-08-25T19:07:16Z

torchx/schedulers/lsf_scheduler.py

+        if resource.gpu > 0:
+            cmds += [f"--gpus", "all"]
+    cmds += ["--entrypoint", role.entrypoint, "--rm", role.image] + [shlex.quote(arg) for arg in role.args]
+    return " ".join(cmds).replace("$", "\\$")


need to use something like shlex.join -- just using " ".join isn't safe from both a correctness and security standpoint

applies below as well

takeshi-yoshimura · 2022-08-26T12:48:58Z

@d4l3k
Thank you for commenting my code! As you point out, it seems much better to use the python library and focus on Docker. I wil try to fix the code and push it again here next week.

d4l3k · 2022-08-26T19:45:13Z

@takeshi-yoshimura there's a balance here -- not sure how hard it is to install the lsf library. If it's painful maybe that's not the best option

d4l3k · 2022-09-16T16:51:30Z

@takeshi-yoshimura what's your plans for updating this pr? We'd like to include it but still needs some polish

takeshi-yoshimura · 2022-09-19T13:39:29Z

@d4l3k
The python library for LSF seems not to provide pip prebuilt binaries. We need to download and build their code at running LSF nodes as far as my test. It may be difficult to work with test cases and build process for torchx.

Regarding LSF Docker image for local tests, I found no official ones so far...

Let me search more and share something again soon here. Sorry for my late responce, but this week I think I can concentrate on revising this PR.

takeshi-yoshimura · 2022-09-20T11:23:15Z

I updated lsf_scheduler.py according to your comment. can you please take a look? @d4l3k

Honestly speaking, I don't recommend using lsf-python-api. Its critical weak point is no support for job submissions with GPUs (IBMSpectrumComputing/lsf-python-api#36). Also, it only has low-level, complex interfaces for python and I couldn't find good documentation on how to use it.

I am also concerned about tests as you pointed out. As far as my investigation, no public container images are available currently. The issue was also discussed in dask-jobqueue (dask/dask-jobqueue#115). As they discussed, we can download LSF Suite Community Edition to build a Docker image. You need an IBM account to download it, but it's distributed under a free license that enables enough capability for testing (a single GPU and limited number of resources). Here is my test Dockerfile and other scripts (probably we cannot put the image on public places). Maybe we can add this kind of code for testing.

Dockerfile (lsfsce10.2.0.12-x86_64.tar.gz is downloaded from here):

from nvidia/cuda:11.7.1-devel-ubuntu20.04
ARG LSFSCE10_2_0_12=lsfsce10.2.0.12-x86_64.tar.gz
COPY $LSFSCE10_2_0_12 /
COPY startserver.sh /
COPY myinstall.config /

ENV HOSTNAME lsf

RUN apt-get update && apt-get install -y python3 python3-pip swig git ed vim && rm -rf /var/cache/apt/* && \
    useradd -m lsfadmin && \
    cd / && tar xzf lsfsce10.2.0.12-x86_64.tar.gz && cd lsfsce10.2.0.12-x86_64/lsf && tar xzf lsf10.1_lsfinstall_linux_x86_64.tar.Z && cd lsf10.1_lsfinstall && \
    ./lsfinstall -f /myinstall.config && rm -rf /myinstall.config /lsfsce10.2.0.12-x86_64* && echo "LSF_ROOT_USER=Y" >> /usr/share/lsf/conf/lsf.conf && \
    echo "LSB_GPU_NEW_SYNTAX=extend" >> /usr/share/lsf/conf/lsf.conf && \
    echo 'source /usr/share/lsf/conf/profile.lsf' >> /home/lsfadmin/.bashrc && echo 'source /usr/share/lsf/conf/profile.lsf' >> /root/.bashrc && \
    cd / && git clone https://github.com/IBMSpectrumComputing/lsf-python-api.git && cd lsf-python-api && \
    . /usr/share/lsf/conf/profile.lsf && python3 setup.py build && python3 setup.py install && cd / && rm -rf /lsf-python-api

USER root

This Dockerfile contains lsf-python-api installation as well. I found pip package for this, but it was not updated for years https://pypi.org/project/platform-python-lsf-api/.

myinstall.config:

LSF_TOP=/usr/share/lsf
LSF_ADMINS=lsfadmin
LSF_CLUSTER_NAME=lsf
LSF_MASTER_LIST=lsf
SILENT_INSTALL=Y
LSF_SILENT_INSTALL_TARLIST=ALL
ACCEPT_LICENSE=Y

startserver.sh:

#!/bin/bash
source /usr/share/lsf/conf/profile.lsf
lsf_daemons start

takeshi-yoshimura · 2022-09-22T14:08:31Z

Public images for LSF were deleted for security reasons in the past. An official instruction to build an LSF image is https://github.com/IBMSpectrumComputing/lsf-operator/blob/main/README-Building-the-images.md.

codecov · 2022-09-23T17:18:28Z

Codecov Report

Merging #588 (f8e9a0b) into main (b70811e) will decrease coverage by 0.52%.
The diff coverage is 89.23%.

@@            Coverage Diff             @@
##             main     #588      +/-   ##
==========================================
- Coverage   94.94%   94.42%   -0.53%     
==========================================
  Files          67       64       -3     
  Lines        4134     4429     +295     
==========================================
+ Hits         3925     4182     +257     
- Misses        209      247      +38

Impacted Files	Coverage Δ
torchx/schedulers/__init__.py	`95.23% <ø> (ø)`
torchx/schedulers/lsf_scheduler.py	`89.23% <89.23%> (ø)`
torchx/util/entrypoints.py	`89.28% <0.00%> (-10.72%)`	⬇️
torchx/specs/named_resources_aws.py	`93.33% <0.00%> (-6.67%)`	⬇️
torchx/runner/api.py	`94.87% <0.00%> (-2.03%)`	⬇️
torchx/specs/__init__.py	`94.28% <0.00%> (-2.02%)`	⬇️
torchx/specs/api.py	`98.40% <0.00%> (ø)`
torchx/util/types.py	`100.00% <0.00%> (ø)`
torchx/cli/cmd_log.py	`95.74% <0.00%> (ø)`
torchx/specs/finder.py	`96.98% <0.00%> (ø)`
... and 9 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

d4l3k · 2022-09-23T17:19:00Z

If lsf-python-api isn't in a good shape there's no strong need to use it then. That was my concern when I looked at it before so glad you have the same opinion. Lack of GPU support is a big blocker wow

For Slurm we just use the CLI and it's stable enough -- we can just mock the inputs/outputs by using patch on subprocess which works fairly well.

Re: testing -- having an LSF integration test isn't a blocker for landing this diff. We can mark this as prototype and add integ testing in a follow up diff.

If there's ways to programmatically fetch the LSF scheduler and install it using the credentials we can add some creds to GitHub secrets which should keep them safe

Do we have any contacts at LSF? Wondering if this policy around docker images is something that we can get changed/exempted from. Is it an option to get a small managed LSF test cluster provided by IBM? We can chat more on Slack

d4l3k · 2022-09-23T21:37:20Z

@takeshi-yoshimura Think the main thing blocking this particular diff is just adding some comprehensive unit tests (and fixing lint/pyre)

takeshi-yoshimura · 2022-10-04T14:26:38Z

@d4l3k I fixed lint and pyre with unit tests. please check them. To unit tests without subprocess calls, I separated parser logic from the LsfScheduler methods.

Do we have any contacts at LSF? Wondering if this policy around docker images is something that we can get changed/exempted from. Is it an option to get a small managed LSF test cluster provided by IBM? We can chat more on Slack

I'm afraid I cannot get approval to have IBM hosts only for this test. I also asked LSF developers for the docker images but got no good answers. To be honest, I have no idea to solve this right now.

There is a forum page for LSF https://community.ibm.com/community/user/businessanalytics/communities/community-home/digestviewer?communitykey=74d589b7-7276-4d70-acf5-0fc26430c6c0. I keep asking in IBM, but we can also tell our issue at the open channel.

By the way, I requested a slack invitation from https://pytorch.org/resources last week but got no replies yet. Is the system working?

If there's ways to programmatically fetch the LSF scheduler and install it using the credentials we can add some creds to GitHub secrets which should keep them safe

this idea is also difficult. The download page for the LSF community edition needs authentication at Web Browser.

d4l3k

Thanks for contributing!

The one 3.7 failure seems to be unrelated -- think it's already fixed in trunk

It would be nice to test some of the subprocess calls w/ mocks but I can quickly add a couple of those when I land it

Sounds like for the integration tests it's not going to be very feasible which is unfortunate -- not much I can do from our side. With the auth we might be able to do something by storing the credentials in github secrets

facebook-github-bot · 2022-10-07T17:38:05Z

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters. Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS). In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there. [component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt) Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images. The following are example commands. **Example: native hello_world and CLI utils** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3 lsf://torchx/echo-pxc3gn5ct061k $ torchx list -s lsf $ torchx status lsf://torchx/echo-pxc3gn5ct061k $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0 ``` **Example: Docker hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Singularity hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Docker Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` **Example: Singularity Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` Pull Request resolved: pytorch#588 Reviewed By: msaroufim Differential Revision: D40184939 Pulled By: msaroufim fbshipit-source-id: 5a13d2ee88b3b5cf1b8e5a3f6786b955d47f21f8

takeshi-yoshimura · 2022-10-08T04:00:18Z

@d4l3k
Thank you! I know this is only the first step for the LSF scheduler. I think I need to keep working on integration tests and documentation (, plus Singularity after dependent changes). Do we have any other TODO items for the LSF scheduler?

Summary: Pull Request resolved: pytorch#610 I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters. Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS). In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there. [component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt) Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images. The following are example commands. **Example: native hello_world and CLI utils** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3 lsf://torchx/echo-pxc3gn5ct061k $ torchx list -s lsf $ torchx status lsf://torchx/echo-pxc3gn5ct061k $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0 ``` **Example: Docker hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Singularity hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Docker Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` **Example: Singularity Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` Pull Request resolved: pytorch#588 Reviewed By: anirbanr-fb-r2p, msaroufim, kurman Differential Revision: D40184939 fbshipit-source-id: d4a4f68d74a2ca12f95f683080c6a00137966ca6

d4l3k · 2022-10-10T22:27:17Z

@takeshi-yoshimura The biggest current gap right now is workspaces. Having support for the DockerWorkspace to allow patching the images before launching the job would be very nice

if you want to add singularity support that'd be a big help -- it'd be pretty nice to add some abstraction for the container execution side here (i.e. same interface for docker vs singularity) that we could use across a couple of schedulers

Testing would also be a big help and circulating it around to get some user feedback

add lsf_scheduler.py

b039ed4

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2022

takeshi-yoshimura mentioned this pull request Aug 25, 2022

[Req] LSF scheduler support #441

Open

d4l3k self-requested a review August 25, 2022 17:44

d4l3k mentioned this pull request Aug 25, 2022

Support multiple workspace types per scheduler #590

Open

d4l3k reviewed Aug 25, 2022

View reviewed changes

remove Singularity and DirWorkspace. and use shlex.join

176349e

takeshi-yoshimura added 3 commits October 4, 2022 01:11

fix lint errors

8373713

fix pyre errors

76c7411

add lsf_scheduler_test.py

f8e9a0b

d4l3k approved these changes Oct 7, 2022

View reviewed changes

d4l3k mentioned this pull request Oct 7, 2022

add LSF scheduler (#588) #610

Closed

facebook-github-bot closed this in 6360df3 Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add LSF scheduler #588

add LSF scheduler #588

takeshi-yoshimura commented Aug 25, 2022

d4l3k left a comment

d4l3k Aug 25, 2022

d4l3k Aug 25, 2022

d4l3k Aug 25, 2022

d4l3k Aug 25, 2022

d4l3k Aug 25, 2022

takeshi-yoshimura commented Aug 26, 2022

d4l3k commented Aug 26, 2022

d4l3k commented Sep 16, 2022

takeshi-yoshimura commented Sep 19, 2022

takeshi-yoshimura commented Sep 20, 2022 •

edited

Loading

takeshi-yoshimura commented Sep 22, 2022

codecov bot commented Sep 23, 2022 •

edited

Loading

d4l3k commented Sep 23, 2022

d4l3k commented Sep 23, 2022

takeshi-yoshimura commented Oct 4, 2022

d4l3k left a comment

facebook-github-bot commented Oct 7, 2022

takeshi-yoshimura commented Oct 8, 2022

d4l3k commented Oct 10, 2022

add LSF scheduler #588

add LSF scheduler #588

Conversation

takeshi-yoshimura commented Aug 25, 2022

d4l3k left a comment

Choose a reason for hiding this comment

d4l3k Aug 25, 2022

Choose a reason for hiding this comment

d4l3k Aug 25, 2022

Choose a reason for hiding this comment

d4l3k Aug 25, 2022

Choose a reason for hiding this comment

d4l3k Aug 25, 2022

Choose a reason for hiding this comment

d4l3k Aug 25, 2022

Choose a reason for hiding this comment

takeshi-yoshimura commented Aug 26, 2022

d4l3k commented Aug 26, 2022

d4l3k commented Sep 16, 2022

takeshi-yoshimura commented Sep 19, 2022

takeshi-yoshimura commented Sep 20, 2022 • edited Loading

takeshi-yoshimura commented Sep 22, 2022

codecov bot commented Sep 23, 2022 • edited Loading

Codecov Report

d4l3k commented Sep 23, 2022

d4l3k commented Sep 23, 2022

takeshi-yoshimura commented Oct 4, 2022

d4l3k left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 7, 2022

takeshi-yoshimura commented Oct 8, 2022

d4l3k commented Oct 10, 2022

takeshi-yoshimura commented Sep 20, 2022 •

edited

Loading

codecov bot commented Sep 23, 2022 •

edited

Loading