Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitting jobs on a remote cluster #55

Open
jdidion opened this issue Oct 1, 2014 · 9 comments
Open

Submitting jobs on a remote cluster #55

jdidion opened this issue Oct 1, 2014 · 9 comments

Comments

@jdidion
Copy link

jdidion commented Oct 1, 2014

I apologize for asking a question here, but there is no dedicated support forum. The documentation is not clear on how to actually get JIP to submit jobs on a remote cluster. In my environment, and I think that of many others, we ssh into a login node to submit jobs on the cluster. I did not see any option to configure the remote host name or login password in the cluster configuration. Thanks

@thasso
Copy link
Owner

thasso commented Oct 7, 2014

Hey,

there is currently no way to talk to the remote side directly. The way we use the system on a "remote" cluster is to first ssh to the login node and then install, configure and use jip there. I was in fact thinking about adding a thin "remote" layer on to so you could execute commands directly from your local machine, and we might still implement something like this, but none the less, jip needs to always be installed and available on the remote cluster/login node. The reason for this is that the jobs interact directly with the job database and do not do this through a server. This avoid starting a server on your cluster and you don't need to enable connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your workstation, feel free to explain your use case in a bit more here and we can see if we can do something about it.

@jdidion
Copy link
Author

jdidion commented Oct 7, 2014

Thanks for the response! The use case I was thinking of is that sometimes I want to run jobs on the cluster and sometimes I want to run them on my local machine. It would be nice to manage both things from my local desktop. If you have all of the commands going over ssh, then I don’t think you need to have anything installed on the remote machine except the scripts that actually get executed (and the data, of course, but it’s beyond the scope of JIP to manage that). The scripts could just be copied to the remote server via scp, or kept in sync via rsync. I’d be happy to contribute to development to make this a part of JIP, but I first I would want to work with you on a plan for the best way to implement it.

Thanks,

John Didion, PhD
Postdoctoral Fellow
Collins Lab, NHGRI

On Oct 7, 2014, at 8:07 AM, thasso <notifications@github.commailto:notifications@github.com> wrote:

Hey,

there is currently no way to talk to the remote side directly. The way we use the system on a "remote" cluster is to first ssh to the login node and then install, configure and use jip there. I was in fact thinking about adding a thin "remote" layer on to so you could execute commands directly from your local machine, and we might still implement something like this, but none the less, jip needs to always be installed and available on the remote cluster/login node. The reason for this is that the jobs interact directly with the job database and do not do this through a server. This avoid starting a server on your cluster and you don't need to enable connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your workstation, feel free to explain your use case in a bit more here and we can see if we can do something about it.


Reply to this email directly or view it on GitHubhttps://github.com//issues/55#issuecomment-58174142.

@thasso
Copy link
Owner

thasso commented Oct 7, 2014

Hi John,

okay I think I understand the use case and I think it would be a good idea
to support a workflow where you can basically migrate jobs between jip
instances, say run first locally, then remote. Note that currently there is
no easy way to avoid the installation if jip on the remote side. It is
needed not only for the actual job execution but also for the creation of
the pipeline graph which imo should be create from within the execution
environment (the remote side). Keeping it like this means we can rely on
the pipeline graph generation and its check to ensure validity of the final
graph. In addition, this step ensures that mandatory input files exists
etc. Imo it should also be no big problem to install jip in the remote
cluster as the installation and execution process happens completely in
user spaces. Please note also that there is no need to start any server on
the remote side if you have access to a job scheduler like SGE, slurm, or
PBS. Only the jip executable (+ dependencies) need to be available.

With this in mind, what I see right now is essentially a SSH wrapper to
delegate command executions to a remote side and get back some data that
can be processed locally.

What I am not sure about right now is how to deal with input files and if
it should be part of jip to take care of syncing the files automatically.
But I would probably suggest to start with the SSH wrapper only. File
syncing is for sure nice to have, but as you already mentioned, you could
also simply use something like rsync.

Best,
-Thasso

2014-10-07 14:27 GMT+02:00 John Didion notifications@github.com:

Thanks for the response! The use case I was thinking of is that sometimes
I want to run jobs on the cluster and sometimes I want to run them on my
local machine. It would be nice to manage both things from my local
desktop. If you have all of the commands going over ssh, then I don’t think
you need to have anything installed on the remote machine except the
scripts that actually get executed (and the data, of course, but it’s
beyond the scope of JIP to manage that). The scripts could just be copied
to the remote server via scp, or kept in sync via rsync. I’d be happy to
contribute to development to make this a part of JIP, but I first I would
want to work with you on a plan for the best way to implement it.

Thanks,

John Didion, PhD
Postdoctoral Fellow
Collins Lab, NHGRI

On Oct 7, 2014, at 8:07 AM, thasso <notifications@github.com<mailto:
notifications@github.com>> wrote:

Hey,

there is currently no way to talk to the remote side directly. The way we
use the system on a "remote" cluster is to first ssh to the login node and
then install, configure and use jip there. I was in fact thinking about
adding a thin "remote" layer on to so you could execute commands directly
from your local machine, and we might still implement something like this,
but none the less, jip needs to always be installed and available on the
remote cluster/login node. The reason for this is that the jobs interact
directly with the job database and do not do this through a server. This
avoid starting a server on your cluster and you don't need to enable
connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your
workstation, feel free to explain your use case in a bit more here and we
can see if we can do something about it.


Reply to this email directly or view it on GitHub<
https://github.com/thasso/pyjip/issues/55#issuecomment-58174142>.


Reply to this email directly or view it on GitHub
#55 (comment).

@jdidion
Copy link
Author

jdidion commented Oct 7, 2014

Sure, I think that workflow will definitely appeal to some users. For me, i’m more interested in having a single interface to execute jobs, whether they be local or remote, but the jobs I run locally and remotely are different (the remote jobs are pipelines for processing sequencing data, while the local jobs are typically quick analyses on the processed data).

The ssh wrapper sounds like the best solution. As far as input files, I think JIP should handle syncing any JIP scripts but (at least for now) it’s up to the user to make sure the data files are in place before executing the job. To me it doesn’t seem ideal to have JIP trying to manage keeping dozens or hundreds of huge bam files in sync.

As far as implementation, can you briefly describe how you would go about doing it? You’re much more familiar with the structure of the code and I want to implement this in the way that makes the most sense.

Thanks

John Didion, PhD
Postdoctoral Fellow
Collins Lab, NHGRI

On Oct 7, 2014, at 8:55 AM, thasso <notifications@github.commailto:notifications@github.com> wrote:

Hi John,

okay I think I understand the use case and I think it would be a good idea
to support a workflow where you can basically migrate jobs between jip
instances, say run first locally, then remote. Note that currently there is
no easy way to avoid the installation if jip on the remote side. It is
needed not only for the actual job execution but also for the creation of
the pipeline graph which imo should be create from within the execution
environment (the remote side). Keeping it like this means we can rely on
the pipeline graph generation and its check to ensure validity of the final
graph. In addition, this step ensures that mandatory input files exists
etc. Imo it should also be no big problem to install jip in the remote
cluster as the installation and execution process happens completely in
user spaces. Please note also that there is no need to start any server on
the remote side if you have access to a job scheduler like SGE, slurm, or
PBS. Only the jip executable (+ dependencies) need to be available.

With this in mind, what I see right now is essentially a SSH wrapper to
delegate command executions to a remote side and get back some data that
can be processed locally.

What I am not sure about right now is how to deal with input files and if
it should be part of jip to take care of syncing the files automatically.
But I would probably suggest to start with the SSH wrapper only. File
syncing is for sure nice to have, but as you already mentioned, you could
also simply use something like rsync.

Best,
-Thasso

2014-10-07 14:27 GMT+02:00 John Didion <notifications@github.commailto:notifications@github.com>:

Thanks for the response! The use case I was thinking of is that sometimes
I want to run jobs on the cluster and sometimes I want to run them on my
local machine. It would be nice to manage both things from my local
desktop. If you have all of the commands going over ssh, then I don’t think
you need to have anything installed on the remote machine except the
scripts that actually get executed (and the data, of course, but it’s
beyond the scope of JIP to manage that). The scripts could just be copied
to the remote server via scp, or kept in sync via rsync. I’d be happy to
contribute to development to make this a part of JIP, but I first I would
want to work with you on a plan for the best way to implement it.

Thanks,

John Didion, PhD
Postdoctoral Fellow
Collins Lab, NHGRI

On Oct 7, 2014, at 8:07 AM, thasso <notifications@github.commailto:notifications@github.com<mailto:
notifications@github.commailto:notifications@github.com>> wrote:

Hey,

there is currently no way to talk to the remote side directly. The way we
use the system on a "remote" cluster is to first ssh to the login node and
then install, configure and use jip there. I was in fact thinking about
adding a thin "remote" layer on to so you could execute commands directly
from your local machine, and we might still implement something like this,
but none the less, jip needs to always be installed and available on the
remote cluster/login node. The reason for this is that the jobs interact
directly with the job database and do not do this through a server. This
avoid starting a server on your cluster and you don't need to enable
connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your
workstation, feel free to explain your use case in a bit more here and we
can see if we can do something about it.


Reply to this email directly or view it on GitHub<
https://github.com/thasso/pyjip/issues/55#issuecomment-58174142>.


Reply to this email directly or view it on GitHub
#55 (comment).


Reply to this email directly or view it on GitHubhttps://github.com//issues/55#issuecomment-58179348.

@jdidion
Copy link
Author

jdidion commented Oct 10, 2014

I’ve thought through this some more, and I think I have a good plan. The model I am working with is that there will be two separate JIP databases, one on the local machine and one on the remote cluster. For a job submitted from the local machine to the remote cluster (via the new ‘remote’ command described below), a placeholder will be inserted into the local database marking that job as a remote job. Subsequent calls to ‘jip jobs’ will fetch job information from the remote machine and update the placeholder records in the local database. That way, a user can track all job information in his local database even if some jobs are submitted locally and some are submitted remotely.

New commands:

  • create: creates a job in the local database, but does not execute it. This job could be run at a later time (with changes to jip_run.py) and/or migrated (see below).
  • migrate: copy job information from one database to another via ssh (other transport mechanisms, such as REST, could be supported later).
  • remote: send a command and arguments to a remote instance of JIP via ssh. Sending a remote run or submit command will first ensure that the specified scripts and/or tools are available on the remote machine, and if not will copy them to the remote machine from the local machine (migrate will also do this). Sending a remote run or submit command will also insert a placeholder record into the local database. Subsequent calls to ‘jobs’ or any of the job manipulation commands will read the results of the remote command and update the placeholder record in the local database.

I think these changes should be fairly transparent to current users, i.e. it wouldn’t affect how they currently do things. There would have to be some kind of database migration step to upgrade the database schema for current users that want to take advantage of the new work flow, but I think SQLAlchemy has facilities for that.

Please let me know if you see any problems with this approach, or if you recommend a better way of doing it.

Thanks,

John Didion, PhD
Postdoctoral Fellow
Collins Group, NHGRI

On Oct 7, 2014, at 9:06 AM, John Didion <john.didion@nih.govmailto:john.didion@nih.gov> wrote:

Sure, I think that workflow will definitely appeal to some users. For me, i’m more interested in having a single interface to execute jobs, whether they be local or remote, but the jobs I run locally and remotely are different (the remote jobs are pipelines for processing sequencing data, while the local jobs are typically quick analyses on the processed data).

The ssh wrapper sounds like the best solution. As far as input files, I think JIP should handle syncing any JIP scripts but (at least for now) it’s up to the user to make sure the data files are in place before executing the job. To me it doesn’t seem ideal to have JIP trying to manage keeping dozens or hundreds of huge bam files in sync.

As far as implementation, can you briefly describe how you would go about doing it? You’re much more familiar with the structure of the code and I want to implement this in the way that makes the most sense.

Thanks

John Didion, PhD
Postdoctoral Fellow
Collins Lab, NHGRI

On Oct 7, 2014, at 8:55 AM, thasso <notifications@github.commailto:notifications@github.com> wrote:

Hi John,

okay I think I understand the use case and I think it would be a good idea
to support a workflow where you can basically migrate jobs between jip
instances, say run first locally, then remote. Note that currently there is
no easy way to avoid the installation if jip on the remote side. It is
needed not only for the actual job execution but also for the creation of
the pipeline graph which imo should be create from within the execution
environment (the remote side). Keeping it like this means we can rely on
the pipeline graph generation and its check to ensure validity of the final
graph. In addition, this step ensures that mandatory input files exists
etc. Imo it should also be no big problem to install jip in the remote
cluster as the installation and execution process happens completely in
user spaces. Please note also that there is no need to start any server on
the remote side if you have access to a job scheduler like SGE, slurm, or
PBS. Only the jip executable (+ dependencies) need to be available.

With this in mind, what I see right now is essentially a SSH wrapper to
delegate command executions to a remote side and get back some data that
can be processed locally.

What I am not sure about right now is how to deal with input files and if
it should be part of jip to take care of syncing the files automatically.
But I would probably suggest to start with the SSH wrapper only. File
syncing is for sure nice to have, but as you already mentioned, you could
also simply use something like rsync.

Best,
-Thasso

2014-10-07 14:27 GMT+02:00 John Didion <notifications@github.commailto:notifications@github.com>:

Thanks for the response! The use case I was thinking of is that sometimes
I want to run jobs on the cluster and sometimes I want to run them on my
local machine. It would be nice to manage both things from my local
desktop. If you have all of the commands going over ssh, then I don’t think
you need to have anything installed on the remote machine except the
scripts that actually get executed (and the data, of course, but it’s
beyond the scope of JIP to manage that). The scripts could just be copied
to the remote server via scp, or kept in sync via rsync. I’d be happy to
contribute to development to make this a part of JIP, but I first I would
want to work with you on a plan for the best way to implement it.

Thanks,

John Didion, PhD
Postdoctoral Fellow
Collins Lab, NHGRI

On Oct 7, 2014, at 8:07 AM, thasso <notifications@github.commailto:notifications@github.com<mailto:
notifications@github.commailto:notifications@github.com>> wrote:

Hey,

there is currently no way to talk to the remote side directly. The way we
use the system on a "remote" cluster is to first ssh to the login node and
then install, configure and use jip there. I was in fact thinking about
adding a thin "remote" layer on to so you could execute commands directly
from your local machine, and we might still implement something like this,
but none the less, jip needs to always be installed and available on the
remote cluster/login node. The reason for this is that the jobs interact
directly with the job database and do not do this through a server. This
avoid starting a server on your cluster and you don't need to enable
connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your
workstation, feel free to explain your use case in a bit more here and we
can see if we can do something about it.


Reply to this email directly or view it on GitHub<
https://github.com/thasso/pyjip/issues/55#issuecomment-58174142>.


Reply to this email directly or view it on GitHub
#55 (comment).


Reply to this email directly or view it on GitHubhttps://github.com//issues/55#issuecomment-58179348.

@thasso
Copy link
Owner

thasso commented Oct 13, 2014

Sounds good to me. I don't see any obvious flaws at the moment and we can iterate on this. If you want to start implementing this, please note that current development version of JIP can be found in the "develop" branch. I would suggest you create a pull request against that branch and I can review the changes before merging them in.

@jdidion
Copy link
Author

jdidion commented Oct 15, 2014

I’ve scaled this back a bit. I decided mucking up the database with lots of pointers to jobs running on remote hosts was probably not worth the cost. Instead I’m implementing the following:

export: export jobs as a json object
submit (modification of existing command): add the ability to specify an export jobs json object to be imported. Any other command line options will override the values loaded from the json object.
migrate: export a job from the local database, copy it to the remote machine, optionally also sync scripts, import the job, and optionally submit the job

This should be relatively quick to implement since I will be using an existing ssh library (although this will create an additional dependency).

On Oct 13, 2014, at 7:00 AM, thasso notifications@github.com wrote:

Sounds good to me. I don't see any obvious flaws at the moment and we can iterate on this. If you want to start implementing this, please note that current development version of JIP can be found in the "develop" branch. I would suggest you create a pull request against that branch and I can review the changes before merging them in.


Reply to this email directly or view it on GitHub.

@jdidion
Copy link
Author

jdidion commented Nov 12, 2014

Hi Thasso,

I am writing a pipeline where the first step is parallel alignment of multiple sets of fastq files and the second step will be to merge the resulting BAM files. I’m not sure of how to implement this in JIP.

I assume the first step is to iterate over all pairs of fastq files and call run, i.e.:

p = Pipeline()
for f1,f2 in fastqs:
p.run(‘align’, input=(f1,f2))

But how to make those run as a single group, and to have the merge step depend on the completion of all the jobs in that group?

Thanks,

John

On Oct 15, 2014, at 5:57 PM, John Didion johnpaul@didion.net wrote:

I’ve scaled this back a bit. I decided mucking up the database with lots of pointers to jobs running on remote hosts was probably not worth the cost. Instead I’m implementing the following:

export: export jobs as a json object
submit (modification of existing command): add the ability to specify an export jobs json object to be imported. Any other command line options will override the values loaded from the json object.
migrate: export a job from the local database, copy it to the remote machine, optionally also sync scripts, import the job, and optionally submit the job

This should be relatively quick to implement since I will be using an existing ssh library (although this will create an additional dependency).

On Oct 13, 2014, at 7:00 AM, thasso notifications@github.com wrote:

Sounds good to me. I don't see any obvious flaws at the moment and we can iterate on this. If you want to start implementing this, please note that current development version of JIP can be found in the "develop" branch. I would suggest you create a pull request against that branch and I can review the changes before merging them in.


Reply to this email directly or view it on GitHub.

@thasso
Copy link
Owner

thasso commented Jan 2, 2015

Hi John,

Sorry for the major delay, but now there is some time finally :)

You are on the right track already with the dependencies. You can do it as you
tried in your example and simply iterate over your fastq input files. In order
to establish dependencies, you can use the dependsOn function exposed on the
node objects. But heres the point where the jip dependency resolution and
edge-multiplicities can really help. I will try to layout the full example to
showcase how you can use job parameters from the pipeline and its nodes.

The assumption here is basically that the align job expects a single fastq
file as input while the merge job takes a list of files. This allows you to

  1. Expand the align jobs based on the list of input parameters
  2. Collapse on the merge job based on the list of output alignments created by the align jobs

In pseudo code, this would look something like this:

fastqs = [...]
p = Pipeline()

align = p.run('align', input=fastqs, output='${input|ext}.bam')
merge = p.run('merge', input=align)

# expand the pipeline. Now you'll have n align jobs but still one merge job
p.expand()
# create teh list of jobs
jobs = jip.create_jobs(p)

Please note that the expansion and job creation is on necessary if you use
the jip command line tools and write the pipeline using a jip pipeline script.

Also note the output parameter of the align job. I assumed that the job allows
you to specify the name of the output file. We need something dynamic here
because we want to expand on the list of input files.

The final merge jobs input is just the align job. This works as long as
align only defines a single output. If thats not the case, you'll need to
be more specific: p.run('merge', input=align.outout).

Using the jip pipeline graph and expanding on the inputs and outputs of the
nodes allows you to avoid specifying dependencies explicitly.

I hope it was not too late and still helps.

Best,
-Thasso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants