Jobs indicated as running never actually start. #42

kbruegge · 2015-04-03T14:41:41Z

Hello!

I really like your project but I'm having trouble running your example code in examples\manual.py.
When I run it I get the promising output:

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster.
2015-04-03 16:19:05,742 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.194.168.53:52713

The output of qstat also looks fine:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 423383 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 queue_name@somecluster.com     1
 423384 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 queue_name@somecluster.com     1
 423385 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 queue_name@somecluster.com     1
 423386 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 queue_name@somecluster.com     1

As you can see, the jobs are indicated as (r)unning.

The problem however is that the jobs never actually seem to finish. Which is odd since the calculation should when done locally
takes about 10 seconds. As expected since the function sleep_walk(10) is being called.

I then modified your example to skip the sleep function and write out a file called test.txt. But nothing ever happens.

Which brings me to my second question. How do I use the JobMonitor feature? I didnt gather much information from your
documentation I'm afraid.

Any help is much appreciated. Also if there is any way I can contribute please let me know.

Kai

The text was updated successfully, but these errors were encountered:

dan-blanchard · 2015-04-03T14:57:14Z

There is substantial overhead in starting jobs up on SGE (about a minute), so even when it says "running", that may not actually be true. GridMap is intended to be used for tasks that will take at least a few minutes to run, because otherwise the overhead is not in any way worth it. The example is kind of a bad one, because the calculations are so fast, so all you'll notice is the overhead.

If you let it run for like 5 minutes and it still doesn't finish, then there's probably a real issue.

As for JobMonitor, if you want more info you can either set the logging level to DEBUG (which will give you a ton of information), or run gridmap_web, which will give you a web wrapper around JobMonitor. It isn't very feature-rich yet, so I usually just use JobMonitor with debug logging when things aren't working right.

If you want to know more about how things work, check out this detailed rundown on the wiki.

I'm well aware that the documentation for GridMap could use some work (see #39), but I actually no longer actively use gridmap because I've changed jobs and now work at a company that doesn't use SGE (or any DRMAA-compatible grid). If you want to help out with documentation or by tackling any of the open issues, please make a PR. Thanks for offering!

kbruegge · 2015-04-03T15:32:16Z

Thanks for your reply.
My jobs just hit the walltime limit which was at 2 hours. So there seems to be something wrong :)
I also started some jobs with DEBUG log level. The output looks okay as far as I can tell. Don't know
about job_id : -1. It just repeats the following lines over and over:

    .
    .
    .
    2015-04-03 17:23:26,986 - gridmap.runner - DEBUG - Connecting to JobMonitor (tcp://10.194.168.53:61096)
    2015-04-03 17:23:26,986 - gridmap.runner - DEBUG - Sending message: {'command': 'heart_beat', 'data': {}, 'ip_address': '10.194.168.53', 'host_name': 'the_host_name', 'job_id': -1}
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Received message: {'command': 'heart_beat', 'data': {}, 'ip_address': '10.194.168.53', 'host_name': 'the_host_name', 'job_id': -1}
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Checking if jobs are alive
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Sending reply:
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - 0 out of 4 jobs completed
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Waiting for message
    .
    .
    .

If I can get this to work on our clusters I'll gladly contribute to documentation as I go along and figure things out. If this works for what im trying to
do then a bunch of people from my group might use it as well.

dan-blanchard · 2015-04-03T19:42:23Z

The job_id: -1 means those messages are actually from the JobMonitor itself. It's how it knows to check if the jobs are alive and if its heard from them. If you don't see any messages from any jobs with IDs other than -1, then it implies that maybe you've got some sort of firewall issue preventing the workers from connecting to the JobMonitor.

djoffe · 2018-02-20T16:33:57Z

I am hitting the exact same issue. Was this ever fixed?
How would I see debug info from the worker jobs, to find out if these are firewall issues?

Thanks

djoffe · 2018-02-20T17:16:20Z

Found the issue for my case, leaving some traces in case anyone else comes here:

I am using SGE grid. Checking job status after they finished showed:

$ qacct -j 3555
==============================================================
...
failed       26  : opening input/output file
...

Turned out the default temp_dir (defined as /scratch/ in gridmap.conf) exists but is inaccessible in my case. This error is not caught by _append_job_to_session in job.py.
The default temp_dir can be overridden by passing tmp_dir as argument to process_jobs

    job_outputs = process_jobs(
        functionJobs,
        max_processes=4,
        temp_dir='/path/to/tmp/',
    )

I am not sure what the intended way of overriding gridmap.conf default values is.

kalkairis · 2019-03-31T09:08:19Z

Running into the same issue as people above me.
With the following code:

import gridmap


def foo(x, y):
    return x * y


if __name__ == "__main__":
    jobs = []

    for i in range(10):
        job = gridmap.Job(foo, [i, i + 1])
        jobs.append(job)
    job_outputs = gridmap.process_jobs(jobs, max_processes=4, quiet=False)
    print(job_outputs)

The code never reaches the print(job_outputs) section.

djoffe mentioned this issue Feb 20, 2018

Handle case where temp_dir exists but is not writable #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs indicated as running never actually start. #42

Jobs indicated as running never actually start. #42

kbruegge commented Apr 3, 2015

dan-blanchard commented Apr 3, 2015

kbruegge commented Apr 3, 2015

dan-blanchard commented Apr 3, 2015

djoffe commented Feb 20, 2018

djoffe commented Feb 20, 2018

kalkairis commented Mar 31, 2019

Jobs indicated as running never actually start. #42

Jobs indicated as running never actually start. #42

Comments

kbruegge commented Apr 3, 2015

dan-blanchard commented Apr 3, 2015

kbruegge commented Apr 3, 2015

dan-blanchard commented Apr 3, 2015

djoffe commented Feb 20, 2018

djoffe commented Feb 20, 2018

kalkairis commented Mar 31, 2019