Skip to content

Conversation

@giordano
Copy link
Contributor

@giordano giordano commented May 2, 2021

This should be mostly working, but I need to fix some details, so I'm marking the pull request as draft.

I probably also need to update the documentation to mention SGE is a supported scheduler?

Fix #1937

@pep8speaks
Copy link

pep8speaks commented May 2, 2021

Hello @giordano, Thank you for updating!

Cheers! There are no PEP8 issues in this Pull Request!Do see the ReFrame Coding Style Guide

Comment last updated at 2021-07-26 20:14:50 UTC

@jenkins-cscs
Copy link
Collaborator

Can I test this patch?

@giordano giordano force-pushed the mg/sge-scheduler branch from 72cb212 to db90f78 Compare May 2, 2021 18:53
Copy link
Contributor

@ekouts ekouts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks good to me, thank you @giordano for the contribution! I have some comments but I think you should wait also for @vkarak 's feedback before changing things.

First of all I was wondering if you really need to rewrite everything. As far as I see most of the methods are very similar to the PBS scheduler, so it would make sense to me to inherit from PBS and change only the methods that are different, like we do for the torque scheduler. You would have to change __init__, submit and poll. In the same logic I don't think you need a special _SgeJob, SGE_CANCEL_DELAY and SGE_OUTPUT_WRITEBACK_WAIT, you can just reuse PBS's variables.

@vkarak vkarak changed the title Add SGE scheduler [feat] Add SGE scheduler May 4, 2021
@vkarak vkarak added this to the ReFrame sprint 21.05.1 milestone May 5, 2021
@giordano
Copy link
Contributor Author

giordano commented May 5, 2021

Thanks for the suggestion about inheriting SgeJobScheduler from PbsJobScheduler, I just did it! The last remaining bit I believe is emitting the preamble: at the moment I'm skipping all settings like number of cores/number of tasks/number of CPUs, because I'm not really sure how to set them in a portable way.

What are exactly the settings that need to be in the preamble?

@vkarak
Copy link
Contributor

vkarak commented May 5, 2021

ok to test

@codecov-commenter
Copy link

codecov-commenter commented May 5, 2021

Codecov Report

Merging #1959 (2bcc6f6) into master (4642104) will decrease coverage by 0.41%.
The diff coverage is 37.97%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1959      +/-   ##
==========================================
- Coverage   86.74%   86.33%   -0.42%     
==========================================
  Files          52       53       +1     
  Lines        9258     9337      +79     
==========================================
+ Hits         8031     8061      +30     
- Misses       1227     1276      +49     
Impacted Files Coverage Δ
reframe/core/backends.py 90.00% <ø> (ø)
reframe/core/schedulers/sge.py 37.97% <37.97%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4642104...2bcc6f6. Read the comment docs.

@vkarak
Copy link
Contributor

vkarak commented May 5, 2021

Thanks @giordano for the PR! I agree with @ekouts on her comments.

The last remaining bit I believe is emitting the preamble: at the moment I'm skipping all settings like number of cores/number of tasks/number of CPUs, because I'm not really sure how to set them in a portable way.

What about setting the TASKS_OPT for the SGE scheduler such that it doesn't need the extra information? Then I guess that PBS' _emit_lselect_option() would do the job.

I am now looking into your PR in more detail and I'll get back to you with some more comments.

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giordano Thanks again for your PR! I think that the way you look into the XML output of the jobs can be simplified. Here is an outline of what I think would be more suitable:

@register_scheduler('sge')
class SgeJobScheduler(PbsJobScheduler):
    def poll(self, *jobs):
        ...

        # Index jobs
        job_index = {job.jobid: job for job in jobs}
        for queue_info in root:
            ...
            for job_elem in queue_info:
                jobid = job_list.find("JB_job_number").text
                if jobid not in job_index:
                    # Not a reframe job
                    continue

                job = job_index[jobid]
                job._state = _sge_state(job_list.find("state").text)
                if job.state in {'COMPLETED', 'SUSPENDED', 'ERROR'}:
                    job._completed = True

Where _sge_state() will actually do the mapping of the states, as you exactly do now. I don't think that anything else is needed from what I can imagine by reading through the code.

@vkarak vkarak requested a review from victorusu May 7, 2021 20:56
Co-authored-by: Vasileios Karakasis <vkarak@gmail.com>
@giordano
Copy link
Contributor Author

The problem of the proposed approach is that a completed job may never enter the z status and just disappear from the list. I implemented locally what you suggested but reframe hangs forever. Now I'm missing this whole block which made sure to account for jobs which simply disappear from qstat:
https://github.com/eth-cscs/reframe/blob/624c44e2b2ae64fbf383d03a1aed0ae925effca1/reframe/core/schedulers/sge.py#L154-L160
Always keep in mind that qstat is weird 🙂

@vkarak
Copy link
Contributor

vkarak commented May 17, 2021

@giordano You're right on your comment about the last loop. I've opened a PR to your branch with some small improvements. If it works, you can merge it and then we update the documentation for this one (plus the "decomposition" error message) and we're good to go, I think.

@vkarak vkarak marked this pull request as ready for review May 17, 2021 22:04
@vkarak
Copy link
Contributor

vkarak commented May 17, 2021

And I will add a unit test for the parts that do not require the scheduler, e.g., the emit_preamble().

Vasileios Karakasis and others added 2 commits May 18, 2021 00:09
@vkarak vkarak changed the title [feat] Add SGE scheduler [feat] Add SGE scheduler backedn Jul 26, 2021
@vkarak vkarak changed the title [feat] Add SGE scheduler backedn [feat] Add SGE scheduler backend Jul 26, 2021
@vkarak vkarak merged commit a88e933 into reframe-hpc:master Jul 26, 2021
@giordano giordano deleted the mg/sge-scheduler branch July 26, 2021 21:34
@giordano
Copy link
Contributor Author

Before the merge I was trying to run some benchmarks using also Spack as build system to test my full pipeline (had to fix some things), but it seems to be working as expected. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support SGE scheduler

8 participants