Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume and reboot modes #74

Merged
merged 29 commits into from
Nov 2, 2015
Merged

Resume and reboot modes #74

merged 29 commits into from
Nov 2, 2015

Conversation

snim2
Copy link
Collaborator

@snim2 snim2 commented Oct 27, 2015

This PR adds two new command-line switches to krun: --resume and --reboot.

In resume-mode krun will look for an existing set of results. If one is found, krun first checks that the current platform is (approximately) the same as the platform detailed in the results file. If this test passes, the schedule is built and executions which have already been run are removed from the job queue. Old results are added to the current job scheduler, which means the JSon results file can be dumped (rather than appended to), as before.

Under reboot-mode, every time an execution has finished, krun runs a reboot command which is defined with the platform definition (actually, krun currently just prints the command out, to ease testing, this needs to be fixed before merging).

Json-related code has been refactored into krun/util.py. A very basic test suite has been added to krun/tests/. The documentation in examples/README.md has been updated.

Different platforms have different conventions for starting a program on boot. An example rc.local file has been added to etc/.

Fixes #41
Fixes #54

@snim2 snim2 added this to the Ready to publish milestone Oct 27, 2015
@snim2
Copy link
Collaborator Author

snim2 commented Oct 27, 2015

That Travis build failed because the bz2 module does not contain a context manager in Python 2.6. Is Python2.6 needed?

@ltratt
Copy link
Member

ltratt commented Oct 27, 2015

I think we can safely force the use of python2.7, on the assumption that it will be installed everywhere that we want to run.

@vext01
Copy link
Member

vext01 commented Oct 27, 2015

The design sounds sane. Now I will inspect the code.

# This script is executed at the end of each multiuser runlevel.
#

nohup sudo -H -u krun python krun.py --resume --reboot /home/krun/krun/examples/example.krun
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want nohup? That means all krun output is going to disk twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but the script needs to exit 0 to work with the init framework (I need to test this later today)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually nohup will write stdout to nohup.out you see.

@snim2 snim2 mentioned this pull request Oct 27, 2015
@@ -86,6 +86,30 @@ $ PYTHONPATH=../ ../krun.py example.krun
You should see a log scroll past, and results will be stored in the file:
`../krun/examples/example_results.json.bz2`.

## Running in reboot and resume modes

krun can resume an interrupted benchmark by passing in the `--resume` flag:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say something about the granularity of this feature? I.e. executions.

@snim2
Copy link
Collaborator Author

snim2 commented Oct 28, 2015

This now works on my machine, with some caveats:

  • I haven't found a way to re-start krun after boot whilst running as a non-root user. When I start krun on the command line I have to give a password to sudo. However, I think this is an issue with how the environment is set up rather than krun.
  • When I tried running this on bencher5 I got an error about cpufrequtils not being installed, but apt-get believes the package is there.

@ltratt
Copy link
Member

ltratt commented Oct 28, 2015

If you use sudo in rc.local you should be able to do sudo -u krun <path to krun> without requiring a password (most, though perhaps not all, sudo installs allow root to call sudo without a password).

@snim2
Copy link
Collaborator Author

snim2 commented Oct 28, 2015

yes, but if I run krun from the command line as a non-root user it asks me for a sudo password. I don't think it is doing an sudo -u krun when it asks, because krun works whether or not I have a user called krun.

@ltratt
Copy link
Member

ltratt commented Oct 28, 2015

/etc/rc.local is run as root, so sudo -u krun in that file almost certainly won't require a password. [It is possible someone's set a really silly sudo config that requires a password from root, but I haven't seen such a setup yet.]

Sarah Mount added 12 commits October 30, 2015 10:37
This is a basic check that benchmarking has been resumed on "the same" platform that the benchmark was started on.
Resume mode removes jobs from the schedule that have already been executed and adds old data to the set of results.
Log name either based on current time (ordinary run) or mtime of config file (resume mode).
Information provided in audits differs between platforms.
ETA emails are sent.
Fixed existing error in documentation.
Appends logs to /var/log/rc.local.log.
Linux only.
if len(self) == 0:
debug("krun started with an empty queue of jobs")

if not resume:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic here is wrong?

@vext01
Copy link
Member

vext01 commented Nov 2, 2015

Sarah, if you are happy with my last commit, I think we can merge this.

Note however that we should address #83 and #84.

vext01 and others added 10 commits November 2, 2015 13:18
…ally starts Krun on boot.

Improvements to error messages: if the output file does not exist, don't tell the user it isn't a regular file.
Only wait for network when --started-by-init. --dry-run now simulates time.sleep.
…r. These show that a --reboot makes progress through the schedule, and should help prevent an infinite reboot loop.
vext01 added a commit that referenced this pull request Nov 2, 2015
@vext01 vext01 merged commit 1f09531 into master Nov 2, 2015
@snim2 snim2 deleted the resume-mode branch November 2, 2015 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reboot after each execution. Easier session resume
3 participants