Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobs] Add job manager class for simple jobs python APIs #19567

Merged
merged 12 commits into from
Oct 22, 2021

Conversation

jiaodong
Copy link
Member

@jiaodong jiaodong commented Oct 20, 2021

Changes

The Job Manager class lives in ray private subfolder, intends to empower HTTP, cli and python APIs.

  • Status is implemented by put/get on GCS kvstore as enum classes
  • Logs is implemented by streaming command in shell and pipe stdout & stderr to a file on headnode, under /tmp/ray/session-latest/xxx/logs/jobs/unique_job_id.out
    • Append only, thus streaming logs via /logs API can be made simple in HTTP server

Test plan

  • simple shell command echo
  • simple shell command ls | grep
  • simple shell that pipes stdout to stderr
  • python script that throws exception in subprocess
  • runtime env with remote url script from s3

`
❯ pytest test_job_manager.py -sv
Test session starts (platform: darwin, Python 3.8.10, pytest 5.4.3, pytest-sugar 0.9.4)
cachedir: .pytest_cache
rootdir: /Users/jiaodong/Workspace/ray/python
plugins: sugar-0.9.4, anyio-3.3.1, asyncio-0.15.1, timeout-1.4.2, lazy-fixture-0.6.3, rerunfailures-10.0
collecting ... 2021-10-21 17:38:21,719 INFO services.py:1331 -- View the Ray dashboard at http://127.0.0.1:8265

ray/_private/job_manager/tests/test_job_manager.py::test_submit_basic_echo ✓ 14% █▌
ray/_private/job_manager/tests/test_job_manager.py::test_submit_stderr ✓ 29% ██▉
ray/_private/job_manager/tests/test_job_manager.py::test_submit_ls_grep ✓ 43% ████▍
ray/_private/job_manager/tests/test_job_manager.py::test_subprocess_exception ✓ 57% █████▊
ray/_private/job_manager/tests/test_job_manager.py::test_submit_with_s3_runtime_env ✓ 71% ███████▎
ray/_private/job_manager/tests/test_job_manager.py::TestRuntimeEnv.test_inheritance ✓ 86% ████████▋
ray/_private/job_manager/tests/test_job_manager.py::TestRuntimeEnv.test_multiple_runtime_envs ✓ 100% ██████████

Results (5.94s):
7 passed
`

Next Steps

  • Add HTTP server running as dashboard module to unblock product integration
  • Support kill / stop API
  • More tests around runtime_env
  • Job supervisor actor and subprocess fate sharing

Related issue number

Closes #19414

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jiaodong jiaodong changed the title [WIP] [Ready for review] Add job manager class for simple jobs python APIs Oct 21, 2021
@jiaodong jiaodong marked this pull request as ready for review October 21, 2021 20:29
Runs a command as a child process, streaming stderr & stdout to given
log file.
"""
with open(log_file, "a+") as fin:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to gc the child if the JobSupervisor dies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep fate sharing and proper gc is one of the next steps, just putting up a working one first

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, please re-request when you address the outstanding issues

Comment on lines 85 to 86
Created for each submitted job from JobManager, runs on head node in same
process as ray dashboard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't run in the same process as the ray dashboard ? Also, why are you mentioning the ray dashboard in this file? That's a leaky abstraction. This file should be a standalone module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not mentioning dashboard here is good idea. Afaik dashboard runs on separate server process, but all job supervisor actors live in same ray process as where ray.init() was called on dashboard module ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray actors live in separate processes from the driver they're created in

python/ray/_private/job_manager/job_manager.py Outdated Show resolved Hide resolved
python/ray/_private/job_manager/job_manager.py Outdated Show resolved Hide resolved
python/ray/_private/job_manager/job_manager.py Outdated Show resolved Hide resolved
python/ray/_private/job_manager/job_manager.py Outdated Show resolved Hide resolved
python/ray/_private/job_manager/job_manager.py Outdated Show resolved Hide resolved
except ValueError: # Ray returns ValueError for nonexistent actor.
return None

def submit_job(self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to accept arbitrary key, value metadata that should be attached to the inner job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep we agreed on this since beginning. i put it in a separate PR

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as a first cut, ping me to merge when tests are passing

@edoakes edoakes changed the title [Ready for review] Add job manager class for simple jobs python APIs [jobs] Add job manager class for simple jobs python APIs Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

submit job with background mode
3 participants