Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Implement timeline and profiling API. #2306

Merged
merged 25 commits into from Jul 5, 2018
Merged

[xray] Implement timeline and profiling API. #2306

merged 25 commits into from Jul 5, 2018

Conversation

robertnishihara
Copy link
Collaborator

@robertnishihara robertnishihara commented Jun 26, 2018

This PR adds more detailed profiling information to the timeline for xray and allows you to view the timeline visualization for xray. It also improves the profiling API (for xray).

Example usage. To get a json dump of the profiling information that can be loaded in chrome://tracing, simply do

# Optionally pass in filename= to specify a file to write to.
ray.global_state.chrome_tracing_dump()

The dump will include profiling information for tasks (including the prior breakdown into deserializing arguments, executing the function, getting the results) as well as ray.put, ray.get, ray.wait, task submission, and various other little things. Profiling information will also be included from the driver, and this PR allows for the possibility of creating profiling events from any process that has a GCS client.

To add a custom span to the timeline, simply do

with ray.profile('event name'):
    # Do something.

Then this will appear in the timeline. This can be done within a task or on the driver. An example timeline is attached.

screen shot 2018-06-26 at 1 17 47 pm

Not implemented in this PR

  • Timeline integration with web UI.

TODO in this PR maybe:

  • Documentation of how to use the timeline/profiling tools.
  • Option for opening timeline in browser automatically.
  • Include task spec info in timeline (does this actually matter?).
  • Cleanups
    • Remove duplicate ProfileTableData flatbuffer definition.
    • Remove unused AddProfileEvent method.
  • Include node IP address correctly in timeline!
  • Double check that the timeline still works in legacy Ray.
  • Fix tests.
  • Check performance.

cc @alanamarzoev

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6298/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6299/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6302/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6317/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6330/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6380/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6381/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6385/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6435/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6436/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6447/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6449/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6450/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6451/
Test PASSed.

@robertnishihara
Copy link
Collaborator Author

This PR decreases performance for

%time l = ray.get([f.remote() for _ in range(10000)])

from about 1.3s to about 1.6s.

I think a natural solution to this is to do more batching (in particular to flush from workers on a timer instead of after every task). However, that runs into some of the thread-safety issues brought up in #2248, so I may put that on hold for now.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6464/
Test FAILed.

@robertnishihara
Copy link
Collaborator Author

jenkins, retest this please

def _profile_table(self, component_id):
"""Get the profile events for a given component.

TODO(rkn): This method should support limiting the number of log events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODOs should not be part of the doc string (user documentation)

web browser and load the dumped file. Make sure to enable "Flow events"
in the "View Options" menu.

TODO(rkn): Support including the task specification data in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODOs should not be part of the docstring

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6466/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6467/
Test PASSed.

/// Store some profile events in the GCS.
///
/// \param conn The connection information.
/// \param event_type The type of the event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated

DEPENDS ${FBS_DEPENDS}
COMMENT "Running flatc compiler on ${NODE_MANAGER_FBS_SRC}"
VERBATIM)

#add_custom_target(gen_node_manager_fbs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

@pcmoritz pcmoritz merged commit b90e551 into ray-project:master Jul 5, 2018
@pcmoritz pcmoritz deleted the timeline branch July 5, 2018 06:23
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6481/
Test PASSed.

@raulchen
Copy link
Contributor

raulchen commented Jul 9, 2018

hi @robertnishihara, it looks like that worker.events is accessed from 2 threads, but it's not protected by any lock. I think this could cause problems in some cases.

For example, if events.append is called between these 2 statements, the data will be lost

ray/python/ray/worker.py

Lines 2733 to 2737 in 4ef9d15

worker.local_scheduler_client.push_profile_events(
component_type, ray.ObjectID(worker.worker_id),
worker.node_ip_address, worker.events)
worker.events = []
.

Is my understanding correct? If so, I can file a PR to fix this.

Another quick question, is there any reason why only raylet uses background thread?

@raulchen
Copy link
Contributor

raulchen commented Jul 9, 2018

Just found that the background thread is only used for drivers. Why do workers not use it?

@robertnishihara
Copy link
Collaborator Author

@raulchen you're right, I think it should be protected by a lock.

It's only used in the raylet because I don't think we should add new functionality or change the behavior for the legacy code path.

It's currently only used on the driver, but I think it actually makes sense to use in workers as well. Basically, each worker/driver could just have a background thread that calls _flush_profile_events once a second or so. Then we could remove this line

flush_profile_data()
from the raylet code path.

# Note(rkn): This is run on a background thread in the driver. It uses the
# local scheduler client. This should be ok because it doesn't read from
# the local scheduler client and we have the GIL here. However, if either
# of those things changes, then we could run into issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some research in depth, I believe this is still problematic. The reason is because the get_task function isn't protected by by the GIL See

.

Here's the more detailed explanation of how things can go wrong:

  1. the main thread executes the get_task function and switches out when it's in the middle of write_message. (the request is half-sent now)
  2. then the flush-profile-data thread also calls into write_message via push_profile_events, and also write data to the connection.
  3. then the server will receive a corrupted request.

@raulchen raulchen mentioned this pull request Jul 10, 2018
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants