Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pluggable device profiler support #389

Merged

Conversation

jzhoulon
Copy link
Contributor

@jzhoulon jzhoulon commented May 20, 2021

This RFC will be open for comment until Thursday, June 3rd 2021,

Modular TensorFlow profiler C API

Status Proposed
RFC # 389
Author(s) Zhoulong Jiang (zhoulong.jiang@intel.com), Yiqiang Li (yiqiang.li@intel.com), Eric Lin (eric.lin@intel.com), Jianhui Li (jian.hui.li@intel.com)
Sponsor Situ Yi (yisitu@google.com)
Updated 2021-05-13

Objective

Provide general profiler C API to allow new device profilers to modularly connect to the current TensorFlow runtime.

This RFC is based on the Modular TensorFlow RFC, which aims at extending the TensorFlow design to plug in capabilities like adding a new device profiler.

@yisitu
Copy link
Contributor

yisitu commented May 20, 2021

This RFC will be open for comment until Monday, June 4th , 2021.

Isn't June 4th 2021 a Friday?

Copy link
Contributor

@yisitu yisitu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of edits.

rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
@penpornk
Copy link
Member

If you are interested in attending the design review meeting for this RFC, here are the call details:

Time: Thursday, June 3rd, 2021 from 3-4pm PT
Meeting link: Google Meet

Meeting participants are expected to read the RFC ahead of time as the meeting will focus on the remaining issues/questions.

cc: @kulinseth @wchao1115 @reedwm

@wchao1115
Copy link
Contributor

yeah, I'll be there.

jzhoulon and others added 12 commits May 20, 2021 21:13
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
@jzhoulon
Copy link
Contributor Author

This RFC will be open for comment until Monday, June 4th , 2021.

Isn't June 4th 2021 a Friday?

Thanks for the notify, I have fixed it, and thanks for helping edit the document!

jzhoulon and others added 3 commits May 23, 2021 10:38
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
@@ -0,0 +1,440 @@
# Modular TensorFlow profiler C API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deven-amd
Please have a look at this RFC. I think it would be cleaner to have all the GPU profilers consolidated behind a common interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yisitu, I am working closely with Deven in AMD. Please also CC me for further discussions for the TF profiler. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@penpornk
Copy link
Member

penpornk commented Jun 7, 2021

Design Review Meeting Notes

@jzhoulon went through recently resolved comments then started a discussion on the design alternative.

@jzhoulon: Could you please clarify the design alternative?
@jbaiocchi: TF Profiler already has a plug-in system. This design adds another plug-in on top.

  • Forcing every vendor to conform to the same C API doesn’t seem advisable.
  • The alternative is to let each vendor own a C++ subclass of ProfilerInterface in core TensorFlow and a C API for the subclass to communicate with its corresponding profiler plug-in.
  • We don’t need to standardize the C API. Just the C++ part.

@jbaiocchi: What about a C++ boundary?
@aselle: Then people will have to recompile everything

  • C++ header-only file.
  • Specific compiler.

@jzhoulon: That wouldn’t work for us. Users should be able to just install plug-ins (alongside the default tensorflow package) and everything works.
@yisitu: C++ ABI boundary is too messy. We should focus on the current C API design, or the alternative design which adds C++ subclasses with C API gateways.
@jbaiocchi: Sounds good.

@jbaiocchi: The PluginInterfaceFactory stores a vector of interfaces. Why is one interface not sufficient?
@jbaiocchi: Who is responsible for testing if I want my own ops?

@aselle: Does the mechanism allow multiple profilers?
@jbaiocchi: The expectation is that there is one profiler per device.
@aselle: Having one profiler per .so file is a limitation. However, a workaround is to have two .so files and dlsym the entry points.
@yisitu: I think that can work.
@aselle: Metaprofiler.

@jbaiocchi: Having two levels of plug-ins within TF doesn’t convince me.
@yisitu: Do we want to own the ABI?
@jbaiocchi: The question is whether the data can be interpreted correctly by all profiler tools.
@aselle: How does TF Profiler work right now? Through Proto?
@jbaiocchi: Proto -> MetaProto that can be interpreted by tools

  • Look into properties of events by names whether they are meaningful to the tools.
  • Conform to XSpace schema.
  • @yisitu: Can be thought of as a nested map in proto where you can attach metadata.

@aselle: Binary protocol is well-defined.

@jbaiocchi: Is there a fundamental reason for having multiple interfaces? Two-level registration worries me.

  • Plug-in interface is not a ProfilerInterface. Is there a reason it shouldn’t be?
  • Should the registration go to the main registry instead of the pluggable profiler registry?
  • Should just go through necessary interfaces, not all the available ones.

@aselle: Flatter hierarchy would be good, although the 2-level hierarchy shouldn’t introduce any extra limitations either.
@jbaiocchi: I’d rather have a flat hierarchy.
@aselle: Can use different constructor strings.
@jbaiocchi: Reuse what we already have.

  • Factory that knows how to instantiate different .so files.
  • Add an enum to the lease.

@aselle: Do we have a way to initialize different things right now? The current profiler factory is not like that.
@yiqianglee: A device wouldn’t know between profiler1 vs profiler2.
@aselle: That’s a limitation of static registration. Can do name mapping with user1, user2, ...

@aselle: So pluggable profiler can be a subclass of ProfilerInterface

  • Get rid of the for loop.
  • Single DSL.
  • Need some redesign work to fix ProfileOptions.
  • Probably using strings.

@jianhuili: Right now PluggableDevices use two strings, device_type and device_alias. (Design/naming subject to change.)
@aselle: Platform name?
@reedwm: Physical device name?

@aselle: XPlane protobuf semantics need to be documented more thoroughly.
@YiqiangLi: We require plug-ins to store the exact copy of the XPlane protobuf.

@aselle: For collection calls, there are two function pointers. Can we separate them into 4 functions?
@jbaiocchi: We actually don’t want the RunMetadata one now.
@yisitu: We made it one function to minimize the surface area.
@aselle: It’s still 2 functions in one: give me the size, I’ll give you the buffer.

@aselle: What is the difference between using sizeof and TF_OFFSET_OF_END for struct size?
@yisitu: TF_OFFSET_OF_END gives the size of the structure without padding. It can be used to track the offset of the last variable in the struct.

@aselle: There are two use cases for ext, one is reserved for future use, the other is for storing user data.
@yisitu: We don’t believe it’s being used for user data.
@aselle: We should call it something more descriptive rather than ext

@jzhoulon: Many device event summary pattern-matches strings, e.g., in ClassifyGpuEvent.
@jbaiocchi: Problem of XSpace not telling you what should be in there for the tools to work.
@YiqiangLi: Do you plan to make it more general (not just pattern-matching)?
@jbaiocchi: Devices should expose the strings and we shouldn’t use hard coding.

@yisitu: Someone should write up a pseudocode for the new design, then we can have another design review meeting.

Copy link

@jbaiocchi jbaiocchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix typos.

rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
@penpornk
Copy link
Member

The RFC has been updated. There will be a second design review meeting. Here are the call details:

Time: Thursday, July 22nd, 2021 from 3-4pm PT
Meeting link: Google Meet

Meeting participants are expected to read the updated RFC ahead of time as the meeting will focus on the remaining issues/questions.

cc: @kulinseth @wchao1115 @reedwm

Copy link
Contributor

@yisitu yisitu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typos + formatting + clearing entire profiler registration struct.

rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
Copy link
Contributor

@yisitu yisitu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed typo.

rfcs/20210513-pluggable-profiler-for-tensorflow.md Outdated Show resolved Hide resolved
jzhoulon and others added 6 commits July 22, 2021 21:30
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>
Copy link
Contributor

@yisitu yisitu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As spoken in the meeting, RFC looks good to me.

@penpornk
Copy link
Member

2nd Design Review Meeting Notes

@jzhoulon: Changes from last time:

  • PluggableProfiler is now a subclass of tensorflow::profiler::ProfilerInterface.
  • Only XSpace will be passed around between TensorFlow and plug-in. (No RunMetadata anymore.)

Start()/Stop() Behaviors

@aselle: Can Stop() be called multiple times without a Start()?
@jbaiocchi: TensorFlow only has 1 global client to ensure that Start()/Stop() are called exactly once per instance.
@yisitu: After a Start() is called, there could be a hardware lock while the profiling session is active. If Start() is called twice, the plug-in needs to fail elegantly.
@jbaiocchi: From TF’s perspective, Start()/Stop() will be called only once.
Plug-ins can expect TF to be well-behaved. However, profiler libraries themselves should be robust, in case it’s not plugged into TF but something else that doesn’t have this guarantee.
@aselle: Let’s add “TF will call 1 Start() followed by 1 Stop()” to the RFC.
@yisitu: Let’s also add that plug-ins should fail when getting 2 Start()s?
@jbaiocchi: Some hardware might support multiplexing, e.g., multiple profiler sessions at the same time.
The plug-ins can just return a status warning that a Start() has already been called instead.
@yisitu: Agreed.
@jbaiocchi: We should document the contract from the TF side, e.g., what TF is promising to the libraries.

@reza-amd: Does Stop() need to be blocking?
@jzhoulon: Stop() is non-blocking.
@jbaiocchi:

  • We have two functions, Stop() and CollectData(), exactly for this. – We should also document this.
    • Stop() interacts with the hardware.
    • CollectData() converts data from native profiler format to TF format. It’s not allowed to interact with the hardware.
  • In an RPC mode where we send profiling requests to workers, the control thread of the profiler can be independent of threads that are doing the work. In that case, the control thread can be blocking.
  • Anyway, instantiating a new profiler session while converting the existing profiling data should work.

@yisitu: API can fit different types of use cases, e.g., Stop() can either flush or not flush the buffer. Interaction with hardware is intentionally not mentioned in the RFC such that the API is not too restrictive.
Maybe we need an IsReady() call?
@jbaiocchi: That can be done behind the scenes.
@yisitu: We don’t need to add anything right now. We can extend it afterwards if needed.

Data Structure

@Jianhui-Li: One last unresolved topic from the last review was about the data structure. Just wanted to confirm that it’s not part of this RFC.
@yisitu / @jbaiocchi: Yes, it should be a follow-up RFC – define standard ML metrics / construct data collection.
@Jianhui-Li: Will clarify this in the RFC.
@jbaiocchi: Right now, TraceViewer is the only tool that we promise will work.
Plug-ins own all the contents you put in XPlane.
@Jianhui-Li: Do you have a plan to sort that out?
@jbaiocchi: When they did it for AMD GPU, they looked at the API for Nvidia and created an equivalent one. We understand that this is not necessarily the case for every device.
@yisitu: We will look at what other tools users are interested in apart from TraceViewer. TraceViewer is a good first step.
@Jianhui-Li: Sounds good.

Action Items

@jbaiocchi: Two action items:

  • Outline the behaviors that plug-ins can expect from TensorFlow.
  • Document what people need to do to ensure TraceViewer will work.

Code Location and Testing

@reza-amd: Where would the code be in?
@yisitu: The interface code will be in the TensorFlow repository. Plug-in code would be in your own repositories.
@reza-amd: Who will run the unit tests?
@jbaiocchi: We won’t have all the hardware to test for all plug-ins, so the plug-in developers will have to run the tests themselves. We can do a simple mock test.
@reza-amd: You mentioned multiple-pipeline profiling sessions. What are the scenarios we should test for? Is there a test plan?
@jbaiocchi: We can create a mock library with mock tests. You can use this as an example of what we expect the real plug-ins to conform to.
@yisitu: You can also contribute tests to the TensorFlow repository.

@penpornk / @yisitu: What device are you targeting, @reza-amd?
@reza-amd: AMD GPUs. We are looking to improve the profiler support.

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Sep 1, 2021
@ematejska ematejska moved this from Open reviews to Accepted RFCs in RFC management Sep 1, 2021
@theadactyl theadactyl merged commit 24cd8f0 into tensorflow:master Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
RFC management
  
Accepted RFCs
Development

Successfully merging this pull request may close these issues.

None yet

8 participants