-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[DDP] Add host-side time to CUDATimer #62770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adding timing of forward, backward comp, backward comm, etc will help detect desynchronization issues. Differential Revision: [D30115585](https://our.internmc.facebook.com/intern/diff/D30115585/) [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit da941fb (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Adding timing of forward, backward comp, backward comm, etc will help detect desynchronization issues. Differential Revision: [D30115585](https://our.internmc.facebook.com/intern/diff/D30115585/) ghstack-source-id: 135081989 Pull Request resolved: #62770
class TORCH_API Timer { | ||
private: | ||
// The timestamp of forward call start time in each iteration. | ||
int64_t forward_start_time = kUnsetTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional] Actually it is a mixed coding style. Personally prefer also using c10::optional<int64_t> instead of int64_t here, so no need create a kUnsetTime
and have separate processing (that maps kUnsetTime
to nullopt
).
I understand this is not new code brought in by this PR. No need to do this code improvement in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #62862, it can be a good ramp up task
} | ||
} | ||
|
||
// Return host-side time member variable corresponding to the given event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's specific to "host-side", will it make more sense to move it back to CPUTime
class? Or do you want to update the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment regarding this
|
||
virtual ~Timer() = default; | ||
|
||
// Return host-side timestamp, or nullopt if it has not yet been recorded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's specific to "host-side", will it make more sense to move it back to CPUTime class? Or do you want to update the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the logger, this is also accessed for GPU timing events so we can get the host-side time, so I think it makes sense to keep it in the parent class?
// Record the current event, i.e., mark it as having occurred now. Default | ||
// CPU implementation. | ||
virtual void record(Event event) { | ||
getTime(event) = current_time_in_nanos(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Better to rename it as getTimeRef
, since this method is not a common read-only getter, but returns a mutable reference.
Timer::Event event) { | ||
auto timestamp = timer.getTimestamp(event); | ||
if (timestamp != c10::nullopt) { | ||
// TODO: should we set this as human-readable time instead of unixtime? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that what are mainly used are the results returned by measureDifference
, so should be fine to just use unixtime.
Adding timing of forward, backward comp, backward comm, etc will help detect desynchronization issues. Differential Revision: [D30115585](https://our.internmc.facebook.com/intern/diff/D30115585/) [ghstack-poisoned]
Pull Request resolved: #62770 Adding timing of forward, backward comp, backward comm, etc will help detect desynchronization issues. ghstack-source-id: 135195680 Differential Revision: [D30115585](https://our.internmc.facebook.com/intern/diff/D30115585/)
This pull request has been merged in 44fad84. |
Stack from ghstack:
Adding timing of forward, backward comp, backward comm, etc will help
detect desynchronization issues.
Differential Revision: D30115585