Skip to content

Conversation

rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Aug 4, 2021

Stack from ghstack:

Adding timing of forward, backward comp, backward comm, etc will help
detect desynchronization issues.

Differential Revision: D30115585

Adding timing of forward, backward comp, backward comm, etc will help
detect desynchronization issues.

Differential Revision: [D30115585](https://our.internmc.facebook.com/intern/diff/D30115585/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 4, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit da941fb (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (content): Merge conflict in torch/testing/_internal/common_methods_invocations.py
Auto-merging torch/nn/parallel/distributed.py
Auto-merging torch/csrc/distributed/c10d/reducer.hpp
Auto-merging torch/csrc/distributed/c10d/reducer.cpp
Auto-merging torch/csrc/distributed/c10d/logger.hpp
Auto-merging torch/csrc/distributed/c10d/logger.cpp
Auto-merging torch/csrc/distributed/c10d/init.cpp
Auto-merging torch/_jit_internal.py
Auto-merging tools/stats/import_test_stats.py
Auto-merging test/test_jit.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

HEAD is now at da941fb100 Update on "[DDP] Add host-side time to CUDATimer"
+ git reset --hard da941fb100165a5386a90040528c4f21f0f63167
HEAD is now at da941fb100 Update on "[DDP] Add host-side time to CUDATimer"
+ git merge --allow-unrelated-histories --no-edit --no-ff 5c431981b5b36da6dba61f0e5d5101e72d2fd726
Auto-merging torch/testing/_internal/common_methods_invocations.py
CONFLICT (content): Merge conflict in torch/testing/_internal/common_methods_invocations.py
Auto-merging torch/_jit_internal.py
Auto-merging tools/stats/import_test_stats.py
Auto-merging test/test_jit.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

rohan-varma added a commit that referenced this pull request Aug 4, 2021
Adding timing of forward, backward comp, backward comm, etc will help
detect desynchronization issues.

Differential Revision: [D30115585](https://our.internmc.facebook.com/intern/diff/D30115585/)

ghstack-source-id: 135081989
Pull Request resolved: #62770
class TORCH_API Timer {
private:
// The timestamp of forward call start time in each iteration.
int64_t forward_start_time = kUnsetTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] Actually it is a mixed coding style. Personally prefer also using c10::optional<int64_t> instead of int64_t here, so no need create a kUnsetTime and have separate processing (that maps kUnsetTime to nullopt).

I understand this is not new code brought in by this PR. No need to do this code improvement in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #62862, it can be a good ramp up task

}
}

// Return host-side time member variable corresponding to the given event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's specific to "host-side", will it make more sense to move it back to CPUTime class? Or do you want to update the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment regarding this


virtual ~Timer() = default;

// Return host-side timestamp, or nullopt if it has not yet been recorded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's specific to "host-side", will it make more sense to move it back to CPUTime class? Or do you want to update the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the logger, this is also accessed for GPU timing events so we can get the host-side time, so I think it makes sense to keep it in the parent class?

// Record the current event, i.e., mark it as having occurred now. Default
// CPU implementation.
virtual void record(Event event) {
getTime(event) = current_time_in_nanos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Better to rename it as getTimeRef, since this method is not a common read-only getter, but returns a mutable reference.

Timer::Event event) {
auto timestamp = timer.getTimestamp(event);
if (timestamp != c10::nullopt) {
// TODO: should we set this as human-readable time instead of unixtime?
Copy link
Contributor

@wayi1 wayi1 Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that what are mainly used are the results returned by measureDifference, so should be fine to just use unixtime.

Adding timing of forward, backward comp, backward comm, etc will help
detect desynchronization issues.

Differential Revision: [D30115585](https://our.internmc.facebook.com/intern/diff/D30115585/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Aug 5, 2021
Pull Request resolved: #62770

Adding timing of forward, backward comp, backward comm, etc will help
detect desynchronization issues.
ghstack-source-id: 135195680

Differential Revision: [D30115585](https://our.internmc.facebook.com/intern/diff/D30115585/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 44fad84.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/373/head branch August 10, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants