Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip dummy node creation for autograd engine when there is a single input and place on correct queue #47592

Closed
wants to merge 14 commits into from

Conversation

soulitzer
Copy link
Contributor

@soulitzer soulitzer commented Nov 9, 2020

Fixes #42890

  • Removes dummy node
  • Places graph root on the correct queue based on input buffer's device instead of cpu queue by default

cpu - no significant change in speed (too noisy to measure), but we see up to 7% reduction in small graphs
cuda - small reduction in speed (still very noisy) and up to ~20% reduction in instruction count for small graphs

CPU
Code:

import torch
from torch.utils.benchmark import Timer

setup="""
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
"""

stmt="""
torch.autograd.grad(a*b, [a, b], gradient)
"""

timer = Timer(stmt, setup)

print(timer.timeit(10000))
print(timer.collect_callgrind(100))

Before (when dummy node is not skipped):

torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

  26.62 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7efee44ad8e0>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

                           All          Noisy symbols removed
    Instructions:      9755488                    9659378
    Baseline:             4300                       3784
100 runs per measurement, 1 thread

After

<torch.utils.benchmark.utils.common.Measurement object at 0x7f56961a7730>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

  26.78 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f56961a78e0>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

                           All          Noisy symbols removed
    Instructions:      9045508                    8939872
    Baseline:             4280                       3784
100 runs per measurement, 1 thread

Cuda

Before

<torch.utils.benchmark.utils.common.Measurement object at 0x7f84cbaa1ee0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  70.49 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f84cbaa1e50>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      5054581                    4951911
    Baseline:             4105                       3735
100 runs per measurement, 1 thread

Remove dummy node only

<torch.utils.benchmark.utils.common.Measurement object at 0x7fbf29c67eb0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  55.65 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fbf29c67e20>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      5002105                    4900841
    Baseline:             4177                       3731
100 runs per measurement, 1 thread

Remove dummy node and put in correct queue

<torch.utils.benchmark.utils.common.Measurement object at 0x7fb64438ce80>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  27.56 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb64438cdf0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      4104433                    4007555
    Baseline:             4159                       3735
100 runs per measurement, 1 thread

@dr-ci
Copy link

dr-ci bot commented Nov 9, 2020

💊 CI failures summary and remediations

As of commit 11f8e29 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 41 times.

@soulitzer soulitzer marked this pull request as ready for review November 10, 2020 18:38
torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved
torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved
torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved
torch/csrc/autograd/engine.cpp Show resolved Hide resolved
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved
torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved
torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soulitzer has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@albanD albanD self-requested a review November 12, 2020 14:47
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soulitzer has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@codecov
Copy link

codecov bot commented Nov 12, 2020

Codecov Report

Merging #47592 (11f8e29) into master (b6cb2ca) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master   #47592   +/-   ##
=======================================
  Coverage   81.21%   81.21%           
=======================================
  Files        1837     1837           
  Lines      198086   198095    +9     
=======================================
+ Hits       160874   160884   +10     
+ Misses      37212    37211    -1     

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soulitzer has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@soulitzer soulitzer changed the title Skip dummy node creation for autograd engine when there is a single input Skip dummy node creation for autograd engine when there is a single input and place on correct queue Nov 12, 2020
@facebook-github-bot
Copy link
Contributor

@soulitzer merged this pull request in d20483a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement of autograd engine original task placement in worker threads
3 participants