[PyTorch] Save a single add instruction in the dispatcher #52543

swolchok · 2021-02-20T05:07:13Z

Stack from ghstack:

[PyTorch] Save a single add instruction in the dispatcher #52543 [PyTorch] Save a single add instruction in the dispatcher

This saves one (1) add instruction. New code comments should
explain exactly why. In short, we store a direct pointer in
OperatorHandle in addition to the std::list<OperatorDef>::iterator
because converting the latter to the former requires an add instruction.

It is not clear to me whether this is a particularly great tradeoff,
but I spent (more) time on it (than I expected), so here it is for
review.

Differential Revision: D26564192

This saves one (1) add instruction. New code comments should explain exactly why. In short, we store a direct pointer in `OperatorHandle` in addition to the `std::list<OperatorDef>::iterator` because converting the latter to the former requires an add instruction. It is not clear to me whether this is a particularly great tradeoff, but I spent (more) time on it (than I expected), so here it is for review. Differential Revision: [D26564192](https://our.internmc.facebook.com/intern/diff/D26564192/) [ghstack-poisoned]

facebook-github-bot · 2021-02-20T05:07:23Z

💊 CI failures summary and remediations

As of commit 0e15d1f (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)

1 failure not recognized by patterns:

Job	Step	Action
^flake8-py3	^{Add annotations}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This saves one (1) add instruction. New code comments should explain exactly why. In short, we store a direct pointer in `OperatorHandle` in addition to the `std::list<OperatorDef>::iterator` because converting the latter to the former requires an add instruction. It is not clear to me whether this is a particularly great tradeoff, but I spent (more) time on it (than I expected), so here it is for review. Differential Revision: [D26564192](https://our.internmc.facebook.com/intern/diff/D26564192/) ghstack-source-id: 122147199 Pull Request resolved: #52543

codecov · 2021-02-20T09:41:10Z

Codecov Report

Merging #52543 (0e15d1f) into gh/swolchok/136/base (7ca9776) will decrease coverage by 0.00%.
The diff coverage is 86.20%.

@@                   Coverage Diff                    @@
##           gh/swolchok/136/base   #52543      +/-   ##
========================================================
- Coverage                 80.66%   80.66%   -0.01%     
========================================================
  Files                      1967     1967              
  Lines                    215898   215900       +2     
========================================================
- Hits                     174154   174153       -1     
- Misses                    41744    41747       +3

smessmer

Feeling ambivalent about this. From a code style point of view, storing the same data twice is bad since it introduces potential inconsistencies. But if it saves an instruction and that's worth it, then go ahead.

bhosmer · 2021-02-22T17:55:50Z

Feeling ambivalent about this. From a code style point of view, storing the same data twice is bad since it introduces potential inconsistencies. But if it saves an instruction and that's worth it, then go ahead.

@smessmer I think we need to balance the two kinds of redundancy though. E.g. precomputing the dispatch table is similarly redundant, in the sense that they both store intermediate results for later use. And we're doing that for essentially the same reason as this (to save a ton of redundant computation).

Also from a human standpoint, accessing operator properties through the def rather than the iterator is much clearer. I mean arguably, the redundant member we're storing to save computation is the iterator rather than the definition, right?

ofc this is all pending Scott's confirmation of the effect on instructions. But if the 1% perf bump holds up, that's nothing to sneeze at.

swolchok · 2021-03-01T19:49:11Z

I rechecked the assembly diff. There is some noise because register allocation gets perturbed and different things end up getting spilled to the stack. However, the net savings is indeed one instruction. Landing.

facebook-github-bot · 2021-03-03T18:49:25Z

This pull request has been merged in f5e7255.

) Summary: Pull Request resolved: pytorch#52543 This saves one (1) add instruction. New code comments should explain exactly why. In short, we store a direct pointer in `OperatorHandle` in addition to the `std::list<OperatorDef>::iterator` because converting the latter to the former requires an add instruction. It is not clear to me whether this is a particularly great tradeoff, but I spent (more) time on it (than I expected), so here it is for review. ghstack-source-id: 122147199 Test Plan: Inspect assembly for at::empty in benchmark code -- see add instruction disappeared. Compare empty benchmark performance to baseline with perf stat. Baseline: 5,077.43 msec task-clock # 1.000 CPUs utilized ( +- 0.25% ) 405 context-switches # 0.080 K/sec ( +- 1.37% ) 3 cpu-migrations # 0.001 K/sec ( +- 18.22% ) 12,259 page-faults # 0.002 M/sec ( +- 0.10% ) 10,089,754,343 cycles # 1.987 GHz ( +- 0.25% ) (50.04%) 29,516,000,227 instructions # 2.93 insn per cycle ( +- 0.04% ) (50.08%) 5,662,629,032 branches # 1115.256 M/sec ( +- 0.02% ) (50.08%) 1,955,729 branch-misses # 0.03% of all branches ( +- 0.88% ) (50.04%) 5.0796 +- 0.0128 seconds time elapsed ( +- 0.25% ) After: ``` 5,017.77 msec task-clock # 1.001 CPUs utilized ( +- 0.19% ) 400 context-switches # 0.080 K/sec ( +- 3.09% ) 4 cpu-migrations # 0.001 K/sec ( +- 46.91% ) 12,240 page-faults # 0.002 M/sec ( +- 0.37% ) 9,960,189,535 cycles # 1.985 GHz ( +- 0.19% ) (50.02%) 29,467,149,773 instructions # 2.96 insn per cycle ( +- 0.11% ) (50.03%) 5,661,074,219 branches # 1128.206 M/sec ( +- 0.02% ) (50.07%) 2,032,712 branch-misses # 0.04% of all branches ( +- 1.35% ) (50.07%) 5.0151 +- 0.0101 seconds time elapsed ( +- 0.20% ) ``` 1.2% cycles win, outside the noise 0.16% instruction count win, barely outside noise I am surprised at the size of the cycles win. Reviewed By: bhosmer Differential Revision: D26564192 fbshipit-source-id: 71f731ba54ec1cb407673db691eaf77a257de4a9

facebook-github-bot added the cla signed label Feb 20, 2021

bhosmer approved these changes Feb 21, 2021

View reviewed changes

smessmer approved these changes Feb 22, 2021

View reviewed changes

facebook-github-bot closed this in f5e7255 Mar 3, 2021

facebook-github-bot added the Merged label Mar 3, 2021

facebook-github-bot deleted the gh/swolchok/136/head branch March 7, 2021 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Save a single add instruction in the dispatcher #52543

[PyTorch] Save a single add instruction in the dispatcher #52543

swolchok commented Feb 20, 2021 •

edited

facebook-github-bot commented Feb 20, 2021 •

edited

codecov bot commented Feb 20, 2021

smessmer left a comment

bhosmer commented Feb 22, 2021

swolchok commented Mar 1, 2021

facebook-github-bot commented Mar 3, 2021

[PyTorch] Save a single add instruction in the dispatcher #52543

[PyTorch] Save a single add instruction in the dispatcher #52543

Conversation

swolchok commented Feb 20, 2021 • edited

facebook-github-bot commented Feb 20, 2021 • edited

💊 CI failures summary and remediations

1 failure not recognized by patterns:

codecov bot commented Feb 20, 2021

Codecov Report

smessmer left a comment

Choose a reason for hiding this comment

bhosmer commented Feb 22, 2021

swolchok commented Mar 1, 2021

facebook-github-bot commented Mar 3, 2021

swolchok commented Feb 20, 2021 •

edited

facebook-github-bot commented Feb 20, 2021 •

edited