Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch] Save a single add instruction in the dispatcher #52543

Closed
wants to merge 1 commit into from

Conversation

swolchok
Copy link
Contributor

@swolchok swolchok commented Feb 20, 2021

Stack from ghstack:

This saves one (1) add instruction. New code comments should
explain exactly why. In short, we store a direct pointer in
OperatorHandle in addition to the std::list<OperatorDef>::iterator
because converting the latter to the former requires an add instruction.

It is not clear to me whether this is a particularly great tradeoff,
but I spent (more) time on it (than I expected), so here it is for
review.

Differential Revision: D26564192

This saves one (1) add instruction. New code comments should
explain exactly why. In short, we store a direct pointer in
`OperatorHandle` in addition to the `std::list<OperatorDef>::iterator`
because converting the latter to the former requires an add instruction.

It is not clear to me whether this is a particularly great tradeoff,
but I spent (more) time on it (than I expected), so here it is for
review.

Differential Revision: [D26564192](https://our.internmc.facebook.com/intern/diff/D26564192/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 20, 2021

💊 CI failures summary and remediations

As of commit 0e15d1f (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

1 failure not recognized by patterns:

Job Step Action
GitHub Actions flake8-py3 Add annotations 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

swolchok added a commit that referenced this pull request Feb 20, 2021
This saves one (1) add instruction. New code comments should
explain exactly why. In short, we store a direct pointer in
`OperatorHandle` in addition to the `std::list<OperatorDef>::iterator`
because converting the latter to the former requires an add instruction.

It is not clear to me whether this is a particularly great tradeoff,
but I spent (more) time on it (than I expected), so here it is for
review.

Differential Revision: [D26564192](https://our.internmc.facebook.com/intern/diff/D26564192/)

ghstack-source-id: 122147199
Pull Request resolved: #52543
@codecov
Copy link

codecov bot commented Feb 20, 2021

Codecov Report

Merging #52543 (0e15d1f) into gh/swolchok/136/base (7ca9776) will decrease coverage by 0.00%.
The diff coverage is 86.20%.

@@                   Coverage Diff                    @@
##           gh/swolchok/136/base   #52543      +/-   ##
========================================================
- Coverage                 80.66%   80.66%   -0.01%     
========================================================
  Files                      1967     1967              
  Lines                    215898   215900       +2     
========================================================
- Hits                     174154   174153       -1     
- Misses                    41744    41747       +3     

Copy link
Contributor

@smessmer smessmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feeling ambivalent about this. From a code style point of view, storing the same data twice is bad since it introduces potential inconsistencies. But if it saves an instruction and that's worth it, then go ahead.

@bhosmer
Copy link
Contributor

bhosmer commented Feb 22, 2021

Feeling ambivalent about this. From a code style point of view, storing the same data twice is bad since it introduces potential inconsistencies. But if it saves an instruction and that's worth it, then go ahead.

@smessmer I think we need to balance the two kinds of redundancy though. E.g. precomputing the dispatch table is similarly redundant, in the sense that they both store intermediate results for later use. And we're doing that for essentially the same reason as this (to save a ton of redundant computation).

Also from a human standpoint, accessing operator properties through the def rather than the iterator is much clearer. I mean arguably, the redundant member we're storing to save computation is the iterator rather than the definition, right?

ofc this is all pending Scott's confirmation of the effect on instructions. But if the 1% perf bump holds up, that's nothing to sneeze at.

@swolchok
Copy link
Contributor Author

swolchok commented Mar 1, 2021

I rechecked the assembly diff. There is some noise because register allocation gets perturbed and different things end up getting spilled to the stack. However, the net savings is indeed one instruction. Landing.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f5e7255.

@facebook-github-bot facebook-github-bot deleted the gh/swolchok/136/head branch March 7, 2021 15:16
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
)

Summary:
Pull Request resolved: pytorch#52543

This saves one (1) add instruction. New code comments should
explain exactly why. In short, we store a direct pointer in
`OperatorHandle` in addition to the `std::list<OperatorDef>::iterator`
because converting the latter to the former requires an add instruction.

It is not clear to me whether this is a particularly great tradeoff,
but I spent (more) time on it (than I expected), so here it is for
review.
ghstack-source-id: 122147199

Test Plan:
Inspect assembly for at::empty in benchmark code -- see add
instruction disappeared.

Compare empty benchmark performance to baseline with perf stat.

Baseline:
          5,077.43 msec task-clock                #    1.000 CPUs utilized            ( +-  0.25% )
               405      context-switches          #    0.080 K/sec                    ( +-  1.37% )
                 3      cpu-migrations            #    0.001 K/sec                    ( +- 18.22% )
            12,259      page-faults               #    0.002 M/sec                    ( +-  0.10% )
    10,089,754,343      cycles                    #    1.987 GHz                      ( +-  0.25% )  (50.04%)
    29,516,000,227      instructions              #    2.93  insn per cycle           ( +-  0.04% )  (50.08%)
     5,662,629,032      branches                  # 1115.256 M/sec                    ( +-  0.02% )  (50.08%)
         1,955,729      branch-misses             #    0.03% of all branches          ( +-  0.88% )  (50.04%)

            5.0796 +- 0.0128 seconds time elapsed  ( +-  0.25% )

After:
```
          5,017.77 msec task-clock                #    1.001 CPUs utilized            ( +-  0.19% )
               400      context-switches          #    0.080 K/sec                    ( +-  3.09% )
                 4      cpu-migrations            #    0.001 K/sec                    ( +- 46.91% )
            12,240      page-faults               #    0.002 M/sec                    ( +-  0.37% )
     9,960,189,535      cycles                    #    1.985 GHz                      ( +-  0.19% )  (50.02%)
    29,467,149,773      instructions              #    2.96  insn per cycle           ( +-  0.11% )  (50.03%)
     5,661,074,219      branches                  # 1128.206 M/sec                    ( +-  0.02% )  (50.07%)
         2,032,712      branch-misses             #    0.04% of all branches          ( +-  1.35% )  (50.07%)

            5.0151 +- 0.0101 seconds time elapsed  ( +-  0.20% )
```

1.2% cycles win, outside the noise
0.16% instruction count win, barely outside noise

I am surprised at the size of the cycles win.

Reviewed By: bhosmer

Differential Revision: D26564192

fbshipit-source-id: 71f731ba54ec1cb407673db691eaf77a257de4a9
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
)

Summary:
Pull Request resolved: pytorch#52543

This saves one (1) add instruction. New code comments should
explain exactly why. In short, we store a direct pointer in
`OperatorHandle` in addition to the `std::list<OperatorDef>::iterator`
because converting the latter to the former requires an add instruction.

It is not clear to me whether this is a particularly great tradeoff,
but I spent (more) time on it (than I expected), so here it is for
review.
ghstack-source-id: 122147199

Test Plan:
Inspect assembly for at::empty in benchmark code -- see add
instruction disappeared.

Compare empty benchmark performance to baseline with perf stat.

Baseline:
          5,077.43 msec task-clock                #    1.000 CPUs utilized            ( +-  0.25% )
               405      context-switches          #    0.080 K/sec                    ( +-  1.37% )
                 3      cpu-migrations            #    0.001 K/sec                    ( +- 18.22% )
            12,259      page-faults               #    0.002 M/sec                    ( +-  0.10% )
    10,089,754,343      cycles                    #    1.987 GHz                      ( +-  0.25% )  (50.04%)
    29,516,000,227      instructions              #    2.93  insn per cycle           ( +-  0.04% )  (50.08%)
     5,662,629,032      branches                  # 1115.256 M/sec                    ( +-  0.02% )  (50.08%)
         1,955,729      branch-misses             #    0.03% of all branches          ( +-  0.88% )  (50.04%)

            5.0796 +- 0.0128 seconds time elapsed  ( +-  0.25% )

After:
```
          5,017.77 msec task-clock                #    1.001 CPUs utilized            ( +-  0.19% )
               400      context-switches          #    0.080 K/sec                    ( +-  3.09% )
                 4      cpu-migrations            #    0.001 K/sec                    ( +- 46.91% )
            12,240      page-faults               #    0.002 M/sec                    ( +-  0.37% )
     9,960,189,535      cycles                    #    1.985 GHz                      ( +-  0.19% )  (50.02%)
    29,467,149,773      instructions              #    2.96  insn per cycle           ( +-  0.11% )  (50.03%)
     5,661,074,219      branches                  # 1128.206 M/sec                    ( +-  0.02% )  (50.07%)
         2,032,712      branch-misses             #    0.04% of all branches          ( +-  1.35% )  (50.07%)

            5.0151 +- 0.0101 seconds time elapsed  ( +-  0.20% )
```

1.2% cycles win, outside the noise
0.16% instruction count win, barely outside noise

I am surprised at the size of the cycles win.

Reviewed By: bhosmer

Differential Revision: D26564192

fbshipit-source-id: 71f731ba54ec1cb407673db691eaf77a257de4a9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants