Set up per-operator input database, per-operator microbenchmarking #785

eellison · 2022-08-11T23:32:34Z

Introduce a mode for benchmarking that will run models and serialize operators and their frequency that is toggled with --log-operator-inputs. Example usage:

python benchmarks/runner.py --suites=torchbench --training --dtypes=float16 --output=/scratch/eellison/work/torchdynamo/benchmarks/bench_logs/torchbench_train/ --log-operator-inputs

The outputs for torchbench, timm, and huggingface have been included in this PR as .zip files.

Here are operators and call count for torchbench, huggingface, and timm.

Also introduces a microbench script to compare operators to eager/nvfuser:

Example usage (just running single input for now):
python ./benchmarks/microbenchmarks/operatorbench.py --op=aten.avg_pool2d.default --dtype=float16 --suite=timm

INFO torchinductor.scheduler: RUN buf0
INFO torchinductor.codegen.triton: codegen numel=s0s1IndexingDiv(s2, 2)**2 reduction_numel=1 nodes=1
INFO torchinductor.scheduler: NEW KERNEL
Perf for aten.avg_pool2d.default torch.float16 w/cudagraphs
JIT NVFuser speedup over aten 1.0
Inductor speedup over aten 1.1039353900823776

Follow-ups: do sweep on operators we are slow on, prioritize lowerings which are invoked more frequently.

jansel · 2022-08-12T01:07:11Z

benchmarks/common.py

@@ -1616,6 +1631,44 @@ def main(runner, original_dir=None):
        print_summary(output_filename)


+def log_operator_inputs(model, example_inputs, model_iter_fn, name, args):
+    output_split = args.output.split("/")


Use os.path for filename manipulation.

jansel · 2022-08-12T01:09:04Z

benchmarks/common.py

+                model_iter_fn(model, example_inputs, collect_outputs=False)
+        except Exception as e2:
+            print(f"{name} failed to run with real. Exception: {e2}")
+            raise e2


re-raise exception

Suggested change

raise e2

raise

jansel · 2022-08-12T01:11:01Z

benchmarks/microbenchmarks/operator_inp_utils.py

+from torch.utils._pytree import tree_flatten
+from torch.utils._pytree import tree_map
+
+OP_INP_DIRECTORY = os.path.dirname(__file__) + "/operator_inp_logs/"


os.path.join()

jansel · 2022-08-12T01:13:16Z

benchmarks/timm_models.py

@@ -275,7 +275,8 @@ def _gen_target(self, batch_size, device):
        )

    def compute_loss(self, pred):
-        return self.loss(pred, self.target)
+        # calling lift so modes enabled for forward/backward can handle self.target
+        return self.loss(pred, torch.ops.aten.lift_fresh_copy(self.target))


Will this effect performance measurements?

voznesenskym · 2022-08-12T07:46:40Z

The outputs for torchbench, timm, and huggingface have been included in this PR as .zip files.

Thought about this a little more - instead of storing the .zip files, can we regenerate the contents of the .zip from the runner each time? Storing the .zip in a git repo makes it a bit of a hard to deal with black box (can't diff by line) and also the way we store .zips but not produce the .zip in the runner means there a hidden step - zipping the output.

So instead of:
A) Run runner on some machine
B) Commit .zip
C) Load zip
D) operatorbench

We do:

A) Run operatorbench
B) as a detail of operatorbench it runs the runner
C) Use the output of the runner as input into operatorbench

What do you think @eellison ?

ezyang · 2022-08-12T13:21:24Z

I didn't really read the PR, but you could also unzip the zip before checking it in lol

eellison · 2022-08-12T20:06:46Z

Thought about this a little more - instead of storing the .zip files, can we regenerate the contents of the .zip from the runner each time?

It takes way too long to generate the inputs, even with fake tensor, for this to really make sense. You don't want to have to wait multiple minutes every time you want to run a script that tests the performance of changing an operator lowering with recorded inputs.

Storing the .zip in a git repo makes it a bit of a hard to deal with black box (can't diff by line)

I'm not sure this is really an issue, since no one is going to be line by line comparing the 6 megabytes of operator inputs from TIMM before and after some change.

but you could also unzip the zip before checking it in lol

This would be over 10MB for the three files - seems kind of wasteful when the whole repro size is ~3 MB ( as opposed to ~.4 MB compressed)

I don't know if anyone else has any strong thoughts here - the changes talked about here are pretty minimal so we can always land and make changes as we use this more..

eellison · 2022-08-12T15:31:24Z

benchmarks/microbenchmarks/operator_inp_utils.py

+        if isinstance(i, (torch.memory_format, torch.storage.UntypedStorage)):
+            return True
+        # TODO: serialize/deserialize sparse arguments
+        if isinstance(i, torch.Tensor) and i.is_sparse:


cc @ezyang any idea how hard this is ?

Well, given that you wrote the json format yourself, pretty easy. You'll need to say how many sparse and dense dims, and maybe nnz and coalesced if you want to get frisky

ezyang · 2022-08-15T15:03:56Z

The gists are too big for github to load 😂

ezyang · 2022-08-15T15:10:14Z

Elias, something that's not clear from the PR description: the input database is metadata only, right? If so, I think we should design a compact text format for describing this sort of metadata; something like Python code you could eval() to inflate the tensors would be a pretty good start. Then we should feel pretty comfortable with checking these in as plaintext; they're basically like OpInfo sample inputs but machine generated.

benchmarks/microbenchmarks/operator_inp_logs/hf_train/AlbertForMaskedLM_training.txt

ezyang · 2022-08-18T02:00:33Z

It would be good to get PR feedback from the folks who would also be using the microbenchmarks.

jansel · 2022-08-18T18:27:51Z

It would be useful to see some data generate from this. What ops are we the slowest on?

It also might make sense to filter out view ops. Views should be represented in the strides of inputs to other ops.

desertfire · 2022-08-18T19:08:06Z

benchmarks/microbenchmarks/operatorbench.py

+        torch.jit.trace(gm, gm_args), gm_args, copy_outputs=False
+    )
+
+    repeats = 3


Can we just have a fast correctness checking mode? It will improve our improve test coverage, and may also help us to identify if any of those model accuracy error is real.

Will do as follow up, when I looked into this previously I think there are are a lot of nan-handling errors, which I think the details of are a little bit tricky what pytorch guarantees (will do non-empty input generation to avoid nan errors).

Chillee · 2022-08-18T21:06:50Z

Might get useful to get an operator count minus the ones we're already decomposing.

…_benchmarking

eellison · 2022-08-19T21:38:39Z

It would be useful to see some data generate from this. What ops are we the slowest on?

Just opened https://github.com/pytorch/torchdynamo/issues/922 and pytorch/pytorch#93636 (still need to benchmark torchbench ops).

It also might make sense to filter out view ops

Yea I filter those out and constructors in non_compute_operator.

ngimel · 2022-08-19T21:49:26Z

For benchmarking we'd also need strides, not just sizes - different strides can result in completely different perf

eellison · 2022-08-19T22:11:39Z

@ngimel those are being recorded - see the output. when the tensors are contiguous we omit serializing the strides, otherwise we serialize them: cnt: 12, ((T([128, 512, 512], f16, stride=(262144, 1, 512)), T([128, 512, 64], f16)), {})

qMerge branch 'main' of https://github.com/pytorch/torchdynamo into op_benchmarking

ngimel · 2022-08-22T17:56:45Z

@ngimel those are being recorded - see the output. when the tensors are contiguous we omit serializing the strides, otherwise we serialize them: cnt: 12, ((T([128, 512, 512], f16, stride=(262144, 1, 512)), T([128, 512, 64], f16)), {})

Cool, I've seen a few cases w/o strides so didn't notice that they were recorded when needed.

desertfire · 2022-08-23T20:10:01Z

torchinductor/compile_fx.py

-        g.output(node)
-
-        gm = torch.fx.GraphModule({}, g)
+        gm, gm_inps = gen_gm_and_inputs(target, args, kwargs)


CheckEachNode is called with python_key which is being removed. You need to update this if we want to re-generate the data in the future. I am ok with leaving it as is for this commit.

desertfire

LGTM. I already saw you filing issues found through this approach, so it would be valuable to have the PR in soon and work on follow-ups if needed.

…_benchmarking

Set up per-operator input database, per-operator microbenchmarking

9c6c8b9

eellison requested review from jansel, desertfire and ngimel August 11, 2022 23:32

facebook-github-bot added the cla signed label Aug 11, 2022

jansel reviewed Aug 12, 2022

View reviewed changes

eellison commented Aug 12, 2022

View reviewed changes

eellison added 2 commits August 13, 2022 00:42

respond to comments

ce1485f

lint

b025598

respond to comments

a7a4318

eellison requested a review from ezyang August 17, 2022 17:39

ezyang reviewed Aug 17, 2022

View reviewed changes

benchmarks/microbenchmarks/operator_inp_logs/hf_train/AlbertForMaskedLM_training.txt Outdated Show resolved Hide resolved

ezyang reviewed Aug 17, 2022

View reviewed changes

benchmarks/microbenchmarks/operator_inp_logs/hf_train/AlbertForMaskedLM_training.txt Outdated Show resolved Hide resolved

desertfire reviewed Aug 18, 2022

View reviewed changes

eellison added 4 commits August 19, 2022 20:23

Respond to comments p2

8a3d43c

Merge branch 'main' of https://github.com/pytorch/torchdynamo into op…

4c0b24e

…_benchmarking

extend to mm

aac2f87

fix

92c538a

eellison requested review from jansel, ezyang and desertfire August 19, 2022 21:38

only write out on all benchmarking

7afea9e

eellison added 2 commits August 22, 2022 17:13

fix for dtype in kwargs

741db6c

t log

33b5ace

qMerge branch 'main' of https://github.com/pytorch/torchdynamo into op_benchmarking

small parse change

53fb858

desertfire reviewed Aug 23, 2022

View reviewed changes

desertfire approved these changes Aug 23, 2022

View reviewed changes

eellison added 2 commits August 24, 2022 21:21

small changes

1935560

Merge branch 'main' of https://github.com/pytorch/torchdynamo into op…

50972ed

…_benchmarking

jansel mentioned this pull request Aug 25, 2022

TorchInductor missing ops tracker pytorch/pytorch#93757

Closed

45 tasks

eellison added 2 commits August 25, 2022 15:42

lint

523208b

Merge branch 'main' of https://github.com/pytorch/torchdynamo into op…

30604ab

…_benchmarking

eellison merged commit cf26dca into pytorch:main Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up per-operator input database, per-operator microbenchmarking #785

Set up per-operator input database, per-operator microbenchmarking #785

eellison commented Aug 11, 2022

jansel Aug 12, 2022

jansel Aug 12, 2022

jansel Aug 12, 2022

jansel Aug 12, 2022

voznesenskym commented Aug 12, 2022 •

edited

ezyang commented Aug 12, 2022

eellison commented Aug 12, 2022

eellison Aug 12, 2022

ezyang Aug 15, 2022

ezyang commented Aug 15, 2022

ezyang commented Aug 15, 2022

ezyang commented Aug 18, 2022

jansel commented Aug 18, 2022

desertfire Aug 18, 2022

eellison Aug 19, 2022

Chillee commented Aug 18, 2022

eellison commented Aug 19, 2022

ngimel commented Aug 19, 2022

eellison commented Aug 19, 2022 •

edited

ngimel commented Aug 22, 2022

desertfire Aug 23, 2022

desertfire left a comment

Set up per-operator input database, per-operator microbenchmarking #785

Set up per-operator input database, per-operator microbenchmarking #785

Conversation

eellison commented Aug 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voznesenskym commented Aug 12, 2022 • edited

ezyang commented Aug 12, 2022

eellison commented Aug 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Aug 15, 2022

ezyang commented Aug 15, 2022

ezyang commented Aug 18, 2022

jansel commented Aug 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chillee commented Aug 18, 2022

eellison commented Aug 19, 2022

ngimel commented Aug 19, 2022

eellison commented Aug 19, 2022 • edited

ngimel commented Aug 22, 2022

Choose a reason for hiding this comment

desertfire left a comment

Choose a reason for hiding this comment

voznesenskym commented Aug 12, 2022 •

edited

eellison commented Aug 19, 2022 •

edited