Skip to content

Merge from the main branch#796

Merged
desertfire merged 44 commits intowconstab/ltcfrom
ltc_merge
Mar 16, 2022
Merged

Merge from the main branch#796
desertfire merged 44 commits intowconstab/ltcfrom
ltc_merge

Conversation

@desertfire
Copy link
Copy Markdown
Contributor

No description provided.

xuzhao9 and others added 30 commits January 21, 2022 17:24
Summary:
Without CUDA Graph (with batch size 32):
```
$ python run.py resnet50 -t train -d cuda
Running train method from resnet50 on cuda in eager mode.
GPU Time:            1034.030 milliseconds
CPU Dispatch Time:   1026.865 milliseconds
CPU Total Wall Time: 1034.011 milliseconds
```

 With CUDA Graph (with batch size 32):
```
$ python run.py resnet50 -t train -d cuda --train_cudagraph
Running train method from resnet50 on cuda in eager mode.
GPU Time:            1038.927 milliseconds
CPU Dispatch Time:   346.313 milliseconds
CPU Total Wall Time: 1038.941 milliseconds
```

# Latency by batch size (Train, fp32, on V100)

<google-sheets-html-origin>

Batch Size | Eager (ms) | CUDA Graph (ms) | Speedup
-- | -- | -- | --
1 | 89.033 | 50.233 | 43.58%
2 | 92.854 | 56.13 | 39.55%
4 | 93.676 | 71.465 | 23.71%
8 | 105.099 | 105.381 | -0.27%
16 | 167.292 | 167.966 | -0.40%
32 | 297.315 | 297.989 | -0.23%
64 | 561.262 | 562.029 | -0.14%

Pull Request resolved: #706

Reviewed By: ngimel

Differential Revision: D33720966

Pulled By: xuzhao9

fbshipit-source-id: 8d422d597a879488d14361466172d32a1eeb1f19
Summary:
This PR fixes a few bugs in v2 bisection workflow. It also updates the new V2 config file to use the latest reference run results.

Pull Request resolved: #709

Reviewed By: erichan1

Differential Revision: D33747367

Pulled By: xuzhao9

fbshipit-source-id: 64ec3f967ea0efba8e44cff4802ed761c58b6113
Summary:
This fixes #710. The Dev Infra team is migrating off from RDS, so we are removing the code related to RDS uploading.

Pull Request resolved: #712

Reviewed By: erichan1

Differential Revision: D33777915

Pulled By: xuzhao9

fbshipit-source-id: 35e1a97d2286d0ef236fdaa5e1693765093c69ea
Summary:
These files are not used, but they causes dependbot to complain about using deprecated numpy versions (such as #714  and #713 )

Pull Request resolved: #716

Reviewed By: aaronenyeshi

Differential Revision: D33781655

Pulled By: xuzhao9

fbshipit-source-id: 85dd00d85a21ca0d1b1da68bd569b143f133487a
Summary: Pull Request resolved: #719

Reviewed By: xuzhao9

Differential Revision: D33806240

Pulled By: jansel

fbshipit-source-id: 389f0ffe6bfa18cb996720bdc89a9837b75fa5fc
Summary:
Use `os.path.realpath()` to get the absolute path of the current file.

Pull Request resolved: #721

Reviewed By: jansel

Differential Revision: D33829270

Pulled By: xuzhao9

fbshipit-source-id: b07ab6e5190b02f0ea0a685fa38965fad196c320
Summary:
Use `py38` and `cu113` as the default version to generate nightly configs.
This PR also fixes a bug in the abtest config generation script which uses the `git_version` instead of master commit hash.

Pull Request resolved: #718

Reviewed By: erichan1

Differential Revision: D33797706

Pulled By: xuzhao9

fbshipit-source-id: 4cfe45f8ddc13e84683ac0dc6491c8b229393d93
Summary:
When there is no anomaly detected, we should remove the "tests" and "details" key from the dict.

Pull Request resolved: #728

Reviewed By: erichan1

Differential Revision: D33899737

Pulled By: xuzhao9

fbshipit-source-id: c3377c74bcc4e3afb7a8f36ad994915f2933d2e4
Summary:
This PR is to add the testing linux.2xlarge runner to pytorch/benchmark.

Pull Request resolved: #729

Reviewed By: seemethere

Differential Revision: D33905554

Pulled By: xuzhao9

fbshipit-source-id: 1034b4a2c50e467d708884de03a324a4fcb1f1f7
Summary: Pull Request resolved: #722

Reviewed By: wconstab

Differential Revision: D33847275

Pulled By: jansel

fbshipit-source-id: 3bc63248fbfbab1d61e234e101bd6b5c5e8faf40
Summary:
Change to train_bs in training section.

Pull Request resolved: #731

Reviewed By: xuzhao9

Differential Revision: D33936661

Pulled By: erichan1

fbshipit-source-id: 6b8729234691e21a8dfd86d3deb58d4a147f00b2
Summary:
This PR enables torch_tensorrt (https://github.com/NVIDIA/Torch-TensorRT) to torchvision and timm models. It works on all timm models except timm_nfnet(pytorch/TensorRT#849), but it doesn't work on any of the torchvision models. Still looking into the root cause.

Run with command:
`python run.py timm_efficientnet -d cuda -t eval --torch_tensorrt` returns:
```
GPU Time:             19.196 milliseconds
CPU Dispatch Time:    10.182 milliseconds
CPU Total Wall Time:  19.181 milliseconds
```

`python run.py mnasnet1_0 -d cuda -t eval --torch_tensorrt` returns:

```
:Running eval method from mnasnet1_0 on cuda in eager mode.
Traceback (most recent call last):
  File "run.py", line 192, in <module>
    m = Model(device=args.device, jit=(args.mode == "jit"), extra_args=extra_args)
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/mnasnet1_0/__init__.py", line 8, in __init__
    super().__init__(model_name="mnasnet1_0", device=device, jit=jit,
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/framework/vision/model_factory.py", line 30, in __init__
    apply_args(self, self.args)
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/framework/vision/args.py", line 43, in apply_args
    model.eval_model = enable_tensortrt(model.eval_example_inputs, args.eval_fp16, model.eval_model)
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/framework/vision/args.py", line 55, in enable_tensortrt
    return torch_tensorrt.compile(eval_model, inputs=trt_input, enabled_precisions=enabled_precisions)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torch_tensorrt/_compile.py", line 115, in compile
    return torch_tensorrt.ts.compile(ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs)
  File "/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torch_tensorrt/ts/_compiler.py", line 119, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError:
temporary: the only valid use of a module is looking up an attribute but found  = prim::SetAttr[name="num_batches_tracked"](%_15, %1440)
```

Pull Request resolved: #732

Reviewed By: yinghai

Differential Revision: D33986641

Pulled By: xuzhao9

fbshipit-source-id: 54d15a977b2b04dc5358f25f9cc2609554744944
Summary:
This PR adds a new argument, 'test', which can be 'train' and 'eval', in the model initialization function. It has the following benefits:
- At model initialization, we can choose to only initialize "train" or "eval" model datasets, to avoid wasting device memory
- Separate the space of arguments. It is common that train and eval tests have different experimental features to apply. For example, optimize_for_inference, and tensorrt only work for eval.
- Add an "extra_args" argument to the model, to enable more optional features in the future.

Pull Request resolved: #735

Reviewed By: erichan1

Differential Revision: D34033140

Pulled By: xuzhao9

fbshipit-source-id: 449a963b1ee7ef450b0a5ccb24718d394dd8be40
…ed. (#736)

Summary:
This PR removes two models, maml and maml_omiglot. We are working on another round of enhancing models with tensorrt and other features, since these two models are in low quality and bad gpu utilization, and it is not worth to work on these models. So I am removing them for now.

Pull Request resolved: #736

Reviewed By: erichan1

Differential Revision: D34018844

Pulled By: xuzhao9

fbshipit-source-id: 70da2086bf9782a351f9f8e79238d4d2029229ad
Summary:
Pull Request resolved: pytorch/pytorch#72499

Pull Request resolved: #740

To fx2trt out of tree to remove bloatness of PyTorch core.

It's the first and major step. Next, we will move acc_tracer out of the tree and rearrange some fx passes.

Reviewed By: suo

Differential Revision: D34065866

fbshipit-source-id: c72b7ad752d0706abd9a63caeef48430e85ec56d
Summary:
This PR adds a new `run_sweep.py` script to TorchBench. It runs all specified model tests in subprocess worker, and apply optional arguments to them. Currently, only batch size is supported, we will support `fx2trt` and `torch_tensorrt` in a follow-up PR.

Pull Request resolved: #742

Reviewed By: erichan1

Differential Revision: D34134739

Pulled By: xuzhao9

fbshipit-source-id: f798f415f976d7e3deee7ffed35f4cbfdb7487f0
Summary:
Still, fp16 is not supported. I will try to address the fp16 issue in a follow-up PR.

Sweep all models with command:
`python run_sweep.py -d cuda -t eval --fx2trt`

Result:
```
Running model BERT_pytorch ... [ TypeError ]
Running model Background_Matting ... [ NotImplemented ]
Running model LearningToPaint ... [ TypeError ]
Running model Super_SloMo ... [ NameError ]
Running model alexnet ... [ OK ]
Running model attention_is_all_you_need_pytorch ... [ TypeError ]
Running model dcgan ... [ OK ]
Running model demucs ... [ NameError ]
Running model densenet121 ... [ NotImplemented ]
Running model detectron2_maskrcnn ... [ UnserializableException ]
Running model dlrm ... [ RuntimeError ]
Running model drq ... [ UnserializableException ]
Running model fastNLP_Bert ... [ UnserializableException ]
Running model hf_Albert ... [ ValueError ]
Running model hf_Bart ... [ TypeError ]
Running model hf_Bert ... [ ValueError ]
Running model hf_BigBird ... [ ValueError ]
Running model hf_DistilBert ... [ ValueError ]
Running model hf_GPT2 ... [ ValueError ]
Running model hf_Longformer ... [ ValueError ]
Running model hf_Reformer ... [ ValueError ]
Running model hf_T5 ... [ TypeError ]
Running model mnasnet1_0 ... [ OK ]
Running model mobilenet_v2 ... [ OK ]
Running model mobilenet_v2_quantized_qat ... [ RuntimeError ]
Running model mobilenet_v3_large ... [ OK ]
Running model moco ... [ UnserializableException ]
Running model nvidia_deeprecommender ... [ OK ]
Running model opacus_cifar10 ... [ UnserializableException ]
Running model pyhpc_equation_of_state ... [ TypeError ]
Running model pyhpc_isoneutral_mixing ... [ TypeError ]
Running model pyhpc_turbulent_kinetic_energy ... [ TypeError ]
Running model pytorch_CycleGAN_and_pix2pix ... [ NotImplemented ]
Running model pytorch_stargan ... [ UnserializableException ]
Running model pytorch_struct ... [ NotImplemented ]
Running model pytorch_unet ... [ AttributeError ]
Running model resnet18 ... [ OK ]
Running model resnet50 ... [ OK ]
Running model resnet50_quantized_qat ... [ RuntimeError ]
Running model resnext50_32x4d ... [ OK ]
Running model shufflenet_v2_x1_0 ... [ OK ]
Running model soft_actor_critic ... [ UnserializableException ]
Running model speech_transformer ... [ TypeError ]
Running model squeezenet1_1 ... [ OK ]
Running model tacotron2 ... [ NotImplemented ]
Running model timm_efficientnet ... [ OK ]
Running model timm_nfnet ... [ UnserializableException ]
Running model timm_regnet ... [ UnserializableException ]
Running model timm_resnest ... [ OK ]
Running model timm_vision_transformer ... [ UnserializableException ]
Running model timm_vovnet ... [ UnserializableException ]
Running model tts_angular ... [ UnserializableException ]
Running model vgg16 ... [ OK ]
Running model vision_maskrcnn ... [ UnserializableException ]
Running model yolov3 ... [ UnserializableException ]
```

Pull Request resolved: #743

Reviewed By: yinghai

Differential Revision: D34151603

Pulled By: xuzhao9

fbshipit-source-id: 049d090584b2fc85499a214d96ae34c41d0a1c8e
Summary:
Run command:
```
python run.py mnasnet1_0 -d cuda -t eval
GPU Time:             12.160 milliseconds
CPU Dispatch Time:    12.070 milliseconds
CPU Total Wall Time:  12.184 milliseconds
```

```
python run.py mnasnet1_0 -d cuda -t eval --nvfuser fuser1
GPU Time:             11.600 milliseconds
CPU Dispatch Time:    11.305 milliseconds
CPU Total Wall Time:  11.604 milliseconds
```
```
python run.py mnasnet1_0 -d cuda -t eval --nvfuser fuser2
GPU Time:             11.609 milliseconds
CPU Dispatch Time:    11.377 milliseconds
CPU Total Wall Time:  11.610 milliseconds
```

Pull Request resolved: #744

Reviewed By: davidberard98

Differential Revision: D34107295

Pulled By: xuzhao9

fbshipit-source-id: 59e5d8f5d90484eb7ca744ebd5e486a21fbd0bdb
Summary:
This PR is reverting #736 because we prematurely removed `maml` and `maml_omiglot` models without adding proper replacements.

These models will later be replaced with a higher quality implementation from fewshot (https://github.com/oscarknagg/few-shot)

Pull Request resolved: #752

Reviewed By: jansel

Differential Revision: D34217403

Pulled By: xuzhao9

fbshipit-source-id: ee6ae9a553f218a8a378b491e0ddcce0cca5e965
…754)

Summary:
This PR enables [torch_trt](https://github.com/NVIDIA/Torch-TensorRT) module on all the models.
Currently the library will segfault on some models, and the subprocess_worker needs to correctly handle that (otherwise, it will just hang forever because it is blocked by `os.read()` on a pipe whose input process is dead).

We introduce the following mechanism to handle subprocess segfault:
1. The `Pipe` class stores the pid of the child process if the pipe is reading from the child process.
2. When the pipe reads, it always creates a thread that periodically checks the status of the other process at the other end. If the other process dies or is in zombie status, the threads writes a special string, `_DEAD`, into the pipe, together with its exit code.
3. The main thread in the process checks the return message in the pipe, if it finds the `_DEAD` message, throw an exception, which is handled in `subprocess_worker`.

Pull Request resolved: #754

Reviewed By: robieta

Differential Revision: D34263322

Pulled By: xuzhao9

fbshipit-source-id: 59fc6858d3ea498c3137f406f7b3843f70316d83
Summary:
This PR fixes the following workflow failures:

https://github.com/pytorch/benchmark/actions/runs/1847380869 fails because `git lfs` updates and needs to overwrite the hook, adding `--force` option to workaround the failed command.
https://github.com/pytorch/benchmark/actions/runs/1837003034 fails because there are (unexpected) two pytorch nightly releases on the same day. In this case, use the one with higher version number.

Pull Request resolved: #756

Reviewed By: erichan1

Differential Revision: D34286177

Pulled By: xuzhao9

fbshipit-source-id: c66b437363e8215a54a550aa353d608a39a67e18
)

Summary: Pull Request resolved: #758

Reviewed By: frank-wei

Differential Revision: D34294547

Pulled By: xuzhao9

fbshipit-source-id: 5c2834ffc11ed64f82489c56a290c3b9c569a24a
Summary:
Currently the CI fails with "ModuleNotFoundError: No module named 'pygame'".
The reason seems to be a recent update go gym, see openai/gym#2634

Pull Request resolved: #766

Reviewed By: xuzhao9

Differential Revision: D34423163

Pulled By: kit1980

fbshipit-source-id: b4dc3ceb32c80d9e7d86c828441c6ddb28320e51
Summary:
We need to remove current jit code from each model directory and use a unified entry for all the transformations. This is because if we do the jit script first, then change the precision to fp16, the CI test will fail with error: https://app.circleci.com/pipelines/github/pytorch/benchmark/3665/workflows/da928033-03fa-48d0-90a4-788d3ee794ed/jobs/3771

However, I noticed different models are using different `torch.jit` APIs: 1) `torch.jit.script(model)`, 2) `torch.jit.script(model, example_inputs)`, 3) `torch.jit.trace(model, example_inputs)` 4) an extra `torch.jit.optimize_for_inference()` for inference.  Which one should I use if we are sharing the jit scripting code for all the models? Krovatkin

The current design is to use `torch.jit.trace(model, example_inputs)` by default. For models that need to call `torch.jit.trace()` (like nvidia_deeprecommender), or models that need to script multiple `torch.nn.Module` instances, they should add a callback function, `jit_callback(self)`, to handle the JIT enablement.

Pull Request resolved: #761

Reviewed By: davidberard98

Differential Revision: D34396461

Pulled By: xuzhao9

fbshipit-source-id: b51ef60b8ee28c0bd910404d549cfb4a75c0ae28
Summary:
This PR prepares adding the correctness checking code to eval tests:

1. Each `eval()` function now returns `Tuple[torch.Tensor]`, i.e., the inference result
2. Add a test to check 1) is true for every model
3. change `run_sweep.py` to prepare for the correctness checking

A follow-up PR is #763, which adds the actual correctness calculation code.

Pull Request resolved: #762

Reviewed By: wushirong

Differential Revision: D34438166

Pulled By: xuzhao9

fbshipit-source-id: b876795485d5942727c3f3dad6ec44eef3250678
Summary:
This PR adds the correctness testing code for TensorRT using cosine similarities.

Example command and output on A100:
```
$ python run.py resnet18 -d cuda -t eval --fx2trt
GPU Time:              0.613 milliseconds
CPU Dispatch Time:     2.319 milliseconds
CPU Total Wall Time:   2.647 milliseconds
Correctness:            0.999990403652191
$ python run.py resnet18 -d cuda -t eval --fx2trt --no-fp16
GPU Time:              0.929 milliseconds
CPU Dispatch Time:     2.295 milliseconds
CPU Total Wall Time:   2.926 milliseconds
Correctness:           0.999999642372131
$ python run.py alexnet -d cuda -t eval --fx2trt
GPU Time:              0.582 milliseconds
CPU Dispatch Time:     2.338 milliseconds
CPU Total Wall Time:   2.646 milliseconds
Corrnectness:           1.000000000000000
$ python run.py alexnet -d cuda -t eval --fx2trt --no-fp16
GPU Time:              0.885 milliseconds
CPU Dispatch Time:     2.352 milliseconds
CPU Total Wall Time:   2.937 milliseconds
Corrnectness:           1.000000000000000
$ python run.py mobilenet_v3_large -d cuda -t eval --fx2trt
GPU Time:              1.695 milliseconds
CPU Dispatch Time:     4.424 milliseconds
CPU Total Wall Time:   5.561 milliseconds
Correctness:            0.999975979328156
$ python run.py mobilenet_v3_large -d cuda -t eval --fx2trt --no-fp16
GPU Time:              3.241 milliseconds
CPU Dispatch Time:     3.069 milliseconds
CPU Total Wall Time:   5.590 milliseconds
Correctness:            0.999904215335846
```

Pull Request resolved: #763

Reviewed By: frank-wei

Differential Revision: D34438175

Pulled By: xuzhao9

fbshipit-source-id: c309009d9676628aa693e0037ee5068ee1a15c76
Summary:
This PR cleans up the timm model code to use the same code entry point (`model_factory.py`), making it easier to make changes or add experimental features.

It also cleans up the code related to setting up random seeds, so that all models share the same code path as part of initialization.

Pull Request resolved: #772

Reviewed By: erichan1

Differential Revision: D34524605

Pulled By: xuzhao9

fbshipit-source-id: d8445ef0c5c66e9404616aeb67dc033ac27974b1
Summary:
In OnDemand CI, the script may load `torchbenchmark` module without pytorch installed (at the first time of running the bisection script).
This PR fixes a bug when it fails because `torch` package is not found.

Pull Request resolved: #770

Reviewed By: erichan1

Differential Revision: D34481528

Pulled By: xuzhao9

fbshipit-source-id: 53068747565b33e3d7c4615aa8ad0b3562e0c46d
Summary:
Pull Request resolved: #777

The core lowering component is taking a fx.GraphModule, and turning it into a lowered, `nn.Module` (generally speaking). Or more specifically,
turning it into a `TRTModule` in the case of fx2trt.

```
[nn.Module, PassContext] -> [nn.Module, PassContext]
```

As a matter of fact, the above signature is just a general module transformation pass function we should have consolidated and used across our stack.

Today this involves two steps:

1. Run TRTInterpreter
2. Turn the TRTInterpreterResult into a TRTModule

We wrap it into the above pass function.

Why? This is one step towards making it possible to swap in a different fx -> trt implementation, e.g., torch-tensorrt. (see [discussion](https://fb.workplace.com/groups/890926038157430/posts/1058116424771723/)

Reviewed By: xuzhao9

Differential Revision: D34540677

fbshipit-source-id: 3c332767dcde0496df3096a66c5be9ddffd1bd7f
Summary:
This PR adds the first end-to-end workload, hf_bert, to the suite that:
- Supports both train and inference
- By default, uses `amp.autocast()` to do fp16 train/inference
- Currently, report latency and qps as performance metrics
- Doesn't support multi-GPU workload yet (will support in the future)

To run the benchmark, use: `python run_e2e.py hf_bert -t eval --fp16 [no|amp]`. For example, on A100:
```
$ python run_e2e.py hf_bert -t eval
{"device": "cuda", "device_num": 1, "test": "eval", "num_examples": 1043, "batch_size": 1, "result": {"latency": 14.56970322, "qps": 71.58690772563314}}
$ python run_e2e.py hf_bert -t train
{"device": "cuda", "device_num": 1, "test": "train", "num_examples": 8576, "batch_size": 32, "result": {"latency": 36.95959081, "qps": 232.03720095514768}}
```

Pull Request resolved: #771

Reviewed By: erichan1

Differential Revision: D34529471

Pulled By: xuzhao9

fbshipit-source-id: a9f8b43c9e4e4ff30dfd76c1c88fe3948976fbd2
xuzhao9 and others added 11 commits March 7, 2022 09:59
Summary:
This PR adds fp16 amp to all models. Basically, it adds an autocast context to all eval tests:
```
with torch.cuda.amp.autocast():
    eval()
```

I have the following concerns regarding to the current amp mode:
1. Some models don't support it. Example: BERT_pytorch, attention_is_all_you_need_pytorch
Reproduction:
```
$ python run.py BERT_pytorch -d cuda --fp16 amp
Running eval method from BERT_pytorch on cuda in eager mode.
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/BERT_pytorch/bert_pytorch/model/attention/single.py", line 19, in forward
    scores = scores.masked_fill(mask == 0, -1e9)
RuntimeError: value cannot be converted to type at::Half without overflow
```

2. Some models don't return correct result. Example: dlrm, moco, pyhpc_turbulent_kinetic_energy
Reproduction:
```
$ python run.py pyhpc_turbulent_kinetic_energy -d cuda --fp16 amp
Running eval method from pyhpc_turbulent_kinetic_energy on cuda in eager mode.
GPU Time:              7.316 milliseconds
CPU Dispatch Time:     7.251 milliseconds
CPU Total Wall Time:   7.350 milliseconds
Correctness:            0.000000000000000
```
3. About 2/3 models slightly regress in performance in amp mode. Example: squeezenet1_1, alexnet
Reproduction:
```
$ python run.py alexnet -d cuda --fp16 amp
Running eval method from alexnet on cuda in eager mode.
GPU Time:              1.475 milliseconds
CPU Dispatch Time:     1.305 milliseconds
CPU Total Wall Time:   1.509 milliseconds
Correctness:            0.999999880790710
```

```
$ python run.py alexnet -d cuda --fp16 no
Running eval method from alexnet on cuda in eager mode.
GPU Time:              1.095 milliseconds
CPU Dispatch Time:     0.994 milliseconds
CPU Total Wall Time:   1.126 milliseconds
```

The slowdown is 0.74x.

Pull Request resolved: #776

Reviewed By: ejguan

Differential Revision: D34559508

Pulled By: xuzhao9

fbshipit-source-id: cf585aac5e5eaedbcdca9e8292420a8beae82481
Summary:
This PR adds two new types of runners to the repository: AWS V100 and A100.
The correctness testing CI file is tentative, and will be tested in a follow-up PR.

Pull Request resolved: #779

Reviewed By: ejguan

Differential Revision: D34691844

Pulled By: xuzhao9

fbshipit-source-id: 13ad882b4ed817546f3be6b43653c519b97aae7d
Summary:
This PR supports part of the HuggingFace models to run with fx2trt.
It also enables `fp16` `half` support for hf models, but it is not default because `hf_BigBird` model doesn't support half for now.

Supported: hf_Bert, hf_Albert, hf_GPT2, hf_DistilBert
Not supported: hf_Bart, hf_BigBird, hf_Longformer, hf_Reformer, hf_T5

An example error log of unsupported models:
```
Traceback (most recent call last):
  File "run.py", line 177, in <module>
    m = Model(device=args.device, test=args.test, jit=(args.mode == "jit"), batch_size=args.bs, extra_args=extra_args)
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 13, in __call__
    obj.__post__init__()
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/model.py", line 81, in __post__init__
    apply_args(self, self.extra_args)
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/extra_args.py", line 108, in apply_args
    model.set_module(enable_fx2trt(args.batch_size, fp16=args.fp16, model=module, example_inputs=exmaple_inputs,
  File "/fsx/users/xzhao9/benchmark/torchbenchmark/util/backends/fx2trt.py", line 63, in enable_fx2trt
    traced_model = hf_symbolic_trace(
  File "/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/transformers/utils/fx.py", line 565, in symbolic_trace
    raise NotImplementedError(
NotImplementedError: Model LongformerForMaskedLM is not supported yet, supported models: AlbertModel, AlbertForPreTraining, AlbertForMaskedLM, AlbertForMultipleChoice, AlbertForQuestionAnswering, AlbertForSequenceClassification, AlbertForTokenClassification, BertModel, BertForPreTraining, BertForNextSentencePrediction, BertForMaskedLM, BertLMHeadModel, BertForMultipleChoice, BertForQuestionAnswering, BertForSequenceClassification, BertForTokenClassification, DistilBertModel, DistilBertForMaskedLM, DistilBertForMaskedLM, DistilBertForMultipleChoice, DistilBertForQuestionAnswering, DistilBertForSequenceClassification, DistilBertForTokenClassification, MobileBertModel, MobileBertForPreTraining, MobileBertForNextSentencePrediction, MobileBertForMaskedLM, MobileBertForMultipleChoice, MobileBertForQuestionAnswering, MobileBertForSequenceClassification, MobileBertForTokenClassification, ElectraModel, ElectraForPreTraining, ElectraForMaskedLM, ElectraForMultipleChoice, ElectraForQuestionAnswering, ElectraForSequenceClassification, ElectraForTokenClassification, MegatronBertModel, MegatronBertForPreTraining, MegatronBertForNextSentencePrediction, MegatronBertForMaskedLM, MegatronBertForCausalLM, MegatronBertForMultipleChoice, MegatronBertForQuestionAnswering, MegatronBertForSequenceClassification, MegatronBertForTokenClassification, GPT2Model, GPT2LMHeadModel, GPT2LMHeadModel, GPT2ForSequenceClassification, GPT2ForTokenClassification, GPTJModel, GPTJForCausalLM, GPTJForSequenceClassification, GPTNeoModel, GPTNeoForCausalLM, GPTNeoForSequenceClassification, T5Model, T5ForConditionalGeneration, T5ForConditionalGeneration, GPT2DoubleHeadsModel
```

Pull Request resolved: #778

Reviewed By: frank-wei

Differential Revision: D34757194

Pulled By: xuzhao9

fbshipit-source-id: 017bb2f8050cb28c7e9de3ab77fd2107cbbe10e1
…city (#781)

Summary:
When a test is flagged as "NotImplemented", there are actually two cases:
1. The test itself doesn't implement or handle the configs, e.g., unsupervised-learning models like pytorch_struct doesn't have `eval()` tests, or the pyhpc models don't have `train()` tests.
2. The test doesn't support running on our T4 CI GPU machine, but it runs totally fine on other GPUs, such as `V100` or `A100`.

This PR is to eliminate the second case, so that the test can still run through `run.py` or `run_sweep.py` interfaces. Instead, we flag the test to be `not_implemented` in the `metadata.yaml`, and the CI script `test.py` or `test_bench.py` will read from the metadata and determine they are not suitable to run on the CI machine.

This fixes #688, #626, and #598

Pull Request resolved: #781

Reviewed By: aaronenyeshi

Differential Revision: D34786277

Pulled By: xuzhao9

fbshipit-source-id: d5d3d884839345f4fcad21ccf541a02d8e705f5f
Summary:
This PR uses `fuser` context manager to manage run contexts to replace the old "hacky" implementation.
Because the instance generated by `contextlib.contextmanager` can only be applied once, we need to pass in a lambda and instantiate a new instance every time we run the benchmark

Pull Request resolved: #784

Reviewed By: davidberard98

Differential Revision: D34797602

Pulled By: xuzhao9

fbshipit-source-id: 95d46301c613b796e2b4c9aafc9e4b1a7fe6e59a
Summary:
This PR sets `fp16` to be the default precision of all huggingface models.
This PR also includes an extra patch to the transformers package, because `hf_BigBird` needs to be patched in order to support fp16.

I believe this patch should also be upstream-ed: huggingface/transformers#16034

Pull Request resolved: #782

Reviewed By: frank-wei

Differential Revision: D34803502

Pulled By: xuzhao9

fbshipit-source-id: 3d46f7983aa32333b12af605f69e45f1fe3134d7
Summary:
Given the importance of the `get_module()` interface, we must implement it for every model.
This PR forces the implementation of the `get_module()` interface across all models, and properly implement it for the `Background_Matting` model.

Fixes #567.

Pull Request resolved: #785

Reviewed By: jansel

Differential Revision: D34804942

Pulled By: xuzhao9

fbshipit-source-id: 96708b9042a3fcf3e5f6c86c7cdfc5de0fbc3036
Summary:
To enable TorchBench on Python 3.9, we need to remove the version locks on the dependencies.
After removing dependencies, installing TorchBench on Python 3.9 no longer requires LLVM.
Fixes #498

Pull Request resolved: #787

Reviewed By: jamesr66a

Differential Revision: D34826115

Pulled By: xuzhao9

fbshipit-source-id: 5d1d24328dba5bde2387f814f19dfed7f09df4a9
Summary:
Note that the CPU run is very slow so it is disabled in nightly run by metadata (https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/pytorch_CycleGAN_and_pix2pix/metadata.yaml#L10).
Still, users can run CPU train or eval test with the `run.py` command:
```
$ python run.py pytorch_CycleGAN_and_pix2pix -d cpu -t [train|eval]
```
Fixes #788

Pull Request resolved: #790

Reviewed By: anijain2305

Differential Revision: D34832041

Pulled By: xuzhao9

fbshipit-source-id: 111622206fc82defa4641bcf03d82740e035bd01
Summary:
This PR enables CPU train/eval test on speech_transformer (for accuracy test).

Pull Request resolved: #791

Reviewed By: anijain2305

Differential Revision: D34836544

Pulled By: xuzhao9

fbshipit-source-id: 1e53fe02b118f9bfa81cff74fce7d5add94cc197
xuzhao9 and others added 3 commits March 15, 2022 17:41
Summary:
Recent torchtext API changes break the legacy code we use in `attention_is_all_you_need_pytorch` and `pytorch_struct` models. Adding the removed functions so that we can continue using them.

Pull Request resolved: #795

Reviewed By: erichan1

Differential Revision: D34874088

Pulled By: xuzhao9

fbshipit-source-id: bc31c26a187c88169a379f26a4dc4208382bb14e
@desertfire desertfire merged commit d7c681c into wconstab/ltc Mar 16, 2022
@desertfire desertfire deleted the ltc_merge branch March 16, 2022 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants