Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchDynamo Performance DashBoard #93794

Closed
anijain2305 opened this issue Jul 29, 2022 · 249 comments
Closed

TorchDynamo Performance DashBoard #93794

anijain2305 opened this issue Jul 29, 2022 · 249 comments
Labels
module: dynamo triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@anijain2305
Copy link
Contributor

anijain2305 commented Jul 29, 2022

Dashboard to track the performance of different backends.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

@anijain2305 anijain2305 changed the title [WIP/TRIAL] Setting up Automatic Benchmarking Results Setting up Automatic Benchmarking Results Aug 9, 2022
@anijain2305 anijain2305 changed the title Setting up Automatic Benchmarking Results TorchDynamo Performance DashBoard Aug 10, 2022
@Chillee Chillee pinned this issue Aug 11, 2022
@anijain2305
Copy link
Contributor Author

Compilation Profile

The tables show the worst 50 models for different metrics

Compilation Latency

see more

dtype=float32, unit=seconds

+-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+
|    suite    |                  name                   | batch_size | pytorch | eager  | aot_eager | aot_nvfuser | inductor_cudagraphs |
+-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+
| huggingface |          MobileBertForMaskedLM          |     16     |   0.0   | 67.728 |  78.263   |   139.766   |       426.422       |
| huggingface |     MobileBertForQuestionAnswering      |     32     |   0.0   | 66.85  |  78.347   |   138.941   |       521.547       |
| torchbench  |               densenet121               |     4      |   0.0   | 3.646  |   7.496   |   77.281    |       599.583       |
| torchbench  |       mobilenet_v2_quantized_qat        |     96     |   0.0   | 3.431  |   7.875   |   70.466    |        -2.83        |
| torchbench  |            timm_efficientdet            |     1      |   0.0   | 65.366 |  65.409   |   65.612    |       -4.579        |
| timm_models |            res2net50_14w_8s             |    128     |   0.0   | 3.708  |   8.399   |   57.811    |       428.275       |
| timm_models |            res2net101_26w_4s            |     64     |   0.0   |  5.39  |  10.482   |   56.042    |       459.323       |
| timm_models |               res2next50                |    128     |   0.0   | 1.946  |   4.426   |   52.489    |       292.475       |
| timm_models |             legacy_senet154             |     32     |   0.0   | 5.437  |  11.439   |   51.941    |       335.474       |
| torchbench  |           mobilenet_v3_large            |     32     |   0.0   | 0.797  |   1.958   |   48.049    |       325.854       |
| timm_models |           gluon_inception_v3            |    128     |   0.0   | 1.879  |   4.232   |   46.426    |       472.191       |
| timm_models |              inception_v3               |    128     |   0.0   | 1.872  |   4.206   |   46.343    |       476.239       |
| torchbench  |         resnet50_quantized_qat          |     32     |   0.0   | 2.366  |   6.501   |   45.732    |       -2.707        |
| timm_models |            adv_inception_v3             |    128     |   0.0   | 0.523  |   2.818   |   45.161    |       462.487       |
| huggingface |            XLNetLMHeadModel             |     4      |   0.0   | 13.315 |  21.675   |   38.803    |       598.55        |
| huggingface |       MT5ForConditionalGeneration       |     2      |   0.0   | 13.459 |  18.646   |    37.11    |       380.514       |
| timm_models |            gluon_xception65             |     32     |   0.0   | 1.946  |   5.08    |   34.752    |       226.31        |
| torchbench  |              mobilenet_v2               |     96     |   0.0   | 0.542  |   1.57    |   33.984    |       281.45        |
| huggingface |         MegatronBertForCausalLM         |     2      |   0.0   | 17.26  |  21.586   |   33.516    |       459.281       |
| huggingface |    MegatronBertForQuestionAnswering     |     8      |   0.0   | 16.706 |   21.48   |   33.387    |       598.832       |
| timm_models |               selecsls42b               |    128     |   0.0   | 0.586  |   1.621   |   31.551    |       239.668       |
| timm_models |              nasnetalarge               |     16     |   0.0   | 30.009 |  31.092   |   31.029    |       -3.241        |
| torchbench  |                resnet50                 |     32     |   0.0   | 0.765  |   1.95    |   29.284    |       159.181       |
| torchbench  |               mnasnet1_0                |     32     |   0.0   | 0.611  |   1.694   |   27.733    |       273.082       |
| huggingface |       T5ForConditionalGeneration        |     4      |   0.0   | 7.773  |  11.156   |   27.079    |       266.568       |
| huggingface |          DebertaV2ForMaskedLM           |     1      |   0.0   | 7.919  |  12.989   |   26.708    |       -1.222        |
| torchbench  |                  hf_T5                  |     8      |   0.0   | 7.205  |  10.728   |    26.68    |       234.116       |
| huggingface |             XGLMForCausalLM             |     2      |   0.0   | 6.336  |  10.992   |   26.274    |       598.983       |
| huggingface |                 T5Small                 |     1      |   0.0   | 7.806  |  11.216   |   26.261    |       280.438       |
| huggingface |      DebertaV2ForQuestionAnswering      |     1      |   0.0   | 7.909  |   12.99   |   25.893    |        -1.24        |
| huggingface |     M2M100ForConditionalGeneration      |     2      |   0.0   | 6.279  |  12.378   |   25.882    |       598.719       |
| torchbench  |             resnext50_32x4d             |     8      |   0.0   | 0.773  |   1.949   |   25.454    |       141.483       |
| huggingface |     PegasusForConditionalGeneration     |     4      |   0.0   | 6.235  |  11.986   |   24.865    |       582.318       |
| timm_models |              pnasnet5large              |     16     |   0.0   | 22.702 |  24.632   |   23.986    |       -3.202        |
| huggingface |            YituTechConvBert             |     1      |   0.0   | 7.199  |   10.91   |   22.998    |       338.889       |
| torchbench  |             LearningToPaint             |     96     |   0.0   | 0.419  |   0.85    |   22.746    |       107.36        |
| huggingface |     GPTNeoForSequenceClassification     |     1      |   0.0   | 7.333  |  12.292   |   21.577    |       -1.179        |
| torchbench  |           shufflenet_v2_x1_0            |    128     |   0.0   | 0.885  |   2.245   |    21.15    |       190.619       |
| huggingface |            GPTNeoForCausalLM            |     1      |   0.0   | 7.335  |  12.174   |   21.138    |        -1.16        |
| torchbench  |               hf_BigBird                |     2      |   0.0   | 8.239  |  12.636   |   20.964    |       -1.487        |
| huggingface |                 BigBird                 |     1      |   0.0   | 8.339  |  12.634   |   20.934    |       -1.432        |
| huggingface | BlenderbotSmallForConditionalGeneration |     64     |   0.0   | 5.518  |   9.435   |    20.71    |       274.667       |
| timm_models |                hrnet_w18                |    128     |   0.0   | 18.748 |  20.474   |   20.252    |       -3.612        |
| huggingface |           DebertaForMaskedLM            |     4      |   0.0   | 4.385  |   7.539   |   19.249    |       172.45        |
| torchbench  |           Background_Matting            |     4      |   0.0   | 0.025  |   0.986   |   18.871    |       135.819       |
| huggingface |       DebertaForQuestionAnswering       |     4      |   0.0   | 4.343  |   7.558   |   18.614    |       -1.077        |
| timm_models |                 dpn107                  |     32     |   0.0   | 17.261 |  17.818   |   17.805    |       -2.842        |
| huggingface |           ElectraForCausalLM            |     1      |   0.0   | 5.238  |   7.486   |   17.657    |       276.946       |
| huggingface |                CamemBert                |     1      |   0.0   | 5.146  |   7.451   |   17.356    |       -0.978        |
| huggingface |           LayoutLMForMaskedLM           |     16     |   0.0   | 5.342  |   7.731   |    17.14    |       214.194       |
+-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+

Peak Memory

see more

dtype=float32, unit=GB

+-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+
|    suite    |                  name                   | batch_size | pytorch | eager  | aot_eager | aot_nvfuser | inductor_cudagraphs |
+-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+
| torchbench  |                  vgg16                  |     64     |   0.0   |  0.0   |   3.148   |    3.147    |        1.005        |
| torchbench  |                  hf_T5                  |     8      |   0.0   |  0.0   |   1.749   |    2.566    |        3.397        |
| timm_models |               res2next50                |    128     |   0.0   |  0.0   |   1.415   |    2.101    |        5.326        |
| timm_models |            res2net50_14w_8s             |    128     |   0.0   |  0.0   |   1.572   |    2.036    |        4.705        |
| huggingface |       BlenderbotSmallForCausalLM        |     64     |   0.0   |  0.0   |   1.916   |    1.92     |        4.27         |
| huggingface |            AlbertForMaskedLM            |     2      |   0.0   |  0.0   |   0.954   |    1.844    |        1.231        |
| timm_models |           gluon_inception_v3            |    128     |   0.0   |  0.0   |   2.006   |    1.816    |        2.53         |
| timm_models |            adv_inception_v3             |    128     |   0.0   |  0.0   |   2.006   |    1.816    |        2.528        |
| timm_models |              inception_v3               |    128     |   0.0   |  0.0   |   2.006   |    1.816    |        2.529        |
| huggingface | BlenderbotSmallForConditionalGeneration |     64     |   0.0   |  0.0   |   1.664   |    1.668    |        4.141        |
| huggingface |       AlbertForQuestionAnswering        |     2      |   0.0   |  0.0   |   0.705   |    1.595    |        0.697        |
| timm_models |            gluon_xception65             |     32     |   0.0   |  0.0   |   0.908   |    1.546    |        0.327        |
| huggingface |            XLNetLMHeadModel             |     4      |   0.0   |  0.0   |   1.514   |    1.531    |       -10.373       |
| torchbench  |                hf_Albert                |     8      |   0.0   |  0.0   |   0.356   |    1.459    |       -0.749        |
| huggingface |             BartForCausalLM             |     4      |   0.0   |  0.0   |   1.227   |    1.244    |        4.418        |
| timm_models |            res2net101_26w_4s            |     64     |   0.0   |  0.0   |   0.848   |    1.111    |        2.468        |
| timm_models |             legacy_senet154             |     32     |   0.0   |  0.0   |   0.989   |    1.106    |        0.095        |
| torchbench  |                 hf_Bart                 |     4      |   0.0   |  -0.0  |   1.026   |    1.035    |        1.541        |
| huggingface |           LayoutLMForMaskedLM           |     16     |   0.0   |  0.0   |    1.0    |     1.0     |        2.144        |
| huggingface |             BertForMaskedLM             |     64     |   0.0   |  0.0   |    1.0    |    0.975    |        2.107        |
| huggingface |       T5ForConditionalGeneration        |     4      |   0.0   |  0.0   |   0.736   |    0.944    |        2.519        |
| torchbench  |               timm_nfnet                |    128     |   0.0   | 0.891  |   0.89    |    0.89     |       -13.257       |
| timm_models |               dm_nfnet_f0               |    128     |   0.0   | 0.891  |   0.89    |    0.89     |       -13.257       |
| torchbench  |           Background_Matting            |     4      |   0.0   | -0.03  |   0.586   |    0.865    |        0.999        |
| huggingface |            MBartForCausalLM             |     16     |   0.0   |  0.0   |   0.819   |    0.82     |        3.195        |
| huggingface |    MegatronBertForQuestionAnswering     |     8      |   0.0   |  0.0   |   0.797   |    0.797    |       -3.993        |
| huggingface |            TrOCRForCausalLM             |     8      |   0.0   |  0.0   |   0.75    |    0.75     |        2.531        |
| torchbench  |             pytorch_struct              |    200     |   0.0   |  0.0   |   0.682   |    0.682    |        0.05         |
| torchbench  |                resnet50                 |     32     |   0.0   |  0.0   |   0.438   |    0.673    |        1.107        |
| huggingface |    LayoutLMForSequenceClassification    |     16     |   0.0   | 0.025  |   0.885   |    0.658    |        0.847        |
| timm_models |               selecsls42b               |    128     |   0.0   | 0.076  |   0.695   |    0.649    |        1.965        |
| torchbench  |              pytorch_unet               |     1      |   0.0   |  -0.0  |   0.623   |    0.567    |        0.667        |
| huggingface |       MT5ForConditionalGeneration       |     2      |   0.0   |  0.0   |   0.622   |    0.536    |        3.445        |
| huggingface |                 T5Small                 |     1      |   0.0   |  0.0   |   0.372   |    0.532    |        1.144        |
| huggingface |     MobileBertForQuestionAnswering      |     32     |   0.0   |  0.0   |   0.084   |    0.502    |        0.78         |
| huggingface |            PLBartForCausalLM            |     16     |   0.0   |  0.0   |   0.485   |    0.486    |        1.604        |
| huggingface |       ElectraForQuestionAnswering       |     64     |   0.0   |  0.0   |   0.716   |    0.448    |       -0.436        |
| torchbench  |                 hf_Bert                 |     4      |   0.0   |  0.0   |   0.496   |    0.447    |        1.195        |
| huggingface |                CamemBert                |     1      |   0.0   | -0.003 |   0.445   |    0.447    |       -1.415        |
| huggingface |       RobertaForQuestionAnswering       |     64     |   0.0   |  0.0   |   0.444   |    0.443    |        0.78         |
| huggingface |        BertForQuestionAnswering         |     64     |   0.0   |  0.0   |   0.444   |    0.443    |        0.779        |
| huggingface |         Speech2Text2ForCausalLM         |     64     |   0.0   | 0.101  |   0.428   |    0.433    |        1.004        |
| torchbench  |             LearningToPaint             |     96     |   0.0   | 0.021  |   0.358   |    0.401    |        0.54         |
| huggingface |            YituTechConvBert             |     1      |   0.0   |  0.0   |   0.382   |    0.39     |        1.458        |
| torchbench  |              hf_DistilBert              |     8      |   0.0   |  0.0   |   0.484   |    0.373    |        0.943        |
| torchbench  |           shufflenet_v2_x1_0            |    128     |   0.0   |  0.0   |   0.266   |    0.37     |        0.378        |
| huggingface |          MobileBertForMaskedLM          |     16     |   0.0   |  0.0   |   0.25    |    0.352    |        0.97         |
| torchbench  |               mnasnet1_0                |     32     |   0.0   |  0.0   |   0.149   |     0.3     |        0.358        |
| huggingface |               DistillGPT2               |     1      |   0.0   | 0.003  |   0.408   |    0.29     |        1.164        |
| timm_models |            convmixer_768_32             |     32     |   0.0   |  0.0   |   0.179   |    0.265    |        0.154        |
+-------------+-----------------------------------------+------------+---------+--------+-----------+-------------+---------------------+

Number of graphs

see more

dtype=float32, unit=graphs

+-------------+-----------------------------------+------------+--------+
|    suite    |               name                | batch_size | graphs |
+-------------+-----------------------------------+------------+--------+
| huggingface |       DebertaV2ForMaskedLM        |     1      | 304.0  |
| huggingface |   DebertaV2ForQuestionAnswering   |     1      | 304.0  |
| huggingface |        DebertaForMaskedLM         |     4      | 204.0  |
| huggingface |    DebertaForQuestionAnswering    |     4      | 204.0  |
| huggingface |              BigBird              |     1      |  64.0  |
| torchbench  |            hf_BigBird             |     2      |  64.0  |
| timm_models |            convit_base            |     32     |  27.0  |
| huggingface |            GoogleFnet             |     1      |  27.0  |
| torchbench  |            hf_Reformer            |     4      |  22.0  |
| timm_models |            densenet121            |     64     |  14.0  |
| torchbench  |               moco                |     32     |  11.0  |
| huggingface |  PegasusForConditionalGeneration  |     4      |  7.0   |
| torchbench  |           fastNLP_Bert            |     6      |  10.0  |
| huggingface |  M2M100ForConditionalGeneration   |     2      |  7.0   |
| torchbench  |            tts_angular            |     64     |  4.0   |
| torchbench  |        speech_transformer         |     32     |  4.0   |
| huggingface |      Speech2Text2ForCausalLM      |     64     |  4.0   |
| huggingface |          XGLMForCausalLM          |     2      |  4.0   |
| huggingface |        PegasusForCausalLM         |     8      |  4.0   |
| timm_models |          crossvit_9_240           |     64     |  2.0   |
| timm_models |        eca_botnext26ts_256        |    128     |  2.0   |
| timm_models |         gluon_xception65          |     32     |  2.0   |
| timm_models |          gluon_senet154           |     32     |  2.0   |
| timm_models |        gluon_inception_v3         |    128     |  2.0   |
| timm_models |           ghostnet_100            |    128     |  2.0   |
| timm_models |             gernet_l              |    128     |  2.0   |
| timm_models |             fbnetv3_b             |    128     |  2.0   |
| timm_models |            fbnetc_100             |    128     |  2.0   |
| timm_models |         ese_vovnet19b_dw          |    128     |  2.0   |
| timm_models |         adv_inception_v3          |    128     |  2.0   |
| timm_models |           ecaresnet101d           |     64     |  2.0   |
| timm_models |         eca_halonext26ts          |    128     |  2.0   |
| timm_models |       beit_base_patch16_224       |     64     |  2.0   |
| huggingface | LayoutLMForSequenceClassification |     16     |  2.0   |
| timm_models |           botnet26t_256           |    128     |  2.0   |
| timm_models |           cait_m36_384            |     2      |  2.0   |
| timm_models |          coat_lite_mini           |    128     |  2.0   |
| timm_models |              dpn107               |     32     |  2.0   |
| timm_models |            dm_nfnet_f0            |    128     |  2.0   |
| timm_models |         convmixer_768_32          |     32     |  2.0   |
| timm_models |              dla102               |     64     |  2.0   |
| timm_models |           gmlp_s16_224            |     64     |  2.0   |
| timm_models |  deit_base_distilled_patch16_224  |     64     |  2.0   |
| huggingface |   GPT2ForSequenceClassification   |     4      |  2.0   |
| timm_models |           cspdarknet53            |     64     |  2.0   |
| huggingface |  GPTNeoForSequenceClassification  |     1      |  2.0   |
| timm_models |           convnext_base           |     32     |  2.0   |
| timm_models |           gmixer_24_224           |     64     |  2.0   |
| timm_models |       xcit_large_24_p8_224        |     5      |  2.0   |
| timm_models |            res2next50             |    128     |  2.0   |
+-------------+-----------------------------------+------------+--------+

@anijain2305
Copy link
Contributor Author

Performance Dashboard for float32 precision

Executive Summary

see more We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.

Caveats

  1. Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint.
  2. Experiments do not cover dynamic shapes.
  3. Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+-------------+-------------+-------------+
|    Compiler    | torchbench  | huggingface | timm_models |
+----------------+-------------+-------------+-------------+
|     eager      | 100%, 55/55 | 93%, 41/44  | 100%, 61/61 |
|   aot_eager    | 98%, 54/55  | 93%, 41/44  | 90%, 55/61  |
| aot_cudagraphs | 29%, 16/55  |  0%, 0/44   |  0%, 0/61   |
|  aot_nvfuser   | 62%, 34/55  |  2%, 1/44   | 82%, 50/61  |
|    inductor    | 87%, 48/55  | 77%, 34/44  | 74%, 45/61  |
+----------------+-------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.02x    |    0.0x     |    0.0x     |
|  aot_nvfuser   |   1.12x    |    1.12x    |    1.12x    |
|    inductor    |   1.38x    |    1.60x    |    1.23x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    5.68    |    13.69    |    11.39    |
|   aot_eager    |   10.31    |    20.58    |    17.02    |
| aot_cudagraphs |    4.47    |     0.0     |     0.0     |
|  aot_nvfuser   |   21.51    |    10.59    |    57.77    |
|    inductor    |   278.25   |   120.52    |   427.42    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.87x    |    0.88x    |    0.88x    |
| aot_cudagraphs |   0.48x    |    0.0x     |    0.0x     |
|  aot_nvfuser   |   0.84x    |    1.08x    |    0.85x    |
|    inductor    |   0.79x    |    0.74x    |    0.90x    |
+----------------+------------+-------------+-------------+

torchbench suite with float32 precision

see more

Performance speedup

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|            densenet121            |  4   | 0.9976 |  1.0092   |      0.0       |   1.4538    |  4.5603  |
|         timm_efficientdet         |  1   | 0.9817 |  0.8908   |      0.0       |     0.0     |  3.8319  |
|       functorch_dp_cifar10        |  64  | 1.0004 |  0.9835   |      0.0       |   1.2001    |  3.7742  |
|      timm_vision_transformer      |  8   | 0.9983 |  0.9452   |      0.0       |   1.3452    |  2.5363  |
|                drq                |  1   | 1.0117 |   0.826   |      0.0       |   1.0725    |  2.4186  |
|           BERT_pytorch            |  16  | 1.0094 |  0.8856   |      0.0       |     0.0     |   2.03   |
|             resnet18              |  16  | 1.0049 |  1.1155   |      0.0       |   1.3986    |  1.7819  |
|          pytorch_struct           | 200  | 0.9963 |  0.7395   |     0.8854     |   0.8963    |  1.7657  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.9955 |  0.9348   |     1.1291     |   1.1909    |  1.7242  |
|           lennard_jones           | 1000 | 0.974  |  0.8405   |     1.0627     |   1.0207    |  1.7135  |
|             hf_Albert             |  8   | 1.0013 |  0.9978   |      0.0       |     0.0     |  1.6628  |
|           squeezenet1_1           |  32  | 1.0006 |  1.0042   |     1.0435     |   1.1661    |  1.6351  |
|               dcgan               |  32  | 0.9954 |   1.02    |     1.088      |   1.1569    |  1.6229  |
|          resnext50_32x4d          |  8   | 1.0027 |  1.0793   |      0.0       |   1.3534    |  1.5568  |
|        speech_transformer         |  32  | 1.003  |  0.8984   |      0.0       |     0.0     |  1.4906  |
|            timm_nfnet             | 128  | 0.9995 |  0.9997   |      0.0       |   1.2116    |  1.4697  |
|        mobilenet_v3_large         |  32  | 1.0053 |   1.121   |      0.0       |   1.3848    |  1.4662  |
|              hf_GPT2              |  4   | 1.0053 |  0.9748   |      0.0       |     0.0     |  1.4228  |
|            hf_T5_large            |  2   | 1.0242 |  0.8958   |      0.0       |     0.0     |  1.4145  |
|         soft_actor_critic         | 256  | 0.9952 |  0.7978   |     1.0393     |   1.0108    |  1.3816  |
|           fastNLP_Bert            |  6   | 0.999  |  0.9749   |      0.0       |     0.0     |  1.3503  |
|           pytorch_unet            |  1   | 0.9996 |  0.9969   |      0.0       |   1.0758    |  1.2042  |
|          LearningToPaint          |  96  | 1.0045 |  1.0546   |      0.0       |   1.2423    |  1.2032  |
|              hf_Bart              |  4   | 1.0118 |   0.974   |      0.0       |     0.0     |  1.1751  |
|            Super_SloMo            |  6   | 0.9999 |  0.9977   |      0.0       |     0.0     |  1.1742  |
|               vgg16               |  64  |  1.0   |  0.9986   |     0.7923     |   0.9962    |  1.1703  |
|              hf_Bert              |  4   | 1.0269 |  0.9881   |      0.0       |     0.0     |  1.1642  |
|              alexnet              | 128  | 0.9984 |  0.9988   |     0.777      |   1.0007    |  1.162   |
|            mnasnet1_0             |  32  | 1.001  |  1.1017   |     0.7035     |   1.3033    |  1.1612  |
|           hf_DistilBert           |  8   | 0.9997 |  0.9542   |      0.0       |     0.0     |  1.1537  |
|        Background_Matting         |  4   | 0.9996 |  1.0229   |      0.0       |    1.08     |  1.1159  |
|          pytorch_stargan          |  16  | 0.9994 |  0.9836   |     0.7288     |   0.9873    |  1.1151  |
|            hf_Reformer            |  4   | 0.9963 |    0.0    |     0.8939     |     0.0     |  1.1098  |
|            hf_BigBird             |  2   | 0.985  |  0.9444   |      0.0       |     0.0     |  1.0887  |
|        shufflenet_v2_x1_0         | 128  | 1.0011 |  1.0504   |      0.0       |   1.1836    |  1.0756  |
|         timm_efficientnet         |  32  | 0.9543 |   0.816   |      0.0       |   1.0788    |  1.0728  |
|   timm_vision_transformer_large   |  8   | 0.9992 |  0.9936   |      0.0       |   0.9822    |  1.0534  |
| attention_is_all_you_need_pytorch | 256  | 0.9979 |  0.9708   |      0.0       |     0.0     |  1.0469  |
|           timm_resnest            |  32  | 0.9996 |  1.0033   |      0.0       |   1.1829    |  1.0289  |
|            tts_angular            |  64  | 0.9959 |  0.9672   |     0.9836     |   0.9982    |  1.0112  |
|              demucs               |  4   | 1.0003 |  1.0002   |     0.9997     |   1.0006    |   1.0    |
|    mobilenet_v2_quantized_qat     |  96  | 0.9992 |  0.9996   |     0.999      |   0.9988    |  0.999   |
|      resnet50_quantized_qat       |  32  | 0.9975 |  0.9984   |     0.9983     |   0.9987    |  0.9984  |
|               dlrm                | 2048 | 0.9692 |  0.9785   |      0.0       |     0.0     |  0.9604  |
|           mobilenet_v2            |  96  | 0.9993 |  0.9979   |      0.0       |   1.0437    |  0.9574  |
|            timm_vovnet            |  32  | 0.9073 |  0.9025   |      0.0       |   1.0018    |   0.91   |
|      nvidia_deeprecommender       | 256  | 0.9993 |  0.9629   |     0.5845     |   0.9425    |  0.9044  |
|               moco                |  32  | 0.9947 |  1.0484   |      0.0       |     0.0     |  0.7591  |
|            timm_regnet            |  32  | 0.9652 |  0.9636   |      0.0       |   1.0932    |  0.7378  |
|             resnet50              |  32  | 0.9987 |  0.9933   |      0.0       |    1.161    |  0.7127  |
|              yolov3               |  16  | 0.9996 |  0.9945   |      0.0       |   1.1838    |   0.0    |
|           hf_Longformer           |  2   | 0.969  |   0.899   |     0.8164     |     0.0     |   0.0    |
|               hf_T5               |  8   | 0.9985 |  0.9942   |      0.0       |     0.0     |   0.0    |
|           hf_GPT2_large           |  4   | 0.9996 |  0.9801   |      0.0       |     0.0     |   0.0    |
|             tacotron2             |  64  | 0.9791 |  0.8546   |      0.0       |     0.0     |   0.0    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|               name                | bs  |      eager       |    aot_eager     |  aot_cudagraphs  |   aot_nvfuser    |     inductor     |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|           hf_GPT2_large           |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|            hf_T5_large            |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|   timm_vision_transformer_large   |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|              alexnet              |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               dcgan               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|              demucs               |  4  |       pass       |       pass       |       pass       |       pass       |       pass       |
|           lennard_jones           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|            mnasnet1_0             |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|    mobilenet_v2_quantized_qat     |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|      nvidia_deeprecommender       |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|   pytorch_CycleGAN_and_pix2pix    |  1  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_stargan          | 16  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_struct           | 200 |       pass       |       pass       |       pass       |       pass       |       pass       |
|      resnet50_quantized_qat       |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|         soft_actor_critic         | 256 |       pass       |       pass       |       pass       |       pass       |       pass       |
|           squeezenet1_1           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|            tts_angular            |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               vgg16               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          LearningToPaint          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            densenet121            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|                drq                |  1  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|       functorch_dp_cifar10        |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           mobilenet_v2            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|        mobilenet_v3_large         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           pytorch_unet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet18              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet50              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|          resnext50_32x4d          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|        shufflenet_v2_x1_0         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|         timm_efficientnet         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_nfnet             |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_regnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           timm_resnest            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|      timm_vision_transformer      |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_vovnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            hf_Reformer            |  2  |       pass       |       pass       |       pass       |   fail_to_run    |       pass       |
|           BERT_pytorch            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            Super_SloMo            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
| attention_is_all_you_need_pytorch |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|               dlrm                |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           fastNLP_Bert            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|             hf_Albert             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_Bart              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_Bert              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            hf_BigBird             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           hf_DistilBert           |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_GPT2              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|               hf_T5               |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|        speech_transformer         |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|         timm_efficientdet         |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|        Background_Matting         |  4  |       pass       |       pass       |   fail_to_run    |       pass       |   fail_to_run    |
|           hf_Longformer           |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|            hf_T5_base             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|               moco                |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|             tacotron2             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|          vision_maskrcnn          |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|              yolov3               |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |      0.0000      |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+

Compilation latency (sec)

+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+
|               name                |  bs  |  eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor  |
+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+
|         timm_efficientdet         |  1   | 50.9164 |  70.3788  |      nan       |     nan     | 1855.9202 |
|            densenet121            |  4   | 13.1067 |  25.4059  |      nan       |  101.5015   | 1599.4226 |
|            hf_T5_large            |  2   | 35.7166 |  66.5562  |      nan       |     nan     | 1154.4563 |
|            mnasnet1_0             |  32  | 3.1383  |  7.0386   |    23.5784     |   33.4187   | 924.0881  |
|        mobilenet_v3_large         |  32  | 3.6197  |   7.569   |      nan       |   55.8228   | 815.6827  |
|               moco                |  32  | 11.4915 |  16.8868  |      nan       |     nan     | 792.5782  |
|           mobilenet_v2            |  96  |  3.069  |  6.6873   |      nan       |   39.0419   | 673.3655  |
|          resnext50_32x4d          |  8   | 3.3393  |  7.3876   |      nan       |   31.0213   | 626.7237  |
|         timm_efficientnet         |  32  | 5.8246  |  10.4379  |      nan       |   56.7643   | 573.6101  |
|        shufflenet_v2_x1_0         | 128  | 3.6097  |  8.0917   |      nan       |   29.4511   | 415.0044  |
|           squeezenet1_1           |  32  | 0.6275  |  1.3124   |     3.1679     |   4.8972    | 379.8186  |
|           timm_resnest            |  32  |  1.351  |  3.4723   |      nan       |   36.2388   | 362.8361  |
|            timm_regnet            |  32  |  8.274  |  14.2127  |      nan       |   53.5289   | 335.4974  |
| attention_is_all_you_need_pytorch | 256  | 4.2332  |  10.1412  |      nan       |     nan     | 269.6108  |
|        speech_transformer         |  32  | 7.1452  |  13.5568  |      nan       |     nan     | 259.7565  |
|            timm_vovnet            |  32  | 2.8909  |  6.1661   |      nan       |   25.6462   | 255.4935  |
|       functorch_dp_cifar10        |  64  | 0.7904  |  2.0933   |      nan       |   5.6355    | 208.4064  |
|      timm_vision_transformer      |  8   | 2.9873  |  6.3471   |      nan       |   11.3264   | 200.3176  |
|             resnet18              |  16  | 0.9362  |  2.4353   |      nan       |   18.0277   | 195.5902  |
|   timm_vision_transformer_large   |  8   | 22.2765 |  34.0841  |      nan       |   44.7332   | 189.7259  |
|        Background_Matting         |  4   | 3.6941  |  7.5331   |      nan       |   32.8015   | 183.8065  |
|           BERT_pytorch            |  16  | 4.8027  |  10.7586  |      nan       |     nan     | 183.4356  |
|          LearningToPaint          |  96  | 0.9741  |  2.5194   |      nan       |   24.5849   | 178.7819  |
|             resnet50              |  32  | 3.2773  |  7.4179   |      nan       |   35.0054   | 175.1635  |
|              hf_Bart              |  4   | 7.0179  |  13.1991  |      nan       |     nan     | 163.5884  |
|           fastNLP_Bert            |  6   | 5.0044  |  9.9284   |      nan       |     nan     | 153.2715  |
|              hf_GPT2              |  4   | 3.4005  |  7.8867   |      nan       |     nan     | 149.3391  |
|            timm_nfnet             | 128  | 6.6204  |  11.8892  |      nan       |   34.5324   | 136.2766  |
|          pytorch_stargan          |  16  | 0.8038  |   2.764   |     9.5008     |   4.2834    |  128.577  |
|          pytorch_struct           | 200  | 0.3903  |  0.9288   |     1.4439     |   4.2379    | 106.0572  |
|            Super_SloMo            |  6   | 2.1703  |  5.8559   |      nan       |     nan     |  93.2015  |
|              hf_Bert              |  4   | 4.9761  |  9.5568   |      nan       |     nan     |  82.4714  |
|             hf_Albert             |  8   | 1.1045  |  5.7238   |      nan       |     nan     |  80.7431  |
|            hf_Reformer            |  4   | 2.9996  |    nan    |    13.0539     |     nan     |  77.5866  |
|           pytorch_unet            |  1   | 1.0533  |   2.812   |      nan       |   20.2914   |  64.8277  |
|            hf_BigBird             |  2   | 10.9112 |  16.7734  |      nan       |     nan     |  61.4215  |
|           hf_DistilBert           |  8   |  1.57   |  3.9675   |      nan       |     nan     |  54.3823  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.7416  |  2.5789   |     7.9558     |   4.1967    |  35.9234  |
|               vgg16               |  64  | 0.3047  |  0.7761   |     2.3197     |   2.7169    |  22.3816  |
|               dlrm                | 2048 | 0.5927  |  0.9744   |      nan       |     nan     |  18.9655  |
|                drq                |  1   | 0.2629  |  0.5431   |      nan       |   3.5281    |  18.6718  |
|              alexnet              | 128  | 0.2274  |  0.5093   |     1.2017     |   2.4621    |  17.8106  |
|               dcgan               |  32  | 0.2487  |  0.5048   |      1.23      |   3.8262    |  16.923   |
|      nvidia_deeprecommender       | 256  | 0.2588  |  0.4766   |     0.7503     |   2.4694    |  12.556   |
|         soft_actor_critic         | 256  | 0.2557  |  0.3887   |     0.5963     |   1.5565    |  12.2199  |
|           lennard_jones           | 1000 | 0.2225  |  0.3672   |     0.5077     |   1.1334    |  5.8665   |
|            tts_angular            |  64  | 0.3106  |  0.3618   |     0.4935     |   1.0926    |   4.687   |
|      resnet50_quantized_qat       |  32  | 2.5256  |  2.4992   |     2.5283     |   2.4885    |   2.434   |
|    mobilenet_v2_quantized_qat     |  96  | 2.4664  |  2.4212   |     2.3653     |   2.3407    |  2.3672   |
|              demucs               |  4   | 0.8026  |  0.8012   |     0.8095     |   0.8135    |  0.7167   |
|              yolov3               |  16  | 7.1951  |  13.031   |      nan       |   47.4947   |    nan    |
|           hf_Longformer           |  2   | 11.5662 |  18.8685  |    84.9383     |     nan     |    nan    |
|           hf_GPT2_large           |  4   | 21.0635 |  34.9334  |      nan       |     nan     |    nan    |
|             tacotron2             |  64  | 13.5662 |  26.3055  |      nan       |     nan     |    nan    |
|               hf_T5               |  8   | 3.7864  |  10.4607  |      nan       |     nan     |    nan    |
+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+

Peak Memory Compression Ratio

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|            Super_SloMo            |  6   | 1.0024 |   0.956   |      nan       |     nan     |  1.1855  |
|         timm_efficientnet         |  32  | 0.9998 |  0.7704   |      nan       |   0.7845    |  1.0652  |
|            timm_nfnet             | 128  | 0.9393 |   0.897   |      nan       |   0.9515    |  1.022   |
|         timm_efficientdet         |  1   | 1.0142 |  0.8251   |      nan       |     nan     |  1.0218  |
|      resnet50_quantized_qat       |  32  | 0.9967 |  0.9967   |     0.9967     |   0.9967    |  1.0001  |
|    mobilenet_v2_quantized_qat     |  96  | 0.9957 |  0.9957   |     0.9957     |   0.9957    |  0.9992  |
|           mobilenet_v2            |  96  | 0.9993 |  0.7661   |      nan       |   0.7676    |  0.9975  |
|              demucs               |  4   | 0.9886 |  0.9886   |     0.9886     |   0.9886    |  0.9886  |
|            tts_angular            |  64  | 0.9884 |  0.9884   |     0.984      |   0.9884    |  0.9842  |
|              hf_GPT2              |  4   | 0.9548 |   0.887   |      nan       |     nan     |  0.9505  |
|        Background_Matting         |  4   | 1.0026 |   0.952   |      nan       |   0.9773    |  0.9139  |
|          pytorch_stargan          |  16  | 0.9975 |   1.019   |     0.2027     |   1.0085    |  0.9023  |
|        speech_transformer         |  32  | 0.9988 |  0.9152   |      nan       |     nan     |  0.8959  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.9986 |  0.9173   |     0.2326     |   0.9114    |  0.8941  |
|             hf_Albert             |  8   | 0.9333 |  0.9333   |      nan       |     nan     |  0.8804  |
|           pytorch_unet            |  1   | 0.9985 |  0.8536   |      nan       |    0.851    |  0.859   |
|              hf_Bart              |  4   | 0.9617 |   0.878   |      nan       |     nan     |  0.853   |
|              hf_Bert              |  4   | 0.9683 |  0.8952   |      nan       |     nan     |  0.8517  |
|            timm_regnet            |  32  | 1.0013 |  0.8634   |      nan       |   0.8806    |  0.8481  |
|        shufflenet_v2_x1_0         | 128  |  1.0   |  0.9163   |      nan       |   0.8868    |  0.8447  |
|           fastNLP_Bert            |  6   | 1.0012 |  0.9152   |      nan       |     nan     |  0.8343  |
| attention_is_all_you_need_pytorch | 256  | 0.9481 |  0.9241   |      nan       |     nan     |  0.8261  |
|            timm_vovnet            |  32  | 0.9933 |  0.7644   |      nan       |   0.7778    |  0.8252  |
|           BERT_pytorch            |  16  |  1.0   |  0.8995   |      nan       |     nan     |  0.825   |
|            hf_T5_large            |  2   | 0.922  |  0.8722   |      nan       |     nan     |  0.8237  |
|            hf_BigBird             |  2   | 0.9609 |  0.9609   |      nan       |     nan     |  0.8205  |
|           squeezenet1_1           |  32  | 0.9749 |  0.8159   |     0.2781     |   0.9742    |  0.8159  |
|           hf_DistilBert           |  8   | 0.9212 |  0.9053   |      nan       |     nan     |  0.7841  |
|               dcgan               |  32  |  1.0   |  0.7784   |     0.3321     |   0.7784    |  0.767   |
|               moco                |  32  | 1.0067 |  0.9701   |      nan       |     nan     |  0.7668  |
|              alexnet              | 128  | 0.9998 |  0.7731   |     0.3805     |   0.7736    |  0.743   |
|            mnasnet1_0             |  32  | 0.9988 |  0.9087   |     0.1627     |   0.8348    |  0.7268  |
|             resnet50              |  32  | 1.0002 |  0.8763   |      nan       |   0.8011    |  0.7254  |
|   timm_vision_transformer_large   |  8   | 1.0022 |  0.8433   |      nan       |   0.8015    |  0.7222  |
|      timm_vision_transformer      |  8   |  1.0   |  0.8883   |      nan       |   0.8108    |  0.712   |
|        mobilenet_v3_large         |  32  | 0.9958 |  0.8655   |      nan       |   0.8773    |  0.7041  |
|               dlrm                | 2048 | 0.7282 |  0.7283   |      nan       |     nan     |  0.6973  |
|           timm_resnest            |  32  | 0.9935 |  0.8869   |      nan       |   0.8075    |  0.6862  |
|            densenet121            |  4   |  1.0   |  0.8812   |      nan       |   0.8571    |  0.6618  |
|          resnext50_32x4d          |  8   | 0.9994 |  0.8687   |      nan       |   0.8223    |  0.6615  |
|               vgg16               |  64  |  1.0   |  0.6663   |     0.2532     |   0.6664    |  0.6471  |
|          LearningToPaint          |  96  | 0.9442 |  0.6918   |      nan       |   0.6272    |  0.6444  |
|         soft_actor_critic         | 256  | 0.964  |   0.964   |     0.4356     |   0.9555    |  0.6428  |
|                drq                |  1   | 0.8541 |  0.8541   |      nan       |   0.8541    |  0.6427  |
|             resnet18              |  16  | 0.9846 |  0.7907   |      nan       |   0.7038    |  0.6163  |
|           lennard_jones           | 1000 |  1.0   |    1.0    |     0.3712     |   1.0947    |  0.5646  |
|      nvidia_deeprecommender       | 256  | 0.5598 |  0.5598   |     0.4734     |   0.5598    |  0.5598  |
|          pytorch_struct           | 200  |  1.0   |  0.5079   |     0.4824     |   0.5079    |  0.4222  |
|       functorch_dp_cifar10        |  64  | 0.9626 |  0.8251   |      nan       |   0.8254    |  0.4037  |
|            hf_Reformer            |  4   | 0.3011 |    nan    |     0.1803     |     nan     |  0.299   |
|              yolov3               |  16  | 1.0072 |  0.8533   |      nan       |   0.8915    |   nan    |
|           hf_Longformer           |  2   | 0.9603 |  0.9603   |     0.2879     |     nan     |   nan    |
|             tacotron2             |  64  | 0.9922 |  1.1046   |      nan       |     nan     |   nan    |
|               hf_T5               |  8   | 0.9527 |  0.9446   |      nan       |     nan     |   nan    |
|           hf_GPT2_large           |  4   | 0.936  |  0.8771   |      nan       |     nan     |   nan    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

huggingface suite with float32 precision

see more

Performance speedup

+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|       MT5ForConditionalGeneration       | 2  | 1.0268 |   0.919   |      0.0       |     0.0     |  4.2611  |
|           ElectraForCausalLM            | 1  | 1.0463 |  0.9209   |      0.0       |     0.0     |  4.1573  |
|            YituTechConvBert             | 1  | 1.0326 |  0.9386   |      0.0       |     0.0     |  3.1049  |
|         MegatronBertForCausalLM         | 2  | 1.043  |   0.943   |      0.0       |     0.0     |  2.8277  |
|           RobertaForCausalLM            | 4  | 1.0398 |  0.9419   |      0.0       |     0.0     |  2.7707  |
|          MobileBertForMaskedLM          | 16 | 1.0228 |   0.919   |      0.0       |     0.0     |  2.6211  |
|     M2M100ForConditionalGeneration      | 2  | 1.1203 |  1.0476   |      0.0       |     0.0     |  2.584   |
|             OPTForCausalLM              | 4  | 1.0181 |  0.9027   |      0.0       |     0.0     |  2.5794  |
|             XGLMForCausalLM             | 1  | 1.0148 |  0.8733   |      0.0       |     0.0     |  2.4465  |
|     PegasusForConditionalGeneration     | 4  | 1.0132 |   0.883   |      0.0       |     0.0     |  2.4111  |
|     MobileBertForQuestionAnswering      | 32 | 1.0191 |  0.9141   |      0.0       |     0.0     |  2.3065  |
|                CamemBert                | 1  | 1.046  |   0.945   |      0.0       |     0.0     |  2.2963  |
|               DistillGPT2               | 1  | 1.0351 |  0.9295   |      0.0       |     0.0     |  2.0155  |
|     PLBartForConditionalGeneration      | 8  | 1.0177 |  0.8977   |      0.0       |     0.0     |  1.8483  |
|               GoogleFnet                | 1  | 1.0022 |  0.8086   |      0.0       |   1.1178    |  1.7839  |
|      GPT2ForSequenceClassification      | 4  | 0.9991 |   0.977   |      0.0       |     0.0     |  1.6644  |
|    MegatronBertForQuestionAnswering     | 8  | 1.0461 |  0.9419   |      0.0       |     0.0     |  1.6081  |
|      MBartForConditionalGeneration      | 8  | 1.0126 |   0.916   |      0.0       |     0.0     |  1.4634  |
|            XLNetLMHeadModel             | 4  | 0.9991 |  0.9656   |      0.0       |     0.0     |  1.4289  |
|           PegasusForCausalLM            | 8  | 1.0088 |  0.9262   |      0.0       |     0.0     |  1.3581  |
|       T5ForConditionalGeneration        | 4  | 1.002  |  0.9661   |      0.0       |     0.0     |  1.349   |
|            TrOCRForCausalLM             | 8  | 1.0149 |  0.9561   |      0.0       |     0.0     |  1.337   |
|       AlbertForQuestionAnswering        | 2  |  1.0   |  0.9999   |      0.0       |     0.0     |  1.3032  |
|            AlbertForMaskedLM            | 2  | 1.0008 |  0.9987   |      0.0       |     0.0     |  1.2986  |
|         Speech2Text2ForCausalLM         | 64 | 1.0087 |  0.9398   |      0.0       |     0.0     |  1.2936  |
|    LayoutLMForSequenceClassification    | 16 | 0.9991 |  0.9865   |      0.0       |     0.0     |  1.2471  |
|                 T5Small                 | 1  | 1.0201 |  0.9507   |      0.0       |     0.0     |  1.2467  |
|      BartForConditionalGeneration       | 1  | 1.0117 |  0.8919   |      0.0       |     0.0     |  1.2102  |
|     DistilBertForQuestionAnswering      | 32 | 1.0287 |  0.9788   |      0.0       |     0.0     |  1.186   |
|       DebertaForQuestionAnswering       | 4  | 0.9307 |  0.7473   |     0.7971     |     0.0     |  1.1787  |
|          DistilBertForMaskedLM          | 16 | 1.0282 |  0.9804   |      0.0       |     0.0     |  1.1665  |
|            PLBartForCausalLM            | 16 | 1.0148 |  0.9447   |      0.0       |     0.0     |  1.1599  |
| BlenderbotSmallForConditionalGeneration | 32 | 1.0103 |  0.9364   |      0.0       |     0.0     |  1.1574  |
|             BartForCausalLM             | 2  | 0.9992 |  0.9654   |      0.0       |     0.0     |  1.1029  |
|       RobertaForQuestionAnswering       | 64 | 0.9986 |  0.9812   |      0.0       |     0.0     |  1.1015  |
|        BertForQuestionAnswering         | 64 | 0.9987 |  0.9812   |      0.0       |     0.0     |  1.0921  |
|                 BigBird                 | 1  | 0.996  |  0.9401   |      0.0       |     0.0     |  1.0903  |
|            MBartForCausalLM             | 16 | 1.0061 |  0.9666   |      0.0       |     0.0     |  1.0422  |
|             BertForMaskedLM             | 64 | 0.9993 |  0.9612   |      0.0       |     0.0     |  1.0404  |
|           DebertaForMaskedLM            | 4  | 0.9338 |  0.8099   |     0.7224     |     0.0     |  1.0183  |
|       BlenderbotSmallForCausalLM        | 64 | 1.001  |  0.9056   |      0.0       |     0.0     |  1.0071  |
|          AllenaiLongformerBase          | 1  | 0.9525 |  0.8694   |     0.7836     |     0.0     |   0.0    |
|       ElectraForQuestionAnswering       | 64 | 0.9988 |  0.9837   |      0.0       |     0.0     |   0.0    |
|           LayoutLMForMaskedLM           | 16 | 0.9989 |  0.9699   |      0.0       |     0.0     |   0.0    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser |  inductor   |
+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+
|               GoogleFnet                | 1  |  pass  |   pass    |  fail_to_run   |    pass     |    pass     |
|             BartForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             BertForMaskedLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|        BertForQuestionAnswering         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                 BigBird                 | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       BlenderbotSmallForCausalLM        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
| BlenderbotSmallForConditionalGeneration | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                CamemBert                | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           DebertaForMaskedLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       DebertaForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|          DistilBertForMaskedLM          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     DistilBertForQuestionAnswering      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|               DistillGPT2               | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           ElectraForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       ElectraForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|      GPT2ForSequenceClassification      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           LayoutLMForMaskedLM           | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|    LayoutLMForSequenceClassification    | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            MBartForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       MT5ForConditionalGeneration       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|         MegatronBertForCausalLM         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|    MegatronBertForQuestionAnswering     | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|          MobileBertForMaskedLM          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     MobileBertForQuestionAnswering      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             OPTForCausalLM              | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            PLBartForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           PegasusForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     PegasusForConditionalGeneration     | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           RobertaForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       RobertaForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|         Speech2Text2ForCausalLM         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       T5ForConditionalGeneration        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                 T5Small                 | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            TrOCRForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            XLNetLMHeadModel             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            YituTechConvBert             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            AlbertForMaskedLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|       AlbertForQuestionAnswering        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|          AllenaiLongformerBase          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|      MBartForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|     PLBartForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|      BartForConditionalGeneration       | 0  | 0.0000 |  0.0000   |     0.0000     |   0.0000    |   0.0000    |
|     M2M100ForConditionalGeneration      | 0  | 0.0000 |  0.0000   |     0.0000     |   0.0000    |   0.0000    |
|             XGLMForCausalLM             | 0  | 0.0000 |  0.0000   |     0.0000     |   0.0000    |   0.0000    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+

Compilation latency (sec)

+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+
|                  name                   | bs |  eager   | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+
|            XLNetLMHeadModel             | 4  | 17.6114  |  35.8594  |      nan       |     nan     | 311.0562 |
|          MobileBertForMaskedLM          | 16 | 134.9363 | 159.1989  |      nan       |     nan     | 285.2019 |
|     MobileBertForQuestionAnswering      | 32 | 131.2332 | 154.9954  |      nan       |     nan     | 264.1833 |
|     M2M100ForConditionalGeneration      | 2  | 25.6915  |  36.5483  |      nan       |     nan     | 249.2139 |
|       MT5ForConditionalGeneration       | 2  |  6.4578  |  16.8657  |      nan       |     nan     | 202.2192 |
|       T5ForConditionalGeneration        | 4  |  3.7498  |  10.6849  |      nan       |     nan     | 200.4614 |
|      MBartForConditionalGeneration      | 8  | 26.4326  |  39.5871  |      nan       |     nan     | 185.0177 |
|     PegasusForConditionalGeneration     | 4  |  25.99   |  38.2544  |      nan       |     nan     | 176.0357 |
|      BartForConditionalGeneration       | 1  | 26.0076  |  38.7586  |      nan       |     nan     |  171.48  |
|            YituTechConvBert             | 1  |  8.9211  |  16.6277  |      nan       |     nan     | 171.2474 |
|             XGLMForCausalLM             | 1  | 14.9478  |  24.9402  |      nan       |     nan     | 165.6688 |
|           DebertaForMaskedLM            | 4  |  6.9897  |  13.3816  |    50.5707     |     nan     | 159.4509 |
|         MegatronBertForCausalLM         | 2  | 16.6098  |  26.1689  |      nan       |     nan     | 156.6464 |
|                 T5Small                 | 1  |  3.849   |  10.5757  |      nan       |     nan     | 156.5866 |
|    MegatronBertForQuestionAnswering     | 8  | 16.2237  |  26.0801  |      nan       |     nan     | 152.5495 |
|     PLBartForConditionalGeneration      | 8  |  7.1207  |  13.3694  |      nan       |     nan     | 148.7384 |
| BlenderbotSmallForConditionalGeneration | 32 | 11.9825  |  20.3854  |      nan       |     nan     | 134.417  |
|       DebertaForQuestionAnswering       | 4  |  6.9424  |  13.2729  |    50.3621     |     nan     | 120.3692 |
|           RobertaForCausalLM            | 4  |  5.0343  |  9.8896   |      nan       |     nan     | 108.708  |
|    LayoutLMForSequenceClassification    | 16 |  5.2725  |  10.0526  |      nan       |     nan     | 102.2804 |
|           PegasusForCausalLM            | 8  |  9.8065  |  14.5866  |      nan       |     nan     | 98.5429  |
|            MBartForCausalLM             | 16 |  9.8899  |  14.5906  |      nan       |     nan     | 91.5006  |
|             OPTForCausalLM              | 4  |  4.6313  |  9.4156   |      nan       |     nan     | 88.7484  |
|             BertForMaskedLM             | 64 |  5.0508  |  9.6996   |      nan       |     nan     | 87.3848  |
|             BartForCausalLM             | 2  |  9.8513  |   14.44   |      nan       |     nan     | 87.1208  |
|      GPT2ForSequenceClassification      | 4  |  3.4283  |  7.9924   |      nan       |     nan     |  86.486  |
|            TrOCRForCausalLM             | 8  |  9.7915  |  14.4556  |      nan       |     nan     | 78.9711  |
|               DistillGPT2               | 1  |  1.4429  |  3.7509   |      nan       |     nan     | 75.3892  |
|           ElectraForCausalLM            | 1  |  5.088   |  9.8527   |      nan       |     nan     | 72.4505  |
|            PLBartForCausalLM            | 16 |  3.2335  |  5.4969   |      nan       |     nan     | 70.0976  |
|                CamemBert                | 1  |  4.996   |  9.9911   |      nan       |     nan     |  68.592  |
|     DistilBertForQuestionAnswering      | 32 |  1.7309  |  4.1188   |      nan       |     nan     | 68.2232  |
|         Speech2Text2ForCausalLM         | 64 |  3.1456  |   5.399   |      nan       |     nan     | 67.8609  |
|       BlenderbotSmallForCausalLM        | 64 |  4.796   |   7.892   |      nan       |     nan     | 67.5137  |
|       RobertaForQuestionAnswering       | 64 |  4.8999  |  9.8389   |      nan       |     nan     | 66.2968  |
|        BertForQuestionAnswering         | 64 |  4.8621  |  9.7433   |      nan       |     nan     | 65.7811  |
|            AlbertForMaskedLM            | 2  |  1.2227  |  6.2484   |      nan       |     nan     | 65.4464  |
|                 BigBird                 | 1  | 11.1119  |  16.9625  |      nan       |     nan     | 58.5731  |
|          DistilBertForMaskedLM          | 16 |  1.7176  |  4.1522   |      nan       |     nan     |  51.726  |
|       AlbertForQuestionAnswering        | 2  |  1.2187  |  6.0422   |      nan       |     nan     | 45.4511  |
|               GoogleFnet                | 1  |  1.9996  |  4.2959   |      nan       |   10.5864   | 44.8824  |
|          AllenaiLongformerBase          | 1  |  11.654  |  19.7453  |     85.668     |     nan     |   nan    |
|           LayoutLMForMaskedLM           | 16 |  5.3735  |  10.207   |      nan       |     nan     |   nan    |
|       ElectraForQuestionAnswering       | 64 |  4.9237  |   9.746   |      nan       |     nan     |   nan    |
+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+

Peak Memory Compression Ratio

+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|      GPT2ForSequenceClassification      | 4  | 0.9342 |  0.9091   |      nan       |     nan     |  1.0318  |
|            XLNetLMHeadModel             | 4  | 1.0001 |  0.8976   |      nan       |     nan     |  0.9717  |
|    LayoutLMForSequenceClassification    | 16 |  1.0   |  0.9348   |      nan       |     nan     |  0.9339  |
|        BertForQuestionAnswering         | 64 |  1.0   |  0.9467   |      nan       |     nan     |  0.9145  |
|       RobertaForQuestionAnswering       | 64 |  1.0   |  0.9467   |      nan       |     nan     |  0.9145  |
|                 T5Small                 | 1  |  1.0   |  0.9325   |      nan       |     nan     |  0.8445  |
|     DistilBertForQuestionAnswering      | 32 |  1.0   |  0.9046   |      nan       |     nan     |  0.8394  |
|             BertForMaskedLM             | 64 |  1.0   |  0.9219   |      nan       |     nan     |  0.8321  |
|             BartForCausalLM             | 2  |  1.0   |  0.8847   |      nan       |     nan     |  0.8303  |
|                 BigBird                 | 1  | 1.0001 |  0.9549   |      nan       |     nan     |  0.8224  |
|          DistilBertForMaskedLM          | 16 | 0.9998 |  0.9138   |      nan       |     nan     |  0.8055  |
|            PLBartForCausalLM            | 16 | 0.9997 |  0.8802   |      nan       |     nan     |  0.8028  |
|            MBartForCausalLM             | 16 |  1.0   |  0.8629   |      nan       |     nan     |  0.8005  |
|               DistillGPT2               | 1  | 1.0003 |  0.7721   |      nan       |     nan     |  0.7997  |
|         Speech2Text2ForCausalLM         | 64 |  1.0   |   0.88    |      nan       |     nan     |  0.7768  |
|       T5ForConditionalGeneration        | 4  |  1.0   |  0.9597   |      nan       |     nan     |  0.7754  |
|             XGLMForCausalLM             | 1  | 0.9999 |  0.9999   |      nan       |     nan     |  0.7728  |
|      BartForConditionalGeneration       | 1  |  1.0   |  0.8465   |      nan       |     nan     |  0.7708  |
| BlenderbotSmallForConditionalGeneration | 32 |  1.0   |  0.9036   |      nan       |     nan     |  0.7612  |
|     PLBartForConditionalGeneration      | 8  | 0.9997 |  0.8222   |      nan       |     nan     |  0.7547  |
|                CamemBert                | 1  | 0.998  |  0.7977   |      nan       |     nan     |  0.7369  |
|            YituTechConvBert             | 1  | 0.9858 |  0.7923   |      nan       |     nan     |  0.7299  |
|            TrOCRForCausalLM             | 8  |  1.0   |  0.8048   |      nan       |     nan     |  0.7284  |
|       BlenderbotSmallForCausalLM        | 64 |  1.0   |  0.8401   |      nan       |     nan     |  0.7277  |
|      MBartForConditionalGeneration      | 8  |  1.0   |  0.8137   |      nan       |     nan     |  0.727   |
|             OPTForCausalLM              | 4  | 0.9979 |   0.75    |      nan       |     nan     |  0.714   |
|           RobertaForCausalLM            | 4  | 0.9058 |  0.7778   |      nan       |     nan     |  0.7099  |
|           PegasusForCausalLM            | 8  |  1.0   |  0.9323   |      nan       |     nan     |  0.7012  |
|    MegatronBertForQuestionAnswering     | 8  | 0.923  |  0.8265   |      nan       |     nan     |  0.6997  |
|               GoogleFnet                | 1  | 1.0003 |  0.9447   |      nan       |   1.0813    |  0.6953  |
|     M2M100ForConditionalGeneration      | 2  | 0.9783 |  0.9777   |      nan       |     nan     |  0.6688  |
|         MegatronBertForCausalLM         | 2  | 0.7066 |  0.7066   |      nan       |     nan     |  0.6453  |
|     PegasusForConditionalGeneration     | 4  | 0.9721 |  0.9004   |      nan       |     nan     |  0.642   |
|       MT5ForConditionalGeneration       | 2  | 0.6173 |  0.6173   |      nan       |     nan     |  0.6173  |
|       AlbertForQuestionAnswering        | 2  |  1.0   |  0.9369   |      nan       |     nan     |  0.6126  |
|           ElectraForCausalLM            | 1  |  1.0   |  0.9107   |      nan       |     nan     |  0.6123  |
|            AlbertForMaskedLM            | 2  | 0.9999 |  0.9172   |      nan       |     nan     |  0.6027  |
|          MobileBertForMaskedLM          | 16 | 0.9997 |  0.9179   |      nan       |     nan     |  0.5861  |
|     MobileBertForQuestionAnswering      | 32 |  1.0   |  0.9716   |      nan       |     nan     |  0.4668  |
|           DebertaForMaskedLM            | 4  |  1.0   |  0.9851   |     0.352      |     nan     |  0.4265  |
|       DebertaForQuestionAnswering       | 4  | 0.9845 |  1.0525   |     0.3276     |     nan     |  0.3569  |
|          AllenaiLongformerBase          | 1  | 0.9988 |  0.9515   |     0.3143     |     nan     |   nan    |
|       ElectraForQuestionAnswering       | 64 |  1.0   |  0.9524   |      nan       |     nan     |   nan    |
|           LayoutLMForMaskedLM           | 16 |  1.0   |  0.9409   |      nan       |     nan     |   nan    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+

timm_models suite with float32 precision

see more

Performance speedup

+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|              name               | bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|            hrnet_w18            |  2  | 1.0063 |  1.0839   |      0.0       |   1.4454    |  4.4633  |
|        res2net50_14w_8s         |  2  | 1.0006 |  1.0247   |      0.0       |   1.4422    |  4.1571  |
|           res2next50            |  2  | 1.0002 |  1.0372   |      0.0       |   1.3702    |  4.1399  |
|         coat_lite_mini          | 128 | 0.9999 |  0.9989   |      0.0       |   1.0734    |  1.7041  |
|          ghostnet_100           | 128 | 0.9986 |  0.9941   |      0.0       |    1.243    |  1.6133  |
|        tnt_s_patch16_224        | 64  | 0.9995 |  0.9981   |      0.0       |   1.5571    |  1.5001  |
|        twins_pcpvt_base         | 32  | 1.0054 |  0.9735   |      0.0       |   1.2889    |  1.4364  |
|      xcit_large_24_p8_224       |  5  | 1.0011 |  0.9919   |      0.0       |     0.0     |  1.4302  |
|         crossvit_9_240          | 64  | 1.0068 |  0.9963   |      0.0       |    1.062    |  1.4092  |
|           volo_d1_224           | 64  | 0.9996 |  0.9949   |      0.0       |   1.1385    |  1.4042  |
|            nfnet_l0             | 64  | 1.0001 |   0.797   |      0.0       |   1.0495    |  1.381   |
|          gmixer_24_224          | 64  | 0.9991 |  0.8429   |      0.0       |   0.9957    |  1.3504  |
|          jx_nest_base           | 32  | 0.9992 |  0.9943   |      0.0       |   1.2244    |  1.2882  |
|           convit_base           | 32  | 0.9991 |   0.995   |      0.0       |   1.1931    |  1.2615  |
|            lcnet_050            | 128 | 0.9547 |  0.9495   |      0.0       |   1.5025    |  1.2406  |
|          cait_m36_384           |  2  | 0.9979 |  0.9981   |      0.0       |   0.9962    |  1.2022  |
|          convnext_base          | 32  | 0.9992 |  0.9967   |      0.0       |   1.0434    |  1.172   |
|          gmlp_s16_224           | 64  | 0.9991 |   0.996   |      0.0       |   0.9991    |  1.142   |
|      beit_base_patch16_224      | 64  | 0.9998 |  0.9813   |      0.0       |   0.9537    |  1.1226  |
|           regnety_002           | 128 | 0.9787 |  0.9996   |      0.0       |   1.3613    |  1.1081  |
| deit_base_distilled_patch16_224 | 64  | 0.9997 |   0.998   |      0.0       |    1.019    |  1.1058  |
|      vit_base_patch16_224       | 64  | 0.9997 |  0.9982   |      0.0       |   0.9781    |  1.0981  |
|          mixer_b16_224          | 64  | 0.9996 |  0.9968   |      0.0       |   0.9838    |  1.0523  |
|            mixnet_l             | 64  | 0.971  |  0.8727   |      0.0       |   1.0065    |  1.0458  |
|           tf_mixnet_l           | 64  | 0.9718 |  0.8763   |      0.0       |   1.0061    |  1.0239  |
|             dpn107              | 32  | 0.9587 |  0.9505   |      0.0       |   1.0289    |  1.0034  |
|             dla102              | 64  | 0.9995 |  0.9965   |      0.0       |   1.2853    |  0.9963  |
|          resmlp_12_224          | 128 | 0.9997 |  0.9986   |      0.0       |     0.0     |  0.9746  |
|           resnest101e           | 32  | 1.0033 |  1.0192   |      0.0       |   1.1978    |  0.9554  |
|       tf_efficientnet_b0        | 128 | 0.977  |  0.7833   |      0.0       |   0.9847    |  0.8973  |
|            repvgg_a2            | 128 | 0.9645 |  0.9628   |      0.0       |   1.1198    |  0.891   |
|           selecsls42b           | 128 | 0.9994 |  0.9981   |      0.0       |   1.2083    |  0.8872  |
|          spnasnet_100           | 128 | 0.9614 |  0.9577   |      0.0       |   1.1368    |  0.886   |
|         visformer_small         | 128 |  1.0   |  1.0012   |      0.0       |   1.0216    |  0.8772  |
|            fbnetv3_b            | 128 | 0.965  |  0.9616   |      0.0       |   1.1289    |  0.8724  |
|            gernet_l             | 128 | 0.9735 |  0.9722   |      0.0       |   1.0981    |  0.8702  |
|           mnasnet_100           | 128 | 0.9667 |  0.9638   |      0.0       |   1.1557    |  0.8485  |
|      mobilenetv3_large_100      | 128 | 0.965  |  0.9626   |      0.0       |   1.1636    |  0.8457  |
|          cspdarknet53           | 64  | 0.9583 |  0.9521   |      0.0       |   1.1839    |  0.8444  |
|            tinynet_a            | 128 | 0.9667 |   0.776   |      0.0       |   0.9711    |  0.8364  |
|           mobilevit_s           | 32  | 0.9725 |  0.7645   |      0.0       |   0.9873    |  0.8216  |
|       eca_botnext26ts_256       | 64  | 0.973  |  0.7708   |      0.0       |   1.0167    |  0.7978  |
|        sebotnet33ts_256         | 64  | 0.9759 |  0.8072   |      0.0       |   1.0536    |  0.7733  |
|        eca_halonext26ts         | 64  | 0.9743 |  0.7761   |      0.0       |   1.0143    |  0.7709  |
|           fbnetc_100            | 128 | 0.9668 |  0.9628   |      0.0       |   1.1885    |  0.7567  |
|        res2net101_26w_4s        | 64  | 0.9988 |  0.9967   |      0.0       |   1.1758    |  0.7474  |
|           rexnet_100            | 128 | 0.9724 |  0.8167   |      0.0       |   0.9834    |  0.676   |
|         mobilenetv2_100         | 128 | 0.9668 |  0.9633   |      0.0       |   1.0116    |  0.669   |
|        ese_vovnet19b_dw         | 128 | 0.9789 |  0.9774   |      0.0       |   1.1447    |  0.6203  |
|          botnet26t_256          | 128 | 0.9859 |  0.9852   |      0.0       |   1.2245    |   0.0    |
|           dm_nfnet_f0           | 128 | 0.9993 |  0.9997   |      0.0       |   1.2107    |   0.0    |
|        adv_inception_v3         | 128 | 0.9998 |  0.9971   |      0.0       |   1.1256    |   0.0    |
|       gluon_inception_v3        | 128 |  1.0   |  0.9985   |      0.0       |   1.1248    |   0.0    |
|          inception_v3           | 128 | 0.9998 |  0.9968   |      0.0       |   1.1246    |   0.0    |
|     swsl_resnext101_32x16d      | 32  | 0.9995 |  0.9987   |      0.0       |    1.108    |   0.0    |
|          pnasnet5large          | 16  | 0.9988 |  0.9979   |      0.0       |    1.083    |   0.0    |
|        convmixer_768_32         | 32  | 1.0003 |  0.9997   |      0.0       |    1.061    |   0.0    |
|            pit_b_224            | 64  | 0.9998 |  0.9973   |      0.0       |   1.0594    |   0.0    |
|        gluon_xception65         | 32  | 0.9995 |  0.9967   |      0.0       |   1.0398    |   0.0    |
|         poolformer_m36          | 64  | 0.9994 |  0.9967   |      0.0       |   1.0063    |   0.0    |
|  swin_base_patch4_window7_224   | 64  | 0.9996 |  0.9715   |      0.0       |    1.003    |   0.0    |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+

Accuracy

+---------------------------------+----+-------+---------------+----------------+---------------+---------------+
|              name               | bs | eager |   aot_eager   | aot_cudagraphs |  aot_nvfuser  |   inductor    |
+---------------------------------+----+-------+---------------+----------------+---------------+---------------+
|          convnext_base          | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|          gmixer_24_224          | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|          gmlp_s16_224           | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|          mixer_b16_224          | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|           mnasnet_100           | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|            repvgg_a2            | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|          spnasnet_100           | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|        adv_inception_v3         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      beit_base_patch16_224      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          botnet26t_256          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        convmixer_768_32         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         crossvit_9_240          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          cspdarknet53           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
| deit_base_distilled_patch16_224 | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|             dla102              | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           dm_nfnet_f0           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|             dpn107              | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       eca_botnext26ts_256       | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        eca_halonext26ts         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        ese_vovnet19b_dw         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            gernet_l             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          ghostnet_100           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       gluon_inception_v3        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            hrnet_w18            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          inception_v3           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            lcnet_050            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            mixnet_l             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         mobilenetv2_100         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      mobilenetv3_large_100      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           mobilevit_s           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            nfnet_l0             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          pnasnet5large          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           regnety_002           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        res2net101_26w_4s        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        res2net50_14w_8s         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           res2next50            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           rexnet_100            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        sebotnet33ts_256         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           selecsls42b           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|  swin_base_patch4_window7_224   | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|     swsl_resnext101_32x16d      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       tf_efficientnet_b0        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           tf_mixnet_l           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            tinynet_a            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        tnt_s_patch16_224        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         visformer_small         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      vit_base_patch16_224       | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           volo_d1_224           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          resmlp_12_224          | 2  | pass  |     pass      |      pass      |  fail_to_run  |     pass      |
|           convit_base           | 2  | pass  |     pass      |  fail_to_run   |  fail_to_run  |     pass      |
|      xcit_large_24_p8_224       | 2  | pass  | fail_accuracy |  fail_to_run   |  fail_to_run  |     pass      |
|        gluon_xception65         | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy |     pass      |
|         poolformer_m36          | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy |     pass      |
|         coat_lite_mini          | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|          jx_nest_base           | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|            pit_b_224            | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|        twins_pcpvt_base         | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|           fbnetc_100            | 2  | pass  |     pass      |      pass      |     pass      | fail_accuracy |
|            fbnetv3_b            | 2  | pass  |     pass      |  fail_to_run   |     pass      | fail_accuracy |
|           resnest101e           | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy | fail_accuracy |
|          cait_m36_384           | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy | fail_accuracy |
+---------------------------------+----+-------+---------------+----------------+---------------+---------------+

Compilation latency (sec)

+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+
|              name               | bs  |  eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor  |
+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+
|            hrnet_w18            |  2  | 99.0925 | 129.1429  |      nan       |  302.8547   | 1305.5451 |
|             dpn107              | 32  | 13.3369 |  24.7075  |      nan       |   86.8976   | 1280.7288 |
|           rexnet_100            | 128 | 6.4069  |  11.8586  |      nan       |  107.4496   | 988.7111  |
|        res2net50_14w_8s         |  2  | 19.6264 |  33.3971  |      nan       |   88.2353   | 936.7276  |
|           mobilevit_s           | 32  | 5.7236  |  11.1465  |      nan       |   45.981    | 799.1906  |
|            mixnet_l             | 64  | 13.2916 |  20.3439  |      nan       |   69.4935   | 778.2989  |
|       eca_botnext26ts_256       | 64  | 2.5909  |  6.1519   |      nan       |   49.7554   | 723.0911  |
|          ghostnet_100           | 128 | 9.0362  |  15.9147  |      nan       |   66.3032   | 685.6784  |
|            tinynet_a            | 128 |  7.425  |  13.1769  |      nan       |   67.229    | 660.1354  |
|            fbnetv3_b            | 128 | 12.7368 |  20.2478  |      nan       |   86.4999   | 650.6102  |
|           fbnetc_100            | 128 | 5.4761  |  10.6475  |      nan       |   49.3807   | 636.7924  |
|        twins_pcpvt_base         | 32  | 25.3344 |  36.9167  |      nan       |   68.5287   | 623.4564  |
|           resnest101e           | 32  | 26.2018 |  40.9862  |      nan       |  100.5019   | 610.1099  |
|         coat_lite_mini          | 128 | 3.0107  |  7.0432   |      nan       |   16.4276   | 607.5914  |
|        res2net101_26w_4s        | 64  | 25.6881 |  41.7492  |      nan       |  105.5198   |  529.815  |
|           res2next50            |  2  | 7.2984  |  14.4734  |      nan       |   48.4801   | 510.8523  |
|             dla102              | 64  | 10.5521 |  19.1407  |      nan       |   71.9313   | 507.7483  |
|        sebotnet33ts_256         | 64  | 3.8312  |  8.4416   |      nan       |   53.5113   | 491.4471  |
|           tf_mixnet_l           | 64  |  13.42  |  20.5125  |      nan       |   70.1229   | 489.7827  |
|          cspdarknet53           | 64  | 6.0697  |  11.541   |      nan       |   51.9091   | 486.8412  |
|           mnasnet_100           | 128 | 4.1071  |  7.8077   |      nan       |   40.3807   | 437.7587  |
|       tf_efficientnet_b0        | 128 | 5.6858  |  10.6249  |      nan       |   65.6932   | 426.6445  |
|        eca_halonext26ts         | 64  | 2.5793  |  6.4106   |      nan       |   51.7275   | 422.1906  |
|           regnety_002           | 128 | 4.7761  |  9.4998   |      nan       |   50.0005   | 380.0622  |
|        ese_vovnet19b_dw         | 128 | 1.9265  |  4.0512   |      nan       |   31.7936   | 376.7527  |
|          convnext_base          | 32  | 11.4469 |  15.8597  |      nan       |   30.6712   | 366.3008  |
|         mobilenetv2_100         | 128 | 3.9971  |   7.732   |      nan       |   40.4497   | 363.7838  |
|          spnasnet_100           | 128 | 5.3407  |  10.1369  |      nan       |   47.4009   | 351.9616  |
|      xcit_large_24_p8_224       |  5  | 37.1866 |  52.5417  |      nan       |     nan     | 332.3336  |
|          jx_nest_base           | 32  | 9.9406  |  17.229   |      nan       |   66.5254   | 311.8953  |
|      mobilenetv3_large_100      | 128 | 4.3523  |  8.1262   |      nan       |   67.3167   | 311.1143  |
|         visformer_small         | 128 | 2.3158  |   5.403   |      nan       |   25.7883   | 310.6265  |
|          cait_m36_384           |  2  | 47.2186 |  64.0945  |      nan       |   90.7984   | 298.0052  |
|         crossvit_9_240          | 64  | 7.4081  |  13.6019  |      nan       |   32.2106   | 266.0203  |
|           selecsls42b           | 128 | 2.3164  |  5.4867   |      nan       |   42.0583   | 257.8308  |
|            gernet_l             | 128 | 4.8222  |  9.2556   |      nan       |   39.347    | 251.3237  |
|            lcnet_050            | 128 | 1.9314  |  4.1819   |      nan       |   32.1152   |  232.143  |
|           volo_d1_224           | 64  | 6.5276  |  12.7236  |      nan       |   32.8592   |  182.781  |
|           convit_base           | 32  | 3.8981  |  8.8328   |      nan       |   21.0897   | 177.8059  |
|          gmlp_s16_224           | 64  | 9.0879  |  14.1942  |      nan       |   21.4561   | 145.7858  |
|        tnt_s_patch16_224        | 64  | 12.1016 |  20.3575  |      nan       |   34.7907   | 143.4652  |
|          gmixer_24_224          | 64  | 8.4244  |  14.0469  |      nan       |   23.4351   | 135.0839  |
|            repvgg_a2            | 128 | 4.7534  |   9.049   |      nan       |   47.3708   |  128.289  |
|            nfnet_l0             | 64  | 5.8266  |  11.3992  |      nan       |   31.553    | 103.5121  |
|          resmlp_12_224          | 128 | 2.6977  |  5.0748   |      nan       |     nan     | 101.4346  |
|          mixer_b16_224          | 64  | 2.8858  |  5.2583   |      nan       |   13.4757   |  97.4849  |
| deit_base_distilled_patch16_224 | 64  | 3.0426  |  6.3566   |      nan       |   13.0728   |  78.6364  |
|      beit_base_patch16_224      | 64  | 4.4735  |  8.5964   |      nan       |   18.2085   |  75.1216  |
|      vit_base_patch16_224       | 64  | 2.8552  |  6.5407   |      nan       |   11.5077   |  59.9998  |
|          pnasnet5large          | 16  | 60.8211 |  79.9493  |      nan       |  183.1858   |    nan    |
|          inception_v3           | 128 | 8.3458  |  15.9807  |      nan       |   75.3239   |    nan    |
|        adv_inception_v3         | 128 | 8.5007  |  15.7832  |      nan       |   75.0367   |    nan    |
|       gluon_inception_v3        | 128 | 8.1377  |  16.0286  |      nan       |   74.6521   |    nan    |
|  swin_base_patch4_window7_224   | 64  | 11.8907 |  22.2574  |      nan       |   68.2608   |    nan    |
|        gluon_xception65         | 32  | 14.9179 |  24.5631  |      nan       |   55.7975   |    nan    |
|     swsl_resnext101_32x16d      | 32  | 10.1223 |  18.5382  |      nan       |   49.2201   |    nan    |
|          botnet26t_256          | 128 |  2.287  |   5.453   |      nan       |   42.0242   |    nan    |
|           dm_nfnet_f0           | 128 | 6.4591  |  11.8769  |      nan       |   34.8338   |    nan    |
|         poolformer_m36          | 64  | 13.0015 |  19.6132  |      nan       |   34.8218   |    nan    |
|        convmixer_768_32         | 32  | 6.7607  |  11.8459  |      nan       |   19.5188   |    nan    |
|            pit_b_224            | 64  | 3.6984  |  7.7193   |      nan       |   15.3124   |    nan    |
+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+

Peak Memory Compression Ratio

+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|              name               | bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|          gmixer_24_224          | 64  | 0.9992 |  0.9684   |      nan       |   0.9825    |  1.3808  |
|            nfnet_l0             | 64  | 1.0008 |  0.8298   |      nan       |    0.813    |  1.2558  |
|            tinynet_a            | 128 |  1.0   |  0.7831   |      nan       |   0.7845    |  1.1735  |
|        eca_halonext26ts         | 64  |  1.0   |  0.7717   |      nan       |   0.7731    |  1.1316  |
|           rexnet_100            | 128 | 0.9992 |  0.7879   |      nan       |    0.871    |  1.1072  |
|           convit_base           | 32  | 1.0001 |  0.8879   |      nan       |   0.9506    |  1.068   |
|         mobilenetv2_100         | 128 | 0.9998 |  0.7664   |      nan       |   0.7679    |  1.0051  |
|           mobilevit_s           | 32  | 0.9999 |  0.7692   |      nan       |   0.7431    |  1.0012  |
|             dla102              | 64  | 0.9881 |  0.9181   |      nan       |   0.9541    |  1.001   |
|       eca_botnext26ts_256       | 64  |  1.0   |  0.7705   |      nan       |   0.7679    |  0.9703  |
|           tf_mixnet_l           | 64  | 1.0001 |   0.861   |      nan       |   0.8605    |  0.9698  |
|          cait_m36_384           |  2  | 1.0001 |  0.9024   |      nan       |   0.9202    |  0.9451  |
|       tf_efficientnet_b0        | 128 | 0.9998 |  0.7727   |      nan       |   0.8426    |  0.9413  |
|          mixer_b16_224          | 64  | 0.9956 |  0.9615   |      nan       |   0.8644    |  0.9357  |
|      beit_base_patch16_224      | 64  |  1.0   |  0.9575   |      nan       |   0.8606    |  0.9272  |
|          gmlp_s16_224           | 64  |  1.0   |  0.9766   |      nan       |    0.966    |  0.9267  |
|      vit_base_patch16_224       | 64  | 0.9963 |  0.9469   |      nan       |   0.8229    |  0.915   |
|        tnt_s_patch16_224        | 64  | 1.0001 |  0.9752   |      nan       |   0.8518    |  0.9131  |
|           volo_d1_224           | 64  | 0.9999 |  0.9247   |      nan       |   0.7472    |  0.9124  |
| deit_base_distilled_patch16_224 | 64  | 0.9964 |  0.9476   |      nan       |   0.8242    |  0.9095  |
|          spnasnet_100           | 128 | 1.0005 |  0.9207   |      nan       |   0.8496    |  0.9024  |
|           selecsls42b           | 128 | 0.9883 |  0.8982   |      nan       |   0.9039    |   0.9    |
|            mixnet_l             | 64  | 0.9995 |  0.8486   |      nan       |   0.7938    |  0.8993  |
|      mobilenetv3_large_100      | 128 | 1.0002 |  0.8686   |      nan       |   0.8819    |  0.8982  |
|      xcit_large_24_p8_224       |  5  | 0.9999 |  0.9206   |      nan       |     nan     |  0.8952  |
|           resnest101e           | 32  |  1.0   |  0.9458   |      nan       |   0.9449    |  0.8922  |
|          ghostnet_100           | 128 | 0.9998 |  0.8872   |      nan       |    0.947    |  0.8888  |
|         visformer_small         | 128 | 0.9943 |  0.9442   |      nan       |   0.9475    |  0.8883  |
|            fbnetv3_b            | 128 | 0.9995 |  0.7866   |      nan       |   0.7861    |  0.8837  |
|             dpn107              | 32  | 0.9997 |  0.9285   |      nan       |   0.8949    |  0.8762  |
|          convnext_base          | 32  | 1.0001 |  0.9077   |      nan       |   0.7678    |  0.8761  |
|        twins_pcpvt_base         | 32  | 1.0002 |  0.9127   |      nan       |   0.8351    |  0.8723  |
|          cspdarknet53           | 64  |  1.0   |  0.8562   |      nan       |   0.8797    |  0.8624  |
|          jx_nest_base           | 32  | 1.0017 |   0.898   |      nan       |   0.7112    |  0.8574  |
|        ese_vovnet19b_dw         | 128 | 0.9999 |  0.8938   |      nan       |   0.9369    |  0.8467  |
|        sebotnet33ts_256         | 64  |  1.0   |  0.7109   |      nan       |   0.6852    |  0.841   |
|          resmlp_12_224          | 128 | 0.9893 |  0.9525   |      nan       |     nan     |  0.8169  |
|        res2net101_26w_4s        | 64  | 1.0001 |  0.9307   |      nan       |   0.8959    |  0.8168  |
|         crossvit_9_240          | 64  | 1.0001 |  0.8721   |      nan       |    0.729    |  0.8108  |
|           mnasnet_100           | 128 | 1.0003 |  0.9126   |      nan       |   0.8368    |  0.7984  |
|         coat_lite_mini          | 128 | 1.0049 |  0.8826   |      nan       |   0.7873    |   0.79   |
|            lcnet_050            | 128 | 1.0005 |  0.7721   |      nan       |   0.7722    |  0.7579  |
|           regnety_002           | 128 | 0.9981 |   0.829   |      nan       |   0.7759    |  0.7465  |
|            gernet_l             | 128 |  1.0   |  0.7965   |      nan       |   0.8012    |  0.727   |
|           fbnetc_100            | 128 | 0.9998 |  0.8597   |      nan       |   0.7507    |  0.7246  |
|            hrnet_w18            |  2  | 0.9986 |  0.8792   |      nan       |   0.8869    |  0.6089  |
|           res2next50            |  2  |  1.0   |  0.8353   |      nan       |   0.8404    |  0.5946  |
|        res2net50_14w_8s         |  2  |  1.0   |  0.8387   |      nan       |   0.8474    |  0.5879  |
|            repvgg_a2            | 128 | 1.0003 |  0.8145   |      nan       |   0.6633    |  0.536   |
|          pnasnet5large          | 16  | 1.069  |   1.011   |      nan       |   1.2062    |   nan    |
|        convmixer_768_32         | 32  |  1.0   |  0.9868   |      nan       |   0.9807    |   nan    |
|           dm_nfnet_f0           | 128 | 0.9393 |   0.897   |      nan       |   0.9515    |   nan    |
|         poolformer_m36          | 64  | 1.0003 |  0.9533   |      nan       |   0.9368    |   nan    |
|        gluon_xception65         | 32  | 0.9999 |  0.9384   |      nan       |   0.9001    |   nan    |
|        adv_inception_v3         | 128 | 1.0002 |  0.8694   |      nan       |    0.88     |   nan    |
|       gluon_inception_v3        | 128 | 1.0002 |  0.8694   |      nan       |    0.88     |   nan    |
|          inception_v3           | 128 | 1.0002 |  0.8694   |      nan       |    0.88     |   nan    |
|     swsl_resnext101_32x16d      | 32  | 1.0003 |  0.8983   |      nan       |   0.8684    |   nan    |
|  swin_base_patch4_window7_224   | 64  | 0.9999 |  0.9309   |      nan       |    0.83     |   nan    |
|          botnet26t_256          | 128 |  1.0   |  0.8494   |      nan       |   0.7497    |   nan    |
|            pit_b_224            | 64  | 0.9992 |  0.7962   |      nan       |   0.6417    |   nan    |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+

Performance graphs

see more

bench_logs/timm_models_float32.png :

bench_logs/huggingface_float32.png :

bench_logs/torchbench_float32.png :

@anijain2305
Copy link
Contributor Author

Performance Dashboard for amp precision

Executive Summary

see more We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.

Caveats

  1. Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint.
  2. Experiments do not cover dynamic shapes.
  3. Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 98%, 52/53 | 98%, 42/43  | 100%, 61/61 |
|   aot_eager    | 98%, 52/53 | 98%, 42/43  | 90%, 55/61  |
| aot_cudagraphs | 28%, 15/53 |  2%, 1/43   |  8%, 5/61   |
|  aot_nvfuser   | 60%, 32/53 |  0%, 0/43   | 75%, 46/61  |
|    inductor    | 83%, 44/53 | 86%, 37/43  | 90%, 55/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.00x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.09x    |    1.00x    |    1.00x    |
|  aot_nvfuser   |   1.16x    |    0.0x     |    1.20x    |
|    inductor    |   1.70x    |    2.17x    |    1.30x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    6.19    |    14.88    |    11.64    |
|   aot_eager    |   12.45    |    25.75    |    19.94    |
| aot_cudagraphs |   13.09    |    92.75    |    51.56    |
|  aot_nvfuser   |   29.54    |     0.0     |    80.08    |
|    inductor    |   271.08   |   116.86    |   450.74    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.85x    |    0.86x    |    0.88x    |
| aot_cudagraphs |   0.43x    |    0.38x    |    0.20x    |
|  aot_nvfuser   |   0.83x    |    0.0x     |    0.85x    |
|    inductor    |   0.78x    |    0.82x    |    0.89x    |
+----------------+------------+-------------+-------------+

torchbench suite with amp precision

see more

Performance speedup

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|       functorch_dp_cifar10        |  64  | 1.0027 |  0.9251   |      0.0       |   1.1901    |  4.8999  |
|            densenet121            |  4   | 1.0013 |  0.9144   |      0.0       |   1.3911    |  4.7967  |
|         timm_efficientdet         |  1   | 0.9864 |   0.789   |      0.0       |     0.0     |  4.1288  |
|           BERT_pytorch            |  16  | 1.0115 |  0.8389   |      0.0       |     0.0     |  3.1411  |
|      timm_vision_transformer      |  8   | 1.0012 |  0.8564   |      0.0       |   1.3359    |  3.0906  |
|                drq                |  1   | 1.0045 |  0.8048   |      0.0       |   1.0807    |  2.8848  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.9951 |  0.9111   |     1.3023     |   1.2173    |  2.8184  |
|             resnet18              |  16  | 1.0009 |  0.9878   |      0.0       |   1.3374    |  2.7412  |
|               dcgan               |  32  | 0.9775 |  0.9129   |     1.101      |   0.7342    |  2.5263  |
|           squeezenet1_1           |  32  | 0.9953 |  0.9584   |     1.4044     |   1.1906    |  2.5217  |
|             hf_Albert             |  8   | 1.0009 |  0.9534   |      0.0       |     0.0     |  2.397   |
|              hf_Bert              |  4   | 1.0348 |  0.8618   |      0.0       |     0.0     |  2.2545  |
|               hf_T5               |  8   | 0.9992 |  0.9393   |      0.0       |     0.0     |  2.1467  |
|          resnext50_32x4d          |  8   | 1.0002 |   0.951   |      0.0       |   1.3302    |  2.1414  |
|            hf_T5_large            |  2   | 1.0183 |  0.8592   |      0.0       |     0.0     |  2.0785  |
|           lennard_jones           | 1000 | 0.9794 |  0.7615   |     1.2817     |   1.0468    |  2.0147  |
|        mobilenet_v3_large         |  32  | 1.0023 |  1.0098   |      0.0       |   1.4107    |  2.0136  |
|          pytorch_struct           | 200  | 0.9866 |   0.746   |     1.1521     |    1.011    |  2.0082  |
|              hf_GPT2              |  4   | 1.014  |  0.9867   |      0.0       |     0.0     |  1.8579  |
|          LearningToPaint          |  96  | 1.0025 |  1.0068   |      0.0       |    1.355    |  1.8566  |
|            mnasnet1_0             |  32  | 0.9949 |  1.0116   |     0.8977     |   1.4086    |  1.8302  |
|              hf_Bart              |  4   | 1.0161 |  0.8395   |      0.0       |     0.0     |  1.7504  |
|           fastNLP_Bert            |  6   | 0.9978 |  0.8872   |      0.0       |     0.0     |  1.6528  |
|        speech_transformer         |  32  | 1.0054 |  0.8358   |      0.0       |     0.0     |  1.6385  |
| attention_is_all_you_need_pytorch | 256  | 1.0061 |  0.8945   |      0.0       |     0.0     |  1.5148  |
|         timm_efficientnet         |  32  | 0.9619 |  0.8176   |      0.0       |   1.1837    |  1.4918  |
|           hf_DistilBert           |  8   | 1.0156 |   0.969   |      0.0       |     0.0     |  1.478   |
|         soft_actor_critic         | 256  | 1.0223 |  0.7463   |     1.261      |   1.0634    |  1.4398  |
|           pytorch_unet            |  1   | 0.9996 |   0.993   |      0.0       |   1.1553    |  1.3534  |
|          pytorch_stargan          |  16  | 0.9983 |  1.0034   |     0.8258     |   1.0964    |  1.343   |
|            timm_nfnet             | 128  | 0.9994 |  0.9988   |      0.0       |   1.1733    |  1.3237  |
|        shufflenet_v2_x1_0         | 128  | 0.9995 |  1.0166   |      0.0       |   1.3486    |  1.3069  |
|            Super_SloMo            |  6   | 0.9998 |  0.9956   |      0.0       |     0.0     |  1.2884  |
|               vgg16               |  64  | 0.9999 |  0.9974   |     0.7982     |   0.9961    |  1.2713  |
|        Background_Matting         |  4   | 0.9996 |  1.0182   |      0.0       |   1.1153    |  1.2157  |
|              alexnet              | 128  | 0.9993 |  0.9964   |     0.788      |   1.0031    |  1.2097  |
|   timm_vision_transformer_large   |  8   | 0.9991 |  0.9893   |      0.0       |   0.9929    |  1.1578  |
|            hf_Reformer            |  4   | 0.9958 |  0.9992   |     0.9196     |     0.0     |  1.1578  |
|           timm_resnest            |  32  | 1.0025 |  1.0206   |      0.0       |   1.3168    |  1.1577  |
|            hf_BigBird             |  2   | 0.9911 |  0.9187   |      0.0       |     0.0     |  1.1435  |
|            timm_vovnet            |  32  | 0.9224 |  0.8875   |      0.0       |   1.1275    |  1.1074  |
|            tts_angular            |  64  | 1.0135 |  0.9582   |     1.0002     |   0.9789    |  1.0026  |
|              demucs               |  4   | 1.0019 |  0.9992   |     0.9995     |   0.9981    |  0.9998  |
|      nvidia_deeprecommender       | 256  | 0.9989 |  0.9958   |     0.6966     |   0.9783    |  0.9901  |
|             resnet50              |  32  | 1.0016 |  1.0097   |      0.0       |   1.3632    |  0.9717  |
|               moco                |  32  | 0.9956 |    0.0    |      0.0       |     0.0     |  0.9496  |
|           mobilenet_v2            |  96  | 0.9989 |  0.9866   |      0.0       |   0.9244    |  0.8705  |
|            timm_regnet            |  32  | 0.9775 |  0.9387   |      0.0       |   1.1858    |  0.8539  |
|              yolov3               |  16  | 0.9991 |   0.988   |      0.0       |   0.9136    |   0.0    |
|           hf_Longformer           |  2   | 0.9636 |   0.877   |     0.8882     |     0.0     |   0.0    |
|               dlrm                | 2048 |  0.0   |   1.173   |      0.0       |     0.0     |   0.0    |
|           hf_GPT2_large           |  4   | 0.9995 |  0.9901   |      0.0       |     0.0     |   0.0    |
|             tacotron2             |  64  |  0.98  |   0.762   |      0.0       |     0.0     |   0.0    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|               name                | bs  |      eager       |    aot_eager     |  aot_cudagraphs  |   aot_nvfuser    |     inductor     |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|           hf_GPT2_large           |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|            hf_T5_large            |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|   timm_vision_transformer_large   |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|              alexnet              |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               dcgan               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|              demucs               |  4  |       pass       |       pass       |       pass       |       pass       |       pass       |
|           lennard_jones           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|            mnasnet1_0             |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|      nvidia_deeprecommender       |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|   pytorch_CycleGAN_and_pix2pix    |  1  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_stargan          | 16  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_struct           | 200 |       pass       |       pass       |       pass       |       pass       |       pass       |
|         soft_actor_critic         | 256 |       pass       |       pass       |       pass       |       pass       |       pass       |
|           squeezenet1_1           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               vgg16               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|        Background_Matting         |  4  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|          LearningToPaint          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            densenet121            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|                drq                |  1  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|       functorch_dp_cifar10        |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           mobilenet_v2            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           pytorch_unet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet18              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet50              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|          resnext50_32x4d          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|        shufflenet_v2_x1_0         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|         timm_efficientnet         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_nfnet             |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_regnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           timm_resnest            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|      timm_vision_transformer      |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_vovnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            hf_Reformer            |  2  |       pass       |       pass       |       pass       |   fail_to_run    |       pass       |
|           BERT_pytorch            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            Super_SloMo            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
| attention_is_all_you_need_pytorch |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|               dlrm                |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           fastNLP_Bert            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|             hf_Albert             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_Bart              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_Bert              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            hf_BigBird             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           hf_DistilBert           |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_GPT2              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|               hf_T5               |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            hf_T5_base             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|        speech_transformer         |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           hf_Longformer           |  2  |       pass       |       pass       |       pass       |   fail_to_run    |   fail_to_run    |
|             tacotron2             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|         timm_efficientdet         |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|          vision_maskrcnn          |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|               moco                |  2  |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|        mobilenet_v3_large         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |  fail_accuracy   |
|            tts_angular            |  2  |       pass       |       pass       |       pass       |       pass       |      0.0000      |
|              yolov3               |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |      0.0000      |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+

Compilation latency (sec)

+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+
|               name                |  bs  |  eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor  |
+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+
|         timm_efficientdet         |  1   | 53.0184 |  79.3008  |      nan       |     nan     | 1803.9091 |
|            hf_T5_large            |  2   | 36.9772 |  75.6557  |      nan       |     nan     | 1747.5354 |
|            densenet121            |  4   | 13.5869 |  29.3271  |      nan       |   139.711   | 1688.4482 |
|        mobilenet_v3_large         |  32  | 3.7848  |  9.2923   |      nan       |   75.4499   | 896.1598  |
|            mnasnet1_0             |  32  | 3.4537  |  8.6408   |    43.7596     |   46.4028   | 854.0232  |
|               moco                |  32  |  11.75  |    nan    |      nan       |     nan     | 725.5689  |
|           mobilenet_v2            |  96  | 3.3075  |  8.4906   |      nan       |   43.1604   | 646.6585  |
|          resnext50_32x4d          |  8   | 3.6179  |  9.2283   |      nan       |   39.0386   | 614.9828  |
|         timm_efficientnet         |  32  | 5.9682  |  12.6852  |      nan       |   73.4068   | 568.4268  |
|        shufflenet_v2_x1_0         | 128  | 3.7705  |  9.7518   |      nan       |   41.3993   | 446.0734  |
|            timm_nfnet             | 128  | 6.8229  |  13.575   |      nan       |   42.2533   | 420.5468  |
|           squeezenet1_1           |  32  |  0.676  |  1.7982   |     8.4293     |   6.8649    | 371.0668  |
|           timm_resnest            |  32  | 1.4485  |  4.3952   |      nan       |   43.3534   | 364.8723  |
|            timm_regnet            |  32  | 8.4026  |  17.2587  |      nan       |   66.4884   | 343.2572  |
| attention_is_all_you_need_pytorch | 256  | 4.3975  |  13.0252  |      nan       |     nan     | 277.6015  |
|            timm_vovnet            |  32  | 3.0983  |  7.3442   |      nan       |   32.2979   | 246.8202  |
|        speech_transformer         |  32  | 7.5464  |  17.1829  |      nan       |     nan     | 246.5735  |
|   timm_vision_transformer_large   |  8   | 23.1783 |  40.2448  |      nan       |   58.887    | 209.5997  |
|             resnet18              |  16  |  1.03   |  3.1392   |      nan       |   23.6353   | 207.0568  |
|       functorch_dp_cifar10        |  64  | 0.8539  |  2.5937   |      nan       |   6.4768    | 198.2407  |
|      timm_vision_transformer      |  8   |   3.2   |  8.1832   |      nan       |   16.3986   | 197.8043  |
|          LearningToPaint          |  96  | 1.0574  |  3.1713   |      nan       |   31.0316   | 196.8972  |
|           BERT_pytorch            |  16  | 5.0826  |  13.8418  |      nan       |     nan     |  177.679  |
|               hf_T5               |  8   | 3.9598  |  12.752   |      nan       |     nan     | 163.3629  |
|        Background_Matting         |  4   | 4.0825  |  9.3277   |      nan       |   45.7685   | 157.1204  |
|             resnet50              |  32  | 3.4998  |  9.0749   |      nan       |   43.7996   | 150.7682  |
|              hf_Bart              |  4   | 7.5111  |  17.2897  |      nan       |     nan     | 149.8316  |
|           fastNLP_Bert            |  6   | 5.3619  |  12.7524  |      nan       |     nan     | 148.7053  |
|              hf_GPT2              |  4   | 3.5663  |  10.0245  |      nan       |     nan     | 134.6332  |
|          pytorch_stargan          |  16  |  0.856  |  3.2566   |    11.5768     |   7.5483    | 130.0983  |
|          pytorch_struct           | 200  | 0.4439  |  1.2864   |     1.8954     |   5.4421    | 123.7179  |
|            Super_SloMo            |  6   | 2.3215  |  7.1108   |      nan       |     nan     | 114.2637  |
|             hf_Albert             |  8   | 1.5003  |  8.7612   |      nan       |     nan     |  90.632   |
|            hf_Reformer            |  4   |  3.125  |  5.8245   |    14.0523     |     nan     |  81.3988  |
|              hf_Bert              |  4   | 5.2086  |  12.6031  |      nan       |     nan     |  78.8477  |
|            hf_BigBird             |  2   | 12.0533 |  20.4705  |      nan       |     nan     |  70.7791  |
|           pytorch_unet            |  1   | 1.1442  |  3.4413   |      nan       |   26.7315   |  67.999   |
|           hf_DistilBert           |  8   | 1.8271  |  5.3273   |      nan       |     nan     |  55.4364  |
|   pytorch_CycleGAN_and_pix2pix    |  1   |  0.841  |   3.214   |    12.2729     |   5.1571    |  38.082   |
|               vgg16               |  64  | 0.3688  |  1.1116   |     4.3001     |   3.7105    |  37.4739  |
|              alexnet              | 128  | 0.2897  |  0.6979   |     2.0104     |   3.2319    |  27.3354  |
|                drq                |  1   | 0.2848  |   0.757   |      nan       |   4.4794    |  24.9175  |
|               dcgan               |  32  | 0.2625  |  0.6388   |     1.9613     |    4.319    |  19.5418  |
|      nvidia_deeprecommender       | 256  | 0.2826  |  0.6825   |     1.0449     |   3.0789    |  15.0583  |
|         soft_actor_critic         | 256  | 0.2699  |  0.4887   |     0.799      |   2.1142    |  14.7981  |
|           lennard_jones           | 1000 | 0.2467  |  0.5134   |     0.7056     |   1.5718    |  7.8075   |
|            tts_angular            |  64  | 0.3394  |   0.392   |     0.5831     |   1.1356    |  4.0722   |
|              demucs               |  4   | 0.9187  |  0.9051   |     0.889      |   0.9058    |  0.8242   |
|              yolov3               |  16  | 7.5657  |  15.5034  |      nan       |   46.3505   |    nan    |
|           hf_Longformer           |  2   | 11.7639 |  21.6948  |     92.072     |     nan     |    nan    |
|           hf_GPT2_large           |  4   | 21.5811 |  42.1326  |      nan       |     nan     |    nan    |
|             tacotron2             |  64  | 14.4366 |  30.0122  |      nan       |     nan     |    nan    |
|               dlrm                | 2048 |   nan   |  1.1963   |      nan       |     nan     |    nan    |
+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+

Peak Memory Compression Ratio

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|             hf_Albert             |  8   | 0.9814 |   0.936   |      nan       |     nan     |  1.1576  |
|            Super_SloMo            |  6   | 1.0024 |  0.9697   |      nan       |     nan     |  1.1385  |
|            timm_nfnet             | 128  | 0.9761 |  0.9043   |      nan       |   0.9504    |  1.0243  |
|            tts_angular            |  64  | 1.0015 |  1.0015   |     0.9866     |   1.0015    |  0.9908  |
| attention_is_all_you_need_pytorch | 256  | 0.9976 |  0.9403   |      nan       |     nan     |  0.9875  |
|              demucs               |  4   | 0.987  |   0.987   |     0.987      |    0.987    |  0.987   |
|         timm_efficientdet         |  1   | 1.0316 |  0.8425   |      nan       |     nan     |  0.9858  |
|           BERT_pytorch            |  16  | 0.9998 |  0.8818   |      nan       |     nan     |  0.9728  |
|         timm_efficientnet         |  32  | 0.9982 |  0.7762   |      nan       |   0.7936    |  0.9689  |
|              hf_GPT2              |  4   | 0.971  |  0.8627   |      nan       |     nan     |  0.9645  |
|        Background_Matting         |  4   | 1.0201 |  0.9679   |      nan       |    0.987    |  0.9244  |
|        speech_transformer         |  32  | 1.0015 |  0.9177   |      nan       |     nan     |  0.9066  |
|           mobilenet_v2            |  96  | 1.0001 |  0.7725   |      nan       |   0.9235    |  0.8856  |
|           pytorch_unet            |  1   | 0.9968 |  0.8677   |      nan       |   0.8518    |  0.8681  |
|           fastNLP_Bert            |  6   | 1.0013 |  0.8966   |      nan       |     nan     |  0.8661  |
|   pytorch_CycleGAN_and_pix2pix    |  1   |  1.0   |  0.8624   |     0.2638     |   0.8441    |  0.8602  |
|            hf_T5_large            |  2   | 0.8541 |  0.8541   |      nan       |     nan     |  0.8535  |
|           hf_DistilBert           |  8   | 0.9505 |  0.8806   |      nan       |     nan     |  0.8387  |
|              hf_Bert              |  4   | 0.9844 |  0.8677   |      nan       |     nan     |  0.8383  |
|            timm_regnet            |  32  | 0.9999 |  0.8483   |      nan       |    0.85     |  0.8361  |
|              hf_Bart              |  4   | 0.9099 |  0.8321   |      nan       |     nan     |  0.8151  |
|            hf_BigBird             |  2   | 0.9852 |  0.9787   |      nan       |     nan     |   0.81   |
|            timm_vovnet            |  32  | 0.9903 |  0.7754   |      nan       |   0.7817    |  0.7861  |
|               moco                |  32  | 0.9667 |    nan    |      nan       |     nan     |  0.7819  |
|        shufflenet_v2_x1_0         | 128  | 1.0002 |   0.874   |      nan       |   0.8652    |  0.7813  |
|          pytorch_stargan          |  16  | 0.9929 |  0.9799   |     0.2149     |   0.8882    |  0.7783  |
|             resnet50              |  32  | 1.0004 |  0.8678   |      nan       |   0.8041    |  0.7745  |
|               dcgan               |  32  |  1.0   |  0.7949   |     0.343      |   0.7073    |  0.7527  |
|               vgg16               |  64  | 0.9998 |  0.7378   |     0.2978     |   0.7172    |  0.7491  |
|   timm_vision_transformer_large   |  8   | 0.9987 |  0.8365   |      nan       |   0.8491    |  0.7487  |
|              alexnet              | 128  | 1.0003 |  0.8082   |     0.4354     |    0.805    |  0.7352  |
|               hf_T5               |  8   | 0.9678 |  0.9371   |      nan       |     nan     |  0.7266  |
|           timm_resnest            |  32  | 0.9868 |  0.8809   |      nan       |   0.8726    |  0.722   |
|      timm_vision_transformer      |  8   | 1.0001 |  0.8868   |      nan       |   0.8871    |  0.7151  |
|            mnasnet1_0             |  32  | 0.9994 |  0.8793   |     0.173      |   0.8217    |  0.6596  |
|           squeezenet1_1           |  32  | 0.9604 |  0.7958   |     0.2952     |   0.7589    |  0.6595  |
|        mobilenet_v3_large         |  32  | 0.999  |  0.8661   |      nan       |    0.874    |  0.6573  |
|          resnext50_32x4d          |  8   |  1.0   |  0.8591   |      nan       |    0.823    |  0.6515  |
|                drq                |  1   | 0.9125 |  0.8399   |      nan       |   0.8395    |  0.6406  |
|         soft_actor_critic         | 256  | 0.964  |  0.9151   |     0.4737     |   0.9151    |  0.6279  |
|          LearningToPaint          |  96  | 0.9252 |  0.7196   |      nan       |    0.71     |  0.605   |
|            densenet121            |  4   |  1.0   |  0.8696   |      nan       |   0.8376    |  0.5739  |
|             resnet18              |  16  | 0.9782 |  0.7852   |      nan       |   0.7268    |  0.5644  |
|           lennard_jones           | 1000 |  1.0   |  1.0002   |     0.3735     |   1.0967    |  0.564   |
|      nvidia_deeprecommender       | 256  | 0.5596 |  0.5596   |     0.5262     |   0.5596    |  0.5596  |
|       functorch_dp_cifar10        |  64  | 0.9964 |  0.8131   |      nan       |    0.846    |  0.4465  |
|          pytorch_struct           | 200  |  1.0   |  0.5081   |     0.4858     |   0.5082    |  0.4235  |
|            hf_Reformer            |  4   | 0.3764 |  0.9993   |     0.2539     |     nan     |  0.3629  |
|              yolov3               |  16  | 1.0054 |  0.8488   |      nan       |   0.8244    |   nan    |
|           hf_Longformer           |  2   | 0.9734 |   0.967   |     0.3379     |     nan     |   nan    |
|           hf_GPT2_large           |  4   | 0.9586 |  0.8649   |      nan       |     nan     |   nan    |
|               dlrm                | 2048 |  nan   |  0.7282   |      nan       |     nan     |   nan    |
|             tacotron2             |  64  | 0.9879 |  0.4059   |      nan       |     nan     |   nan    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

huggingface suite with amp precision

see more

Performance speedup

+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|       MT5ForConditionalGeneration       | 2  | 1.0217 |  0.8664   |      0.0       |     0.0     |  6.0266  |
|          MobileBertForMaskedLM          | 16 | 1.0165 |  0.8257   |      0.0       |     0.0     |  5.6755  |
|           ElectraForCausalLM            | 1  | 1.0352 |  0.8536   |      0.0       |     0.0     |  5.5645  |
|     MobileBertForQuestionAnswering      | 32 | 1.0175 |  0.8249   |      0.0       |     0.0     |  5.2401  |
|            YituTechConvBert             | 1  | 1.0261 |  0.8468   |      0.0       |     0.0     |  5.0492  |
|           RobertaForCausalLM            | 4  | 1.0398 |  0.8465   |      0.0       |     0.0     |  4.5969  |
|         MegatronBertForCausalLM         | 2  | 1.0374 |  0.8485   |      0.0       |     0.0     |  4.0218  |
|             OPTForCausalLM              | 4  | 1.0159 |  0.8276   |      0.0       |     0.0     |  3.9227  |
|     M2M100ForConditionalGeneration      | 2  | 1.0129 |  0.8218   |      0.0       |     0.0     |  3.6354  |
|                CamemBert                | 1  | 1.0388 |   0.859   |      0.0       |     0.0     |  3.5143  |
|     PegasusForConditionalGeneration     | 4  | 1.0118 |  0.8263   |      0.0       |     0.0     |  3.1923  |
|             XGLMForCausalLM             | 1  | 1.014  |  0.8144   |      0.0       |     0.0     |  3.1413  |
|     PLBartForConditionalGeneration      | 8  | 1.0194 |  0.8247   |      0.0       |     0.0     |  2.7305  |
|    MegatronBertForQuestionAnswering     | 8  | 1.0396 |  0.8582   |      0.0       |     0.0     |  2.7135  |
|               DistillGPT2               | 1  | 1.0314 |  0.8704   |      0.0       |     0.0     |  2.619   |
|      MBartForConditionalGeneration      | 8  | 1.0167 |  0.8336   |      0.0       |     0.0     |  2.3299  |
|      GPT2ForSequenceClassification      | 4  | 0.9989 |  0.9767   |      0.0       |     0.0     |  2.1375  |
|         Speech2Text2ForCausalLM         | 64 | 1.0086 |  0.8555   |      0.0       |     0.0     |  2.108   |
|       ElectraForQuestionAnswering       | 64 | 0.9994 |  0.9793   |      0.0       |     0.0     |  1.9642  |
|            TrOCRForCausalLM             | 8  | 1.0149 |  0.8298   |      0.0       |     0.0     |  1.8799  |
|          DistilBertForMaskedLM          | 16 | 1.0299 |  0.8516   |      0.0       |     0.0     |  1.8406  |
|           PegasusForCausalLM            | 8  | 1.0109 |   0.826   |      0.0       |     0.0     |  1.8182  |
| BlenderbotSmallForConditionalGeneration | 32 | 1.0087 |  0.8891   |      0.0       |     0.0     |  1.7899  |
|      BartForConditionalGeneration       | 1  | 1.0133 |   0.885   |      0.0       |     0.0     |  1.7522  |
|     DistilBertForQuestionAnswering      | 32 | 1.0305 |  0.8437   |      0.0       |     0.0     |  1.7502  |
|    LayoutLMForSequenceClassification    | 16 | 0.9983 |  0.9785   |      0.0       |     0.0     |  1.7243  |
|       T5ForConditionalGeneration        | 4  | 0.9926 |  0.9361   |      0.0       |     0.0     |  1.6977  |
|       AlbertForQuestionAnswering        | 2  | 1.0007 |  0.8082   |      0.0       |     0.0     |  1.6669  |
|            AlbertForMaskedLM            | 2  | 1.0004 |  0.8087   |      0.0       |     0.0     |  1.6562  |
|                 T5Small                 | 1  | 1.0258 |  0.8963   |      0.0       |     0.0     |  1.593   |
|            XLNetLMHeadModel             | 4  | 1.0008 |  0.9632   |      0.0       |     0.0     |  1.5916  |
|           LayoutLMForMaskedLM           | 16 | 0.9981 |  0.9701   |      0.0       |     0.0     |  1.5762  |
|            PLBartForCausalLM            | 16 | 1.0127 |  0.9448   |      0.0       |     0.0     |  1.5065  |
|             BartForCausalLM             | 2  | 1.0008 |  0.9636   |      0.0       |     0.0     |  1.4564  |
|       RobertaForQuestionAnswering       | 64 | 0.9977 |  0.9494   |      0.0       |     0.0     |  1.4507  |
|        BertForQuestionAnswering         | 64 | 0.9971 |  0.9668   |      0.0       |     0.0     |  1.4373  |
|            MBartForCausalLM             | 16 | 1.0105 |  0.9317   |      0.0       |     0.0     |  1.3877  |
|             BertForMaskedLM             | 64 | 0.9972 |  0.9548   |      0.0       |     0.0     |  1.3316  |
|       BlenderbotSmallForCausalLM        | 64 | 1.0012 |  0.9233   |      0.0       |     0.0     |  1.3041  |
|       DebertaForQuestionAnswering       | 4  | 0.9317 |  0.7286   |     0.9211     |     0.0     |  1.2886  |
|                 BigBird                 | 1  | 0.9945 |  0.9116   |      0.0       |     0.0     |  1.1342  |
|           DebertaForMaskedLM            | 4  | 0.9325 |  0.7359   |     0.7806     |     0.0     |  1.1239  |
|          AllenaiLongformerBase          | 1  | 0.9529 |  0.7382   |     0.8569     |     0.0     |   0.0    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser |  inductor   |
+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+
|            AlbertForMaskedLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       AlbertForQuestionAnswering        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             BartForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             BertForMaskedLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|        BertForQuestionAnswering         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                 BigBird                 | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       BlenderbotSmallForCausalLM        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
| BlenderbotSmallForConditionalGeneration | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                CamemBert                | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           DebertaForMaskedLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|          DistilBertForMaskedLM          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     DistilBertForQuestionAnswering      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|               DistillGPT2               | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           ElectraForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       ElectraForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|      GPT2ForSequenceClassification      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           LayoutLMForMaskedLM           | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|    LayoutLMForSequenceClassification    | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            MBartForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       MT5ForConditionalGeneration       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|         MegatronBertForCausalLM         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|    MegatronBertForQuestionAnswering     | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|          MobileBertForMaskedLM          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     MobileBertForQuestionAnswering      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             OPTForCausalLM              | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            PLBartForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           PegasusForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     PegasusForConditionalGeneration     | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           RobertaForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       RobertaForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|         Speech2Text2ForCausalLM         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       T5ForConditionalGeneration        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                 T5Small                 | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            TrOCRForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            XLNetLMHeadModel             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            YituTechConvBert             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       DebertaForQuestionAnswering       | 1  |  pass  |   pass    | fail_accuracy  | fail_to_run |    pass     |
|          AllenaiLongformerBase          | 1  |  pass  |   pass    |      pass      | fail_to_run | fail_to_run |
|      BartForConditionalGeneration       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|      MBartForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|     PLBartForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|     M2M100ForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |   0.0000    |
|             XGLMForCausalLM             | 0  | 0.0000 |  0.0000   |     0.0000     |   0.0000    |   0.0000    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+

Compilation latency (sec)

+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+
|                  name                   | bs |  eager   | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+
|            XLNetLMHeadModel             | 4  | 18.4058  |  40.7914  |      nan       |     nan     | 324.9091 |
|          MobileBertForMaskedLM          | 16 | 135.1179 | 174.0517  |      nan       |     nan     | 310.7379 |
|     MobileBertForQuestionAnswering      | 32 | 132.1921 | 174.2548  |      nan       |     nan     | 288.521  |
|       T5ForConditionalGeneration        | 4  |  4.1395  |  12.6828  |      nan       |     nan     | 247.4364 |
|     M2M100ForConditionalGeneration      | 2  | 26.4684  |  45.8915  |      nan       |     nan     | 214.2353 |
|       MT5ForConditionalGeneration       | 2  |  6.5618  |  21.1144  |      nan       |     nan     | 202.0306 |
|            YituTechConvBert             | 1  |  9.4912  |  20.8653  |      nan       |     nan     | 187.9041 |
|             XGLMForCausalLM             | 1  | 15.5894  |  30.5654  |      nan       |     nan     | 170.4506 |
|      MBartForConditionalGeneration      | 8  | 26.8296  |  47.6996  |      nan       |     nan     | 170.0447 |
|     PegasusForConditionalGeneration     | 4  | 26.2079  |  45.6003  |      nan       |     nan     | 167.4475 |
|           DebertaForMaskedLM            | 4  |  7.3099  |  14.5402  |    53.1345     |     nan     | 163.0197 |
|    MegatronBertForQuestionAnswering     | 8  | 17.0641  |  31.1438  |      nan       |     nan     | 161.2878 |
|      BartForConditionalGeneration       | 1  | 26.4428  |  45.956   |      nan       |     nan     | 152.167  |
|         MegatronBertForCausalLM         | 2  | 16.4797  |  31.8313  |      nan       |     nan     | 144.6966 |
|                 T5Small                 | 1  |  3.9891  |  12.5222  |      nan       |     nan     | 144.5762 |
|     PLBartForConditionalGeneration      | 8  |  7.4848  |  17.1203  |      nan       |     nan     | 130.1847 |
| BlenderbotSmallForConditionalGeneration | 32 | 12.1662  |  25.0348  |      nan       |     nan     | 124.0423 |
|       DebertaForQuestionAnswering       | 4  |  7.3726  |  14.8011  |    53.6215     |     nan     | 120.8276 |
|           RobertaForCausalLM            | 4  |  5.3202  |  12.7009  |      nan       |     nan     | 107.3699 |
|    LayoutLMForSequenceClassification    | 16 |  5.6057  |  12.9613  |      nan       |     nan     | 92.3443  |
|           PegasusForCausalLM            | 8  |  9.9424  |  16.9838  |      nan       |     nan     | 90.8327  |
|             OPTForCausalLM              | 4  |  4.8978  |  12.0564  |      nan       |     nan     | 86.7957  |
|             BartForCausalLM             | 2  | 10.3112  |  17.1423  |      nan       |     nan     | 83.3249  |
|            MBartForCausalLM             | 16 | 10.0524  |  17.3672  |      nan       |     nan     | 83.1659  |
|       ElectraForQuestionAnswering       | 64 |  5.2311  |  12.7727  |      nan       |     nan     | 82.9515  |
|             BertForMaskedLM             | 64 |  5.2735  |  12.6394  |      nan       |     nan     | 82.4964  |
|           LayoutLMForMaskedLM           | 16 |  5.5665  |  13.2244  |      nan       |     nan     | 80.9633  |
|      GPT2ForSequenceClassification      | 4  |  3.6189  |  10.3118  |      nan       |     nan     | 76.5278  |
|           ElectraForCausalLM            | 1  |  5.3414  |  12.6918  |      nan       |     nan     | 71.6347  |
|            TrOCRForCausalLM             | 8  | 10.3861  |  17.3261  |      nan       |     nan     | 71.0615  |
|                 BigBird                 | 1  | 11.5557  |  20.1771  |      nan       |     nan     | 69.8588  |
|     DistilBertForQuestionAnswering      | 32 |  1.9056  |  5.4488   |      nan       |     nan     | 66.5726  |
|                CamemBert                | 1  |  5.3005  |  12.5095  |      nan       |     nan     | 65.4941  |
|            AlbertForMaskedLM            | 2  |  1.5705  |  8.8811   |      nan       |     nan     | 65.1659  |
|       BlenderbotSmallForCausalLM        | 64 |  4.9754  |  9.5922   |      nan       |     nan     | 63.5161  |
|            PLBartForCausalLM            | 16 |  3.2396  |  6.8243   |      nan       |     nan     | 62.7263  |
|       RobertaForQuestionAnswering       | 64 |  5.1814  |  12.7826  |      nan       |     nan     | 61.2356  |
|        BertForQuestionAnswering         | 64 |  5.1795  |  12.5753  |      nan       |     nan     | 60.1564  |
|         Speech2Text2ForCausalLM         | 64 |  3.3817  |  6.9074   |      nan       |     nan     | 59.4116  |
|               DistillGPT2               | 1  |  1.583   |  4.7509   |      nan       |     nan     | 58.5183  |
|          DistilBertForMaskedLM          | 16 |  2.0107  |  5.6044   |      nan       |     nan     | 50.1556  |
|       AlbertForQuestionAnswering        | 2  |  1.5747  |  8.7694   |      nan       |     nan     | 44.0509  |
|          AllenaiLongformerBase          | 1  | 12.4196  |  22.7413  |    92.7516     |     nan     |   nan    |
+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+

Peak Memory Compression Ratio

+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|      GPT2ForSequenceClassification      | 4  | 0.9675 |  0.9163   |      nan       |     nan     |  1.0699  |
|            XLNetLMHeadModel             | 4  | 0.9912 |  0.8791   |      nan       |     nan     |  1.0109  |
|       ElectraForQuestionAnswering       | 64 | 1.0016 |  0.9539   |      nan       |     nan     |  1.0002  |
|                 T5Small                 | 1  |  1.0   |  0.9124   |      nan       |     nan     |  0.9876  |
|           LayoutLMForMaskedLM           | 16 | 0.9999 |  0.9238   |      nan       |     nan     |  0.9871  |
|             BertForMaskedLM             | 64 | 0.9996 |   0.899   |      nan       |     nan     |  0.9811  |
|    LayoutLMForSequenceClassification    | 16 | 1.004  |  0.9325   |      nan       |     nan     |  0.9712  |
| BlenderbotSmallForConditionalGeneration | 32 | 0.9998 |  0.8996   |      nan       |     nan     |  0.9557  |
|             BartForCausalLM             | 2  |  1.0   |  0.8769   |      nan       |     nan     |  0.9545  |
|       T5ForConditionalGeneration        | 4  | 0.9996 |  0.9594   |      nan       |     nan     |  0.9525  |
|         Speech2Text2ForCausalLM         | 64 | 0.9954 |  0.8265   |      nan       |     nan     |  0.9452  |
|            PLBartForCausalLM            | 16 | 1.0006 |  0.8667   |      nan       |     nan     |  0.9395  |
|       BlenderbotSmallForCausalLM        | 64 | 0.9996 |  0.8172   |      nan       |     nan     |  0.9269  |
|        BertForQuestionAnswering         | 64 | 0.9995 |  0.9315   |      nan       |     nan     |  0.9256  |
|       RobertaForQuestionAnswering       | 64 | 0.9996 |  0.9315   |      nan       |     nan     |  0.9254  |
|          DistilBertForMaskedLM          | 16 | 0.9991 |  0.8698   |      nan       |     nan     |  0.9167  |
|      BartForConditionalGeneration       | 1  |  1.0   |  0.8619   |      nan       |     nan     |  0.881   |
|       AlbertForQuestionAnswering        | 2  |  1.0   |  0.6451   |      nan       |     nan     |  0.8636  |
|            MBartForCausalLM             | 16 |  1.0   |  0.8398   |      nan       |     nan     |  0.8565  |
|            AlbertForMaskedLM            | 2  |  1.0   |  0.6364   |      nan       |     nan     |  0.8515  |
|                 BigBird                 | 1  | 1.0024 |  0.9513   |      nan       |     nan     |  0.8349  |
|     DistilBertForQuestionAnswering      | 32 | 0.9987 |  0.8967   |      nan       |     nan     |  0.834   |
|     PLBartForConditionalGeneration      | 8  | 0.9999 |  0.8304   |      nan       |     nan     |  0.8252  |
|               DistillGPT2               | 1  | 1.0006 |  0.7548   |      nan       |     nan     |  0.812   |
|      MBartForConditionalGeneration      | 8  | 0.9999 |  0.8187   |      nan       |     nan     |  0.7699  |
|            TrOCRForCausalLM             | 8  |  1.0   |  0.7955   |      nan       |     nan     |  0.7566  |
|                CamemBert                | 1  | 0.9989 |  0.7872   |      nan       |     nan     |  0.7482  |
|             OPTForCausalLM              | 4  | 0.9975 |  0.7501   |      nan       |     nan     |  0.7473  |
|            YituTechConvBert             | 1  | 0.9718 |  0.7819   |      nan       |     nan     |  0.7407  |
|           PegasusForCausalLM            | 8  | 0.999  |  0.9444   |      nan       |     nan     |  0.7324  |
|           RobertaForCausalLM            | 4  | 0.9237 |  0.7741   |      nan       |     nan     |  0.7309  |
|             XGLMForCausalLM             | 1  | 0.9999 |  0.9992   |      nan       |     nan     |  0.7214  |
|    MegatronBertForQuestionAnswering     | 8  | 0.9051 |  0.8218   |      nan       |     nan     |  0.7107  |
|          MobileBertForMaskedLM          | 16 | 0.9985 |  0.8983   |      nan       |     nan     |  0.6948  |
|     PegasusForConditionalGeneration     | 4  | 0.9996 |  0.9196   |      nan       |     nan     |  0.6769  |
|           ElectraForCausalLM            | 1  | 0.9993 |  0.8955   |      nan       |     nan     |  0.6701  |
|         MegatronBertForCausalLM         | 2  | 0.7726 |  0.7726   |      nan       |     nan     |  0.6697  |
|     M2M100ForConditionalGeneration      | 2  | 0.9999 |  0.9497   |      nan       |     nan     |  0.6569  |
|     MobileBertForQuestionAnswering      | 32 | 1.0142 |  0.9796   |      nan       |     nan     |  0.6265  |
|       MT5ForConditionalGeneration       | 2  | 0.6019 |  0.6019   |      nan       |     nan     |  0.6019  |
|           DebertaForMaskedLM            | 4  | 0.9982 |  0.9826   |     0.3599     |     nan     |  0.4498  |
|       DebertaForQuestionAnswering       | 4  | 0.979  |  1.0568   |     0.3578     |     nan     |  0.3761  |
|          AllenaiLongformerBase          | 1  | 0.9996 |  0.9477   |     0.3752     |     nan     |   nan    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+

timm_models suite with amp precision

see more

Performance speedup

+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|              name               | bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|            hrnet_w18            |  2  | 1.0028 |  0.9644   |      0.0       |   1.3794    |  4.8666  |
|        res2net50_14w_8s         |  2  | 0.9994 |  0.9247   |      0.0       |   1.3968    |  4.7346  |
|           res2next50            |  2  | 1.0037 |  0.9304   |      0.0       |    1.362    |  4.6397  |
|        twins_pcpvt_base         | 32  | 1.0024 |  0.8988   |      0.0       |    1.36     |  2.5347  |
|      xcit_large_24_p8_224       |  5  | 1.0003 |    0.0    |      0.0       |     0.0     |  2.1071  |
|          cait_m36_384           |  2  | 1.0023 |  0.8557   |      0.0       |   1.3541    |  2.0791  |
|        tnt_s_patch16_224        | 64  | 0.9994 |  0.9944   |      0.0       |   1.8326    |  1.9956  |
|          ghostnet_100           | 128 | 1.0031 |  1.0008   |      0.0       |   1.5591    |  1.893   |
|         crossvit_9_240          | 64  | 1.0051 |  0.9639   |      0.0       |   1.1374    |  1.7206  |
|          gmixer_24_224          | 64  | 0.9987 |  0.8853   |      0.0       |   1.0128    |  1.6752  |
|           volo_d1_224           | 64  | 0.9994 |  0.9941   |      0.0       |   1.1497    |  1.6642  |
|            lcnet_050            | 128 | 0.9678 |  0.9515   |      0.0       |   1.6064    |  1.6229  |
|            nfnet_l0             | 64  | 1.006  |   0.839   |      0.0       |    1.193    |  1.5908  |
|           regnety_002           | 128 | 0.981  |   0.933   |      0.0       |   1.3813    |  1.5766  |
|  swin_base_patch4_window7_224   | 64  | 0.9992 |  0.9578   |      0.0       |   1.0465    |  1.5415  |
|         coat_lite_mini          | 128 |  1.0   |  0.9957   |      0.0       |   1.2651    |  1.4983  |
|          resmlp_12_224          | 128 | 1.0002 |  0.9982   |     0.7823     |     0.0     |  1.4718  |
|          jx_nest_base           | 32  | 0.9992 |  0.9917   |      0.0       |   1.2314    |   1.46   |
|           resnest101e           | 32  | 1.0043 |  0.9905   |      0.0       |   1.4192    |  1.4201  |
|          gmlp_s16_224           | 64  | 0.9989 |   0.983   |      0.0       |   1.0513    |  1.4139  |
|           convit_base           | 32  | 0.9994 |  0.9914   |      0.0       |     0.0     |  1.3895  |
|            pit_b_224            | 64  | 0.9995 |  0.9939   |      0.0       |   1.0686    |  1.3627  |
|           dm_nfnet_f0           | 128 | 0.9992 |  0.9992   |      0.0       |   1.1759    |  1.3014  |
|          mixer_b16_224          | 64  | 0.9992 |  0.9904   |     0.716      |   0.9657    |  1.2967  |
|      beit_base_patch16_224      | 64  | 0.9996 |  0.9776   |      0.0       |   1.0503    |  1.2906  |
| deit_base_distilled_patch16_224 | 64  | 0.9996 |  0.9913   |      0.0       |   1.0703    |  1.2895  |
|        adv_inception_v3         | 128 |  1.0   |  0.9952   |      0.0       |   1.1927    |  1.2253  |
|       gluon_inception_v3        | 128 |  1.0   |  0.9946   |      0.0       |    1.194    |  1.2168  |
|          inception_v3           | 128 |  1.0   |  0.9952   |      0.0       |   1.1935    |  1.2139  |
|         poolformer_m36          | 64  | 0.9991 |  0.9974   |      0.0       |     0.0     |  1.2087  |
|      vit_base_patch16_224       | 64  | 0.9997 |  0.9933   |      0.0       |   0.9995    |  1.1961  |
|           tf_mixnet_l           | 64  | 0.9832 |  0.8984   |      0.0       |   1.1168    |  1.1412  |
|           mobilevit_s           | 32  | 0.9752 |  0.7969   |      0.0       |   1.2175    |  1.1277  |
|            mixnet_l             | 64  | 0.9802 |   0.889   |      0.0       |   1.1177    |  1.0927  |
|         visformer_small         | 128 | 1.0003 |  1.0006   |      0.0       |   1.0867    |  1.0534  |
|          pnasnet5large          | 16  | 1.0052 |  1.0238   |      0.0       |   1.1323    |  1.0315  |
|             dla102              | 64  | 0.9994 |  1.0099   |      0.0       |   1.3742    |  1.0293  |
|            fbnetv3_b            | 128 | 0.9685 |  0.9577   |      0.0       |   1.2758    |  0.9577  |
|           mnasnet_100           | 128 | 0.9535 |  0.9394   |     0.6673     |   1.3679    |  0.9231  |
|            repvgg_a2            | 128 | 0.9416 |  0.9342   |      0.0       |   1.1287    |  0.9156  |
|           selecsls42b           | 128 | 0.9995 |  0.9942   |      0.0       |    1.356    |  0.8981  |
|            tinynet_a            | 128 | 0.9605 |  0.8048   |      0.0       |   1.0887    |  0.8876  |
|        convmixer_768_32         | 32  | 0.9997 |  0.9979   |      0.0       |   1.0523    |  0.8863  |
|             dpn107              | 32  | 0.9485 |  0.9127   |      0.0       |   0.9813    |  0.8856  |
|          cspdarknet53           | 64  | 0.9432 |   0.935   |      0.0       |   0.9008    |  0.8791  |
|          convnext_base          | 32  | 1.0058 |  0.9438   |      0.0       |   1.3613    |  0.8489  |
|        res2net101_26w_4s        | 64  | 1.0025 |   0.996   |      0.0       |   1.3914    |  0.8471  |
|      mobilenetv3_large_100      | 128 | 0.9552 |  0.9437   |      0.0       |   1.3446    |  0.8334  |
|          spnasnet_100           | 128 | 0.9462 |  0.9369   |     0.6574     |   1.3183    |  0.8288  |
|            gernet_l             | 128 | 0.9466 |  0.9359   |      0.0       |   1.1389    |  0.7974  |
|           fbnetc_100            | 128 | 0.9525 |  0.9432   |     0.6733     |   1.3758    |  0.7479  |
|        eca_halonext26ts         | 64  | 0.9639 |  0.8063   |      0.0       |   1.1003    |  0.7363  |
|        sebotnet33ts_256         | 64  | 0.9669 |  0.8367   |      0.0       |    1.116    |  0.7274  |
|       tf_efficientnet_b0        | 128 | 0.9642 |  0.8073   |      0.0       |   1.0953    |  0.7162  |
|       eca_botnext26ts_256       | 64  | 0.9627 |  0.8009   |      0.0       |   1.1043    |  0.703   |
|          botnet26t_256          | 128 | 0.9783 |  0.9756   |      0.0       |   1.3439    |  0.6823  |
|         mobilenetv2_100         | 128 | 0.9498 |  0.9402   |      0.0       |   0.8656    |  0.6635  |
|        ese_vovnet19b_dw         | 128 | 0.9693 |   0.965   |      0.0       |   1.2431    |  0.6551  |
|           rexnet_100            | 128 | 0.9775 |  0.8495   |      0.0       |   1.0358    |  0.6527  |
|     swsl_resnext101_32x16d      | 32  | 0.9995 |  0.9796   |      0.0       |   1.0735    |  0.6428  |
|        gluon_xception65         | 32  | 0.998  |  0.9783   |      0.0       |   1.0628    |  0.5736  |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+

Accuracy

+---------------------------------+----+-------+---------------+----------------+---------------+---------------+
|              name               | bs | eager |   aot_eager   | aot_cudagraphs |  aot_nvfuser  |   inductor    |
+---------------------------------+----+-------+---------------+----------------+---------------+---------------+
|           fbnetc_100            | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|           mnasnet_100           | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|            repvgg_a2            | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|        adv_inception_v3         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      beit_base_patch16_224      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          botnet26t_256          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        convmixer_768_32         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          convnext_base          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         crossvit_9_240          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          cspdarknet53           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
| deit_base_distilled_patch16_224 | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|             dla102              | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           dm_nfnet_f0           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|             dpn107              | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       eca_botnext26ts_256       | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        eca_halonext26ts         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            gernet_l             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          ghostnet_100           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       gluon_inception_v3        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          inception_v3           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            lcnet_050            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            mixnet_l             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         mobilenetv2_100         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      mobilenetv3_large_100      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           mobilevit_s           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            nfnet_l0             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          pnasnet5large          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           regnety_002           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        res2net101_26w_4s        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        res2net50_14w_8s         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           res2next50            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           rexnet_100            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        sebotnet33ts_256         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           selecsls42b           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|  swin_base_patch4_window7_224   | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|     swsl_resnext101_32x16d      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       tf_efficientnet_b0        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           tf_mixnet_l           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            tinynet_a            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        tnt_s_patch16_224        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         visformer_small         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      vit_base_patch16_224       | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           volo_d1_224           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          resmlp_12_224          | 2  | pass  |     pass      |      pass      |  fail_to_run  |     pass      |
|           convit_base           | 2  | pass  |     pass      |  fail_to_run   |  fail_to_run  |     pass      |
|      xcit_large_24_p8_224       | 2  | pass  |  fail_to_run  |  fail_to_run   |  fail_to_run  |     pass      |
|          gmixer_24_224          | 2  | pass  |     pass      |      pass      | fail_accuracy |     pass      |
|          gmlp_s16_224           | 2  | pass  |     pass      |      pass      | fail_accuracy |     pass      |
|          mixer_b16_224          | 2  | pass  |     pass      |      pass      | fail_accuracy |     pass      |
|         poolformer_m36          | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy |     pass      |
|           resnest101e           | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy |     pass      |
|         coat_lite_mini          | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|          jx_nest_base           | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|            pit_b_224            | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|        twins_pcpvt_base         | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|        ese_vovnet19b_dw         | 2  | pass  |     pass      |  fail_to_run   |     pass      | fail_accuracy |
|        gluon_xception65         | 2  | pass  |     pass      |  fail_to_run   |     pass      | fail_accuracy |
|            hrnet_w18            | 2  | pass  |     pass      |  fail_to_run   |     pass      | fail_accuracy |
|          spnasnet_100           | 2  | pass  |     pass      |      pass      | fail_accuracy | fail_accuracy |
|            fbnetv3_b            | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy | fail_accuracy |
|          cait_m36_384           | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy | fail_accuracy |
+---------------------------------+----+-------+---------------+----------------+---------------+---------------+

Compilation latency (sec)

+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+
|              name               | bs  |  eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor  |
+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+
|            hrnet_w18            |  2  | 97.6966 | 141.0268  |      nan       |   477.271   | 1428.6487 |
|          pnasnet5large          | 16  | 59.9965 |  89.713   |      nan       |  251.7262   | 1281.2421 |
|             dpn107              | 32  | 13.8456 |  28.5265  |      nan       |  112.7519   | 1259.4166 |
|           rexnet_100            | 128 | 6.6675  |  14.2855  |      nan       |  122.0738   | 1152.8284 |
|        res2net50_14w_8s         |  2  | 20.0355 |  38.8849  |      nan       |  123.4919   | 956.6994  |
|           mobilevit_s           | 32  | 6.1429  |  13.5761  |      nan       |   62.1746   | 912.5899  |
|            mixnet_l             | 64  | 13.5148 |  22.8111  |      nan       |   89.7195   | 839.4074  |
|       eca_botnext26ts_256       | 64  | 2.6024  |  7.2121   |      nan       |   64.9197   | 837.0591  |
|        twins_pcpvt_base         | 32  | 26.7089 |  45.6423  |      nan       |   99.6458   |  834.726  |
|          ghostnet_100           | 128 | 9.3711  |  18.8885  |      nan       |   98.4747   | 771.2625  |
|            tinynet_a            | 128 | 7.7367  |  15.6026  |      nan       |   84.8686   | 727.8874  |
|            fbnetv3_b            | 128 | 13.3606 |  24.3739  |      nan       |  111.7964   |  698.534  |
|         coat_lite_mini          | 128 | 3.3191  |  9.0908   |      nan       |   34.6481   | 679.6933  |
|           resnest101e           | 32  | 26.9489 |  47.6734  |      nan       |  126.5945   | 658.9299  |
|             dla102              | 64  | 10.6743 |  22.7225  |      nan       |   97.4741   | 630.0407  |
|           fbnetc_100            | 128 |  5.671  |  12.3828  |    89.3023     |   64.1864   | 627.2809  |
|        sebotnet33ts_256         | 64  | 3.9369  |   9.979   |      nan       |   70.2461   | 608.7544  |
|          botnet26t_256          | 128 | 2.5158  |  6.7876   |      nan       |   51.1621   | 591.8039  |
|           tf_mixnet_l           | 64  | 14.0096 |  23.5027  |      nan       |   89.8185   | 550.0596  |
|          cspdarknet53           | 64  | 6.2996  |  13.5792  |      nan       |   45.3241   | 535.2317  |
|        eca_halonext26ts         | 64  | 2.7292  |  7.5233   |      nan       |   67.9176   |  531.045  |
|           res2next50            |  2  | 7.6466  |  17.1314  |      nan       |   65.4278   | 518.6104  |
|       tf_efficientnet_b0        | 128 | 6.0641  |  13.1288  |      nan       |   83.8699   | 508.1979  |
|        adv_inception_v3         | 128 |  8.67   |  18.8374  |      nan       |  106.7887   | 469.8298  |
|           mnasnet_100           | 128 | 4.1978  |  9.7492   |    60.5217     |   53.9337   | 462.3385  |
|        res2net101_26w_4s        | 64  | 25.9614 |  47.214   |      nan       |  144.0171   | 451.5853  |
|  swin_base_patch4_window7_224   | 64  |  12.4   |  26.9354  |      nan       |   82.8153   | 424.9892  |
|           regnety_002           | 128 | 4.9105  |  10.8595  |      nan       |   60.7642   | 413.4684  |
|            nfnet_l0             | 64  |  6.122  |  13.0335  |      nan       |   40.1358   | 407.7362  |
|         mobilenetv2_100         | 128 | 4.2405  |  9.2914   |      nan       |   43.8073   | 400.1731  |
|          convnext_base          | 32  | 12.0387 |  19.3516  |      nan       |   47.4608   | 400.0663  |
|        ese_vovnet19b_dw         | 128 | 2.0251  |  5.1077   |      nan       |   40.0498   | 397.4036  |
|         visformer_small         | 128 | 2.3605  |  6.7356   |      nan       |   32.1076   | 379.8813  |
|      xcit_large_24_p8_224       |  5  | 37.1179 |    nan    |      nan       |     nan     |  363.892  |
|      mobilenetv3_large_100      | 128 | 4.5824  |  10.1168  |      nan       |   86.4595   | 363.6031  |
|        gluon_xception65         | 32  | 15.4767 |  29.189   |      nan       |   78.4504   | 353.0086  |
|          jx_nest_base           | 32  | 9.7785  |  19.7435  |      nan       |   59.9364   | 327.2779  |
|          cait_m36_384           |  2  | 48.1901 |  71.6923  |      nan       |  107.6152   |  308.271  |
|         poolformer_m36          | 64  | 13.1268 |  21.8151  |      nan       |     nan     | 304.8976  |
|         crossvit_9_240          | 64  | 7.7826  |  17.0441  |      nan       |   42.6455   | 293.8263  |
|            gernet_l             | 128 | 4.9774  |  11.7724  |      nan       |   48.1579   | 285.8593  |
|           selecsls42b           | 128 | 2.4734  |  6.9182   |      nan       |   52.4839   | 275.3577  |
|          spnasnet_100           | 128 | 5.6856  |  12.2802  |    81.6643     |   61.8244   | 262.9874  |
|            lcnet_050            | 128 | 2.0093  |   5.267   |      nan       |   39.738    |  252.48   |
|       gluon_inception_v3        | 128 | 8.4342  |  18.8111  |      nan       |  107.1782   | 234.7551  |
|          inception_v3           | 128 | 8.4807  |  18.9464  |      nan       |  107.6097   | 223.2148  |
|     swsl_resnext101_32x16d      | 32  | 10.3929 |  22.1383  |      nan       |   63.0214   | 217.9656  |
|           volo_d1_224           | 64  |  6.874  |  16.0245  |      nan       |   45.317    | 210.5924  |
|           convit_base           | 32  | 4.1162  |  10.6181  |      nan       |     nan     | 190.8993  |
|            pit_b_224            | 64  | 3.9964  |   9.565   |      nan       |   27.7656   | 183.1499  |
|        tnt_s_patch16_224        | 64  | 12.6558 |  24.9605  |      nan       |   49.1967   |  166.85   |
|          gmlp_s16_224           | 64  | 9.7371  |  17.5417  |      nan       |   30.1711   | 149.1056  |
|            repvgg_a2            | 128 | 4.9392  |  10.6779  |      nan       |   65.977    | 139.9949  |
|          gmixer_24_224          | 64  | 8.6395  |  17.5991  |      nan       |   35.0491   | 131.8514  |
|           dm_nfnet_f0           | 128 | 6.6834  |  13.7387  |      nan       |   42.7846   | 128.7499  |
|          resmlp_12_224          | 128 |  2.834  |  6.1399   |     9.9394     |     nan     | 102.5117  |
|          mixer_b16_224          | 64  | 2.8878  |   7.212   |    16.3638     |   18.1333   | 100.1476  |
|        convmixer_768_32         | 32  |  7.064  |  14.717   |      nan       |   24.0237   |  85.642   |
|      beit_base_patch16_224      | 64  | 4.6764  |  10.6156  |      nan       |   22.3841   |  83.8798  |
| deit_base_distilled_patch16_224 | 64  | 3.1291  |   8.244   |      nan       |   16.9322   |  80.9149  |
|      vit_base_patch16_224       | 64  | 3.0647  |  7.9743   |      nan       |   16.4046   |  70.3833  |
+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+

Peak Memory Compression Ratio

+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|              name               | bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|          gmixer_24_224          | 64  | 1.0001 |  0.9563   |      nan       |   0.8998    |  1.2577  |
|          gmlp_s16_224           | 64  |  1.0   |  0.9679   |      nan       |    0.92     |  1.2405  |
|            tinynet_a            | 128 | 1.0001 |  0.7955   |      nan       |   0.7958    |  1.1632  |
|          pnasnet5large          | 16  | 1.0583 |  0.9923   |      nan       |   1.1741    |  1.1266  |
|        eca_halonext26ts         | 64  | 0.999  |  0.7814   |      nan       |    0.786    |  1.0889  |
|           dm_nfnet_f0           | 128 | 0.9758 |  0.9039   |      nan       |    0.95     |  1.0616  |
|        tnt_s_patch16_224        | 64  |  1.0   |  0.9718   |      nan       |   0.9431    |  1.0587  |
|           volo_d1_224           | 64  | 1.0015 |  0.9518   |      nan       |   0.8587    |  1.0378  |
|           convit_base           | 32  | 0.9991 |   0.86    |      nan       |     nan     |  1.0309  |
|      beit_base_patch16_224      | 64  | 0.9999 |  0.9367   |      nan       |   0.9298    |  1.0097  |
|           mobilevit_s           | 32  |  1.0   |  0.7722   |      nan       |    0.787    |  1.0078  |
|           rexnet_100            | 128 | 0.9988 |  0.7919   |      nan       |   0.8648    |  1.001   |
|             dla102              | 64  | 0.9998 |  0.9549   |      nan       |   0.9751    |  0.9969  |
|            pit_b_224            | 64  | 1.0021 |  0.8074   |      nan       |   0.8179    |  0.9856  |
|         poolformer_m36          | 64  | 1.0015 |  0.9462   |      nan       |     nan     |  0.9797  |
|          convnext_base          | 32  | 1.0065 |   0.908   |      nan       |   0.7521    |  0.9564  |
|        twins_pcpvt_base         | 32  | 0.9963 |  0.9079   |      nan       |   0.8007    |  0.9553  |
|        convmixer_768_32         | 32  | 0.9992 |  0.9807   |      nan       |   0.9715    |  0.9513  |
|         visformer_small         | 128 | 0.9899 |  0.9353   |      nan       |   0.8884    |  0.9341  |
|           resnest101e           | 32  | 1.0002 |  0.9762   |      nan       |   0.9535    |  0.9292  |
|           tf_mixnet_l           | 64  | 0.9995 |  0.8624   |      nan       |   0.8426    |  0.9291  |
|          mixer_b16_224          | 64  | 0.9929 |  0.9425   |     0.2532     |   0.7726    |  0.9225  |
|       tf_efficientnet_b0        | 128 | 1.0006 |  0.7769   |      nan       |    0.846    |  0.9189  |
|            nfnet_l0             | 64  | 0.9993 |   0.824   |      nan       |   0.8257    |  0.9132  |
|         mobilenetv2_100         | 128 | 0.9992 |  0.7716   |      nan       |   0.9249    |  0.8963  |
|      vit_base_patch16_224       | 64  | 0.9955 |  0.9384   |      nan       |   0.8801    |  0.8916  |
| deit_base_distilled_patch16_224 | 64  | 0.9944 |  0.9376   |      nan       |   0.8794    |  0.8911  |
|      mobilenetv3_large_100      | 128 | 0.9987 |  0.8562   |      nan       |   0.8673    |  0.8885  |
|        adv_inception_v3         | 128 | 1.0003 |  0.8759   |      nan       |   0.8538    |  0.8829  |
|       gluon_inception_v3        | 128 | 1.0003 |  0.8759   |      nan       |   0.8538    |  0.8829  |
|          inception_v3           | 128 | 1.0003 |  0.8759   |      nan       |   0.8538    |  0.8829  |
|        gluon_xception65         | 32  |  1.0   |  0.8895   |      nan       |   0.8854    |  0.8712  |
|             dpn107              | 32  | 0.9981 |  0.9115   |      nan       |   0.8834    |   0.87   |
|           selecsls42b           | 128 | 0.9789 |  0.8913   |      nan       |   0.8811    |  0.866   |
|            fbnetv3_b            | 128 | 1.0003 |  0.7918   |      nan       |   0.7903    |  0.8647  |
|            mixnet_l             | 64  | 0.9989 |  0.8507   |      nan       |   0.7796    |  0.8601  |
|          spnasnet_100           | 128 | 0.9988 |  0.8961   |     0.1651     |   0.8371    |  0.8599  |
|       eca_botnext26ts_256       | 64  | 0.9998 |  0.7776   |      nan       |   0.7813    |  0.8533  |
|     swsl_resnext101_32x16d      | 32  | 1.0009 |  0.8805   |      nan       |   0.8487    |  0.8523  |
|      xcit_large_24_p8_224       |  5  | 0.9987 |    nan    |      nan       |     nan     |  0.8489  |
|          resmlp_12_224          | 128 | 0.9827 |  0.9667   |     0.2637     |     nan     |  0.845   |
|          ghostnet_100           | 128 | 1.0013 |  0.8903   |      nan       |   0.9244    |  0.833   |
|         coat_lite_mini          | 128 | 1.0338 |   0.929   |      nan       |   0.6593    |  0.8328  |
|        ese_vovnet19b_dw         | 128 |  1.0   |   0.867   |      nan       |   0.9146    |  0.8269  |
|          cspdarknet53           | 64  |  1.0   |  0.8467   |      nan       |   0.7906    |  0.813   |
|          cait_m36_384           |  2  | 0.9998 |  0.8806   |      nan       |   0.9023    |  0.8081  |
|          jx_nest_base           | 32  |  1.0   |  0.8945   |      nan       |    0.86     |   0.8    |
|         crossvit_9_240          | 64  | 1.0008 |  0.8801   |      nan       |   0.8854    |  0.7934  |
|        res2net101_26w_4s        | 64  | 0.9999 |  0.9202   |      nan       |   0.8569    |  0.7834  |
|           mnasnet_100           | 128 | 0.9993 |  0.8882   |     0.1669     |   0.8253    |  0.773   |
|  swin_base_patch4_window7_224   | 64  | 0.9998 |  0.9234   |      nan       |   0.8451    |  0.7676  |
|        sebotnet33ts_256         | 64  | 0.9999 |  0.7108   |      nan       |   0.7354    |  0.7449  |
|            gernet_l             | 128 | 0.9998 |  0.8655   |      nan       |    0.83     |  0.7238  |
|           fbnetc_100            | 128 | 0.9984 |  0.8631   |     0.1626     |   0.7352    |  0.7104  |
|            lcnet_050            | 128 | 0.9992 |  0.7927   |      nan       |   0.7885    |  0.705   |
|           regnety_002           | 128 | 0.9994 |  0.8284   |      nan       |   0.7819    |  0.6971  |
|          botnet26t_256          | 128 |  1.0   |  0.8755   |      nan       |    0.78     |  0.6615  |
|           res2next50            |  2  |  1.0   |  0.8301   |      nan       |   0.8198    |  0.6012  |
|        res2net50_14w_8s         |  2  |  1.0   |  0.8275   |      nan       |   0.8169    |  0.5927  |
|            hrnet_w18            |  2  |  1.0   |  0.8383   |      nan       |   0.8363    |  0.5746  |
|            repvgg_a2            | 128 | 1.0003 |  0.7971   |      nan       |   0.6902    |  0.5572  |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+

Performance graphs

see more

bench_logs/timm_models_amp.png :

bench_logs/torchbench_amp.png :

bench_logs/huggingface_amp.png :

@anijain2305
Copy link
Contributor Author

Performance Dashboard for float32 precision

Executive Summary

see more We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.

Caveats

  1. Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint.
  2. Experiments do not cover dynamic shapes.
  3. Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+-------------+-------------+-------------+
|    Compiler    | torchbench  | huggingface | timm_models |
+----------------+-------------+-------------+-------------+
|     eager      | 100%, 55/55 | 93%, 41/44  | 100%, 61/61 |
|   aot_eager    | 98%, 54/55  | 93%, 41/44  | 90%, 55/61  |
| aot_cudagraphs | 29%, 16/55  |  0%, 0/44   |  0%, 0/61   |
|  aot_nvfuser   | 62%, 34/55  |  2%, 1/44   | 82%, 50/61  |
|    inductor    | 87%, 48/55  | 77%, 34/44  | 74%, 45/61  |
+----------------+-------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.02x    |    0.0x     |    0.0x     |
|  aot_nvfuser   |   1.12x    |    1.13x    |    1.12x    |
|    inductor    |   1.37x    |    1.61x    |    1.24x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    5.70    |    13.73    |    11.39    |
|   aot_eager    |   10.34    |    20.46    |    17.09    |
| aot_cudagraphs |    4.54    |     0.0     |     0.0     |
|  aot_nvfuser   |   21.31    |    10.74    |    57.51    |
|    inductor    |   265.33   |   111.78    |   417.22    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.87x    |    0.88x    |    0.88x    |
| aot_cudagraphs |   0.48x    |    0.0x     |    0.0x     |
|  aot_nvfuser   |   0.84x    |    1.08x    |    0.85x    |
|    inductor    |   0.79x    |    0.74x    |    0.89x    |
+----------------+------------+-------------+-------------+

torchbench suite with float32 precision

see more

Performance speedup

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|            densenet121            |  4   | 1.0021 |  1.0072   |      0.0       |   1.4515    |  4.6393  |
|         timm_efficientdet         |  1   | 0.9831 |  0.8908   |      0.0       |     0.0     |  3.8674  |
|       functorch_dp_cifar10        |  64  | 1.0019 |  0.9777   |      0.0       |   1.1919    |  3.6153  |
|      timm_vision_transformer      |  8   | 1.003  |   0.923   |      0.0       |   1.3434    |  2.5786  |
|                drq                |  1   | 0.9972 |  0.8497   |      0.0       |   1.0702    |  2.4508  |
|           BERT_pytorch            |  16  | 1.0091 |  0.8721   |      0.0       |     0.0     |  1.855   |
|             resnet18              |  16  | 1.003  |  1.1147   |      0.0       |   1.4051    |  1.7636  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.9993 |   0.938   |     1.1197     |   1.1919    |  1.729   |
|          pytorch_struct           | 200  | 0.9961 |  0.7502   |     0.8973     |    0.884    |  1.7059  |
|           lennard_jones           | 1000 | 0.9674 |  0.8486   |     1.0724     |   1.0278    |  1.667   |
|             hf_Albert             |  8   | 1.0012 |   0.995   |      0.0       |     0.0     |  1.6645  |
|           squeezenet1_1           |  32  | 0.9972 |  1.0037   |     0.9904     |   1.1563    |  1.6496  |
|               dcgan               |  32  | 0.9915 |  1.0198   |     1.109      |   1.1794    |  1.6235  |
|        speech_transformer         |  32  | 1.0078 |  0.9013   |      0.0       |     0.0     |  1.4912  |
|            timm_nfnet             | 128  | 0.9995 |  1.0004   |      0.0       |   1.2113    |  1.4741  |
|              hf_GPT2              |  4   | 1.0129 |  0.9793   |      0.0       |     0.0     |  1.4269  |
|            hf_T5_large            |  2   | 1.0232 |  0.9244   |      0.0       |     0.0     |  1.4038  |
|          resnext50_32x4d          |  8   | 1.0017 |  1.0845   |      0.0       |   1.3674    |  1.4019  |
|           fastNLP_Bert            |  6   | 0.9991 |  0.9746   |      0.0       |     0.0     |  1.3537  |
|        mobilenet_v3_large         |  32  | 1.0051 |  1.1141   |      0.0       |   1.3888    |  1.343   |
|         soft_actor_critic         | 256  | 0.9997 |  0.7922   |     1.0271     |   1.0222    |  1.2641  |
|          LearningToPaint          |  96  | 1.0027 |  1.0327   |      0.0       |   1.2377    |  1.262   |
|           pytorch_unet            |  1   | 0.9997 |  0.9987   |      0.0       |   1.0754    |  1.203   |
|              hf_Bart              |  4   | 1.0137 |  0.9696   |      0.0       |     0.0     |  1.1822  |
|               vgg16               |  64  | 0.9999 |  0.9984   |     0.7922     |   0.9965    |  1.1723  |
|            Super_SloMo            |  6   | 1.0001 |  0.9977   |      0.0       |     0.0     |  1.1704  |
|              alexnet              | 128  | 0.9993 |  0.9977   |     0.7784     |   1.0005    |  1.1646  |
|              hf_Bert              |  4   | 1.0249 |  1.0019   |      0.0       |     0.0     |  1.1577  |
|           hf_DistilBert           |  8   | 1.0009 |  0.9543   |      0.0       |     0.0     |  1.1516  |
|        shufflenet_v2_x1_0         | 128  | 1.0001 |  1.0777   |      0.0       |   1.2258    |  1.1504  |
|            mnasnet1_0             |  32  | 1.0009 |   1.123   |     0.748      |   1.3056    |  1.1302  |
|          pytorch_stargan          |  16  | 0.9995 |  0.9825   |     0.7291     |   0.9891    |  1.1176  |
|        Background_Matting         |  4   | 0.9996 |  1.0224   |      0.0       |   1.0822    |  1.1164  |
|            hf_Reformer            |  4   | 0.9965 |    0.0    |     0.894      |     0.0     |  1.1094  |
|         timm_efficientnet         |  32  | 0.9572 |   0.818   |      0.0       |   1.0643    |  1.095   |
|            hf_BigBird             |  2   | 0.9932 |  0.9458   |      0.0       |     0.0     |  1.0781  |
|   timm_vision_transformer_large   |  8   | 0.9994 |   0.994   |      0.0       |   0.9828    |  1.052   |
| attention_is_all_you_need_pytorch | 256  | 0.997  |  0.9694   |      0.0       |     0.0     |  1.0474  |
|           timm_resnest            |  32  | 0.9994 |   1.002   |      0.0       |   1.1837    |  1.0351  |
|              demucs               |  4   | 0.9998 |  0.9992   |     1.0002     |   0.9996    |  0.9995  |
|    mobilenet_v2_quantized_qat     |  96  | 0.9993 |  0.9991   |     0.9986     |   0.9989    |  0.9984  |
|      resnet50_quantized_qat       |  32  | 0.9972 |   0.998   |     0.9985     |    0.998    |  0.998   |
|            tts_angular            |  64  | 0.9963 |   0.96    |     0.9962     |   0.9982    |  0.9919  |
|               dlrm                | 2048 | 1.0936 |   0.932   |      0.0       |     0.0     |  0.9396  |
|            timm_vovnet            |  32  | 0.9057 |  0.9046   |      0.0       |   0.9795    |  0.9172  |
|      nvidia_deeprecommender       | 256  | 0.9994 |  0.9628   |     0.5849     |   0.9423    |  0.9044  |
|           mobilenet_v2            |  96  | 0.9996 |  0.9984   |      0.0       |   1.0439    |  0.865   |
|               moco                |  32  | 0.9926 |   1.045   |      0.0       |     0.0     |  0.8381  |
|             resnet50              |  32  | 0.9984 |  0.9932   |      0.0       |   1.1621    |  0.7785  |
|            timm_regnet            |  32  | 0.9649 |  0.9625   |      0.0       |   1.0943    |  0.7707  |
|              yolov3               |  16  | 0.9995 |  0.9943   |      0.0       |   1.1829    |   0.0    |
|           hf_Longformer           |  2   | 0.9693 |   0.901   |     0.8158     |     0.0     |   0.0    |
|               hf_T5               |  8   | 1.0007 |  0.9899   |      0.0       |     0.0     |   0.0    |
|           hf_GPT2_large           |  4   | 0.9996 |  0.9801   |      0.0       |     0.0     |   0.0    |
|             tacotron2             |  64  | 0.9808 |  0.8586   |      0.0       |     0.0     |   0.0    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|               name                | bs  |      eager       |    aot_eager     |  aot_cudagraphs  |   aot_nvfuser    |     inductor     |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|           hf_GPT2_large           |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|            hf_T5_large            |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|   timm_vision_transformer_large   |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|              alexnet              |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               dcgan               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|              demucs               |  4  |       pass       |       pass       |       pass       |       pass       |       pass       |
|           lennard_jones           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|            mnasnet1_0             |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|    mobilenet_v2_quantized_qat     |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|      nvidia_deeprecommender       |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|   pytorch_CycleGAN_and_pix2pix    |  1  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_stargan          | 16  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_struct           | 200 |       pass       |       pass       |       pass       |       pass       |       pass       |
|      resnet50_quantized_qat       |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|         soft_actor_critic         | 256 |       pass       |       pass       |       pass       |       pass       |       pass       |
|           squeezenet1_1           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|            tts_angular            |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               vgg16               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          LearningToPaint          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            densenet121            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|                drq                |  1  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|       functorch_dp_cifar10        |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           mobilenet_v2            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|        mobilenet_v3_large         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           pytorch_unet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet18              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet50              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|          resnext50_32x4d          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|        shufflenet_v2_x1_0         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|         timm_efficientnet         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_nfnet             |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_regnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           timm_resnest            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|      timm_vision_transformer      |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_vovnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            hf_Reformer            |  2  |       pass       |       pass       |       pass       |   fail_to_run    |       pass       |
|           BERT_pytorch            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            Super_SloMo            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
| attention_is_all_you_need_pytorch |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|               dlrm                |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           fastNLP_Bert            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|             hf_Albert             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_Bart              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_Bert              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            hf_BigBird             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           hf_DistilBert           |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_GPT2              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|               hf_T5               |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|        speech_transformer         |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|         timm_efficientdet         |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|        Background_Matting         |  4  |       pass       |       pass       |   fail_to_run    |       pass       |   fail_to_run    |
|           hf_Longformer           |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|            hf_T5_base             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|               moco                |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|             tacotron2             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|          vision_maskrcnn          |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|              yolov3               |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |      0.0000      |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+

Compilation latency (sec)

+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+
|               name                |  bs  |  eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor  |
+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+
|         timm_efficientdet         |  1   | 51.5766 |  70.6045  |      nan       |     nan     | 1764.6785 |
|            densenet121            |  4   | 13.3012 |  25.1495  |      nan       |   99.4384   | 1532.0131 |
|            hf_T5_large            |  2   | 35.6569 |  66.2515  |      nan       |     nan     | 1068.5147 |
|            mnasnet1_0             |  32  | 3.1714  |  6.9425   |    24.1489     |   33.4974   | 843.7609  |
|        mobilenet_v3_large         |  32  | 3.5883  |   7.421   |      nan       |   55.4373   | 787.0648  |
|               moco                |  32  | 11.1404 |  16.7881  |      nan       |     nan     | 677.8514  |
|           mobilenet_v2            |  96  | 3.0986  |  6.6705   |      nan       |   39.0118   | 623.2404  |
|          resnext50_32x4d          |  8   | 3.3002  |  7.4339   |      nan       |   30.9222   | 591.0255  |
|         timm_efficientnet         |  32  | 5.7511  |  10.4823  |      nan       |   56.0236   | 539.9275  |
|        shufflenet_v2_x1_0         | 128  | 3.5859  |  8.0994   |      nan       |   29.641    | 449.5994  |
|           squeezenet1_1           |  32  | 0.6202  |  1.3239   |     3.539      |    4.885    | 366.1201  |
|           timm_resnest            |  32  | 1.3364  |  3.5203   |      nan       |   35.8046   | 348.1886  |
|            timm_regnet            |  32  | 8.1136  |  14.0954  |      nan       |   53.1497   | 317.6042  |
|            timm_vovnet            |  32  | 2.9071  |  6.1334   |      nan       |   24.786    | 265.4777  |
| attention_is_all_you_need_pytorch | 256  |  4.266  |  10.1758  |      nan       |     nan     | 261.9771  |
|        speech_transformer         |  32  | 7.2245  |  13.6655  |      nan       |     nan     | 251.9521  |
|       functorch_dp_cifar10        |  64  | 0.7908  |  2.0897   |      nan       |   5.4668    | 204.5091  |
|      timm_vision_transformer      |  8   | 2.9851  |  6.2629   |      nan       |   11.3289   | 196.1347  |
|          LearningToPaint          |  96  | 0.9587  |  2.4854   |      nan       |   24.429    | 189.1747  |
|             resnet18              |  16  | 0.9185  |  2.4438   |      nan       |   17.9014   | 185.4883  |
|   timm_vision_transformer_large   |  8   | 22.2284 |  34.3611  |      nan       |   44.8166   | 174.6423  |
|           BERT_pytorch            |  16  |  4.836  |  10.8222  |      nan       |     nan     | 174.2309  |
|              hf_Bart              |  4   | 7.2937  |  13.3922  |      nan       |     nan     | 150.9699  |
|             resnet50              |  32  | 3.2836  |  7.3932   |      nan       |   34.4205   |  145.403  |
|          pytorch_stargan          |  16  | 0.7907  |   2.763   |     9.5307     |   4.3293    | 145.2698  |
|           fastNLP_Bert            |  6   | 4.9808  |  10.0575  |      nan       |     nan     | 142.9017  |
|        Background_Matting         |  4   | 3.6956  |  7.4423   |      nan       |   32.1955   |  141.231  |
|              hf_GPT2              |  4   | 3.5631  |   8.387   |      nan       |     nan     | 139.3171  |
|            timm_nfnet             | 128  | 6.4912  |  11.9484  |      nan       |   34.2804   | 136.1473  |
|          pytorch_struct           | 200  | 0.4001  |  0.9359   |     1.4509     |   4.2146    |  103.788  |
|            Super_SloMo            |  6   |  2.116  |  5.8313   |      nan       |     nan     |  86.5013  |
|             hf_Albert             |  8   | 1.0841  |  5.7737   |      nan       |     nan     |  79.1676  |
|              hf_Bert              |  4   | 4.9073  |  9.6611   |      nan       |     nan     |  76.0375  |
|            hf_Reformer            |  4   |  3.011  |    nan    |    13.0912     |     nan     |  73.2447  |
|            hf_BigBird             |  2   | 10.8878 |  16.7952  |      nan       |     nan     |  58.7916  |
|           pytorch_unet            |  1   | 1.0526  |  2.7433   |      nan       |   20.291    |  56.5606  |
|           hf_DistilBert           |  8   | 1.6504  |  3.9743   |      nan       |     nan     |  49.8976  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.7386  |   2.59    |     7.9453     |   4.1358    |  31.9732  |
|               vgg16               |  64  | 0.3239  |  0.7723   |     2.3694     |   2.6377    |  19.7724  |
|                drq                |  1   | 0.2568  |  0.5426   |      nan       |    3.49     |  19.6268  |
|               dlrm                | 2048 | 0.5936  |  0.9576   |      nan       |     nan     |  17.1468  |
|              alexnet              | 128  | 0.2564  |  0.5024   |     1.1934     |   2.4487    |  15.6905  |
|               dcgan               |  32  | 0.2503  |  0.5086   |     1.2065     |    3.791    |  15.4419  |
|      nvidia_deeprecommender       | 256  |  0.255  |  0.4785   |     0.7806     |   2.4561    |  11.5989  |
|         soft_actor_critic         | 256  | 0.2525  |  0.3811   |     0.6593     |   1.5779    |  10.3899  |
|           lennard_jones           | 1000 | 0.2231  |   0.362   |     0.5064     |   1.1272    |  5.2309   |
|            tts_angular            |  64  | 0.3078  |   0.363   |     0.4981     |   1.0814    |  4.2127   |
|      resnet50_quantized_qat       |  32  | 2.4789  |  2.5093   |     2.5295     |   2.4749    |  2.4968   |
|    mobilenet_v2_quantized_qat     |  96  | 2.3837  |  2.3536   |     2.377      |   2.3057    |  2.2628   |
|              demucs               |  4   |  0.802  |  0.8072   |     0.8072     |   0.7996    |  0.7216   |
|              yolov3               |  16  | 7.2552  |  13.1212  |      nan       |   47.2727   |    nan    |
|           hf_Longformer           |  2   | 11.3734 |  19.0144  |    90.6872     |     nan     |    nan    |
|           hf_GPT2_large           |  4   | 21.1646 |  35.4272  |      nan       |     nan     |    nan    |
|             tacotron2             |  64  | 14.0298 |  26.6327  |      nan       |     nan     |    nan    |
|               hf_T5               |  8   | 3.8362  |  10.6544  |      nan       |     nan     |    nan    |
+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+

Peak Memory Compression Ratio

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|            Super_SloMo            |  6   | 1.0024 |   0.956   |      nan       |     nan     |  1.1857  |
|         timm_efficientnet         |  32  | 0.9998 |  0.7704   |      nan       |   0.7845    |  1.0652  |
|            timm_nfnet             | 128  | 0.9393 |   0.897   |      nan       |   0.9515    |  1.022   |
|         timm_efficientdet         |  1   | 1.0142 |  0.8251   |      nan       |     nan     |  1.0218  |
|      resnet50_quantized_qat       |  32  | 0.9967 |  0.9967   |     0.9967     |   0.9967    |  1.0001  |
|    mobilenet_v2_quantized_qat     |  96  | 0.9957 |  0.9957   |     0.9957     |   0.9957    |  0.9992  |
|           mobilenet_v2            |  96  | 0.9993 |  0.7661   |      nan       |   0.7676    |  0.9975  |
|              demucs               |  4   | 0.9886 |  0.9886   |     0.9886     |   0.9886    |  0.9886  |
|            tts_angular            |  64  | 0.9884 |  0.9884   |     0.984      |   0.9884    |  0.9842  |
|              hf_GPT2              |  4   | 0.9548 |   0.887   |      nan       |     nan     |  0.9505  |
|        Background_Matting         |  4   | 1.0026 |   0.952   |      nan       |   0.9773    |  0.9139  |
|          pytorch_stargan          |  16  | 0.9975 |   1.019   |     0.2027     |   1.0085    |  0.9023  |
|        speech_transformer         |  32  | 0.9988 |  0.9152   |      nan       |     nan     |  0.896   |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.9986 |  0.9194   |     0.2326     |   0.9141    |  0.8941  |
|             hf_Albert             |  8   | 0.9333 |  0.9333   |      nan       |     nan     |  0.8804  |
|           pytorch_unet            |  1   | 0.9985 |  0.8536   |      nan       |    0.851    |  0.859   |
|              hf_Bart              |  4   | 0.9617 |  0.8786   |      nan       |     nan     |  0.853   |
|              hf_Bert              |  4   | 0.9683 |  0.8952   |      nan       |     nan     |  0.8517  |
|            timm_regnet            |  32  | 1.0013 |  0.8634   |      nan       |   0.8806    |  0.8481  |
|        shufflenet_v2_x1_0         | 128  |  1.0   |  0.9163   |      nan       |   0.8868    |  0.8447  |
|           fastNLP_Bert            |  6   | 1.0012 |  0.9152   |      nan       |     nan     |  0.8343  |
| attention_is_all_you_need_pytorch | 256  | 0.9481 |  0.9241   |      nan       |     nan     |  0.8264  |
|            timm_vovnet            |  32  | 0.9933 |  0.7644   |      nan       |   0.7778    |  0.8252  |
|           BERT_pytorch            |  16  |  1.0   |  0.8995   |      nan       |     nan     |  0.825   |
|            hf_T5_large            |  2   | 0.922  |  0.8722   |      nan       |     nan     |  0.8237  |
|            hf_BigBird             |  2   | 0.9609 |  0.9609   |      nan       |     nan     |  0.8205  |
|           squeezenet1_1           |  32  | 0.9749 |  0.8159   |     0.2781     |   0.9742    |  0.8159  |
|           hf_DistilBert           |  8   | 0.9212 |  0.9053   |      nan       |     nan     |  0.7841  |
|               dcgan               |  32  |  1.0   |  0.7784   |     0.3321     |   0.7784    |  0.767   |
|               moco                |  32  | 1.0067 |  0.9701   |      nan       |     nan     |  0.767   |
|              alexnet              | 128  | 0.9998 |  0.7731   |     0.3805     |   0.7736    |  0.743   |
|            mnasnet1_0             |  32  | 0.9988 |  0.9087   |     0.1627     |   0.8348    |  0.7268  |
|             resnet50              |  32  | 1.0002 |  0.8763   |      nan       |   0.8011    |  0.7255  |
|   timm_vision_transformer_large   |  8   | 1.0022 |  0.8433   |      nan       |   0.8015    |  0.7222  |
|      timm_vision_transformer      |  8   |  1.0   |  0.8883   |      nan       |   0.8108    |  0.712   |
|        mobilenet_v3_large         |  32  | 0.9958 |  0.8655   |      nan       |   0.8773    |  0.7041  |
|               dlrm                | 2048 | 0.7282 |  0.7283   |      nan       |     nan     |  0.6973  |
|           timm_resnest            |  32  | 0.9935 |  0.8869   |      nan       |   0.8075    |  0.6861  |
|            densenet121            |  4   |  1.0   |  0.8812   |      nan       |   0.8571    |  0.6617  |
|          resnext50_32x4d          |  8   | 0.9994 |  0.8687   |      nan       |   0.8223    |  0.6614  |
|               vgg16               |  64  |  1.0   |  0.6663   |     0.2532     |   0.6664    |  0.6471  |
|          LearningToPaint          |  96  | 0.9442 |  0.7168   |      nan       |   0.6504    |  0.6444  |
|         soft_actor_critic         | 256  | 0.964  |   0.964   |     0.4356     |   0.9555    |  0.6428  |
|                drq                |  1   | 0.8541 |  0.8541   |      nan       |   0.8541    |  0.6427  |
|             resnet18              |  16  | 0.9846 |  0.7907   |      nan       |   0.7038    |  0.6163  |
|           lennard_jones           | 1000 |  1.0   |    1.0    |     0.3712     |   1.0947    |  0.5646  |
|      nvidia_deeprecommender       | 256  | 0.5598 |  0.5598   |     0.4734     |   0.5598    |  0.5598  |
|          pytorch_struct           | 200  |  1.0   |  0.5079   |     0.4824     |   0.5079    |  0.4222  |
|       functorch_dp_cifar10        |  64  | 0.9626 |  0.8251   |      nan       |   0.8254    |  0.4037  |
|            hf_Reformer            |  4   | 0.3011 |    nan    |     0.1803     |     nan     |  0.299   |
|              yolov3               |  16  | 1.0072 |  0.8533   |      nan       |   0.8915    |   nan    |
|           hf_Longformer           |  2   | 0.9603 |  0.9603   |     0.288      |     nan     |   nan    |
|             tacotron2             |  64  | 0.9922 |  1.1046   |      nan       |     nan     |   nan    |
|               hf_T5               |  8   | 0.9527 |  0.9446   |      nan       |     nan     |   nan    |
|           hf_GPT2_large           |  4   | 0.936  |  0.8771   |      nan       |     nan     |   nan    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

huggingface suite with float32 precision

see more

Performance speedup

+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|       MT5ForConditionalGeneration       | 2  | 1.027  |  0.9168   |      0.0       |     0.0     |  4.3687  |
|           ElectraForCausalLM            | 1  | 1.0453 |  0.9369   |      0.0       |     0.0     |  4.1923  |
|            YituTechConvBert             | 1  | 1.0289 |  0.9299   |      0.0       |     0.0     |  3.4016  |
|         MegatronBertForCausalLM         | 2  | 1.0372 |  0.9357   |      0.0       |     0.0     |  2.8899  |
|     M2M100ForConditionalGeneration      | 2  | 1.0114 |  0.9048   |      0.0       |     0.0     |  2.8587  |
|     MobileBertForQuestionAnswering      | 32 | 1.0194 |  0.9122   |      0.0       |     0.0     |  2.7823  |
|          MobileBertForMaskedLM          | 16 | 1.0195 |   0.903   |      0.0       |     0.0     |  2.6125  |
|             OPTForCausalLM              | 4  | 1.0186 |   0.897   |      0.0       |     0.0     |  2.5828  |
|           RobertaForCausalLM            | 4  | 1.0437 |  0.9334   |      0.0       |     0.0     |  2.5069  |
|             XGLMForCausalLM             | 1  | 1.0146 |  0.8742   |      0.0       |     0.0     |  2.4941  |
|                CamemBert                | 1  | 1.0435 |  0.9498   |      0.0       |     0.0     |  2.2953  |
|     PegasusForConditionalGeneration     | 4  | 1.0124 |  0.8918   |      0.0       |     0.0     |  2.0816  |
|               DistillGPT2               | 1  | 1.0299 |  0.9446   |      0.0       |     0.0     |  1.9655  |
|               GoogleFnet                | 1  | 1.0046 |  0.8137   |      0.0       |   1.1324    |  1.8265  |
|    MegatronBertForQuestionAnswering     | 8  | 1.0398 |  0.9391   |      0.0       |     0.0     |  1.7417  |
|     PLBartForConditionalGeneration      | 8  | 1.0168 |  0.9089   |      0.0       |     0.0     |  1.7175  |
|      GPT2ForSequenceClassification      | 4  | 0.9988 |  0.9775   |      0.0       |     0.0     |  1.6644  |
|      MBartForConditionalGeneration      | 8  | 1.0163 |  0.9134   |      0.0       |     0.0     |  1.4676  |
|            XLNetLMHeadModel             | 4  | 0.9998 |  0.9649   |      0.0       |     0.0     |  1.4274  |
|       T5ForConditionalGeneration        | 4  | 0.9982 |  0.9723   |      0.0       |     0.0     |  1.3487  |
|            TrOCRForCausalLM             | 8  | 1.0117 |  0.9445   |      0.0       |     0.0     |  1.3447  |
|       AlbertForQuestionAnswering        | 2  |  1.0   |  1.0001   |      0.0       |     0.0     |  1.3067  |
|            AlbertForMaskedLM            | 2  | 1.0006 |  0.9979   |      0.0       |     0.0     |   1.3    |
|       DebertaForQuestionAnswering       | 4  | 0.9388 |  0.7464   |     0.794      |     0.0     |  1.2795  |
|    LayoutLMForSequenceClassification    | 16 | 0.9994 |  0.9881   |      0.0       |     0.0     |  1.2534  |
|         Speech2Text2ForCausalLM         | 64 | 1.0101 |  0.9381   |      0.0       |     0.0     |  1.2338  |
|                 T5Small                 | 1  | 1.022  |  0.9544   |      0.0       |     0.0     |  1.2217  |
|           PegasusForCausalLM            | 8  | 1.0118 |   0.92    |      0.0       |     0.0     |  1.2173  |
|      BartForConditionalGeneration       | 1  | 1.0142 |  0.9898   |      0.0       |     0.0     |  1.2117  |
|     DistilBertForQuestionAnswering      | 32 | 1.0293 |  0.9825   |      0.0       |     0.0     |  1.1948  |
| BlenderbotSmallForConditionalGeneration | 32 | 1.0107 |  0.9413   |      0.0       |     0.0     |  1.1946  |
|          DistilBertForMaskedLM          | 16 | 1.0288 |   0.978   |      0.0       |     0.0     |  1.1572  |
|            PLBartForCausalLM            | 16 | 1.0098 |  0.9437   |      0.0       |     0.0     |  1.1312  |
|             BartForCausalLM             | 2  | 0.9998 |  0.9666   |      0.0       |     0.0     |  1.1055  |
|       RobertaForQuestionAnswering       | 64 | 0.9986 |  0.9822   |      0.0       |     0.0     |  1.0941  |
|            MBartForCausalLM             | 16 |  1.01  |  0.9621   |      0.0       |     0.0     |  1.0884  |
|                 BigBird                 | 1  | 0.9892 |  0.9347   |      0.0       |     0.0     |  1.0879  |
|        BertForQuestionAnswering         | 64 | 0.9987 |   0.981   |      0.0       |     0.0     |  1.0865  |
|             BertForMaskedLM             | 64 | 0.9988 |  0.9623   |      0.0       |     0.0     |  1.0409  |
|           DebertaForMaskedLM            | 4  | 0.9388 |  0.8149   |     0.7231     |     0.0     |  1.0161  |
|       BlenderbotSmallForCausalLM        | 64 | 1.001  |  0.9085   |      0.0       |     0.0     |  1.008   |
|          AllenaiLongformerBase          | 1  | 0.9551 |  0.8695   |     0.7833     |     0.0     |   0.0    |
|       ElectraForQuestionAnswering       | 64 | 0.999  |  0.9853   |      0.0       |     0.0     |   0.0    |
|           LayoutLMForMaskedLM           | 16 | 0.9991 |  0.9699   |      0.0       |     0.0     |   0.0    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser |  inductor   |
+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+
|               GoogleFnet                | 1  |  pass  |   pass    |  fail_to_run   |    pass     |    pass     |
|             BartForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             BertForMaskedLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|        BertForQuestionAnswering         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                 BigBird                 | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       BlenderbotSmallForCausalLM        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
| BlenderbotSmallForConditionalGeneration | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                CamemBert                | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           DebertaForMaskedLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       DebertaForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|          DistilBertForMaskedLM          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     DistilBertForQuestionAnswering      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|               DistillGPT2               | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           ElectraForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       ElectraForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|      GPT2ForSequenceClassification      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           LayoutLMForMaskedLM           | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|    LayoutLMForSequenceClassification    | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            MBartForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       MT5ForConditionalGeneration       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|         MegatronBertForCausalLM         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|    MegatronBertForQuestionAnswering     | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|          MobileBertForMaskedLM          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     MobileBertForQuestionAnswering      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             OPTForCausalLM              | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            PLBartForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           PegasusForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     PegasusForConditionalGeneration     | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           RobertaForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       RobertaForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|         Speech2Text2ForCausalLM         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       T5ForConditionalGeneration        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                 T5Small                 | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            TrOCRForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            XLNetLMHeadModel             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            YituTechConvBert             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            AlbertForMaskedLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|       AlbertForQuestionAnswering        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|          AllenaiLongformerBase          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|      MBartForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|     PLBartForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|      BartForConditionalGeneration       | 0  | 0.0000 |  0.0000   |     0.0000     |   0.0000    |   0.0000    |
|     M2M100ForConditionalGeneration      | 0  | 0.0000 |  0.0000   |     0.0000     |   0.0000    |   0.0000    |
|             XGLMForCausalLM             | 0  | 0.0000 |  0.0000   |     0.0000     |   0.0000    |   0.0000    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+

Compilation latency (sec)

+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+
|                  name                   | bs |  eager   | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+
|            XLNetLMHeadModel             | 4  | 17.8864  |  36.3251  |      nan       |     nan     | 317.1367 |
|          MobileBertForMaskedLM          | 16 | 135.2893 | 155.4058  |      nan       |     nan     | 271.7268 |
|     MobileBertForQuestionAnswering      | 32 | 133.5434 | 156.8102  |      nan       |     nan     | 252.473  |
|     M2M100ForConditionalGeneration      | 2  | 25.5586  |  37.8759  |      nan       |     nan     | 222.3239 |
|       MT5ForConditionalGeneration       | 2  |  6.4161  |  16.6703  |      nan       |     nan     | 179.1136 |
|            YituTechConvBert             | 1  |  8.9448  |  16.5143  |      nan       |     nan     | 176.3369 |
|       T5ForConditionalGeneration        | 4  |  3.7734  |  10.937   |      nan       |     nan     | 175.8095 |
|             XGLMForCausalLM             | 1  | 15.1297  |  24.758   |      nan       |     nan     | 168.856  |
|      MBartForConditionalGeneration      | 8  | 26.0665  |  38.8293  |      nan       |     nan     | 168.5349 |
|     PegasusForConditionalGeneration     | 4  | 25.6046  |  38.5024  |      nan       |     nan     | 157.4908 |
|           DebertaForMaskedLM            | 4  |  7.1369  |  13.2473  |    49.7312     |     nan     | 149.1406 |
|      BartForConditionalGeneration       | 1  | 25.5661  |  37.9984  |      nan       |     nan     | 148.0126 |
|    MegatronBertForQuestionAnswering     | 8  | 16.2073  |  25.7688  |      nan       |     nan     | 137.4447 |
|         MegatronBertForCausalLM         | 2  |  16.236  |  26.2057  |      nan       |     nan     | 136.8413 |
| BlenderbotSmallForConditionalGeneration | 32 | 11.9456  |  19.9424  |      nan       |     nan     | 134.1319 |
|                 T5Small                 | 1  |  3.7531  |   10.71   |      nan       |     nan     | 133.5697 |
|     PLBartForConditionalGeneration      | 8  |  7.3286  |   13.74   |      nan       |     nan     | 132.5687 |
|       DebertaForQuestionAnswering       | 4  |  6.9868  |  12.9985  |    50.6366     |     nan     | 114.6722 |
|           RobertaForCausalLM            | 4  |  5.2682  |  9.8593   |      nan       |     nan     | 100.9032 |
|    LayoutLMForSequenceClassification    | 16 |  5.1824  |  9.9437   |      nan       |     nan     | 92.2545  |
|           PegasusForCausalLM            | 8  |  9.8456  |  14.438   |      nan       |     nan     | 88.2178  |
|            MBartForCausalLM             | 16 |  9.8451  |  14.2179  |      nan       |     nan     | 85.4066  |
|             OPTForCausalLM              | 4  |  4.6586  |  9.5188   |      nan       |     nan     | 77.5511  |
|             BertForMaskedLM             | 64 |  4.9281  |  9.7456   |      nan       |     nan     |  77.007  |
|      GPT2ForSequenceClassification      | 4  |  3.4782  |  8.0937   |      nan       |     nan     | 76.4033  |
|             BartForCausalLM             | 2  |  9.6334  |   14.23   |      nan       |     nan     | 76.2828  |
|           ElectraForCausalLM            | 1  |  5.0797  |  9.7233   |      nan       |     nan     | 72.6091  |
|            TrOCRForCausalLM             | 8  |  10.038  |  14.4735  |      nan       |     nan     | 70.3343  |
|       BlenderbotSmallForCausalLM        | 64 |  4.7331  |  7.8131   |      nan       |     nan     | 68.4415  |
|         Speech2Text2ForCausalLM         | 64 |  3.1545  |  5.4563   |      nan       |     nan     | 65.9358  |
|               DistillGPT2               | 1  |  1.4438  |  3.7992   |      nan       |     nan     | 63.1728  |
|            PLBartForCausalLM            | 16 |  3.2604  |  5.7169   |      nan       |     nan     | 61.9116  |
|        BertForQuestionAnswering         | 64 |  4.8664  |  9.6553   |      nan       |     nan     | 60.3864  |
|     DistilBertForQuestionAnswering      | 32 |  1.7088  |  4.0654   |      nan       |     nan     | 60.3565  |
|                CamemBert                | 1  |  5.0927  |  9.6565   |      nan       |     nan     | 59.6166  |
|       RobertaForQuestionAnswering       | 64 |  4.8469  |  9.7659   |      nan       |     nan     |  59.427  |
|                 BigBird                 | 1  | 10.8289  |  16.7412  |      nan       |     nan     | 58.8768  |
|            AlbertForMaskedLM            | 2  |  1.2433  |  5.8391   |      nan       |     nan     | 56.5995  |
|       AlbertForQuestionAnswering        | 2  |  1.2235  |  5.7785   |      nan       |     nan     | 47.9866  |
|          DistilBertForMaskedLM          | 16 |  1.7344  |   4.111   |      nan       |     nan     | 46.6191  |
|               GoogleFnet                | 1  |  1.9789  |  4.2376   |      nan       |   10.744    |  42.907  |
|          AllenaiLongformerBase          | 1  | 11.4511  |  19.2509  |     86.117     |     nan     |   nan    |
|           LayoutLMForMaskedLM           | 16 |  5.5414  |  10.3348  |      nan       |     nan     |   nan    |
|       ElectraForQuestionAnswering       | 64 |  4.8934  |  9.6669   |      nan       |     nan     |   nan    |
+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+

Peak Memory Compression Ratio

+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|      GPT2ForSequenceClassification      | 4  | 0.9342 |  0.9091   |      nan       |     nan     |  1.0318  |
|            XLNetLMHeadModel             | 4  | 1.0001 |  0.8976   |      nan       |     nan     |  0.9717  |
|    LayoutLMForSequenceClassification    | 16 |  1.0   |  0.9348   |      nan       |     nan     |  0.9339  |
|        BertForQuestionAnswering         | 64 |  1.0   |  0.9467   |      nan       |     nan     |  0.9145  |
|       RobertaForQuestionAnswering       | 64 |  1.0   |  0.9467   |      nan       |     nan     |  0.9145  |
|                 T5Small                 | 1  |  1.0   |  0.9325   |      nan       |     nan     |  0.8445  |
|     DistilBertForQuestionAnswering      | 32 |  1.0   |  0.9046   |      nan       |     nan     |  0.8394  |
|             BertForMaskedLM             | 64 |  1.0   |  0.9219   |      nan       |     nan     |  0.8321  |
|             BartForCausalLM             | 2  |  1.0   |  0.8847   |      nan       |     nan     |  0.8303  |
|                 BigBird                 | 1  | 1.0001 |  0.9549   |      nan       |     nan     |  0.8224  |
|          DistilBertForMaskedLM          | 16 | 0.9998 |  0.9138   |      nan       |     nan     |  0.8055  |
|            PLBartForCausalLM            | 16 | 0.9997 |  0.8802   |      nan       |     nan     |  0.8028  |
|            MBartForCausalLM             | 16 |  1.0   |  0.8629   |      nan       |     nan     |  0.8005  |
|               DistillGPT2               | 1  | 1.0003 |  0.7721   |      nan       |     nan     |  0.7997  |
|         Speech2Text2ForCausalLM         | 64 |  1.0   |   0.88    |      nan       |     nan     |  0.7767  |
|       T5ForConditionalGeneration        | 4  |  1.0   |  0.9597   |      nan       |     nan     |  0.7754  |
|             XGLMForCausalLM             | 1  | 0.9999 |  0.9999   |      nan       |     nan     |  0.7728  |
|      BartForConditionalGeneration       | 1  |  1.0   |  0.8465   |      nan       |     nan     |  0.7708  |
| BlenderbotSmallForConditionalGeneration | 32 |  1.0   |  0.9036   |      nan       |     nan     |  0.7612  |
|     PLBartForConditionalGeneration      | 8  | 0.9997 |  0.8222   |      nan       |     nan     |  0.7547  |
|                CamemBert                | 1  | 0.998  |  0.7977   |      nan       |     nan     |  0.7369  |
|            YituTechConvBert             | 1  | 0.9858 |  0.7923   |      nan       |     nan     |  0.7298  |
|            TrOCRForCausalLM             | 8  |  1.0   |  0.8048   |      nan       |     nan     |  0.7284  |
|       BlenderbotSmallForCausalLM        | 64 |  1.0   |  0.8401   |      nan       |     nan     |  0.7277  |
|      MBartForConditionalGeneration      | 8  |  1.0   |  0.8137   |      nan       |     nan     |  0.727   |
|             OPTForCausalLM              | 4  | 0.9979 |   0.75    |      nan       |     nan     |  0.714   |
|           RobertaForCausalLM            | 4  | 0.9058 |  0.7778   |      nan       |     nan     |  0.7099  |
|           PegasusForCausalLM            | 8  |  1.0   |  0.9323   |      nan       |     nan     |  0.7012  |
|    MegatronBertForQuestionAnswering     | 8  | 0.923  |  0.8265   |      nan       |     nan     |  0.6997  |
|               GoogleFnet                | 1  | 1.0003 |  0.9447   |      nan       |   1.0813    |  0.6953  |
|     M2M100ForConditionalGeneration      | 2  | 0.9795 |   0.979   |      nan       |     nan     |  0.6702  |
|         MegatronBertForCausalLM         | 2  | 0.7066 |  0.7066   |      nan       |     nan     |  0.6453  |
|     PegasusForConditionalGeneration     | 4  | 0.9721 |  0.9004   |      nan       |     nan     |  0.642   |
|       MT5ForConditionalGeneration       | 2  | 0.6173 |  0.6173   |      nan       |     nan     |  0.6173  |
|       AlbertForQuestionAnswering        | 2  |  1.0   |  0.9369   |      nan       |     nan     |  0.6126  |
|           ElectraForCausalLM            | 1  |  1.0   |  0.9107   |      nan       |     nan     |  0.6123  |
|            AlbertForMaskedLM            | 2  | 0.9999 |  0.9172   |      nan       |     nan     |  0.6027  |
|          MobileBertForMaskedLM          | 16 | 0.9997 |  0.9179   |      nan       |     nan     |  0.5861  |
|     MobileBertForQuestionAnswering      | 32 |  1.0   |  0.9716   |      nan       |     nan     |  0.4668  |
|           DebertaForMaskedLM            | 4  |  1.0   |  0.9851   |     0.352      |     nan     |  0.4265  |
|       DebertaForQuestionAnswering       | 4  | 0.9845 |  1.0525   |     0.3277     |     nan     |  0.3569  |
|          AllenaiLongformerBase          | 1  | 0.9988 |  0.9515   |     0.3144     |     nan     |   nan    |
|       ElectraForQuestionAnswering       | 64 |  1.0   |  0.9524   |      nan       |     nan     |   nan    |
|           LayoutLMForMaskedLM           | 16 |  1.0   |  0.9409   |      nan       |     nan     |   nan    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+

timm_models suite with float32 precision

see more

Performance speedup

+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|              name               | bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|        res2net50_14w_8s         |  2  | 0.9983 |   1.027   |      0.0       |   1.4439    |  4.7917  |
|            hrnet_w18            |  2  | 1.0076 |  1.0877   |      0.0       |   1.4906    |  4.6235  |
|           res2next50            |  2  | 1.0034 |  1.0445   |      0.0       |   1.3722    |  4.1476  |
|         coat_lite_mini          | 128 |  1.0   |  0.9994   |      0.0       |   1.0739    |  1.7094  |
|          ghostnet_100           | 128 | 0.9985 |  0.9939   |      0.0       |    1.249    |  1.5956  |
|        tnt_s_patch16_224        | 64  | 0.9997 |  0.9961   |      0.0       |   1.5683    |  1.5095  |
|        twins_pcpvt_base         | 32  | 1.0037 |  0.9738   |      0.0       |   1.3525    |  1.4376  |
|      xcit_large_24_p8_224       |  5  | 1.0006 |  0.9883   |      0.0       |     0.0     |  1.4149  |
|         crossvit_9_240          | 64  | 1.0049 |  0.9992   |      0.0       |   1.0961    |  1.405   |
|           volo_d1_224           | 64  | 0.9995 |  0.9952   |      0.0       |   1.1385    |  1.3979  |
|            nfnet_l0             | 64  | 0.9996 |  0.7979   |      0.0       |   1.0535    |  1.3819  |
|          gmixer_24_224          | 64  | 0.999  |  0.8428   |      0.0       |   0.9942    |  1.3536  |
|          jx_nest_base           | 32  | 0.9995 |  0.9942   |      0.0       |   1.2243    |  1.2913  |
|            lcnet_050            | 128 | 0.9564 |  0.9466   |      0.0       |   1.5001    |  1.2739  |
|           convit_base           | 32  | 0.9992 |  0.9931   |      0.0       |   1.1944    |  1.2661  |
|          convnext_base          | 32  | 0.9994 |   0.994   |      0.0       |   1.0411    |  1.2019  |
|          cait_m36_384           |  2  | 0.9981 |  0.9894   |      0.0       |   0.9966    |  1.196   |
|          gmlp_s16_224           | 64  | 0.9989 |  0.9964   |      0.0       |   0.9982    |  1.1454  |
|      beit_base_patch16_224      | 64  | 0.9998 |  0.9743   |      0.0       |   0.9541    |  1.1235  |
| deit_base_distilled_patch16_224 | 64  | 0.9997 |   0.998   |      0.0       |   1.0189    |  1.1047  |
|           regnety_002           | 128 | 0.9778 |  0.9883   |      0.0       |   1.3588    |  1.101   |
|      vit_base_patch16_224       | 64  | 0.9998 |  0.9982   |      0.0       |   0.9778    |  1.0942  |
|          mixer_b16_224          | 64  | 0.9997 |  0.9973   |      0.0       |   0.9836    |  1.0789  |
|           tf_mixnet_l           | 64  | 0.9714 |  0.8744   |      0.0       |   1.0062    |  1.0438  |
|          resmlp_12_224          | 128 | 0.9998 |  0.9997   |      0.0       |     0.0     |  1.0094  |
|            mixnet_l             | 64  | 0.9707 |  0.8727   |      0.0       |   1.0055    |  1.0017  |
|             dpn107              | 32  | 0.9584 |  0.9514   |      0.0       |    1.029    |  0.9988  |
|             dla102              | 64  | 0.9992 |  0.9967   |      0.0       |   1.2857    |  0.9897  |
|            gernet_l             | 128 | 0.9739 |   0.969   |      0.0       |   1.0979    |  0.9142  |
|           resnest101e           | 32  | 1.0011 |   1.018   |      0.0       |    1.204    |  0.9009  |
|            repvgg_a2            | 128 | 0.9634 |  0.9621   |      0.0       |   1.1211    |  0.8987  |
|           mobilevit_s           | 32  | 0.9749 |  0.7654   |      0.0       |   0.9566    |  0.8956  |
|         visformer_small         | 128 | 1.0001 |  1.0006   |      0.0       |   1.0204    |  0.8732  |
|           selecsls42b           | 128 | 0.9998 |  0.9983   |      0.0       |   1.2088    |  0.8727  |
|          cspdarknet53           | 64  | 0.9586 |  0.9504   |      0.0       |   1.1831    |  0.8635  |
|           mnasnet_100           | 128 | 0.9646 |  0.9634   |      0.0       |   1.1533    |  0.8582  |
|            fbnetv3_b            | 128 | 0.9648 |  0.9584   |      0.0       |   1.1334    |  0.8559  |
|        sebotnet33ts_256         | 64  | 0.9761 |  0.8072   |      0.0       |   1.0537    |  0.8532  |
|            tinynet_a            | 128 | 0.9662 |  0.7755   |      0.0       |   0.9712    |  0.8438  |
|      mobilenetv3_large_100      | 128 | 0.9659 |  0.9624   |      0.0       |   1.1625    |  0.793   |
|        res2net101_26w_4s        | 64  | 0.9987 |  0.9969   |      0.0       |   1.1757    |  0.7829  |
|       tf_efficientnet_b0        | 128 | 0.9763 |  0.7833   |      0.0       |   0.9849    |  0.7726  |
|          spnasnet_100           | 128 | 0.961  |  0.9581   |      0.0       |   1.1386    |  0.7679  |
|        eca_halonext26ts         | 64  | 0.9745 |  0.7769   |      0.0       |   1.0166    |  0.7612  |
|           fbnetc_100            | 128 | 0.9657 |  0.9619   |      0.0       |   1.1839    |  0.7582  |
|         mobilenetv2_100         | 128 | 0.9666 |  0.9604   |      0.0       |   1.0141    |  0.699   |
|       eca_botnext26ts_256       | 64  | 0.9736 |  0.7695   |      0.0       |   1.0172    |  0.6956  |
|           rexnet_100            | 128 | 0.9729 |  0.8138   |      0.0       |    0.983    |  0.6949  |
|        ese_vovnet19b_dw         | 128 | 0.9788 |  0.9775   |      0.0       |   1.1442    |  0.6341  |
|          botnet26t_256          | 128 | 0.9849 |   0.985   |      0.0       |   1.2249    |   0.0    |
|           dm_nfnet_f0           | 128 | 0.9998 |  0.9994   |      0.0       |   1.2112    |   0.0    |
|        adv_inception_v3         | 128 |  1.0   |  0.9987   |      0.0       |   1.1247    |   0.0    |
|          inception_v3           | 128 |  1.0   |  0.9982   |      0.0       |   1.1244    |   0.0    |
|       gluon_inception_v3        | 128 | 0.9999 |  0.9986   |      0.0       |   1.1222    |   0.0    |
|     swsl_resnext101_32x16d      | 32  | 0.9994 |  0.9989   |      0.0       |   1.1076    |   0.0    |
|          pnasnet5large          | 16  | 0.9988 |  0.9982   |      0.0       |   1.0821    |   0.0    |
|        convmixer_768_32         | 32  | 0.9998 |  0.9999   |      0.0       |    1.061    |   0.0    |
|            pit_b_224            | 64  | 0.9998 |  0.9976   |      0.0       |   1.0601    |   0.0    |
|        gluon_xception65         | 32  | 0.9992 |  0.9976   |      0.0       |   1.0409    |   0.0    |
|         poolformer_m36          | 64  | 0.9994 |  0.9985   |      0.0       |   1.0061    |   0.0    |
|  swin_base_patch4_window7_224   | 64  | 0.9998 |  0.9787   |      0.0       |   0.9982    |   0.0    |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+

Accuracy

+---------------------------------+----+-------+---------------+----------------+---------------+---------------+
|              name               | bs | eager |   aot_eager   | aot_cudagraphs |  aot_nvfuser  |   inductor    |
+---------------------------------+----+-------+---------------+----------------+---------------+---------------+
|          convnext_base          | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|          gmixer_24_224          | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|          gmlp_s16_224           | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|          mixer_b16_224          | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|           mnasnet_100           | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|            repvgg_a2            | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|          spnasnet_100           | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|        adv_inception_v3         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      beit_base_patch16_224      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          botnet26t_256          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        convmixer_768_32         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         crossvit_9_240          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          cspdarknet53           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
| deit_base_distilled_patch16_224 | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|             dla102              | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           dm_nfnet_f0           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|             dpn107              | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       eca_botnext26ts_256       | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        eca_halonext26ts         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        ese_vovnet19b_dw         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            gernet_l             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          ghostnet_100           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       gluon_inception_v3        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            hrnet_w18            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          inception_v3           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            lcnet_050            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            mixnet_l             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         mobilenetv2_100         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      mobilenetv3_large_100      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           mobilevit_s           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            nfnet_l0             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          pnasnet5large          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           regnety_002           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        res2net101_26w_4s        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        res2net50_14w_8s         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           res2next50            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           rexnet_100            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        sebotnet33ts_256         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           selecsls42b           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|  swin_base_patch4_window7_224   | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|     swsl_resnext101_32x16d      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       tf_efficientnet_b0        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           tf_mixnet_l           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            tinynet_a            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        tnt_s_patch16_224        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         visformer_small         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      vit_base_patch16_224       | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           volo_d1_224           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          resmlp_12_224          | 2  | pass  |     pass      |      pass      |  fail_to_run  |     pass      |
|           convit_base           | 2  | pass  |     pass      |  fail_to_run   |  fail_to_run  |     pass      |
|      xcit_large_24_p8_224       | 2  | pass  | fail_accuracy |  fail_to_run   |  fail_to_run  |     pass      |
|        gluon_xception65         | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy |     pass      |
|         poolformer_m36          | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy |     pass      |
|         coat_lite_mini          | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|          jx_nest_base           | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|            pit_b_224            | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|        twins_pcpvt_base         | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|           fbnetc_100            | 2  | pass  |     pass      |      pass      |     pass      | fail_accuracy |
|            fbnetv3_b            | 2  | pass  |     pass      |  fail_to_run   |     pass      | fail_accuracy |
|           resnest101e           | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy | fail_accuracy |
|          cait_m36_384           | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy | fail_accuracy |
+---------------------------------+----+-------+---------------+----------------+---------------+---------------+

Compilation latency (sec)

+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+
|              name               | bs  |  eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor  |
+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+
|            hrnet_w18            |  2  | 97.691  | 128.2568  |      nan       |  297.4634   | 1326.4957 |
|             dpn107              | 32  | 13.3213 |  24.7932  |      nan       |   87.0993   | 1248.9361 |
|           rexnet_100            | 128 | 6.4621  |  12.2169  |      nan       |  106.2673   | 954.6179  |
|        res2net50_14w_8s         |  2  | 19.923  |  34.3528  |      nan       |   87.1183   | 931.7786  |
|           mobilevit_s           | 32  | 5.6473  |  11.2171  |      nan       |   45.1615   | 830.5169  |
|            mixnet_l             | 64  | 13.4325 |  20.6526  |      nan       |   69.3248   | 755.6167  |
|       eca_botnext26ts_256       | 64  | 2.4512  |  6.3653   |      nan       |   49.7973   |  739.881  |
|          ghostnet_100           | 128 | 8.9586  |  16.0244  |      nan       |   65.523    | 667.2918  |
|            tinynet_a            | 128 |  7.716  |  13.3784  |      nan       |   67.6541   | 645.5261  |
|           fbnetc_100            | 128 |  5.434  |  10.4927  |      nan       |   50.2826   | 612.6355  |
|           resnest101e           | 32  | 26.5974 |  40.6878  |      nan       |   100.07    | 606.1691  |
|        twins_pcpvt_base         | 32  | 26.1593 |  36.7425  |      nan       |   69.8539   | 604.2703  |
|            fbnetv3_b            | 128 | 12.668  |  20.6589  |      nan       |   85.5758   | 582.8326  |
|         coat_lite_mini          | 128 | 3.1411  |  7.0172   |      nan       |   16.5631   | 571.6242  |
|           res2next50            |  2  | 7.3453  |  14.7597  |      nan       |   47.873    | 543.6359  |
|             dla102              | 64  | 10.6899 |  18.9579  |      nan       |   71.5604   | 512.4648  |
|           mnasnet_100           | 128 | 4.0287  |  7.8445   |      nan       |   40.2794   | 476.7122  |
|           tf_mixnet_l           | 64  | 13.6793 |  21.1891  |      nan       |   69.8174   | 473.6279  |
|        sebotnet33ts_256         | 64  | 3.7753  |  8.4122   |      nan       |   53.6861   | 472.7106  |
|        eca_halonext26ts         | 64  | 2.5729  |   6.504   |      nan       |   51.8466   | 455.4176  |
|          cspdarknet53           | 64  | 5.8183  |  11.2112  |      nan       |   52.2796   | 454.9286  |
|        res2net101_26w_4s        | 64  | 25.839  |  41.9511  |      nan       |  106.9702   | 414.5372  |
|       tf_efficientnet_b0        | 128 | 5.8682  |  10.6433  |      nan       |   65.7641   | 404.0946  |
|        ese_vovnet19b_dw         | 128 | 1.8725  |  4.1691   |      nan       |   31.8386   | 401.7781  |
|         mobilenetv2_100         | 128 | 4.1951  |  8.0722   |      nan       |   39.9188   | 346.6298  |
|          convnext_base          | 32  | 11.3503 |  16.231   |      nan       |   31.7707   | 334.6469  |
|           regnety_002           | 128 | 4.7306  |  9.0104   |      nan       |   49.7585   | 326.9695  |
|      xcit_large_24_p8_224       |  5  | 36.843  |  52.7637  |      nan       |     nan     |  322.862  |
|          jx_nest_base           | 32  | 9.6403  |  17.4674  |      nan       |   66.114    | 322.1634  |
|      mobilenetv3_large_100      | 128 | 4.3189  |   8.215   |      nan       |   67.2751   | 296.0159  |
|         visformer_small         | 128 | 2.2803  |  5.4314   |      nan       |   25.6553   | 293.7596  |
|          cait_m36_384           |  2  | 48.6937 |  65.4215  |      nan       |   92.2057   | 279.3734  |
|            gernet_l             | 128 | 4.7024  |  9.9219   |      nan       |   39.0823   |  252.524  |
|         crossvit_9_240          | 64  | 7.4244  |  13.9238  |      nan       |   32.9177   | 251.2102  |
|           selecsls42b           | 128 | 2.3137  |  5.5553   |      nan       |   40.3432   | 243.8417  |
|          spnasnet_100           | 128 | 5.3442  |  10.3948  |      nan       |   46.8295   | 227.5039  |
|            lcnet_050            | 128 | 1.9178  |  4.1662   |      nan       |   31.8492   | 219.0054  |
|           volo_d1_224           | 64  |  6.695  |  12.6315  |      nan       |   32.6511   | 192.8301  |
|           convit_base           | 32  | 3.8807  |  8.8518   |      nan       |   21.3229   | 187.4577  |
|          gmlp_s16_224           | 64  | 9.0829  |  14.1574  |      nan       |   21.2325   | 149.2961  |
|        tnt_s_patch16_224        | 64  | 11.8226 |  21.1815  |      nan       |   34.8234   | 140.3073  |
|          gmixer_24_224          | 64  | 8.2047  |  14.0553  |      nan       |   23.6592   | 132.0265  |
|            repvgg_a2            | 128 |  4.598  |   8.933   |      nan       |   46.5715   | 124.4128  |
|          resmlp_12_224          | 128 | 2.6661  |  4.8475   |      nan       |     nan     |  98.1064  |
|            nfnet_l0             | 64  | 5.9174  |  11.4931  |      nan       |   30.9432   |  96.3515  |
|          mixer_b16_224          | 64  | 2.6958  |  5.1905   |      nan       |   12.7396   |  94.3682  |
| deit_base_distilled_patch16_224 | 64  | 3.0897  |   6.374   |      nan       |   12.9275   |  84.8878  |
|      beit_base_patch16_224      | 64  | 4.6591  |  9.1219   |      nan       |   17.496    |  83.1654  |
|      vit_base_patch16_224       | 64  | 2.8722  |  6.2339   |      nan       |   11.5018   |  68.0847  |
|          pnasnet5large          | 16  | 59.4832 |  80.4982  |      nan       |  183.5509   |    nan    |
|        adv_inception_v3         | 128 |  8.161  |  15.6215  |      nan       |   74.7227   |    nan    |
|       gluon_inception_v3        | 128 | 8.2187  |  15.8574  |      nan       |   74.6038   |    nan    |
|          inception_v3           | 128 | 8.1272  |  15.7713  |      nan       |   74.2407   |    nan    |
|  swin_base_patch4_window7_224   | 64  | 11.989  |  21.809   |      nan       |   68.8397   |    nan    |
|        gluon_xception65         | 32  | 14.9902 |  24.9327  |      nan       |   55.4597   |    nan    |
|     swsl_resnext101_32x16d      | 32  | 10.0119 |  18.3546  |      nan       |   49.483    |    nan    |
|          botnet26t_256          | 128 | 2.4012  |  5.6863   |      nan       |   42.0424   |    nan    |
|           dm_nfnet_f0           | 128 | 6.5043  |  11.8243  |      nan       |   34.6682   |    nan    |
|         poolformer_m36          | 64  | 13.1099 |  19.2828  |      nan       |   34.655    |    nan    |
|        convmixer_768_32         | 32  | 6.8749  |  11.8401  |      nan       |   20.2715   |    nan    |
|            pit_b_224            | 64  | 3.6016  |  7.4214   |      nan       |   15.3574   |    nan    |
+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+

Peak Memory Compression Ratio

+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|              name               | bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|          gmixer_24_224          | 64  | 0.9992 |  0.9684   |      nan       |   0.9825    |  1.3808  |
|            nfnet_l0             | 64  | 1.0008 |  0.8298   |      nan       |    0.813    |  1.2558  |
|            tinynet_a            | 128 |  1.0   |  0.7831   |      nan       |   0.7845    |  1.1735  |
|           rexnet_100            | 128 | 0.9992 |  0.7879   |      nan       |    0.871    |  1.1072  |
|           convit_base           | 32  | 1.0001 |  0.8879   |      nan       |   0.9506    |  1.068   |
|         mobilenetv2_100         | 128 | 0.9998 |  0.7664   |      nan       |   0.7679    |  1.0051  |
|           mobilevit_s           | 32  | 0.9999 |  0.7692   |      nan       |   0.7431    |  1.0011  |
|             dla102              | 64  | 0.9881 |  0.9181   |      nan       |   0.9541    |  1.001   |
|        eca_halonext26ts         | 64  | 0.9999 |  0.7717   |      nan       |   0.7731    |  0.9711  |
|       eca_botnext26ts_256       | 64  |  1.0   |  0.7705   |      nan       |   0.7679    |  0.9703  |
|           tf_mixnet_l           | 64  | 1.0001 |   0.861   |      nan       |   0.8605    |  0.9698  |
|          cait_m36_384           |  2  | 1.0001 |  0.9024   |      nan       |   0.9202    |  0.9451  |
|       tf_efficientnet_b0        | 128 | 0.9998 |  0.7727   |      nan       |   0.8426    |  0.9413  |
|          mixer_b16_224          | 64  | 0.9956 |  0.9615   |      nan       |   0.8644    |  0.9357  |
|      beit_base_patch16_224      | 64  |  1.0   |  0.9575   |      nan       |   0.8606    |  0.9272  |
|          gmlp_s16_224           | 64  |  1.0   |  0.9766   |      nan       |    0.966    |  0.9267  |
|      vit_base_patch16_224       | 64  | 0.9963 |  0.9469   |      nan       |   0.8229    |  0.915   |
|        tnt_s_patch16_224        | 64  | 1.0001 |  0.9752   |      nan       |   0.8518    |  0.9131  |
|           volo_d1_224           | 64  | 0.9999 |  0.9247   |      nan       |   0.7472    |  0.9124  |
| deit_base_distilled_patch16_224 | 64  | 0.9964 |  0.9476   |      nan       |   0.8242    |  0.9095  |
|          spnasnet_100           | 128 | 1.0005 |  0.9207   |      nan       |   0.8496    |  0.9024  |
|           selecsls42b           | 128 | 0.9883 |  0.8982   |      nan       |   0.9039    |  0.8999  |
|            mixnet_l             | 64  | 0.9995 |  0.8486   |      nan       |   0.7938    |  0.8993  |
|      mobilenetv3_large_100      | 128 | 1.0002 |  0.8686   |      nan       |   0.8819    |  0.8982  |
|      xcit_large_24_p8_224       |  5  | 0.9999 |  0.9206   |      nan       |     nan     |  0.8952  |
|           resnest101e           | 32  |  1.0   |  0.9458   |      nan       |   0.9449    |  0.8922  |
|          ghostnet_100           | 128 | 0.9998 |  0.8872   |      nan       |    0.947    |  0.8888  |
|         visformer_small         | 128 | 0.9943 |  0.9442   |      nan       |   0.9475    |  0.8883  |
|            fbnetv3_b            | 128 | 0.9995 |  0.7866   |      nan       |   0.7861    |  0.8837  |
|             dpn107              | 32  | 0.9997 |  0.9285   |      nan       |   0.8949    |  0.8763  |
|          convnext_base          | 32  | 1.0001 |  0.9077   |      nan       |   0.7678    |  0.8762  |
|        twins_pcpvt_base         | 32  | 1.0002 |  0.9127   |      nan       |   0.8351    |  0.8723  |
|          cspdarknet53           | 64  |  1.0   |  0.8562   |      nan       |   0.8797    |  0.8624  |
|          jx_nest_base           | 32  | 1.0017 |   0.898   |      nan       |   0.7112    |  0.8574  |
|        ese_vovnet19b_dw         | 128 | 0.9999 |  0.8938   |      nan       |   0.9369    |  0.8467  |
|        sebotnet33ts_256         | 64  |  1.0   |  0.7109   |      nan       |   0.6852    |  0.841   |
|          resmlp_12_224          | 128 | 0.9893 |  0.9525   |      nan       |     nan     |  0.8169  |
|        res2net101_26w_4s        | 64  | 1.0001 |  0.9307   |      nan       |   0.8959    |  0.8167  |
|         crossvit_9_240          | 64  | 1.0001 |  0.8721   |      nan       |    0.729    |  0.8108  |
|           mnasnet_100           | 128 | 1.0003 |  0.9126   |      nan       |   0.8368    |  0.7984  |
|         coat_lite_mini          | 128 | 1.0049 |  0.8826   |      nan       |   0.7873    |   0.79   |
|            lcnet_050            | 128 | 1.0005 |  0.7721   |      nan       |   0.7722    |  0.7579  |
|           regnety_002           | 128 | 0.9981 |   0.829   |      nan       |   0.7759    |  0.7465  |
|            gernet_l             | 128 |  1.0   |  0.7965   |      nan       |   0.8012    |  0.727   |
|           fbnetc_100            | 128 | 0.9998 |  0.8597   |      nan       |   0.7507    |  0.7246  |
|            hrnet_w18            |  2  | 0.9986 |  0.8792   |      nan       |   0.8869    |  0.6089  |
|           res2next50            |  2  |  1.0   |  0.8353   |      nan       |   0.8404    |  0.606   |
|        res2net50_14w_8s         |  2  |  1.0   |  0.8387   |      nan       |   0.8474    |  0.5877  |
|            repvgg_a2            | 128 | 1.0003 |  0.8145   |      nan       |   0.6633    |  0.536   |
|          pnasnet5large          | 16  | 1.069  |   1.011   |      nan       |   1.2062    |   nan    |
|        convmixer_768_32         | 32  |  1.0   |  0.9868   |      nan       |   0.9807    |   nan    |
|           dm_nfnet_f0           | 128 | 0.9393 |   0.897   |      nan       |   0.9515    |   nan    |
|         poolformer_m36          | 64  | 1.0003 |  0.9533   |      nan       |   0.9368    |   nan    |
|        gluon_xception65         | 32  | 0.9999 |  0.9384   |      nan       |   0.9001    |   nan    |
|        adv_inception_v3         | 128 | 1.0002 |  0.8694   |      nan       |    0.88     |   nan    |
|       gluon_inception_v3        | 128 | 1.0002 |  0.8694   |      nan       |    0.88     |   nan    |
|          inception_v3           | 128 | 1.0002 |  0.8694   |      nan       |    0.88     |   nan    |
|     swsl_resnext101_32x16d      | 32  | 1.0003 |  0.8983   |      nan       |   0.8684    |   nan    |
|  swin_base_patch4_window7_224   | 64  | 0.9999 |  0.9309   |      nan       |    0.83     |   nan    |
|          botnet26t_256          | 128 |  1.0   |  0.8494   |      nan       |   0.7497    |   nan    |
|            pit_b_224            | 64  | 0.9992 |  0.7962   |      nan       |   0.6417    |   nan    |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+

Performance graphs

see more

bench_logs/timm_models_float32.png :

bench_logs/torchbench_float32.png :

bench_logs/huggingface_float32.png :

@anijain2305
Copy link
Contributor Author

Performance Dashboard for float32 precision

Executive Summary

see more We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.

Caveats

  1. Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint.
  2. Experiments do not cover dynamic shapes.
  3. Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager | 93%, 41/44  |
| inductor  | 64%, 28/44  |
+-----------+-------------+

Geometric mean speedup

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    1.00x    |
| inductor  |    1.76x    |
+-----------+-------------+

Mean compilation time (seconds)

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    20.82    |
| inductor  |    80.93    |
+-----------+-------------+

Peak memory footprint compression ratio (higher is better)

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    0.88x    |
| inductor  |    0.74x    |
+-----------+-------------+

Metrics over time

see more

bench_logs/geomean_over_time.png :

bench_logs/passrate_over_time.png :

huggingface suite with float32 precision

see more

Performance speedup

+-----------------------------------------+----+-----------+----------+
|                  name                   | bs | aot_eager | inductor |
+-----------------------------------------+----+-----------+----------+
|       MT5ForConditionalGeneration       | 2  |   0.912   |  4.6277  |
|           ElectraForCausalLM            | 1  |  0.9372   |  4.1844  |
|            YituTechConvBert             | 1  |  0.9314   |  3.7368  |
|         MegatronBertForCausalLM         | 2  |  0.9425   |  3.3657  |
|             OPTForCausalLM              | 4  |  0.9837   |  2.9643  |
|          MobileBertForMaskedLM          | 16 |  0.9291   |  2.9327  |
|           RobertaForCausalLM            | 4  |  0.9599   |  2.5863  |
|     M2M100ForConditionalGeneration      | 2  |  0.9524   |  2.544   |
|             XGLMForCausalLM             | 1  |  0.8789   |  2.4742  |
|     PegasusForConditionalGeneration     | 4  |  0.8936   |  2.4345  |
|     MobileBertForQuestionAnswering      | 32 |  0.9097   |  2.3948  |
|                CamemBert                | 1  |  0.9434   |  2.2449  |
|               GoogleFnet                | 1  |  0.8119   |  2.0603  |
|               DistillGPT2               | 1  |   0.934   |  1.9454  |
|    MegatronBertForQuestionAnswering     | 8  |   0.932   |  1.8596  |
|     PLBartForConditionalGeneration      | 8  |  0.9042   |  1.6688  |
|      MBartForConditionalGeneration      | 8  |  0.8875   |  1.4768  |
|            XLNetLMHeadModel             | 4  |  0.9655   |  1.427   |
|                 T5Small                 | 1  |  0.9592   |  1.358   |
|         Speech2Text2ForCausalLM         | 64 |  0.9438   |  1.2946  |
|     DistilBertForQuestionAnswering      | 32 |  0.9767   |  1.2753  |
|            TrOCRForCausalLM             | 8  |  0.9338   |  1.2341  |
|           PegasusForCausalLM            | 8  |  0.9351   |  1.2218  |
|      BartForConditionalGeneration       | 1  |  0.9916   |  1.2055  |
| BlenderbotSmallForConditionalGeneration | 32 |  0.9314   |  1.1764  |
|       DebertaForQuestionAnswering       | 4  |  0.7412   |  1.1722  |
|          DistilBertForMaskedLM          | 16 |   0.98    |  1.163   |
|            PLBartForCausalLM            | 16 |  0.9466   |  1.1229  |
|             BartForCausalLM             | 2  |  0.9662   |  1.1018  |
|       RobertaForQuestionAnswering       | 64 |  0.9825   |  1.0993  |
|                 BigBird                 | 1  |  0.9386   |  1.0925  |
|        BertForQuestionAnswering         | 64 |  0.9818   |  1.0919  |
|            MBartForCausalLM             | 16 |  0.9638   |  1.0433  |
|       AlbertForQuestionAnswering        | 2  |  0.9998   |   0.0    |
|            AlbertForMaskedLM            | 2  |  0.9979   |   0.0    |
|    LayoutLMForSequenceClassification    | 16 |  0.9875   |   0.0    |
|       ElectraForQuestionAnswering       | 64 |   0.984   |   0.0    |
|      GPT2ForSequenceClassification      | 4  |  0.9756   |   0.0    |
|       T5ForConditionalGeneration        | 4  |  0.9709   |   0.0    |
|           LayoutLMForMaskedLM           | 16 |  0.9701   |   0.0    |
|             BertForMaskedLM             | 64 |  0.9612   |   0.0    |
|       BlenderbotSmallForCausalLM        | 64 |  0.9085   |   0.0    |
|          AllenaiLongformerBase          | 1  |  0.8731   |   0.0    |
|           DebertaForMaskedLM            | 4  |  0.8027   |   0.0    |
+-----------------------------------------+----+-----------+----------+

Accuracy

+-----------------------------------------+----+-----------+-------------+
|                  name                   | bs | aot_eager |  inductor   |
+-----------------------------------------+----+-----------+-------------+
|             BartForCausalLM             | 1  |   pass    |    pass     |
|             BertForMaskedLM             | 1  |   pass    |    pass     |
|        BertForQuestionAnswering         | 1  |   pass    |    pass     |
|                 BigBird                 | 1  |   pass    |    pass     |
|       BlenderbotSmallForCausalLM        | 1  |   pass    |    pass     |
| BlenderbotSmallForConditionalGeneration | 1  |   pass    |    pass     |
|                CamemBert                | 1  |   pass    |    pass     |
|           DebertaForMaskedLM            | 1  |   pass    |    pass     |
|       DebertaForQuestionAnswering       | 1  |   pass    |    pass     |
|          DistilBertForMaskedLM          | 1  |   pass    |    pass     |
|     DistilBertForQuestionAnswering      | 1  |   pass    |    pass     |
|               DistillGPT2               | 1  |   pass    |    pass     |
|           ElectraForCausalLM            | 1  |   pass    |    pass     |
|       ElectraForQuestionAnswering       | 1  |   pass    |    pass     |
|      GPT2ForSequenceClassification      | 1  |   pass    |    pass     |
|               GoogleFnet                | 1  |   pass    |    pass     |
|           LayoutLMForMaskedLM           | 1  |   pass    |    pass     |
|    LayoutLMForSequenceClassification    | 1  |   pass    |    pass     |
|            MBartForCausalLM             | 1  |   pass    |    pass     |
|       MT5ForConditionalGeneration       | 1  |   pass    |    pass     |
|         MegatronBertForCausalLM         | 1  |   pass    |    pass     |
|    MegatronBertForQuestionAnswering     | 1  |   pass    |    pass     |
|          MobileBertForMaskedLM          | 1  |   pass    |    pass     |
|     MobileBertForQuestionAnswering      | 1  |   pass    |    pass     |
|             OPTForCausalLM              | 1  |   pass    |    pass     |
|            PLBartForCausalLM            | 1  |   pass    |    pass     |
|           PegasusForCausalLM            | 1  |   pass    |    pass     |
|     PegasusForConditionalGeneration     | 1  |   pass    |    pass     |
|           RobertaForCausalLM            | 1  |   pass    |    pass     |
|       RobertaForQuestionAnswering       | 1  |   pass    |    pass     |
|         Speech2Text2ForCausalLM         | 1  |   pass    |    pass     |
|       T5ForConditionalGeneration        | 1  |   pass    |    pass     |
|                 T5Small                 | 1  |   pass    |    pass     |
|            TrOCRForCausalLM             | 1  |   pass    |    pass     |
|            XLNetLMHeadModel             | 1  |   pass    |    pass     |
|            YituTechConvBert             | 1  |   pass    |    pass     |
|            AlbertForMaskedLM            | 1  |   pass    | fail_to_run |
|       AlbertForQuestionAnswering        | 1  |   pass    | fail_to_run |
|          AllenaiLongformerBase          | 1  |   pass    | fail_to_run |
|      MBartForConditionalGeneration      | 1  |   pass    | fail_to_run |
|     PLBartForConditionalGeneration      | 1  |   pass    | fail_to_run |
|      BartForConditionalGeneration       | 0  |  0.0000   |   0.0000    |
|     M2M100ForConditionalGeneration      | 0  |  0.0000   |   0.0000    |
|             XGLMForCausalLM             | 0  |  0.0000   |   0.0000    |
+-----------------------------------------+----+-----------+-------------+

Compilation latency (sec)

+-----------------------------------------+----+-----------+----------+
|                  name                   | bs | aot_eager | inductor |
+-----------------------------------------+----+-----------+----------+
|          MobileBertForMaskedLM          | 16 | 161.9975  | 230.3942 |
|     MobileBertForQuestionAnswering      | 32 | 156.5321  | 229.1364 |
|     M2M100ForConditionalGeneration      | 2  |  36.8031  | 169.1486 |
|            XLNetLMHeadModel             | 4  |  36.3955  | 140.9739 |
|      MBartForConditionalGeneration      | 8  |  39.8095  | 135.2674 |
|             XGLMForCausalLM             | 1  |  25.104   | 133.6482 |
|      BartForConditionalGeneration       | 1  |  38.6818  | 127.4739 |
|     PegasusForConditionalGeneration     | 4  |  38.2538  | 119.5229 |
|       MT5ForConditionalGeneration       | 2  |  16.9923  | 111.013  |
|       DebertaForQuestionAnswering       | 4  |  13.4467  | 110.5823 |
|         MegatronBertForCausalLM         | 2  |  26.3919  | 108.5788 |
|    MegatronBertForQuestionAnswering     | 8  |  26.2115  | 106.7605 |
|     PLBartForConditionalGeneration      | 8  |  13.4489  | 91.7812  |
| BlenderbotSmallForConditionalGeneration | 32 |  20.3056  | 86.0149  |
|                 T5Small                 | 1  |  11.1881  | 80.8945  |
|            YituTechConvBert             | 1  |  16.8696  | 78.8642  |
|            TrOCRForCausalLM             | 8  |  14.528   | 68.5267  |
|             OPTForCausalLM              | 4  |  9.7842   | 64.4264  |
|            MBartForCausalLM             | 16 |  14.6384  |  60.153  |
|           PegasusForCausalLM            | 8  |  14.4482  | 60.0701  |
|             BartForCausalLM             | 2  |  14.2281  | 57.4828  |
|           RobertaForCausalLM            | 4  |  10.1821  | 57.4763  |
|           ElectraForCausalLM            | 1  |  9.8272   | 56.1576  |
|       RobertaForQuestionAnswering       | 64 |  9.9816   | 56.0056  |
|        BertForQuestionAnswering         | 64 |  9.7034   | 55.3096  |
|                CamemBert                | 1  |  9.9611   | 52.6892  |
|                 BigBird                 | 1  |  17.3435  | 52.4819  |
|         Speech2Text2ForCausalLM         | 64 |  5.4863   | 44.3714  |
|            PLBartForCausalLM            | 16 |  5.4581   | 41.5006  |
|          DistilBertForMaskedLM          | 16 |  4.3538   | 37.1118  |
|     DistilBertForQuestionAnswering      | 32 |  4.2162   | 34.2192  |
|               GoogleFnet                | 1  |  4.3239   | 33.2724  |
|               DistillGPT2               | 1  |  3.8532   | 32.1718  |
|          AllenaiLongformerBase          | 1  |  19.6187  |   nan    |
|           DebertaForMaskedLM            | 4  |  13.4496  |   nan    |
|       T5ForConditionalGeneration        | 4  |  11.1033  |   nan    |
|    LayoutLMForSequenceClassification    | 16 |  10.3767  |   nan    |
|           LayoutLMForMaskedLM           | 16 |  10.3325  |   nan    |
|             BertForMaskedLM             | 64 |  9.9551   |   nan    |
|       ElectraForQuestionAnswering       | 64 |  9.8843   |   nan    |
|      GPT2ForSequenceClassification      | 4  |  8.4326   |   nan    |
|       BlenderbotSmallForCausalLM        | 64 |  8.0904   |   nan    |
|            AlbertForMaskedLM            | 2  |  6.3343   |   nan    |
|       AlbertForQuestionAnswering        | 2  |  5.8626   |   nan    |
+-----------------------------------------+----+-----------+----------+

Peak Memory Compression Ratio

+-----------------------------------------+----+-----------+----------+
|                  name                   | bs | aot_eager | inductor |
+-----------------------------------------+----+-----------+----------+
|            XLNetLMHeadModel             | 4  |  0.8976   |  0.9807  |
|        BertForQuestionAnswering         | 64 |  0.9467   |  0.9145  |
|       RobertaForQuestionAnswering       | 64 |  0.9467   |  0.9145  |
|                 T5Small                 | 1  |  0.9325   |  0.8445  |
|     DistilBertForQuestionAnswering      | 32 |  0.9046   |  0.8405  |
|          DistilBertForMaskedLM          | 16 |  0.9138   |  0.8391  |
|             BartForCausalLM             | 2  |  0.8847   |  0.8303  |
|           ElectraForCausalLM            | 1  |  0.9107   |  0.827   |
|                 BigBird                 | 1  |  0.9549   |  0.8224  |
|            PLBartForCausalLM            | 16 |  0.8802   |  0.8028  |
|            MBartForCausalLM             | 16 |  0.8629   |  0.8005  |
|               DistillGPT2               | 1  |  0.7721   |  0.7997  |
|         Speech2Text2ForCausalLM         | 64 |   0.88    |  0.7767  |
|     PLBartForConditionalGeneration      | 8  |  0.8222   |  0.7744  |
|             XGLMForCausalLM             | 1  |  0.9999   |  0.7728  |
|      BartForConditionalGeneration       | 1  |  0.8465   |  0.7708  |
| BlenderbotSmallForConditionalGeneration | 32 |  0.9036   |  0.7612  |
|                CamemBert                | 1  |  0.7977   |  0.7369  |
|            YituTechConvBert             | 1  |  0.7923   |  0.7298  |
|            TrOCRForCausalLM             | 8  |  0.8048   |  0.7284  |
|      MBartForConditionalGeneration      | 8  |  0.8137   |  0.727   |
|             OPTForCausalLM              | 4  |   0.75    |  0.714   |
|           RobertaForCausalLM            | 4  |  0.7778   |  0.7099  |
|           PegasusForCausalLM            | 8  |  0.9323   |  0.7012  |
|    MegatronBertForQuestionAnswering     | 8  |  0.8265   |  0.6997  |
|               GoogleFnet                | 1  |  0.9447   |  0.6953  |
|     M2M100ForConditionalGeneration      | 2  |  0.9801   |  0.6643  |
|         MegatronBertForCausalLM         | 2  |  0.7066   |  0.6453  |
|     PegasusForConditionalGeneration     | 4  |  0.9004   |  0.642   |
|       MT5ForConditionalGeneration       | 2  |  0.6173   |  0.6173  |
|          MobileBertForMaskedLM          | 16 |  0.9179   |  0.5861  |
|     MobileBertForQuestionAnswering      | 32 |  0.9716   |  0.4668  |
|       DebertaForQuestionAnswering       | 4  |  1.0525   |  0.3569  |
|           DebertaForMaskedLM            | 4  |  0.9851   |   nan    |
|       T5ForConditionalGeneration        | 4  |  0.9597   |   nan    |
|       ElectraForQuestionAnswering       | 64 |  0.9524   |   nan    |
|          AllenaiLongformerBase          | 1  |  0.9515   |   nan    |
|           LayoutLMForMaskedLM           | 16 |  0.9409   |   nan    |
|       AlbertForQuestionAnswering        | 2  |  0.9369   |   nan    |
|    LayoutLMForSequenceClassification    | 16 |  0.9348   |   nan    |
|             BertForMaskedLM             | 64 |  0.9219   |   nan    |
|            AlbertForMaskedLM            | 2  |  0.9172   |   nan    |
|      GPT2ForSequenceClassification      | 4  |  0.9091   |   nan    |
|       BlenderbotSmallForCausalLM        | 64 |  0.8401   |   nan    |
+-----------------------------------------+----+-----------+----------+

Performance graphs

see more

bench_logs/huggingface_float32.png :

@anijain2305
Copy link
Contributor Author

Performance Dashboard for amp precision

Executive Summary

see more We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.

Caveats

  1. Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint.
  2. Experiments do not cover dynamic shapes.
  3. Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 98%, 52/53 | 98%, 42/43  | 100%, 61/61 |
|   aot_eager    | 98%, 52/53 | 98%, 42/43  | 90%, 55/61  |
| aot_cudagraphs | 28%, 15/53 |  2%, 1/43   |  10%, 6/61  |
|  aot_nvfuser   | 60%, 32/53 |  0%, 0/43   | 75%, 46/61  |
|    inductor    | 81%, 43/53 | 86%, 37/43  | 90%, 55/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.09x    |    1.00x    |    1.00x    |
|  aot_nvfuser   |   1.16x    |    0.0x     |    1.20x    |
|    inductor    |   1.68x    |    2.20x    |    1.31x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    6.15    |    14.88    |    11.73    |
|   aot_eager    |   12.44    |    25.70    |    19.93    |
| aot_cudagraphs |   12.80    |    93.53    |    51.65    |
|  aot_nvfuser   |   29.54    |     0.0     |    79.13    |
|    inductor    |   258.47   |   118.80    |   452.93    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.96x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.85x    |    0.86x    |    0.88x    |
| aot_cudagraphs |   0.43x    |    0.38x    |    0.19x    |
|  aot_nvfuser   |   0.83x    |    0.0x     |    0.85x    |
|    inductor    |   0.77x    |    0.82x    |    0.89x    |
+----------------+------------+-------------+-------------+

torchbench suite with amp precision

see more

Performance speedup

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|            densenet121            |  4   | 1.0002 |  0.9102   |      0.0       |    1.397    |  5.0623  |
|       functorch_dp_cifar10        |  64  | 1.0015 |  0.9112   |      0.0       |   1.1939    |  4.737   |
|         timm_efficientdet         |  1   | 0.9848 |  0.8085   |      0.0       |     0.0     |  4.2687  |
|           BERT_pytorch            |  16  | 1.0107 |  0.8304   |      0.0       |     0.0     |  3.1041  |
|      timm_vision_transformer      |  8   | 1.0006 |   0.846   |      0.0       |   1.3541    |  3.0679  |
|                drq                |  1   | 1.0024 |  0.8093   |      0.0       |    1.106    |  2.9813  |
|             resnet18              |  16  | 1.0009 |   0.989   |      0.0       |   1.3483    |  2.6731  |
|               dcgan               |  32  | 0.9772 |  0.9046   |     1.1443     |   0.7307    |  2.6188  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.998  |  0.9303   |     1.4873     |   1.2113    |  2.5972  |
|             hf_Albert             |  8   | 1.0011 |  0.9552   |      0.0       |     0.0     |  2.3953  |
|           squeezenet1_1           |  32  | 0.9933 |  0.9562   |     1.337      |   1.1937    |  2.3116  |
|          resnext50_32x4d          |  8   | 1.0027 |  0.9499   |      0.0       |   1.3374    |  2.1943  |
|        mobilenet_v3_large         |  32  | 1.0042 |  1.0057   |      0.0       |    1.411    |  2.1611  |
|               hf_T5               |  8   | 0.9984 |  0.9446   |      0.0       |     0.0     |  2.1382  |
|            hf_T5_large            |  2   | 1.0172 |  0.8568   |      0.0       |     0.0     |  2.122   |
|          pytorch_struct           | 200  | 1.0012 |  0.7441   |     1.0266     |   0.9964    |  2.0323  |
|              hf_Bert              |  4   | 1.0336 |  0.8486   |      0.0       |     0.0     |  1.8828  |
|              hf_GPT2              |  4   | 1.017  |  0.9879   |      0.0       |     0.0     |  1.8574  |
|            mnasnet1_0             |  32  | 0.9986 |  1.0159   |     0.9193     |   1.4046    |  1.7708  |
|          LearningToPaint          |  96  | 1.0045 |  1.0023   |      0.0       |   1.3491    |  1.7422  |
|              hf_Bart              |  4   | 1.0155 |  0.8359   |      0.0       |     0.0     |  1.7211  |
|           lennard_jones           | 1000 | 0.9786 |  0.7278   |     1.2952     |   1.0447    |  1.5978  |
|         timm_efficientnet         |  32  | 0.9608 |  0.8133   |      0.0       |   1.1851    |  1.5685  |
| attention_is_all_you_need_pytorch | 256  | 1.0029 |  0.9032   |      0.0       |     0.0     |  1.5158  |
|         soft_actor_critic         | 256  | 1.011  |   0.707   |     1.2513     |   1.0703    |  1.4902  |
|           hf_DistilBert           |  8   | 1.0015 |   0.969   |      0.0       |     0.0     |  1.4765  |
|           fastNLP_Bert            |  6   | 1.0004 |  0.8861   |      0.0       |     0.0     |  1.4585  |
|        shufflenet_v2_x1_0         | 128  | 1.0011 |  1.0157   |      0.0       |   1.3391    |  1.3717  |
|           pytorch_unet            |  1   | 0.9999 |  0.9926   |      0.0       |   1.1552    |  1.3528  |
|            timm_nfnet             | 128  | 0.9997 |  0.9985   |      0.0       |   1.1712    |  1.3388  |
|          pytorch_stargan          |  16  | 0.9984 |  1.0165   |     0.8265     |   1.1173    |  1.3192  |
|            Super_SloMo            |  6   | 0.9997 |   0.996   |      0.0       |     0.0     |  1.2905  |
|               vgg16               |  64  | 0.9997 |  0.9978   |     0.7975     |   0.9952    |  1.2744  |
|        Background_Matting         |  4   | 0.9993 |  1.0175   |      0.0       |   1.1152    |  1.2167  |
|              alexnet              | 128  | 0.9993 |  0.9971   |     0.788      |   1.0029    |  1.2085  |
|           timm_resnest            |  32  | 0.9995 |  1.0217   |      0.0       |   1.3245    |  1.2011  |
|            hf_Reformer            |  4   | 0.9924 |  0.9996   |     0.9192     |     0.0     |  1.1589  |
|   timm_vision_transformer_large   |  8   | 0.9991 |  0.9895   |      0.0       |   0.9926    |  1.1581  |
|            hf_BigBird             |  2   | 0.9986 |  0.9103   |      0.0       |     0.0     |  1.1491  |
|            timm_vovnet            |  32  | 0.9212 |  0.8868   |      0.0       |   1.1273    |  1.1101  |
|               moco                |  32  | 0.9968 |    0.0    |      0.0       |     0.0     |  1.0487  |
|            tts_angular            |  64  | 0.9963 |  0.9382   |     0.9949     |   0.9984    |  1.0118  |
|              demucs               |  4   | 0.9985 |  1.0008   |     0.9996     |   0.9991    |  1.0012  |
|      nvidia_deeprecommender       | 256  | 0.9985 |  0.9955   |     0.6966     |   0.9787    |  0.9905  |
|           mobilenet_v2            |  96  | 0.9988 |  0.9875   |      0.0       |   0.9305    |  0.9033  |
|             resnet50              |  32  | 1.0012 |  1.0086   |      0.0       |   1.3687    |  0.8978  |
|            timm_regnet            |  32  | 0.9812 |  0.9369   |      0.0       |   1.2152    |  0.7564  |
|              yolov3               |  16  | 0.9986 |  0.9886   |      0.0       |   0.9097    |   0.0    |
|           hf_Longformer           |  2   | 0.9639 |  0.8829   |     0.8871     |     0.0     |   0.0    |
|               dlrm                | 2048 |  0.0   |  1.2025   |      0.0       |     0.0     |   0.0    |
|           hf_GPT2_large           |  4   | 0.9996 |  0.9898   |      0.0       |     0.0     |   0.0    |
|        speech_transformer         |  32  | 1.0047 |  0.8518   |      0.0       |     0.0     |   0.0    |
|             tacotron2             |  64  | 0.9796 |  0.7578   |      0.0       |     0.0     |   0.0    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|               name                | bs  |      eager       |    aot_eager     |  aot_cudagraphs  |   aot_nvfuser    |     inductor     |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|           hf_GPT2_large           |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|            hf_T5_large            |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|   timm_vision_transformer_large   |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|              alexnet              |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               dcgan               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|              demucs               |  4  |       pass       |       pass       |       pass       |       pass       |       pass       |
|           lennard_jones           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|            mnasnet1_0             |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|      nvidia_deeprecommender       |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|   pytorch_CycleGAN_and_pix2pix    |  1  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_stargan          | 16  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_struct           | 200 |       pass       |       pass       |       pass       |       pass       |       pass       |
|         soft_actor_critic         | 256 |       pass       |       pass       |       pass       |       pass       |       pass       |
|           squeezenet1_1           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               vgg16               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|        Background_Matting         |  4  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|          LearningToPaint          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            densenet121            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|                drq                |  1  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|       functorch_dp_cifar10        |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           mobilenet_v2            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           pytorch_unet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet18              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet50              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|          resnext50_32x4d          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|        shufflenet_v2_x1_0         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|         timm_efficientnet         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_nfnet             |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_regnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           timm_resnest            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|      timm_vision_transformer      |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_vovnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            hf_Reformer            |  2  |       pass       |       pass       |       pass       |   fail_to_run    |       pass       |
|           BERT_pytorch            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            Super_SloMo            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
| attention_is_all_you_need_pytorch |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|               dlrm                |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           fastNLP_Bert            |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|             hf_Albert             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_Bart              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_Bert              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            hf_BigBird             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           hf_DistilBert           |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|              hf_GPT2              |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|               hf_T5               |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|            hf_T5_base             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |       pass       |
|           hf_Longformer           |  2  |       pass       |       pass       |       pass       |   fail_to_run    |   fail_to_run    |
|        speech_transformer         |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|             tacotron2             |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|         timm_efficientdet         |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|          vision_maskrcnn          |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|               moco                |  2  |       pass       |   fail_to_run    |   fail_to_run    |   fail_to_run    |   fail_to_run    |
|        mobilenet_v3_large         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |  fail_accuracy   |
|            tts_angular            |  2  |       pass       |       pass       |       pass       |       pass       |      0.0000      |
|              yolov3               |  2  |       pass       |       pass       |   fail_to_run    |   fail_to_run    |      0.0000      |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+

Compilation latency (sec)

+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+
|               name                |  bs  |  eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor  |
+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+
|         timm_efficientdet         |  1   | 52.6602 |  79.1055  |      nan       |     nan     | 1818.1052 |
|            hf_T5_large            |  2   | 36.3097 |  76.3336  |      nan       |     nan     | 1734.768  |
|            densenet121            |  4   | 13.5051 |  28.8872  |      nan       |   138.834   | 1386.3113 |
|            mnasnet1_0             |  32  | 3.3962  |  8.6081   |    43.4979     |   46.1998   | 826.0046  |
|        mobilenet_v3_large         |  32  | 3.7847  |  9.2689   |      nan       |   75.0785   | 748.7895  |
|          resnext50_32x4d          |  8   | 3.6093  |  9.3411   |      nan       |   39.3272   | 707.1618  |
|               moco                |  32  |  11.37  |    nan    |      nan       |     nan     |  683.29   |
|           mobilenet_v2            |  96  | 3.3085  |  8.3878   |      nan       |   43.4906   | 635.5209  |
|         timm_efficientnet         |  32  | 5.9808  |  12.6187  |      nan       |   73.2075   | 583.9074  |
|        shufflenet_v2_x1_0         | 128  | 3.8001  |  9.9697   |      nan       |   40.5941   | 398.4956  |
|           squeezenet1_1           |  32  | 0.6697  |  1.7814   |     6.7516     |   6.8886    | 385.1688  |
|            timm_nfnet             | 128  | 6.6658  |  13.5281  |      nan       |   42.693    |  365.85   |
|           timm_resnest            |  32  | 1.4486  |  4.3564   |      nan       |   43.3691   | 349.0304  |
|            timm_regnet            |  32  |  8.429  |  16.9052  |      nan       |   67.531    | 317.4367  |
| attention_is_all_you_need_pytorch | 256  | 4.4349  |  12.7302  |      nan       |     nan     | 251.7143  |
|            timm_vovnet            |  32  |  3.013  |  7.3585   |      nan       |   32.2961   | 228.4544  |
|   timm_vision_transformer_large   |  8   | 22.8303 |  40.7575  |      nan       |   59.3202   | 203.1062  |
|          LearningToPaint          |  96  | 1.0884  |  3.1644   |      nan       |   30.8535   | 196.1728  |
|       functorch_dp_cifar10        |  64  | 0.8334  |  2.5709   |      nan       |   6.5105    | 186.6153  |
|      timm_vision_transformer      |  8   | 3.1965  |  8.1056   |      nan       |   16.1745   | 185.5642  |
|           BERT_pytorch            |  16  | 5.1553  |  13.6741  |      nan       |     nan     | 183.0714  |
|             resnet18              |  16  | 0.9908  |  3.0796   |      nan       |   23.6609   | 178.4714  |
|             resnet50              |  32  | 3.4622  |  9.1981   |      nan       |   44.2773   | 167.5082  |
|           fastNLP_Bert            |  6   | 5.3662  |  12.8031  |      nan       |     nan     | 155.3766  |
|               hf_T5               |  8   | 3.9527  |  12.7903  |      nan       |     nan     | 152.7908  |
|        Background_Matting         |  4   | 4.0586  |  9.8569   |      nan       |   45.5208   |  137.705  |
|          pytorch_stargan          |  16  | 0.8563  |  3.2896   |     11.618     |   7.5638    | 137.6237  |
|              hf_Bart              |  4   | 7.5098  |  17.1193  |      nan       |     nan     | 136.6578  |
|              hf_GPT2              |  4   | 3.6623  |  9.9582   |      nan       |     nan     | 128.4494  |
|          pytorch_struct           | 200  | 0.4445  |  1.2679   |     1.8613     |   5.4827    | 121.6668  |
|            Super_SloMo            |  6   | 2.3013  |  7.0704   |      nan       |     nan     |  91.5593  |
|             hf_Albert             |  8   | 1.5093  |  8.5484   |      nan       |     nan     |  81.4949  |
|            hf_Reformer            |  4   |  3.183  |  5.8886   |    13.7829     |     nan     |  80.2307  |
|              hf_Bert              |  4   | 5.2581  |  12.5739  |      nan       |     nan     |  72.6783  |
|            hf_BigBird             |  2   | 11.8968 |  20.2146  |      nan       |     nan     |  66.169   |
|           pytorch_unet            |  1   | 1.1321  |  3.4759   |      nan       |   26.7798   |  61.8894  |
|           hf_DistilBert           |  8   | 1.7875  |   5.408   |      nan       |     nan     |  47.2585  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 0.8066  |   3.255   |    11.9549     |   5.2307    |  33.7224  |
|               vgg16               |  64  | 0.3752  |  1.1111   |     4.1704     |   3.7078    |  29.9942  |
|              alexnet              | 128  | 0.2813  |  0.6904   |     1.9896     |   3.2561    |  29.2225  |
|                drq                |  1   | 0.2865  |  0.7521   |      nan       |   4.4815    |  22.5066  |
|               dcgan               |  32  |  0.268  |  0.6336   |     1.8825     |   4.3205    |  17.1734  |
|      nvidia_deeprecommender       | 256  | 0.2933  |  0.6746   |     1.0105     |   2.9894    |  15.6936  |
|         soft_actor_critic         | 256  | 0.2749  |  0.4931   |     0.715      |   2.1025    |  14.6417  |
|           lennard_jones           | 1000 | 0.2403  |  0.5118   |     0.6931     |   1.5472    |  8.5416   |
|            tts_angular            |  64  | 0.3366  |  0.3937   |     0.5196     |   1.1651    |  4.0383   |
|              demucs               |  4   | 0.9022  |  0.8912   |     0.8836     |   0.8907    |   0.789   |
|              yolov3               |  16  | 7.4472  |  15.7484  |      nan       |   45.481    |    nan    |
|           hf_Longformer           |  2   | 11.7858 |  21.3262  |    90.6374     |     nan     |    nan    |
|           hf_GPT2_large           |  4   | 21.8976 |  41.7361  |      nan       |     nan     |    nan    |
|             tacotron2             |  64  | 13.9023 |  30.239   |      nan       |     nan     |    nan    |
|        speech_transformer         |  32  | 7.6548  |   17.41   |      nan       |     nan     |    nan    |
|               dlrm                | 2048 |   nan   |  1.2125   |      nan       |     nan     |    nan    |
+-----------------------------------+------+---------+-----------+----------------+-------------+-----------+

Peak Memory Compression Ratio

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|             hf_Albert             |  8   | 0.9814 |   0.936   |      nan       |     nan     |  1.1576  |
|            Super_SloMo            |  6   | 1.0024 |  0.9697   |      nan       |     nan     |  1.1385  |
|            timm_nfnet             | 128  | 0.9761 |  0.9043   |      nan       |   0.9504    |  1.0242  |
|            tts_angular            |  64  | 1.0015 |  1.0015   |     0.9866     |   1.0015    |  0.9908  |
| attention_is_all_you_need_pytorch | 256  | 0.9976 |  0.9403   |      nan       |     nan     |  0.9875  |
|              demucs               |  4   | 0.987  |   0.987   |     0.987      |    0.987    |  0.987   |
|         timm_efficientdet         |  1   | 1.0316 |  0.8425   |      nan       |     nan     |  0.9857  |
|           BERT_pytorch            |  16  | 0.9998 |  0.8819   |      nan       |     nan     |  0.9728  |
|         timm_efficientnet         |  32  | 0.9982 |  0.7762   |      nan       |   0.7936    |  0.9689  |
|              hf_GPT2              |  4   | 0.971  |  0.8627   |      nan       |     nan     |  0.9645  |
|        Background_Matting         |  4   | 1.0201 |  0.9679   |      nan       |    0.987    |  0.9244  |
|           mobilenet_v2            |  96  | 1.0001 |  0.7725   |      nan       |   0.9235    |  0.8856  |
|           pytorch_unet            |  1   | 0.9968 |  0.8677   |      nan       |   0.8518    |  0.8681  |
|           fastNLP_Bert            |  6   | 1.0013 |  0.8966   |      nan       |     nan     |  0.8661  |
|   pytorch_CycleGAN_and_pix2pix    |  1   |  1.0   |  0.8751   |     0.2642     |   0.8432    |  0.8602  |
|            hf_T5_large            |  2   | 0.8541 |  0.8541   |      nan       |     nan     |  0.8535  |
|           hf_DistilBert           |  8   | 0.9505 |  0.8806   |      nan       |     nan     |  0.8387  |
|              hf_Bert              |  4   | 0.9844 |  0.8677   |      nan       |     nan     |  0.8383  |
|            timm_regnet            |  32  | 0.9999 |  0.8483   |      nan       |    0.85     |  0.8361  |
|              hf_Bart              |  4   | 0.9099 |  0.8321   |      nan       |     nan     |  0.8151  |
|            hf_BigBird             |  2   | 0.9852 |  0.9787   |      nan       |     nan     |   0.81   |
|            timm_vovnet            |  32  | 0.9903 |  0.7754   |      nan       |   0.7817    |  0.7861  |
|               moco                |  32  | 0.9667 |    nan    |      nan       |     nan     |  0.782   |
|        shufflenet_v2_x1_0         | 128  | 1.0002 |   0.874   |      nan       |   0.8652    |  0.7812  |
|          pytorch_stargan          |  16  | 0.9929 |  0.9799   |     0.2149     |   0.8882    |  0.7783  |
|               dcgan               |  32  |  1.0   |  0.7949   |     0.343      |   0.7073    |  0.7527  |
|               vgg16               |  64  | 0.9998 |  0.7378   |     0.2978     |   0.7172    |  0.7491  |
|   timm_vision_transformer_large   |  8   | 0.9987 |  0.8365   |      nan       |   0.8491    |  0.7487  |
|              alexnet              | 128  | 1.0003 |  0.8082   |     0.4354     |    0.805    |  0.7352  |
|               hf_T5               |  8   | 0.9678 |  0.9371   |      nan       |     nan     |  0.7266  |
|           timm_resnest            |  32  | 0.9868 |  0.8809   |      nan       |   0.8726    |  0.7218  |
|      timm_vision_transformer      |  8   | 1.0001 |  0.8868   |      nan       |   0.8871    |  0.7151  |
|             resnet50              |  32  | 1.0004 |  0.8678   |      nan       |   0.8041    |  0.7143  |
|            mnasnet1_0             |  32  | 0.9994 |  0.8793   |     0.173      |   0.8217    |  0.6596  |
|           squeezenet1_1           |  32  | 0.9604 |  0.7958   |     0.2951     |   0.7589    |  0.6595  |
|        mobilenet_v3_large         |  32  | 0.999  |  0.8661   |      nan       |    0.874    |  0.6573  |
|          resnext50_32x4d          |  8   |  1.0   |  0.8591   |      nan       |    0.823    |  0.6514  |
|                drq                |  1   | 0.9125 |  0.8399   |      nan       |   0.8395    |  0.6406  |
|         soft_actor_critic         | 256  | 0.964  |  0.9151   |     0.4737     |   0.9151    |  0.6279  |
|          LearningToPaint          |  96  | 0.9252 |  0.7196   |      nan       |    0.71     |  0.605   |
|            densenet121            |  4   |  1.0   |  0.8696   |      nan       |   0.8376    |  0.5739  |
|             resnet18              |  16  | 0.9782 |  0.7852   |      nan       |   0.7268    |  0.5644  |
|           lennard_jones           | 1000 |  1.0   |  1.0002   |     0.3735     |   1.0967    |  0.564   |
|      nvidia_deeprecommender       | 256  | 0.5596 |  0.5596   |     0.5262     |   0.5596    |  0.5596  |
|       functorch_dp_cifar10        |  64  | 0.9964 |  0.8131   |      nan       |    0.846    |  0.4465  |
|          pytorch_struct           | 200  |  1.0   |  0.5081   |     0.4858     |   0.5082    |  0.4235  |
|            hf_Reformer            |  4   | 0.3764 |    1.0    |     0.2539     |     nan     |  0.3629  |
|              yolov3               |  16  | 1.0054 |  0.8488   |      nan       |   0.8244    |   nan    |
|           hf_Longformer           |  2   | 0.9734 |   0.967   |     0.3374     |     nan     |   nan    |
|        speech_transformer         |  32  | 1.0015 |  0.9177   |      nan       |     nan     |   nan    |
|           hf_GPT2_large           |  4   | 0.9586 |  0.8649   |      nan       |     nan     |   nan    |
|               dlrm                | 2048 |  nan   |  0.7282   |      nan       |     nan     |   nan    |
|             tacotron2             |  64  | 0.9879 |  0.4069   |      nan       |     nan     |   nan    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

huggingface suite with amp precision

see more

Performance speedup

+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|     MobileBertForQuestionAnswering      | 32 | 1.0156 |  0.8169   |      0.0       |     0.0     |  5.8307  |
|          MobileBertForMaskedLM          | 16 | 1.0187 |  0.8248   |      0.0       |     0.0     |  5.7089  |
|       MT5ForConditionalGeneration       | 2  | 1.0224 |  0.8508   |      0.0       |     0.0     |  5.4709  |
|           ElectraForCausalLM            | 1  | 1.0362 |  0.8465   |      0.0       |     0.0     |  5.4366  |
|            YituTechConvBert             | 1  | 1.0208 |  0.8384   |      0.0       |     0.0     |  4.6274  |
|         MegatronBertForCausalLM         | 2  | 1.0325 |  0.8502   |      0.0       |     0.0     |  4.1375  |
|     M2M100ForConditionalGeneration      | 2  | 1.0103 |  0.8308   |      0.0       |     0.0     |  4.0046  |
|           RobertaForCausalLM            | 4  | 1.0395 |  0.8381   |      0.0       |     0.0     |  3.9484  |
|             OPTForCausalLM              | 4  | 1.0159 |  0.8275   |      0.0       |     0.0     |  3.9047  |
|                CamemBert                | 1  | 1.0396 |  0.8447   |      0.0       |     0.0     |  3.4434  |
|     PegasusForConditionalGeneration     | 4  | 1.0105 |  0.8149   |      0.0       |     0.0     |  3.2421  |
|             XGLMForCausalLM             | 1  | 1.0117 |  0.8168   |      0.0       |     0.0     |  3.1117  |
|     PLBartForConditionalGeneration      | 8  | 1.0154 |  0.8245   |      0.0       |     0.0     |  2.8361  |
|    MegatronBertForQuestionAnswering     | 8  | 1.0376 |   0.859   |      0.0       |     0.0     |  2.688   |
|               DistillGPT2               | 1  | 1.024  |  0.8702   |      0.0       |     0.0     |  2.6104  |
|      MBartForConditionalGeneration      | 8  | 1.0136 |  0.8357   |      0.0       |     0.0     |  2.3857  |
|         Speech2Text2ForCausalLM         | 64 | 1.0051 |  0.8348   |      0.0       |     0.0     |  2.2561  |
|      GPT2ForSequenceClassification      | 4  | 0.9993 |  0.9755   |      0.0       |     0.0     |  2.1462  |
|       ElectraForQuestionAnswering       | 64 | 0.9999 |  0.9776   |      0.0       |     0.0     |  1.9724  |
| BlenderbotSmallForConditionalGeneration | 32 | 1.0098 |  0.8688   |      0.0       |     0.0     |  1.9514  |
|            TrOCRForCausalLM             | 8  | 1.0113 |   0.829   |      0.0       |     0.0     |  1.9288  |
|           PegasusForCausalLM            | 8  | 1.0103 |  0.8014   |      0.0       |     0.0     |  1.8377  |
|          DistilBertForMaskedLM          | 16 | 1.0299 |  0.8455   |      0.0       |     0.0     |  1.8339  |
|      BartForConditionalGeneration       | 1  | 1.0151 |  0.8364   |      0.0       |     0.0     |  1.7748  |
|     DistilBertForQuestionAnswering      | 32 | 1.034  |  0.8491   |      0.0       |     0.0     |  1.7693  |
|    LayoutLMForSequenceClassification    | 16 | 0.9972 |  0.9671   |      0.0       |     0.0     |  1.7319  |
|       T5ForConditionalGeneration        | 4  | 1.0002 |  0.9362   |      0.0       |     0.0     |  1.7017  |
|       AlbertForQuestionAnswering        | 2  | 1.0011 |   0.808   |      0.0       |     0.0     |  1.6617  |
|            AlbertForMaskedLM            | 2  | 1.0004 |   0.808   |      0.0       |     0.0     |  1.6509  |
|            PLBartForCausalLM            | 16 | 1.0101 |  0.9365   |      0.0       |     0.0     |  1.6438  |
|                 T5Small                 | 1  | 1.0281 |  0.8763   |      0.0       |     0.0     |  1.6266  |
|            XLNetLMHeadModel             | 4  | 1.0006 |  0.9605   |      0.0       |     0.0     |  1.5968  |
|           LayoutLMForMaskedLM           | 16 | 0.9985 |   0.969   |      0.0       |     0.0     |  1.5917  |
|             BartForCausalLM             | 2  | 1.0003 |  0.9618   |      0.0       |     0.0     |  1.4597  |
|       DebertaForQuestionAnswering       | 4  | 0.9344 |  0.7279   |     0.9349     |     0.0     |  1.4504  |
|        BertForQuestionAnswering         | 64 | 0.9972 |  0.9677   |      0.0       |     0.0     |  1.446   |
|       RobertaForQuestionAnswering       | 64 | 0.9979 |  0.9686   |      0.0       |     0.0     |  1.4407  |
|           DebertaForMaskedLM            | 4  | 0.9334 |  0.7268   |     0.7967     |     0.0     |  1.4123  |
|            MBartForCausalLM             | 16 | 1.0091 |   0.823   |      0.0       |     0.0     |  1.3982  |
|             BertForMaskedLM             | 64 | 0.9973 |   0.956   |      0.0       |     0.0     |  1.3317  |
|       BlenderbotSmallForCausalLM        | 64 | 1.0004 |   0.927   |      0.0       |     0.0     |  1.3061  |
|                 BigBird                 | 1  | 0.9924 |  0.9078   |      0.0       |     0.0     |  1.1488  |
|          AllenaiLongformerBase          | 1  | 0.9546 |  0.7324   |     0.854      |     0.0     |   0.0    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser |  inductor   |
+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+
|            AlbertForMaskedLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       AlbertForQuestionAnswering        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             BartForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             BertForMaskedLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|        BertForQuestionAnswering         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                 BigBird                 | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       BlenderbotSmallForCausalLM        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
| BlenderbotSmallForConditionalGeneration | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                CamemBert                | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           DebertaForMaskedLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|          DistilBertForMaskedLM          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     DistilBertForQuestionAnswering      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|               DistillGPT2               | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           ElectraForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       ElectraForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|      GPT2ForSequenceClassification      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           LayoutLMForMaskedLM           | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|    LayoutLMForSequenceClassification    | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            MBartForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       MT5ForConditionalGeneration       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|         MegatronBertForCausalLM         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|    MegatronBertForQuestionAnswering     | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|          MobileBertForMaskedLM          | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     MobileBertForQuestionAnswering      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|             OPTForCausalLM              | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            PLBartForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           PegasusForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|     PegasusForConditionalGeneration     | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|           RobertaForCausalLM            | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       RobertaForQuestionAnswering       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|         Speech2Text2ForCausalLM         | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       T5ForConditionalGeneration        | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|                 T5Small                 | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            TrOCRForCausalLM             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            XLNetLMHeadModel             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|            YituTechConvBert             | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |    pass     |
|       DebertaForQuestionAnswering       | 1  |  pass  |   pass    | fail_accuracy  | fail_to_run |    pass     |
|          AllenaiLongformerBase          | 1  |  pass  |   pass    |      pass      | fail_to_run | fail_to_run |
|      BartForConditionalGeneration       | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|      MBartForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|     PLBartForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run | fail_to_run |
|     M2M100ForConditionalGeneration      | 1  |  pass  |   pass    |  fail_to_run   | fail_to_run |   0.0000    |
|             XGLMForCausalLM             | 0  | 0.0000 |  0.0000   |     0.0000     |   0.0000    |   0.0000    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+-------------+

Compilation latency (sec)

+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+
|                  name                   | bs |  eager   | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+
|            XLNetLMHeadModel             | 4  | 17.5696  |  42.9354  |      nan       |     nan     | 327.9079 |
|          MobileBertForMaskedLM          | 16 | 135.8255 | 173.7536  |      nan       |     nan     | 308.991  |
|     MobileBertForQuestionAnswering      | 32 | 132.8536 | 171.7633  |      nan       |     nan     | 293.2592 |
|       T5ForConditionalGeneration        | 4  |  3.922   |  12.943   |      nan       |     nan     | 248.2661 |
|     M2M100ForConditionalGeneration      | 2  | 26.2972  |  44.8597  |      nan       |     nan     | 223.5051 |
|       MT5ForConditionalGeneration       | 2  |  6.6742  |  19.818   |      nan       |     nan     | 203.911  |
|            YituTechConvBert             | 1  |  9.2675  |  20.8203  |      nan       |     nan     | 195.7945 |
|      MBartForConditionalGeneration      | 8  | 26.6741  |  47.2119  |      nan       |     nan     | 173.1357 |
|             XGLMForCausalLM             | 1  | 15.5248  |  30.2576  |      nan       |     nan     | 170.6504 |
|     PegasusForConditionalGeneration     | 4  | 26.1067  |  45.261   |      nan       |     nan     | 167.2376 |
|           DebertaForMaskedLM            | 4  |  7.4344  |  14.5652  |    53.2994     |     nan     | 164.457  |
|      BartForConditionalGeneration       | 1  | 26.3695  |  45.964   |      nan       |     nan     | 155.1766 |
|         MegatronBertForCausalLM         | 2  | 16.6556  |  31.6041  |      nan       |     nan     | 151.6307 |
|    MegatronBertForQuestionAnswering     | 8  | 16.8169  |  31.9636  |      nan       |     nan     | 149.1209 |
|                 T5Small                 | 1  |  3.949   |  12.7818  |      nan       |     nan     | 148.9024 |
|     PLBartForConditionalGeneration      | 8  |  7.4476  |  17.4072  |      nan       |     nan     | 135.8023 |
| BlenderbotSmallForConditionalGeneration | 32 | 12.7961  |  25.3891  |      nan       |     nan     | 127.313  |
|       DebertaForQuestionAnswering       | 4  |  7.1998  |  14.5172  |    53.5221     |     nan     | 122.565  |
|           RobertaForCausalLM            | 4  |  5.2604  |  13.0452  |      nan       |     nan     | 104.3819 |
|    LayoutLMForSequenceClassification    | 16 |  5.4858  |  12.9984  |      nan       |     nan     | 93.8895  |
|           PegasusForCausalLM            | 8  |  9.9544  |  17.0666  |      nan       |     nan     | 92.9066  |
|       ElectraForQuestionAnswering       | 64 |  5.2746  |  12.8697  |      nan       |     nan     | 92.0522  |
|            MBartForCausalLM             | 16 |  10.398  |  17.1343  |      nan       |     nan     | 85.7616  |
|             OPTForCausalLM              | 4  |  4.9946  |  12.1428  |      nan       |     nan     | 84.3122  |
|             BertForMaskedLM             | 64 |  5.1354  |  12.5731  |      nan       |     nan     |  84.208  |
|           LayoutLMForMaskedLM           | 16 |  5.6794  |  13.827   |      nan       |     nan     | 82.2904  |
|             BartForCausalLM             | 2  | 10.0323  |  17.1562  |      nan       |     nan     | 81.4293  |
|      GPT2ForSequenceClassification      | 4  |  3.6793  |  10.0043  |      nan       |     nan     | 78.5398  |
|            TrOCRForCausalLM             | 8  |  9.9941  |  17.2084  |      nan       |     nan     | 73.7194  |
|       BlenderbotSmallForCausalLM        | 64 |  4.8936  |  9.6694   |      nan       |     nan     | 73.5185  |
|           ElectraForCausalLM            | 1  |  5.385   |  12.7792  |      nan       |     nan     | 70.1919  |
|                 BigBird                 | 1  | 11.6176  |  20.1569  |      nan       |     nan     | 67.9323  |
|     DistilBertForQuestionAnswering      | 32 |  1.921   |  5.3883   |      nan       |     nan     |  67.798  |
|         Speech2Text2ForCausalLM         | 64 |  3.2046  |  6.8548   |      nan       |     nan     | 67.6319  |
|            AlbertForMaskedLM            | 2  |  1.5751  |  8.7428   |      nan       |     nan     | 67.5946  |
|               DistillGPT2               | 1  |  1.5417  |  4.7305   |      nan       |     nan     | 66.8776  |
|            PLBartForCausalLM            | 16 |  3.3208  |  7.2321   |      nan       |     nan     | 65.4376  |
|                CamemBert                | 1  |  5.225   |  12.6121  |      nan       |     nan     | 64.2281  |
|       RobertaForQuestionAnswering       | 64 |  5.5016  |  12.5104  |      nan       |     nan     | 62.6216  |
|        BertForQuestionAnswering         | 64 |  5.228   |  12.4546  |      nan       |     nan     |  61.755  |
|          DistilBertForMaskedLM          | 16 |  1.9566  |  5.6066   |      nan       |     nan     | 51.4273  |
|       AlbertForQuestionAnswering        | 2  |  1.6953  |  8.6595   |      nan       |     nan     | 45.7502  |
|          AllenaiLongformerBase          | 1  | 12.2056  |  22.4955  |    93.5262     |     nan     |   nan    |
+-----------------------------------------+----+----------+-----------+----------------+-------------+----------+

Peak Memory Compression Ratio

+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|                  name                   | bs | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+
|      GPT2ForSequenceClassification      | 4  | 0.9675 |  0.9163   |      nan       |     nan     |   1.07   |
|            XLNetLMHeadModel             | 4  | 0.9912 |  0.8791   |      nan       |     nan     |  1.0109  |
|       ElectraForQuestionAnswering       | 64 | 1.0016 |  0.9539   |      nan       |     nan     |  1.0002  |
|                 T5Small                 | 1  |  1.0   |  0.9124   |      nan       |     nan     |  0.9876  |
|           LayoutLMForMaskedLM           | 16 | 0.9999 |  0.9238   |      nan       |     nan     |  0.9871  |
|             BertForMaskedLM             | 64 | 0.9996 |   0.899   |      nan       |     nan     |  0.9811  |
|    LayoutLMForSequenceClassification    | 16 | 1.004  |  0.9325   |      nan       |     nan     |  0.9712  |
| BlenderbotSmallForConditionalGeneration | 32 | 0.9998 |  0.8996   |      nan       |     nan     |  0.9557  |
|             BartForCausalLM             | 2  |  1.0   |  0.8769   |      nan       |     nan     |  0.9545  |
|       T5ForConditionalGeneration        | 4  | 0.9996 |  0.9594   |      nan       |     nan     |  0.9525  |
|         Speech2Text2ForCausalLM         | 64 | 0.9954 |  0.8489   |      nan       |     nan     |  0.9452  |
|            PLBartForCausalLM            | 16 | 1.0006 |  0.8667   |      nan       |     nan     |  0.9395  |
|       BlenderbotSmallForCausalLM        | 64 | 0.9996 |  0.8172   |      nan       |     nan     |  0.9269  |
|        BertForQuestionAnswering         | 64 | 0.9995 |  0.9315   |      nan       |     nan     |  0.9256  |
|       RobertaForQuestionAnswering       | 64 | 0.9996 |  0.9315   |      nan       |     nan     |  0.9254  |
|          DistilBertForMaskedLM          | 16 | 0.9991 |  0.8698   |      nan       |     nan     |  0.9167  |
|      BartForConditionalGeneration       | 1  |  1.0   |  0.8619   |      nan       |     nan     |  0.881   |
|       AlbertForQuestionAnswering        | 2  |  1.0   |  0.6451   |      nan       |     nan     |  0.8636  |
|            MBartForCausalLM             | 16 |  1.0   |  0.8398   |      nan       |     nan     |  0.8565  |
|            AlbertForMaskedLM            | 2  |  1.0   |  0.6364   |      nan       |     nan     |  0.8515  |
|                 BigBird                 | 1  | 1.0024 |  0.9513   |      nan       |     nan     |  0.8349  |
|     DistilBertForQuestionAnswering      | 32 | 0.9987 |  0.8967   |      nan       |     nan     |  0.834   |
|     PLBartForConditionalGeneration      | 8  | 0.9999 |  0.8307   |      nan       |     nan     |  0.8252  |
|               DistillGPT2               | 1  | 1.0006 |  0.7548   |      nan       |     nan     |  0.812   |
|      MBartForConditionalGeneration      | 8  | 0.9999 |  0.8187   |      nan       |     nan     |  0.7699  |
|            TrOCRForCausalLM             | 8  |  1.0   |  0.7955   |      nan       |     nan     |  0.7566  |
|                CamemBert                | 1  | 0.9989 |  0.7872   |      nan       |     nan     |  0.7482  |
|             OPTForCausalLM              | 4  | 0.9975 |  0.7501   |      nan       |     nan     |  0.7473  |
|            YituTechConvBert             | 1  | 0.9718 |  0.7819   |      nan       |     nan     |  0.7407  |
|           PegasusForCausalLM            | 8  | 0.999  |  0.9444   |      nan       |     nan     |  0.7324  |
|           RobertaForCausalLM            | 4  | 0.9237 |  0.7741   |      nan       |     nan     |  0.7309  |
|             XGLMForCausalLM             | 1  | 0.9999 |  0.9992   |      nan       |     nan     |  0.7214  |
|    MegatronBertForQuestionAnswering     | 8  | 0.9051 |  0.8218   |      nan       |     nan     |  0.7107  |
|          MobileBertForMaskedLM          | 16 | 0.9985 |  0.8983   |      nan       |     nan     |  0.6948  |
|     PegasusForConditionalGeneration     | 4  | 0.9996 |  0.9196   |      nan       |     nan     |  0.6769  |
|           ElectraForCausalLM            | 1  | 0.9993 |  0.8955   |      nan       |     nan     |  0.6701  |
|         MegatronBertForCausalLM         | 2  | 0.7726 |  0.7726   |      nan       |     nan     |  0.6697  |
|     M2M100ForConditionalGeneration      | 2  | 0.9999 |   0.954   |      nan       |     nan     |  0.6523  |
|     MobileBertForQuestionAnswering      | 32 | 1.0142 |  0.9796   |      nan       |     nan     |  0.6265  |
|       MT5ForConditionalGeneration       | 2  | 0.6019 |  0.6019   |      nan       |     nan     |  0.6019  |
|           DebertaForMaskedLM            | 4  | 0.9982 |  0.9826   |     0.3599     |     nan     |  0.4498  |
|       DebertaForQuestionAnswering       | 4  | 0.979  |  1.0568   |     0.3576     |     nan     |  0.3761  |
|          AllenaiLongformerBase          | 1  | 0.9996 |  0.9477   |     0.3752     |     nan     |   nan    |
+-----------------------------------------+----+--------+-----------+----------------+-------------+----------+

timm_models suite with amp precision

see more

Performance speedup

+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|              name               | bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|        res2net50_14w_8s         |  2  | 0.9966 |  0.8973   |      0.0       |   1.3904    |  5.5892  |
|            hrnet_w18            |  2  | 1.004  |  0.9636   |      0.0       |   1.3727    |  4.9403  |
|           res2next50            |  2  | 1.0004 |  0.9702   |      0.0       |    1.363    |  4.678   |
|        twins_pcpvt_base         | 32  | 1.0036 |  0.8938   |      0.0       |   1.3592    |  2.5448  |
|      xcit_large_24_p8_224       |  5  | 1.0012 |    0.0    |      0.0       |     0.0     |  2.0556  |
|          cait_m36_384           |  2  | 1.0024 |  0.8465   |      0.0       |   1.3421    |  2.0541  |
|        tnt_s_patch16_224        | 64  | 0.9997 |  0.9927   |      0.0       |   1.8446    |  2.0203  |
|          ghostnet_100           | 128 | 1.0043 |  0.9984   |      0.0       |   1.5386    |  1.8112  |
|          gmixer_24_224          | 64  | 1.0008 |  0.8843   |      0.0       |   1.0368    |  1.6807  |
|           volo_d1_224           | 64  | 0.9994 |  0.9943   |      0.0       |   1.1498    |  1.6678  |
|         crossvit_9_240          | 64  | 1.0032 |  0.9572   |      0.0       |   1.1315    |  1.5867  |
|            nfnet_l0             | 64  | 1.0066 |  0.8388   |      0.0       |   1.1434    |  1.5833  |
|  swin_base_patch4_window7_224   | 64  | 0.9993 |   0.961   |      0.0       |   1.0563    |  1.5723  |
|            lcnet_050            | 128 | 0.9684 |  0.9499   |      0.0       |   1.5746    |  1.5519  |
|         coat_lite_mini          | 128 | 1.0002 |  0.9947   |      0.0       |   1.2658    |  1.5316  |
|           regnety_002           | 128 | 0.9786 |  0.9364   |      0.0       |   1.3847    |  1.5049  |
|           resnest101e           | 32  | 1.0032 |  0.9843   |      0.0       |   1.4186    |  1.4787  |
|          resmlp_12_224          | 128 |  1.0   |  0.9975   |     0.7819     |     0.0     |  1.4644  |
|          jx_nest_base           | 32  | 0.9992 |  0.9909   |      0.0       |    1.238    |  1.4634  |
|           convit_base           | 32  | 0.9995 |  0.9916   |      0.0       |     0.0     |  1.3992  |
|          gmlp_s16_224           | 64  | 0.9989 |  0.9827   |      0.0       |    1.051    |  1.3904  |
|            pit_b_224            | 64  | 0.9997 |  0.9939   |      0.0       |   1.0687    |  1.3644  |
|           dm_nfnet_f0           | 128 | 0.9993 |  0.9976   |      0.0       |   1.1757    |  1.326   |
|          mixer_b16_224          | 64  | 0.9992 |  0.9907   |     0.7171     |   0.9682    |  1.3168  |
| deit_base_distilled_patch16_224 | 64  | 0.9994 |  0.9911   |      0.0       |    1.071    |  1.2892  |
|      beit_base_patch16_224      | 64  | 0.9997 |  0.9783   |      0.0       |   1.0509    |  1.2862  |
|        adv_inception_v3         | 128 | 0.9998 |  0.9953   |      0.0       |   1.1938    |  1.2801  |
|       gluon_inception_v3        | 128 |  1.0   |  0.9948   |      0.0       |   1.1944    |  1.2254  |
|         poolformer_m36          | 64  | 0.999  |  0.9974   |      0.0       |     0.0     |  1.209   |
|          inception_v3           | 128 | 0.9999 |   0.995   |      0.0       |   1.1944    |  1.2078  |
|           mobilevit_s           | 32  | 0.9736 |  0.7981   |      0.0       |   1.2122    |  1.2009  |
|      vit_base_patch16_224       | 64  | 0.9995 |  0.9934   |      0.0       |   1.0006    |  1.1978  |
|            mixnet_l             | 64  | 0.9791 |  0.8892   |      0.0       |   1.0867    |  1.178   |
|           tf_mixnet_l           | 64  | 0.9808 |   0.894   |      0.0       |   1.1188    |  1.1177  |
|         visformer_small         | 128 | 0.9999 |  1.0005   |      0.0       |   1.0857    |  1.0997  |
|          pnasnet5large          | 16  | 1.0052 |  1.0336   |      0.0       |   1.1349    |  1.052   |
|            fbnetv3_b            | 128 | 0.9596 |  0.9445   |      0.0       |   1.2915    |  1.0325  |
|             dla102              | 64  | 1.0033 |  0.9902   |      0.0       |   1.3766    |  1.0242  |
|             dpn107              | 32  | 0.9389 |  0.9299   |      0.0       |   0.9938    |  0.9342  |
|            repvgg_a2            | 128 | 0.9422 |  0.9332   |     0.6563     |   1.1301    |  0.9011  |
|           fbnetc_100            | 128 | 0.952  |  0.9423   |     0.6644     |   1.3738    |  0.8982  |
|           selecsls42b           | 128 | 0.9998 |  0.9936   |      0.0       |   1.3554    |  0.8981  |
|          cspdarknet53           | 64  | 0.9431 |  0.9323   |      0.0       |   0.9006    |  0.8892  |
|        convmixer_768_32         | 32  | 0.9998 |  0.9979   |      0.0       |   1.0527    |  0.8866  |
|            tinynet_a            | 128 | 0.9575 |  0.8062   |      0.0       |   1.0907    |  0.8775  |
|           mnasnet_100           | 128 | 0.9523 |  0.9433   |     0.6613     |   1.3688    |  0.8396  |
|          convnext_base          | 32  | 1.0041 |  0.9226   |      0.0       |   1.3138    |  0.8371  |
|      mobilenetv3_large_100      | 128 | 0.9548 |  0.9437   |      0.0       |   1.3436    |  0.8349  |
|        res2net101_26w_4s        | 64  |  1.0   |  0.9969   |      0.0       |   1.3864    |  0.8124  |
|            gernet_l             | 128 | 0.9461 |  0.9361   |      0.0       |   1.1391    |  0.8112  |
|          spnasnet_100           | 128 | 0.9468 |  0.9375   |     0.6531     |   1.3174    |  0.7948  |
|         mobilenetv2_100         | 128 | 0.9504 |  0.9396   |      0.0       |   0.8657    |  0.7434  |
|        sebotnet33ts_256         | 64  | 0.9669 |  0.8365   |      0.0       |   1.1144    |  0.734   |
|       tf_efficientnet_b0        | 128 | 0.9647 |  0.8063   |      0.0       |   1.0946    |  0.7245  |
|          botnet26t_256          | 128 | 0.9792 |  0.9756   |      0.0       |   1.3411    |  0.7229  |
|        eca_halonext26ts         | 64  | 0.9636 |  0.8061   |      0.0       |   1.0992    |  0.705   |
|       eca_botnext26ts_256       | 64  | 0.9616 |  0.8005   |      0.0       |   1.1086    |  0.6749  |
|           rexnet_100            | 128 | 0.9646 |  0.8483   |      0.0       |   1.0366    |  0.6448  |
|        ese_vovnet19b_dw         | 128 | 0.9691 |  0.9642   |      0.0       |   1.2435    |  0.6419  |
|     swsl_resnext101_32x16d      | 32  | 0.9989 |  0.9801   |      0.0       |   1.0755    |  0.6057  |
|        gluon_xception65         | 32  | 0.9985 |  0.9876   |      0.0       |   1.0635    |  0.5872  |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+

Accuracy

+---------------------------------+----+-------+---------------+----------------+---------------+---------------+
|              name               | bs | eager |   aot_eager   | aot_cudagraphs |  aot_nvfuser  |   inductor    |
+---------------------------------+----+-------+---------------+----------------+---------------+---------------+
|           fbnetc_100            | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|           mnasnet_100           | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|            repvgg_a2            | 2  | pass  |     pass      |      pass      |     pass      |     pass      |
|        adv_inception_v3         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      beit_base_patch16_224      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          botnet26t_256          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        convmixer_768_32         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          convnext_base          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         crossvit_9_240          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          cspdarknet53           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
| deit_base_distilled_patch16_224 | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|             dla102              | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           dm_nfnet_f0           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|             dpn107              | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       eca_botnext26ts_256       | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        eca_halonext26ts         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            gernet_l             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          ghostnet_100           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       gluon_inception_v3        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          inception_v3           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            lcnet_050            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            mixnet_l             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         mobilenetv2_100         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      mobilenetv3_large_100      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           mobilevit_s           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            nfnet_l0             | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          pnasnet5large          | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           regnety_002           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        res2net101_26w_4s        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        res2net50_14w_8s         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           res2next50            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           rexnet_100            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        sebotnet33ts_256         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           selecsls42b           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|  swin_base_patch4_window7_224   | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|     swsl_resnext101_32x16d      | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|       tf_efficientnet_b0        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           tf_mixnet_l           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|            tinynet_a            | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|        tnt_s_patch16_224        | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|         visformer_small         | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|      vit_base_patch16_224       | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|           volo_d1_224           | 2  | pass  |     pass      |  fail_to_run   |     pass      |     pass      |
|          resmlp_12_224          | 2  | pass  |     pass      |      pass      |  fail_to_run  |     pass      |
|           convit_base           | 2  | pass  |     pass      |  fail_to_run   |  fail_to_run  |     pass      |
|      xcit_large_24_p8_224       | 2  | pass  |  fail_to_run  |  fail_to_run   |  fail_to_run  |     pass      |
|          gmixer_24_224          | 2  | pass  |     pass      |      pass      | fail_accuracy |     pass      |
|          gmlp_s16_224           | 2  | pass  |     pass      |      pass      | fail_accuracy |     pass      |
|          mixer_b16_224          | 2  | pass  |     pass      |      pass      | fail_accuracy |     pass      |
|         poolformer_m36          | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy |     pass      |
|           resnest101e           | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy |     pass      |
|         coat_lite_mini          | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|          jx_nest_base           | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|            pit_b_224            | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|        twins_pcpvt_base         | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy |     pass      |
|        ese_vovnet19b_dw         | 2  | pass  |     pass      |  fail_to_run   |     pass      | fail_accuracy |
|        gluon_xception65         | 2  | pass  |     pass      |  fail_to_run   |     pass      | fail_accuracy |
|            hrnet_w18            | 2  | pass  |     pass      |  fail_to_run   |     pass      | fail_accuracy |
|          spnasnet_100           | 2  | pass  |     pass      |      pass      | fail_accuracy | fail_accuracy |
|            fbnetv3_b            | 2  | pass  |     pass      |  fail_to_run   | fail_accuracy | fail_accuracy |
|          cait_m36_384           | 2  | pass  | fail_accuracy |  fail_to_run   | fail_accuracy | fail_accuracy |
+---------------------------------+----+-------+---------------+----------------+---------------+---------------+

Compilation latency (sec)

+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+
|              name               | bs  |  eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor  |
+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+
|            hrnet_w18            |  2  | 99.5471 | 142.1533  |      nan       |  471.0836   | 1399.7866 |
|             dpn107              | 32  | 13.8554 |  28.7483  |      nan       |  112.6905   | 1352.9515 |
|          pnasnet5large          | 16  | 60.4879 |  88.5364  |      nan       |  251.1834   | 1340.2779 |
|           rexnet_100            | 128 | 6.8038  |  14.4292  |      nan       |  120.8599   | 1069.0229 |
|        res2net50_14w_8s         |  2  | 20.6816 |  38.8763  |      nan       |  121.3153   | 987.9243  |
|          ghostnet_100           | 128 | 9.5437  |  19.3356  |      nan       |   96.7996   |  882.697  |
|           mobilevit_s           | 32  | 5.9457  |  13.7479  |      nan       |   61.5465   | 879.8939  |
|        twins_pcpvt_base         | 32  | 26.7453 |  43.9793  |      nan       |   95.4658   | 843.4319  |
|       eca_botnext26ts_256       | 64  | 2.6274  |   7.452   |      nan       |   63.6443   | 839.8298  |
|            mixnet_l             | 64  | 13.5416 |  23.0219  |      nan       |   88.4897   |  835.172  |
|            fbnetv3_b            | 128 | 13.3888 |  24.1748  |      nan       |  109.7611   | 772.9081  |
|            tinynet_a            | 128 | 7.7469  |  15.6905  |      nan       |   83.963    | 743.7648  |
|           resnest101e           | 32  | 27.3378 |  47.7018  |      nan       |  125.8207   | 700.9297  |
|        sebotnet33ts_256         | 64  |  3.961  |  10.0038  |      nan       |   69.1966   | 648.9216  |
|           fbnetc_100            | 128 | 5.7278  |  12.479   |    85.5777     |   63.2472   | 638.7629  |
|         coat_lite_mini          | 128 |  3.266  |  9.1575   |      nan       |   34.2188   | 636.5705  |
|          botnet26t_256          | 128 | 2.4678  |  6.7299   |      nan       |   51.027    | 588.0481  |
|           tf_mixnet_l           | 64  | 13.7087 |  23.521   |      nan       |   89.4409   | 565.3166  |
|             dla102              | 64  | 10.8136 |  22.7803  |      nan       |   96.3515   | 540.0435  |
|        eca_halonext26ts         | 64  | 2.7442  |  7.8035   |      nan       |   67.5506   | 524.7978  |
|          cspdarknet53           | 64  | 6.1577  |  13.819   |      nan       |   44.4818   | 516.8901  |
|           res2next50            |  2  | 7.5752  |  17.4257  |      nan       |   64.6808   | 508.6052  |
|           mnasnet_100           | 128 | 4.2555  |  9.7943   |    61.6838     |   53.8161   | 460.7911  |
|       tf_efficientnet_b0        | 128 | 5.9847  |  12.8606  |      nan       |   81.5061   | 453.8244  |
|          convnext_base          | 32  | 11.9958 |  19.1913  |      nan       |   46.8047   | 447.0484  |
|        res2net101_26w_4s        | 64  | 26.2277 |  46.854   |      nan       |  142.2874   | 442.9257  |
|  swin_base_patch4_window7_224   | 64  | 12.9152 |  26.0704  |      nan       |   83.2757   | 431.9341  |
|        adv_inception_v3         | 128 | 8.7126  |  19.1453  |      nan       |   105.887   | 431.2125  |
|            nfnet_l0             | 64  | 6.0435  |  13.1945  |      nan       |   38.7061   | 399.3298  |
|      mobilenetv3_large_100      | 128 | 4.5585  |  10.0823  |      nan       |   83.9393   | 397.9304  |
|         mobilenetv2_100         | 128 | 4.1958  |  9.4317   |      nan       |   43.0148   | 392.8332  |
|           regnety_002           | 128 | 4.9475  |  10.8999  |      nan       |   60.0576   | 392.0666  |
|        ese_vovnet19b_dw         | 128 | 2.0528  |  5.1975   |      nan       |   39.6613   | 388.7989  |
|         visformer_small         | 128 | 2.5656  |  6.5413   |      nan       |   31.5689   | 381.6193  |
|      xcit_large_24_p8_224       |  5  | 37.375  |    nan    |      nan       |     nan     | 352.6873  |
|        gluon_xception65         | 32  | 15.5262 |  29.4819  |      nan       |    78.65    | 347.3151  |
|          jx_nest_base           | 32  | 9.7428  |  20.6265  |      nan       |    58.19    | 320.1488  |
|          cait_m36_384           |  2  | 47.7696 |  73.0288  |      nan       |  109.6265   | 306.2536  |
|         poolformer_m36          | 64  | 13.3601 |  21.2852  |      nan       |     nan     | 303.6845  |
|            gernet_l             | 128 | 4.9595  |  11.1233  |      nan       |   47.9657   | 292.4524  |
|         crossvit_9_240          | 64  |  7.76   |  17.4312  |      nan       |   42.3485   | 281.7218  |
|           selecsls42b           | 128 | 2.5011  |  6.9082   |      nan       |   51.4771   | 280.0613  |
|       gluon_inception_v3        | 128 | 8.5703  |  18.794   |      nan       |  105.7343   | 276.2485  |
|          spnasnet_100           | 128 | 5.5887  |  12.1725  |    81.9197     |   60.788    | 274.2042  |
|            lcnet_050            | 128 | 2.0128  |   5.131   |      nan       |   39.9489   | 244.1516  |
|          inception_v3           | 128 | 8.5271  |  18.8796  |      nan       |  106.0345   | 223.4377  |
|     swsl_resnext101_32x16d      | 32  | 10.3836 |  22.0073  |      nan       |   61.957    | 221.0566  |
|           volo_d1_224           | 64  | 6.7957  |  15.1805  |      nan       |   44.0256   | 211.2936  |
|           convit_base           | 32  | 4.3998  |  10.9888  |      nan       |     nan     | 182.1969  |
|            pit_b_224            | 64  |  3.947  |  10.0387  |      nan       |   27.9232   | 182.0427  |
|        tnt_s_patch16_224        | 64  | 12.8349 |  25.0047  |      nan       |   48.7462   | 166.9154  |
|          gmlp_s16_224           | 64  | 9.5498  |  17.7147  |      nan       |   30.2121   | 151.1543  |
|          gmixer_24_224          | 64  | 8.8075  |  17.6468  |      nan       |   35.1994   | 141.2526  |
|            repvgg_a2            | 128 | 4.8536  |  10.7212  |    53.3933     |   65.0058   | 138.1295  |
|           dm_nfnet_f0           | 128 | 6.6351  |  13.4685  |      nan       |   42.0166   | 133.3581  |
|          resmlp_12_224          | 128 | 2.9325  |  6.0916   |    10.2012     |     nan     | 100.1161  |
|          mixer_b16_224          | 64  | 2.9455  |  7.0706   |     17.11      |   18.0814   |  96.6917  |
|      beit_base_patch16_224      | 64  | 4.9563  |  10.3961  |      nan       |   21.2115   |  91.1698  |
|        convmixer_768_32         | 32  | 7.1212  |  14.8516  |      nan       |   23.9174   |  89.8166  |
| deit_base_distilled_patch16_224 | 64  | 3.2165  |  8.1922   |      nan       |   16.7035   |  84.6066  |
|      vit_base_patch16_224       | 64  | 3.0903  |  7.8695   |      nan       |   16.0406   |  71.5226  |
+---------------------------------+-----+---------+-----------+----------------+-------------+-----------+

Peak Memory Compression Ratio

+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|              name               | bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+
|          gmixer_24_224          | 64  | 1.0001 |  0.9563   |      nan       |   0.8998    |  1.2577  |
|          gmlp_s16_224           | 64  |  1.0   |  0.9679   |      nan       |    0.92     |  1.2405  |
|            tinynet_a            | 128 | 1.0001 |  0.7955   |      nan       |   0.7958    |  1.1632  |
|          pnasnet5large          | 16  | 1.0583 |  0.9923   |      nan       |   1.1741    |  1.1265  |
|        eca_halonext26ts         | 64  | 0.999  |  0.7814   |      nan       |    0.786    |  1.0887  |
|           dm_nfnet_f0           | 128 | 0.9758 |  0.9039   |      nan       |    0.95     |  1.0616  |
|        tnt_s_patch16_224        | 64  |  1.0   |  0.9718   |      nan       |   0.9431    |  1.0587  |
|           volo_d1_224           | 64  | 1.0015 |  0.9518   |      nan       |   0.8587    |  1.0378  |
|           convit_base           | 32  | 0.9991 |   0.86    |      nan       |     nan     |  1.0309  |
|      beit_base_patch16_224      | 64  | 0.9999 |  0.9367   |      nan       |   0.9298    |  1.0097  |
|           mobilevit_s           | 32  |  1.0   |  0.7722   |      nan       |    0.787    |  1.0078  |
|           rexnet_100            | 128 | 0.9988 |  0.7919   |      nan       |   0.8648    |  1.0009  |
|             dla102              | 64  | 0.9998 |  0.9549   |      nan       |   0.9751    |  0.997   |
|            pit_b_224            | 64  | 1.0021 |  0.8074   |      nan       |   0.8179    |  0.9856  |
|         poolformer_m36          | 64  | 1.0015 |  0.9462   |      nan       |     nan     |  0.9797  |
|          convnext_base          | 32  | 1.0065 |   0.908   |      nan       |   0.7521    |  0.9564  |
|        twins_pcpvt_base         | 32  | 0.9963 |  0.9079   |      nan       |   0.8007    |  0.9553  |
|        convmixer_768_32         | 32  | 0.9992 |  0.9807   |      nan       |   0.9715    |  0.9508  |
|         visformer_small         | 128 | 0.9899 |  0.9353   |      nan       |   0.8884    |  0.9342  |
|           resnest101e           | 32  | 1.0002 |  0.9762   |      nan       |   0.9535    |  0.9292  |
|           tf_mixnet_l           | 64  | 0.9995 |  0.8624   |      nan       |   0.8426    |  0.9291  |
|          mixer_b16_224          | 64  | 0.9929 |  0.9425   |     0.2532     |   0.7726    |  0.9225  |
|       tf_efficientnet_b0        | 128 | 1.0006 |  0.7769   |      nan       |    0.846    |  0.9189  |
|            nfnet_l0             | 64  | 0.9993 |   0.824   |      nan       |   0.8257    |  0.913   |
|         mobilenetv2_100         | 128 | 0.9992 |  0.7716   |      nan       |   0.9249    |  0.8963  |
|      vit_base_patch16_224       | 64  | 0.9955 |  0.9384   |      nan       |   0.8801    |  0.8916  |
| deit_base_distilled_patch16_224 | 64  | 0.9944 |  0.9376   |      nan       |   0.8794    |  0.8911  |
|      mobilenetv3_large_100      | 128 | 0.9987 |  0.8562   |      nan       |   0.8673    |  0.8886  |
|        adv_inception_v3         | 128 | 1.0003 |  0.8759   |      nan       |   0.8538    |  0.8829  |
|       gluon_inception_v3        | 128 | 1.0003 |  0.8759   |      nan       |   0.8538    |  0.8829  |
|          inception_v3           | 128 | 1.0003 |  0.8759   |      nan       |   0.8538    |  0.8829  |
|        gluon_xception65         | 32  |  1.0   |  0.8895   |      nan       |   0.8854    |  0.8713  |
|             dpn107              | 32  | 0.9981 |  0.9115   |      nan       |   0.8834    |  0.8701  |
|           selecsls42b           | 128 | 0.9789 |  0.8913   |      nan       |   0.8811    |  0.8659  |
|            fbnetv3_b            | 128 | 1.0003 |  0.7918   |      nan       |   0.7903    |  0.8645  |
|            mixnet_l             | 64  | 0.9989 |  0.8507   |      nan       |   0.7796    |  0.8601  |
|          spnasnet_100           | 128 | 0.9988 |  0.8961   |     0.1651     |   0.8371    |  0.8599  |
|       eca_botnext26ts_256       | 64  | 0.9998 |  0.7776   |      nan       |   0.7813    |  0.8532  |
|     swsl_resnext101_32x16d      | 32  | 1.0009 |  0.8805   |      nan       |   0.8487    |  0.8523  |
|      xcit_large_24_p8_224       |  5  | 0.9987 |    nan    |      nan       |     nan     |  0.8489  |
|          resmlp_12_224          | 128 | 0.9827 |  0.9667   |     0.2637     |     nan     |  0.845   |
|          ghostnet_100           | 128 | 1.0013 |  0.8903   |      nan       |   0.9244    |  0.833   |
|         coat_lite_mini          | 128 | 1.0338 |   0.929   |      nan       |   0.6593    |  0.8328  |
|        ese_vovnet19b_dw         | 128 |  1.0   |   0.867   |      nan       |   0.9146    |  0.8269  |
|          cspdarknet53           | 64  |  1.0   |  0.8469   |      nan       |   0.7906    |  0.813   |
|          cait_m36_384           |  2  | 0.9998 |  0.8806   |      nan       |   0.9023    |  0.8081  |
|          jx_nest_base           | 32  |  1.0   |  0.8945   |      nan       |    0.86     |   0.8    |
|         crossvit_9_240          | 64  | 1.0008 |  0.8801   |      nan       |   0.8854    |  0.7933  |
|        res2net101_26w_4s        | 64  | 0.9999 |  0.9202   |      nan       |   0.8569    |  0.7834  |
|           mnasnet_100           | 128 | 0.9993 |  0.8882   |     0.1669     |   0.8253    |  0.773   |
|  swin_base_patch4_window7_224   | 64  | 0.9998 |  0.9234   |      nan       |   0.8451    |  0.7676  |
|        sebotnet33ts_256         | 64  | 0.9999 |  0.7108   |      nan       |   0.7354    |  0.7449  |
|            gernet_l             | 128 | 0.9998 |  0.8655   |      nan       |   0.8299    |  0.7238  |
|           fbnetc_100            | 128 | 0.9984 |  0.8631   |     0.1626     |   0.7352    |  0.7104  |
|            lcnet_050            | 128 | 0.9992 |  0.7927   |      nan       |   0.7885    |  0.705   |
|           regnety_002           | 128 | 0.9994 |  0.8284   |      nan       |   0.7819    |  0.6975  |
|          botnet26t_256          | 128 |  1.0   |  0.8755   |      nan       |    0.78     |  0.6616  |
|           res2next50            |  2  |  1.0   |  0.8301   |      nan       |   0.8198    |  0.6012  |
|        res2net50_14w_8s         |  2  |  1.0   |  0.8275   |      nan       |   0.8169    |  0.5927  |
|            hrnet_w18            |  2  |  1.0   |  0.8383   |      nan       |   0.8363    |  0.5746  |
|            repvgg_a2            | 128 | 1.0003 |  0.7971   |     0.1444     |   0.6902    |  0.5572  |
+---------------------------------+-----+--------+-----------+----------------+-------------+----------+

Performance graphs

see more

bench_logs/timm_models_amp.png :

bench_logs/huggingface_amp.png :

bench_logs/torchbench_amp.png :

@anijain2305
Copy link
Contributor Author

Performance Dashboard for amp precision

Executive Summary

see more We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.

Caveats

  1. Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint.
  2. Experiments do not cover dynamic shapes.
  3. Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager | 98%, 42/43  |
| inductor  | 84%, 36/43  |
+-----------+-------------+

Geometric mean speedup

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    1.00x    |
| inductor  |    2.25x    |
+-----------+-------------+

Mean compilation time (seconds)

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    25.89    |
| inductor  |    87.60    |
+-----------+-------------+

Peak memory footprint compression ratio (higher is better)

+-----------+-------------+
| Compiler  | huggingface |
+-----------+-------------+
| aot_eager |    0.86x    |
| inductor  |    0.83x    |
+-----------+-------------+

Metrics over time

see more

bench_logs/passrate_over_time.png :

bench_logs/geomean_over_time.png :

huggingface suite with amp precision

see more

Performance speedup

+-----------------------------------------+----+-----------+----------+
|                  name                   | bs | aot_eager | inductor |
+-----------------------------------------+----+-----------+----------+
|           ElectraForCausalLM            | 1  |  0.8457   |  6.3238  |
|          MobileBertForMaskedLM          | 16 |  0.8464   |  6.2119  |
|       MT5ForConditionalGeneration       | 2  |  0.8564   |  5.4336  |
|     MobileBertForQuestionAnswering      | 32 |  0.8237   |  5.0662  |
|            YituTechConvBert             | 1  |  0.8386   |  4.728   |
|         MegatronBertForCausalLM         | 2  |  0.8526   |  4.216   |
|             OPTForCausalLM              | 4  |  0.8238   |  3.9855  |
|           RobertaForCausalLM            | 4  |  0.8428   |  3.9209  |
|     PegasusForConditionalGeneration     | 4  |  0.8341   |  3.8129  |
|     M2M100ForConditionalGeneration      | 2  |   0.815   |  3.7545  |
|             XGLMForCausalLM             | 1  |  0.8097   |  3.6427  |
|                CamemBert                | 1  |  0.8502   |  3.4717  |
|     PLBartForConditionalGeneration      | 8  |  0.8242   |  3.2557  |
|    MegatronBertForQuestionAnswering     | 8  |  0.8588   |  3.1214  |
|               DistillGPT2               | 1  |  0.8651   |  2.624   |
|      MBartForConditionalGeneration      | 8  |  0.8457   |  2.4031  |
|      GPT2ForSequenceClassification      | 4  |  0.9749   |  2.1444  |
|         Speech2Text2ForCausalLM         | 64 |  0.8355   |  2.0987  |
|       ElectraForQuestionAnswering       | 64 |  0.9657   |  1.9725  |
|            TrOCRForCausalLM             | 8  |  0.8338   |  1.9162  |
|                 T5Small                 | 1  |  0.8839   |  1.8796  |
|          DistilBertForMaskedLM          | 16 |  0.8498   |  1.8745  |
|           PegasusForCausalLM            | 8  |  0.8044   |  1.8467  |
| BlenderbotSmallForConditionalGeneration | 32 |  0.8895   |  1.7928  |
|      BartForConditionalGeneration       | 1  |  0.8342   |  1.7886  |
|     DistilBertForQuestionAnswering      | 32 |  0.8476   |  1.7631  |
|    LayoutLMForSequenceClassification    | 16 |  0.9786   |  1.7266  |
|       T5ForConditionalGeneration        | 4  |  0.9354   |  1.6896  |
|       AlbertForQuestionAnswering        | 2  |  0.8084   |  1.6586  |
|            AlbertForMaskedLM            | 2  |  0.8084   |  1.6458  |
|            XLNetLMHeadModel             | 4  |  0.9599   |  1.5933  |
|            PLBartForCausalLM            | 16 |  0.9311   |  1.5454  |
|       DebertaForQuestionAnswering       | 4  |  0.7242   |  1.4752  |
|             BartForCausalLM             | 2  |   0.963   |  1.4663  |
|       RobertaForQuestionAnswering       | 64 |  0.9577   |  1.4464  |
|        BertForQuestionAnswering         | 64 |  0.9665   |  1.4415  |
|            MBartForCausalLM             | 16 |  0.8988   |  1.4018  |
|           DebertaForMaskedLM            | 4  |   0.729   |  1.3922  |
|             BertForMaskedLM             | 64 |  0.9562   |  1.3337  |
|       BlenderbotSmallForCausalLM        | 64 |  0.9249   |  1.303   |
|                 BigBird                 | 1  |  0.9124   |  1.1505  |
|           LayoutLMForMaskedLM           | 16 |  0.9693   |   0.0    |
|          AllenaiLongformerBase          | 1  |  0.7271   |   0.0    |
+-----------------------------------------+----+-----------+----------+

Accuracy

+-----------------------------------------+----+-----------+-------------+
|                  name                   | bs | aot_eager |  inductor   |
+-----------------------------------------+----+-----------+-------------+
|            AlbertForMaskedLM            | 1  |   pass    |    pass     |
|       AlbertForQuestionAnswering        | 1  |   pass    |    pass     |
|             BartForCausalLM             | 1  |   pass    |    pass     |
|             BertForMaskedLM             | 1  |   pass    |    pass     |
|        BertForQuestionAnswering         | 1  |   pass    |    pass     |
|                 BigBird                 | 1  |   pass    |    pass     |
|       BlenderbotSmallForCausalLM        | 1  |   pass    |    pass     |
| BlenderbotSmallForConditionalGeneration | 1  |   pass    |    pass     |
|                CamemBert                | 1  |   pass    |    pass     |
|           DebertaForMaskedLM            | 1  |   pass    |    pass     |
|       DebertaForQuestionAnswering       | 1  |   pass    |    pass     |
|          DistilBertForMaskedLM          | 1  |   pass    |    pass     |
|     DistilBertForQuestionAnswering      | 1  |   pass    |    pass     |
|               DistillGPT2               | 1  |   pass    |    pass     |
|           ElectraForCausalLM            | 1  |   pass    |    pass     |
|       ElectraForQuestionAnswering       | 1  |   pass    |    pass     |
|      GPT2ForSequenceClassification      | 1  |   pass    |    pass     |
|           LayoutLMForMaskedLM           | 1  |   pass    |    pass     |
|    LayoutLMForSequenceClassification    | 1  |   pass    |    pass     |
|            MBartForCausalLM             | 1  |   pass    |    pass     |
|       MT5ForConditionalGeneration       | 1  |   pass    |    pass     |
|         MegatronBertForCausalLM         | 1  |   pass    |    pass     |
|    MegatronBertForQuestionAnswering     | 1  |   pass    |    pass     |
|          MobileBertForMaskedLM          | 1  |   pass    |    pass     |
|     MobileBertForQuestionAnswering      | 1  |   pass    |    pass     |
|             OPTForCausalLM              | 1  |   pass    |    pass     |
|            PLBartForCausalLM            | 1  |   pass    |    pass     |
|           PegasusForCausalLM            | 1  |   pass    |    pass     |
|     PegasusForConditionalGeneration     | 1  |   pass    |    pass     |
|           RobertaForCausalLM            | 1  |   pass    |    pass     |
|       RobertaForQuestionAnswering       | 1  |   pass    |    pass     |
|         Speech2Text2ForCausalLM         | 1  |   pass    |    pass     |
|       T5ForConditionalGeneration        | 1  |   pass    |    pass     |
|                 T5Small                 | 1  |   pass    |    pass     |
|            TrOCRForCausalLM             | 1  |   pass    |    pass     |
|            XLNetLMHeadModel             | 1  |   pass    |    pass     |
|            YituTechConvBert             | 1  |   pass    |    pass     |
|          AllenaiLongformerBase          | 1  |   pass    | fail_to_run |
|      BartForConditionalGeneration       | 1  |   pass    | fail_to_run |
|      MBartForConditionalGeneration      | 1  |   pass    | fail_to_run |
|     PLBartForConditionalGeneration      | 1  |   pass    | fail_to_run |
|     M2M100ForConditionalGeneration      | 1  |   pass    |   0.0000    |
|             XGLMForCausalLM             | 0  |  0.0000   |   0.0000    |
+-----------------------------------------+----+-----------+-------------+

Compilation latency (sec)

+-----------------------------------------+----+-----------+----------+
|                  name                   | bs | aot_eager | inductor |
+-----------------------------------------+----+-----------+----------+
|          MobileBertForMaskedLM          | 16 | 177.6258  | 256.3088 |
|     MobileBertForQuestionAnswering      | 32 |  173.307  | 244.0373 |
|     M2M100ForConditionalGeneration      | 2  |  46.0615  | 178.9848 |
|       T5ForConditionalGeneration        | 4  |  12.8008  | 154.8311 |
|      MBartForConditionalGeneration      | 8  |  47.4506  | 151.3125 |
|             XGLMForCausalLM             | 1  |  30.2354  | 144.8421 |
|      BartForConditionalGeneration       | 1  |  46.2761  | 140.5452 |
|       MT5ForConditionalGeneration       | 2  |  19.9685  | 140.3639 |
|            XLNetLMHeadModel             | 4  |  41.7665  | 137.2853 |
|     PegasusForConditionalGeneration     | 4  |  46.8928  | 136.7294 |
|       DebertaForQuestionAnswering       | 4  |  14.3774  | 130.8288 |
|           DebertaForMaskedLM            | 4  |  14.5131  | 120.3766 |
|         MegatronBertForCausalLM         | 2  |  31.7314  | 118.0195 |
|    MegatronBertForQuestionAnswering     | 8  |  31.7457  | 117.8187 |
|            YituTechConvBert             | 1  |  20.5728  | 101.3281 |
| BlenderbotSmallForConditionalGeneration | 32 |  25.0994  | 96.1492  |
|     PLBartForConditionalGeneration      | 8  |  17.4257  |  94.093  |
|                 T5Small                 | 1  |  12.6393  | 91.9498  |
|             OPTForCausalLM              | 4  |  12.2074  | 71.8761  |
|            MBartForCausalLM             | 16 |  17.439   | 68.2201  |
|       ElectraForQuestionAnswering       | 64 |  12.8934  | 67.2964  |
|            TrOCRForCausalLM             | 8  |  17.0497  | 65.3169  |
|    LayoutLMForSequenceClassification    | 16 |  12.9564  | 65.1504  |
|           ElectraForCausalLM            | 1  |  12.7619  | 64.8722  |
|             BartForCausalLM             | 2  |  16.9692  | 64.1101  |
|       RobertaForQuestionAnswering       | 64 |  13.3347  | 63.5298  |
|           PegasusForCausalLM            | 8  |  17.0779  | 62.4789  |
|             BertForMaskedLM             | 64 |  12.6206  | 61.9757  |
|                 BigBird                 | 1  |  20.3027  | 61.6895  |
|        BertForQuestionAnswering         | 64 |  12.604   |  61.571  |
|      GPT2ForSequenceClassification      | 4  |  10.1868  | 60.4961  |
|                CamemBert                | 1  |  12.6605  |  59.846  |
|           RobertaForCausalLM            | 4  |  12.8908  | 58.2581  |
|       BlenderbotSmallForCausalLM        | 64 |  9.5692   | 52.4237  |
|            PLBartForCausalLM            | 16 |  6.8495   | 48.1822  |
|            AlbertForMaskedLM            | 2  |  9.2523   | 45.7043  |
|       AlbertForQuestionAnswering        | 2  |  8.8283   | 45.5348  |
|               DistillGPT2               | 1  |  4.7016   | 41.7968  |
|          DistilBertForMaskedLM          | 16 |  5.5164   | 39.5652  |
|     DistilBertForQuestionAnswering      | 32 |  5.4656   | 39.5651  |
|         Speech2Text2ForCausalLM         | 64 |  6.9795   |  38.258  |
|          AllenaiLongformerBase          | 1  |  22.9042  |   nan    |
|           LayoutLMForMaskedLM           | 16 |   13.22   |   nan    |
+-----------------------------------------+----+-----------+----------+

Peak Memory Compression Ratio

+-----------------------------------------+----+-----------+----------+
|                  name                   | bs | aot_eager | inductor |
+-----------------------------------------+----+-----------+----------+
|      GPT2ForSequenceClassification      | 4  |  0.9163   |   1.07   |
|       ElectraForQuestionAnswering       | 64 |  0.9539   |  1.0237  |
|            XLNetLMHeadModel             | 4  |  0.8791   |  1.0109  |
|                 T5Small                 | 1  |  0.9124   |  0.9876  |
|             BertForMaskedLM             | 64 |   0.899   |  0.9811  |
|    LayoutLMForSequenceClassification    | 16 |  0.9325   |  0.9712  |
| BlenderbotSmallForConditionalGeneration | 32 |  0.8996   |  0.9557  |
|             BartForCausalLM             | 2  |  0.8769   |  0.9545  |
|       T5ForConditionalGeneration        | 4  |  0.9594   |  0.9525  |
|         Speech2Text2ForCausalLM         | 64 |  0.8489   |  0.9452  |
|          DistilBertForMaskedLM          | 16 |  0.8698   |  0.9448  |
|           ElectraForCausalLM            | 1  |  0.8955   |  0.941   |
|            PLBartForCausalLM            | 16 |  0.8667   |  0.9395  |
|       BlenderbotSmallForCausalLM        | 64 |  0.8172   |  0.9269  |
|        BertForQuestionAnswering         | 64 |  0.9315   |  0.9256  |
|       RobertaForQuestionAnswering       | 64 |  0.9315   |  0.9254  |
|      BartForConditionalGeneration       | 1  |  0.8619   |  0.881   |
|       AlbertForQuestionAnswering        | 2  |  0.6451   |  0.8636  |
|            MBartForCausalLM             | 16 |  0.8398   |  0.8565  |
|            AlbertForMaskedLM            | 2  |  0.6364   |  0.8515  |
|                 BigBird                 | 1  |  0.9513   |  0.8349  |
|     DistilBertForQuestionAnswering      | 32 |  0.8967   |  0.8334  |
|     PLBartForConditionalGeneration      | 8  |  0.8307   |  0.8251  |
|               DistillGPT2               | 1  |  0.7548   |  0.812   |
|          MobileBertForMaskedLM          | 16 |  0.8983   |  0.7803  |
|      MBartForConditionalGeneration      | 8  |  0.8187   |  0.7699  |
|            TrOCRForCausalLM             | 8  |  0.7955   |  0.7566  |
|                CamemBert                | 1  |  0.7872   |  0.7482  |
|             OPTForCausalLM              | 4  |  0.7501   |  0.7473  |
|            YituTechConvBert             | 1  |  0.7819   |  0.7407  |
|           PegasusForCausalLM            | 8  |  0.9444   |  0.7324  |
|           RobertaForCausalLM            | 4  |  0.7741   |  0.7309  |
|             XGLMForCausalLM             | 1  |  0.9992   |  0.7214  |
|    MegatronBertForQuestionAnswering     | 8  |  0.8218   |  0.7107  |
|     PegasusForConditionalGeneration     | 4  |  0.9196   |  0.6769  |
|         MegatronBertForCausalLM         | 2  |  0.7726   |  0.6697  |
|     M2M100ForConditionalGeneration      | 2  |  0.9497   |  0.6568  |
|     MobileBertForQuestionAnswering      | 32 |  0.9796   |  0.6265  |
|       MT5ForConditionalGeneration       | 2  |  0.6019   |  0.6019  |
|           DebertaForMaskedLM            | 4  |  0.9826   |  0.4498  |
|       DebertaForQuestionAnswering       | 4  |  1.0568   |  0.3761  |
|          AllenaiLongformerBase          | 1  |  0.9477   |   nan    |
|           LayoutLMForMaskedLM           | 16 |  0.9238   |   nan    |
+-----------------------------------------+----+-----------+----------+

Performance graphs

see more

bench_logs/huggingface_amp.png :

@anijain2305
Copy link
Contributor Author

Performance Dashboard for float32 precision

Executive Summary

see more We evaluate different backends across three benchmark suites - torchbench, huggingface and timm. We run these experiments on A100 GPUs. Each experiment runs one iteration of forward and backward pass. For accuracy, we check the numerical correctness of forward pass outputs and gradients by comparing with native pytorch. We measure speedup by normalizing against the performance of native pytorch. We report mean compilation latency numbers and peak memory footprint reduction ratio.

Caveats

  1. Batch size has been reduced to workaround OOM errors. Work is in progress to reduce peak memory footprint.
  2. Experiments do not cover dynamic shapes.
  3. Experimental setup does not have optimizer.

To measure performance, compilation latency and memory footprint reduction, we remove the models that fail accuracy checks.

Passrate

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      | 91%, 50/55 | 98%, 43/44  | 100%, 61/61 |
|   aot_eager    | 89%, 49/55 | 98%, 43/44  | 90%, 55/61  |
| aot_cudagraphs | 25%, 14/55 |  0%, 0/44   |  2%, 1/61   |
|  aot_nvfuser   | 58%, 32/55 |  2%, 1/44   | 82%, 50/61  |
|    inductor    | 84%, 46/55 | 93%, 41/44  | 95%, 58/61  |
+----------------+------------+-------------+-------------+

Geometric mean speedup

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   1.00x    |    1.01x    |    1.00x    |
|   aot_eager    |   1.01x    |    1.00x    |    1.00x    |
| aot_cudagraphs |   1.02x    |    0.0x     |    1.00x    |
|  aot_nvfuser   |   1.13x    |    1.12x    |    1.12x    |
|    inductor    |   1.39x    |    1.60x    |    1.21x    |
+----------------+------------+-------------+-------------+

Mean compilation time (seconds)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |    5.42    |    14.22    |    11.34    |
|   aot_eager    |    9.77    |    21.16    |    16.79    |
| aot_cudagraphs |    4.86    |     0.0     |    7.42     |
|  aot_nvfuser   |   22.48    |    10.56    |    57.73    |
|    inductor    |   238.15   |   109.27    |   366.65    |
+----------------+------------+-------------+-------------+

Peak memory footprint compression ratio (higher is better)

+----------------+------------+-------------+-------------+
|    Compiler    | torchbench | huggingface | timm_models |
+----------------+------------+-------------+-------------+
|     eager      |   0.95x    |    0.98x    |    1.00x    |
|   aot_eager    |   0.86x    |    0.89x    |    0.88x    |
| aot_cudagraphs |   0.41x    |    0.0x     |    0.25x    |
|  aot_nvfuser   |   0.83x    |    1.08x    |    0.85x    |
|    inductor    |   0.78x    |    0.74x    |    0.90x    |
+----------------+------------+-------------+-------------+

torchbench suite with float32 precision

see more

Performance speedup

+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|               name                |  bs  | eager  | aot_eager | aot_cudagraphs | aot_nvfuser | inductor |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+
|            densenet121            |  4   | 0.9991 |  1.0087   |      0.0       |   1.4479    |  4.9052  |
|         timm_efficientdet         |  1   | 0.9824 |  0.8787   |      0.0       |     0.0     |  3.9475  |
|       functorch_dp_cifar10        |  64  | 0.9963 |  0.9772   |      0.0       |    1.197    |  3.6233  |
|      timm_vision_transformer      |  8   | 1.0025 |  0.9173   |      0.0       |   1.3464    |  2.5509  |
|                drq                |  1   | 1.0043 |  0.8529   |      0.0       |   1.0585    |  2.4584  |
|           BERT_pytorch            |  16  | 1.0078 |  0.8843   |      0.0       |     0.0     |  1.8656  |
|             resnet18              |  16  | 1.0033 |   1.104   |      0.0       |   1.3915    |  1.8125  |
|               dcgan               |  32  | 0.9844 |  1.0223   |     1.0738     |   1.1668    |  1.7591  |
|           lennard_jones           | 1000 | 0.9793 |  0.8541   |     1.062      |    1.027    |  1.7573  |
|          pytorch_struct           | 200  | 0.9964 |  0.7439   |     0.8929     |   0.8905    |  1.7547  |
|   pytorch_CycleGAN_and_pix2pix    |  1   | 1.0004 |  0.9318   |     1.1166     |   1.2026    |  1.7117  |
|             hf_Albert             |  8   | 1.0013 |  0.9975   |      0.0       |     0.0     |  1.6656  |
|           squeezenet1_1           |  32  | 1.0075 |  1.0042   |     0.9826     |   1.1641    |  1.6018  |
|          resnext50_32x4d          |  8   | 1.0038 |  1.0848   |      0.0       |    1.36     |  1.513   |
|        mobilenet_v3_large         |  32  | 1.0035 |  1.1165   |      0.0       |    1.397    |  1.4827  |
|            timm_nfnet             | 128  | 0.9995 |  0.9997   |      0.0       |    1.211    |  1.4715  |
|              hf_GPT2              |  4   | 1.0071 |  0.9753   |      0.0       |     0.0     |  1.4298  |
|            hf_T5_large            |  2   | 1.0245 |  0.8903   |      0.0       |     0.0     |  1.4073  |
|         soft_actor_critic         | 256  | 1.0007 |  0.7819   |     1.0121     |    1.045    |  1.377   |
|           fastNLP_Bert            |  6   | 0.9988 |  0.9748   |      0.0       |     0.0     |  1.3639  |
|              hf_Bart              |  4   | 1.0125 |  0.9707   |      0.0       |     0.0     |  1.2501  |
|          LearningToPaint          |  96  | 1.004  |  1.0612   |      0.0       |   1.2169    |  1.2114  |
|           pytorch_unet            |  1   |  1.0   |  0.9966   |      0.0       |   1.0756    |  1.2054  |
|            Super_SloMo            |  6   |  1.0   |  0.9973   |      0.0       |     0.0     |  1.1763  |
|               vgg16               |  64  | 0.9999 |  0.9982   |     0.7928     |   0.9965    |  1.1707  |
|              alexnet              | 128  | 0.9999 |  0.9985   |     0.7786     |    1.001    |  1.1615  |
|           hf_DistilBert           |  8   | 0.9997 |  0.9551   |      0.0       |     0.0     |  1.1572  |
|              hf_Bert              |  4   | 1.0291 |  0.9963   |      0.0       |     0.0     |  1.1565  |
|            mnasnet1_0             |  32  | 0.9998 |  1.1013   |     0.7453     |   1.3026    |  1.1524  |
|          pytorch_stargan          |  16  | 0.9992 |  0.9827   |     0.7293     |   1.0246    |  1.1189  |
|        Background_Matting         |  4   | 0.9997 |  1.0225   |      0.0       |   1.0825    |  1.1138  |
|            hf_Reformer            |  4   | 0.9965 |    0.0    |     0.8945     |     0.0     |  1.1108  |
|            hf_BigBird             |  2   | 0.9941 |  0.9398   |      0.0       |     0.0     |  1.0989  |
|         timm_efficientnet         |  32  | 0.961  |  0.8183   |      0.0       |   1.0739    |  1.0815  |
|        shufflenet_v2_x1_0         | 128  | 1.0008 |  1.0519   |      0.0       |   1.1884    |  1.0746  |
|   timm_vision_transformer_large   |  8   | 0.9999 |  0.9935   |      0.0       |    0.982    |  1.0531  |
| attention_is_all_you_need_pytorch | 256  | 0.9973 |  0.9715   |      0.0       |     0.0     |  1.0492  |
|           timm_resnest            |  32  |  1.0   |  1.0019   |      0.0       |   1.1832    |  1.0416  |
|            tts_angular            |  64  | 0.9854 |  0.9625   |     0.9851     |   1.0031    |  1.0091  |
|              demucs               |  4   | 1.0004 |  1.0005   |      1.0       |   1.0003    |  1.0002  |
|               dlrm                | 2048 | 0.904  |  0.8836   |      0.0       |     0.0     |  0.9304  |
|            timm_vovnet            |  32  | 0.9122 |  0.9055   |      0.0       |   0.9776    |  0.9152  |
|      nvidia_deeprecommender       | 256  | 0.9991 |  0.9632   |     0.5842     |   0.9441    |  0.9044  |
|           mobilenet_v2            |  96  | 0.9996 |  0.9986   |      0.0       |   1.0422    |  0.8514  |
|            timm_regnet            |  32  | 0.9654 |   0.964   |      0.0       |   1.0934    |  0.7595  |
|             resnet50              |  32  | 0.9986 |  0.9934   |      0.0       |   1.1608    |  0.7378  |
|              yolov3               |  16  | 0.9995 |  0.9944   |      0.0       |   1.1831    |   0.0    |
|               hf_T5               |  8   | 1.0018 |  0.9897   |      0.0       |     0.0     |   0.0    |
|           hf_GPT2_large           |  4   | 0.9999 |  0.9805   |      0.0       |     0.0     |   0.0    |
|        speech_transformer         |  32  | 1.0016 |  0.9211   |      0.0       |     0.0     |   0.0    |
|           hf_Longformer           |  0   |  0.0   |    0.0    |      0.0       |     0.0     |   0.0    |
|    mobilenet_v2_quantized_qat     |  0   |  0.0   |    0.0    |      0.0       |     0.0     |   0.0    |
|               moco                |  0   |  0.0   |    0.0    |      0.0       |     0.0     |   0.0    |
|      resnet50_quantized_qat       |  0   |  0.0   |    0.0    |      0.0       |     0.0     |   0.0    |
|             tacotron2             |  0   |  0.0   |    0.0    |      0.0       |     0.0     |   0.0    |
+-----------------------------------+------+--------+-----------+----------------+-------------+----------+

Accuracy

+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|               name                | bs  |      eager       |    aot_eager     |  aot_cudagraphs  |   aot_nvfuser    |     inductor     |
+-----------------------------------+-----+------------------+------------------+------------------+------------------+------------------+
|           hf_GPT2_large           |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|            hf_T5_large            |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|   timm_vision_transformer_large   |  2  | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip | pass_due_to_skip |
|              alexnet              |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               dcgan               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|              demucs               |  4  |       pass       |       pass       |       pass       |       pass       |       pass       |
|           lennard_jones           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|            mnasnet1_0             |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|      nvidia_deeprecommender       |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|   pytorch_CycleGAN_and_pix2pix    |  1  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_stargan          | 16  |       pass       |       pass       |       pass       |       pass       |       pass       |
|          pytorch_struct           | 200 |       pass       |       pass       |       pass       |       pass       |       pass       |
|         soft_actor_critic         | 256 |       pass       |       pass       |       pass       |       pass       |       pass       |
|           squeezenet1_1           |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|            tts_angular            |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|               vgg16               |  2  |       pass       |       pass       |       pass       |       pass       |       pass       |
|        Background_Matting         |  4  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|          LearningToPaint          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            densenet121            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|                drq                |  1  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|       functorch_dp_cifar10        |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           mobilenet_v2            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|        mobilenet_v3_large         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           pytorch_unet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet18              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|             resnet50              |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|          resnext50_32x4d          |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|        shufflenet_v2_x1_0         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|         timm_efficientnet         |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_nfnet             |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|            timm_regnet            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|           timm_resnest            |  2  |       pass       |       pass       |   fail_to_run    |       pass       |       pass       |
|      timm_vision_transformer      |  2  |       pass       |       pass       |   fail_to_run    |