Skip to content

Long test time for PyTorch test_fx::TestVisionTracing with dynamo enabled #93620

@huydhn

Description

@huydhn

Issue

I'm investigating a list of slow tests on PyTorch CI, and this one stands out to me as the top of the list linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge). The metric shows that it takes close to 3 hours to finish under normal condition.

Taking a closer look there, it turns out that the long pole hogging more than two third of the test time is a set of TestVisionTracing tests in test_fx.py. I have the test log with the timing below FYI, i.e. https://github.com/pytorch/pytorch/runs/7774079140. My guess is that the bigger the models, the slower it becomes.

2022-08-03T17:25:10.6477370Z   test_torchvision_models_alexnet (__main__.TestVisionTracing) ... ok (0.886s)
2022-08-03T17:26:05.7221409Z   test_torchvision_models_convnext_base (__main__.TestVisionTracing) ... ok (54.702s)
2022-08-03T17:27:02.4509609Z   test_torchvision_models_convnext_large (__main__.TestVisionTracing) ... ok (56.729s)
2022-08-03T17:27:56.6380097Z   test_torchvision_models_convnext_small (__main__.TestVisionTracing) ... ok (54.187s)
2022-08-03T17:28:12.1358494Z   test_torchvision_models_convnext_tiny (__main__.TestVisionTracing) ... ok (15.498s)
2022-08-03T17:30:15.0829924Z   test_torchvision_models_densenet121 (__main__.TestVisionTracing) ... ok (122.947s)
2022-08-03T17:33:58.2949895Z   test_torchvision_models_densenet161 (__main__.TestVisionTracing) ... ok (223.212s)
2022-08-03T17:38:06.7859626Z   test_torchvision_models_densenet169 (__main__.TestVisionTracing) ... ok (248.491s)
2022-08-03T17:38:06.7896555Z   test_torchvision_models_densenet201 (__main__.TestVisionTracing) ... ok (358.177s)
2022-08-03T17:44:04.9669275Z   test_torchvision_models_detection_fasterrcnn_mobilenet_v3_large_320_fpn ok (0.770s)
2022-08-03T17:44:05.7351427Z   test_torchvision_models_detection_fasterrcnn_mobilenet_v3_large_fpn ok (0.243s)
2022-08-03T17:44:05.9786792Z   test_torchvision_models_detection_fasterrcnn_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.560s)
2022-08-03T17:44:07.0258696Z   test_torchvision_models_detection_fasterrcnn_resnet50_fpn_v2 (__main__.TestVisionTracing) ... ok (0.488s)
2022-08-03T17:44:07.0270010Z   test_torchvision_models_detection_fcos_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.471s)
2022-08-03T17:44:07.4987991Z   test_torchvision_models_detection_keypointrcnn_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.647s)
2022-08-03T17:44:08.1453012Z   test_torchvision_models_detection_maskrcnn_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.453s)
2022-08-03T17:44:09.1151239Z   test_torchvision_models_detection_maskrcnn_resnet50_fpn_v2 (__main__.TestVisionTracing) ... ok (0.518s)
2022-08-03T17:44:09.1162174Z   test_torchvision_models_detection_retinanet_resnet50_fpn (__main__.TestVisionTracing) ...  ok (0.435s)
2022-08-03T17:44:10.0464869Z   test_torchvision_models_detection_retinanet_resnet50_fpn_v2 (__main__.TestVisionTracing) ... ok (0.496s)
2022-08-03T17:44:10.0476882Z   test_torchvision_models_detection_ssd300_vgg16 (__main__.TestVisionTracing) ... ok (4.769s)
2022-08-03T17:44:14.8171701Z   test_torchvision_models_detection_ssdlite320_mobilenet_v3_large (__main__.TestVisionTracing) ... ok (45.547s)
2022-08-03T17:45:58.0140988Z   test_torchvision_models_efficientnet_b0 (__main__.TestVisionTracing) ... ok (57.650s)
2022-08-03T17:48:00.4734725Z   test_torchvision_models_efficientnet_b1 (__main__.TestVisionTracing) ... ok (122.459s)
2022-08-03T17:50:03.9557458Z   test_torchvision_models_efficientnet_b2 (__main__.TestVisionTracing) ... ok (123.482s)
2022-08-03T17:52:38.0078169Z   test_torchvision_models_efficientnet_b3 (__main__.TestVisionTracing) ... ok (154.052s)
2022-08-03T17:56:36.5506145Z   test_torchvision_models_efficientnet_b4 (__main__.TestVisionTracing) ... ok (238.543s)
2022-08-03T18:02:31.9753962Z   test_torchvision_models_efficientnet_b5 (__main__.TestVisionTracing) ... ok (355.425s)
2022-08-03T18:10:34.0433740Z   test_torchvision_models_efficientnet_b6 (__main__.TestVisionTracing) ... ok (482.068s)
2022-08-03T18:22:45.6426801Z   test_torchvision_models_efficientnet_b7 (__main__.TestVisionTracing) ... ok (731.599s)
2022-08-03T18:22:47.1220214Z   test_torchvision_models_efficientnet_v2_l (__main__.TestVisionTracing) ...ok (1194.113s)
2022-08-03T18:53:00.3677928Z   test_torchvision_models_efficientnet_v2_m (__main__.TestVisionTracing) ... ok (620.612s)
2022-08-03T18:57:49.4276270Z   test_torchvision_models_efficientnet_v2_s (__main__.TestVisionTracing) ... ok (289.060s)
2022-08-03T18:57:49.4438332Z   test_torchvision_models_googlenet (__main__.TestVisionTracing) ... ok (29.098s)
2022-08-03T18:58:18.5393137Z   test_torchvision_models_inception_v3 (__main__.TestVisionTracing) ... ok (63.303s)
2022-08-03T18:59:46.1718149Z   test_torchvision_models_mnasnet0_5 (__main__.TestVisionTracing) ... ok (24.342s)
2022-08-03T18:59:46.1867787Z   test_torchvision_models_mnasnet0_75 (__main__.TestVisionTracing) ... ok (25.346s)
2022-08-03T19:00:36.9759874Z   test_torchvision_models_mnasnet1_0 (__main__.TestVisionTracing) ... ok (25.458s)
2022-08-03T19:01:02.9683669Z   test_torchvision_models_mnasnet1_3 (__main__.TestVisionTracing) ... ok (25.992s)
2022-08-03T19:01:32.5979077Z   test_torchvision_models_mobilenet_v2 (__main__.TestVisionTracing) ... ok (29.629s)
2022-08-03T19:02:08.5415970Z   test_torchvision_models_mobilenet_v3_large (__main__.TestVisionTracing) ... ok (35.944s)
2022-08-03T19:02:32.4524047Z   test_torchvision_models_mobilenet_v3_small (__main__.TestVisionTracing) ... ok (23.910s)
2022-08-03T19:03:29.8420213Z   test_torchvision_models_regnet_x_16gf (__main__.TestVisionTracing) ... ok (57.389s)
2022-08-03T19:04:07.9935732Z   test_torchvision_models_regnet_x_1_6gf (__main__.TestVisionTracing) ... ok (38.151s)
2022-08-03T19:05:12.0549659Z   test_torchvision_models_regnet_x_32gf (__main__.TestVisionTracing) ... ok (64.061s)
2022-08-03T19:06:22.7225415Z   test_torchvision_models_regnet_x_3_2gf (__main__.TestVisionTracing) ... ok (70.667s)
2022-08-03T19:07:18.2250778Z   test_torchvision_models_regnet_x_400mf (__main__.TestVisionTracing) ... ok (55.502s)
2022-08-03T19:07:49.3684930Z   test_torchvision_models_regnet_x_800mf (__main__.TestVisionTracing) ... ok (31.143s)
2022-08-03T19:08:51.6417110Z   test_torchvision_models_regnet_x_8gf (__main__.TestVisionTracing) ... ok (62.273s)
2022-08-03T19:11:49.2177510Z   test_torchvision_models_regnet_y_128gf (__main__.TestVisionTracing) ... ok (177.576s)
2022-08-03T19:11:49.2208120Z   test_torchvision_models_regnet_y_16gf (__main__.TestVisionTracing) ... ok (73.293s)
2022-08-03T19:15:41.0194445Z   test_torchvision_models_regnet_y_1_6gf (__main__.TestVisionTracing) ... ok (158.508s)
2022-08-03T19:17:12.7901234Z   test_torchvision_models_regnet_y_32gf (__main__.TestVisionTracing) ... ok (91.770s)
2022-08-03T19:18:48.2377446Z   test_torchvision_models_regnet_y_3_2gf (__main__.TestVisionTracing) ... ok (95.447s)
2022-08-03T19:19:45.2292131Z   test_torchvision_models_regnet_y_400mf (__main__.TestVisionTracing) ... ok (56.991s)
2022-08-03T19:20:31.7528460Z   test_torchvision_models_regnet_y_800mf (__main__.TestVisionTracing) ... ok (46.523s)
2022-08-03T19:20:32.2151362Z   test_torchvision_models_regnet_y_8gf (__main__.TestVisionTracing) ... ok (41.056s)
2022-08-03T19:22:49.0783090Z   test_torchvision_models_resnet101 (__main__.TestVisionTracing) ... ok (96.269s)
2022-08-03T19:26:08.9278863Z   test_torchvision_models_resnet152 (__main__.TestVisionTracing) ... ok (199.849s)
2022-08-03T19:26:12.7781004Z   test_torchvision_models_resnet18 (__main__.TestVisionTracing) ... ok (3.850s)
2022-08-03T19:26:24.1143151Z   test_torchvision_models_resnet34 (__main__.TestVisionTracing) ... ok (11.336s)
2022-08-03T19:26:47.0665742Z   test_torchvision_models_resnet50 (__main__.TestVisionTracing) ... ok (22.952s)
2022-08-03T19:28:18.6756152Z   test_torchvision_models_resnext101_32x8d (__main__.TestVisionTracing) ... ok (91.609s)
2022-08-03T19:29:51.5175733Z   test_torchvision_models_resnext101_64x4d (__main__.TestVisionTracing) ... ok (92.842s)
2022-08-03T19:30:15.0395468Z   test_torchvision_models_resnext50_32x4d (__main__.TestVisionTracing) ... ok (23.522s)
2022-08-03T19:30:15.0417455Z   test_torchvision_models_segmentation_deeplabv3_mobilenet_v3_large (__main__.TestVisionTracing) ... ok (41.518s)
2022-08-03T19:30:56.5593147Z   test_torchvision_models_segmentation_deeplabv3_resnet101 (__main__.TestVisionTracing) ... ok (65.539s)
2022-08-03T19:32:02.0989764Z   test_torchvision_models_segmentation_deeplabv3_resnet50 (__main__.TestVisionTracing) ... ok (19.032s)
2022-08-03T19:32:21.1306708Z   test_torchvision_models_segmentation_fcn_resnet101 (__main__.TestVisionTracing) ...ok (55.015s)
2022-08-03T19:33:16.1455914Z   test_torchvision_models_segmentation_fcn_resnet50 (__main__.TestVisionTracing) ... ok (15.044s)
2022-08-03T19:33:31.1890927Z   test_torchvision_models_segmentation_lraspp_mobilenet_v3_large (__main__.TestVisionTracing) ... ok (22.548s)
2022-08-03T19:34:29.0257884Z   test_torchvision_models_shufflenet_v2_x0_5 (__main__.TestVisionTracing) ... ok (35.290s)
2022-08-03T19:34:29.0365339Z   test_torchvision_models_shufflenet_v2_x1_0 (__main__.TestVisionTracing) ... ok (35.166s)
2022-08-03T19:35:39.0395488Z   test_torchvision_models_shufflenet_v2_x1_5 (__main__.TestVisionTracing) ... ok (34.847s)
2022-08-03T19:36:14.5451081Z   test_torchvision_models_shufflenet_v2_x2_0 (__main__.TestVisionTracing) ... ok (35.505s)
2022-08-03T19:36:17.8882872Z   test_torchvision_models_squeezenet1_0 (__main__.TestVisionTracing) ... ok (3.343s)
2022-08-03T19:36:21.1852522Z   test_torchvision_models_squeezenet1_1 (__main__.TestVisionTracing) ... ok (3.297s)
2022-08-03T19:36:23.9077391Z   test_torchvision_models_swin_b (__main__.TestVisionTracing) ... ok (5.616s)
2022-08-03T19:36:27.5764829Z   test_torchvision_models_swin_s (__main__.TestVisionTracing) ... ok (3.819s)
2022-08-03T19:36:32.4463984Z   test_torchvision_models_swin_t (__main__.TestVisionTracing) ... ok (1.826s)
2022-08-03T19:36:34.9411761Z   test_torchvision_models_vgg11 (__main__.TestVisionTracing) ... ok (2.495s)
2022-08-03T19:36:38.0211472Z   test_torchvision_models_vgg11_bn (__main__.TestVisionTracing) ... ok (3.080s)
2022-08-03T19:36:41.0269459Z   test_torchvision_models_vgg13 (__main__.TestVisionTracing) ... ok (3.006s)
2022-08-03T19:36:45.1734879Z   test_torchvision_models_vgg13_bn (__main__.TestVisionTracing) ... ok (4.146s)
2022-08-03T19:36:48.3880343Z   test_torchvision_models_vgg16 (__main__.TestVisionTracing) ... ok (3.214s)
2022-08-03T19:36:52.6734397Z   test_torchvision_models_vgg16_bn (__main__.TestVisionTracing) ... ok (4.285s)
2022-08-03T19:36:56.6082544Z   test_torchvision_models_vgg19 (__main__.TestVisionTracing) ... ok (3.935s)
2022-08-03T19:37:02.3325897Z   test_torchvision_models_vgg19_bn (__main__.TestVisionTracing) ... ok (5.724s)
2022-08-03T19:37:06.5155769Z   test_torchvision_models_video_mc3_18 (__main__.TestVisionTracing) ... ok (4.183s)
2022-08-03T19:38:55.8700574Z   test_torchvision_models_video_mvit_v1_b (__main__.TestVisionTracing) ... ok (109.354s)
2022-08-03T19:39:12.5654877Z   test_torchvision_models_video_r2plus1d_18 (__main__.TestVisionTracing) ... ok (16.695s)
2022-08-03T19:39:17.3841514Z   test_torchvision_models_video_r3d_18 (__main__.TestVisionTracing) ... ok (4.818s)
2022-08-03T19:39:21.9138265Z   test_torchvision_models_vit_b_16 (__main__.TestVisionTracing) ... ok (4.529s)
2022-08-03T19:39:25.7134707Z   test_torchvision_models_vit_b_32 (__main__.TestVisionTracing) ... ok (3.800s)
2022-08-03T19:39:25.7494094Z   test_torchvision_models_vit_h_14 (__main__.TestVisionTracing) ... ok (24.109s)
2022-08-03T19:40:02.6691677Z   test_torchvision_models_vit_l_16 (__main__.TestVisionTracing) ... ok (12.846s)
2022-08-03T19:40:15.0948724Z   test_torchvision_models_vit_l_32 (__main__.TestVisionTracing) ... ok (12.426s)
2022-08-03T19:40:16.4909689Z   test_torchvision_models_wide_resnet101_2 (__main__.TestVisionTracing) ... ok (90.164s)
2022-08-03T19:42:09.2416125Z   test_torchvision_models_wide_resnet50_2 (__main__.TestVisionTracing) ... ok (23.982s)

On the other hand, these tests take only minutes without dynamo, i.e. https://github.com/pytorch/pytorch/runs/7774078598.

AFAIK, dynamo tests are run with torchdynamo.optimize("eager") #80106, and these figures are probably expected. So, my questions are:

  • Is there a way to alleviate the situation here like running these tests in a different "lazier" mode to avoid this bottleneck?
  • Does the team think that these tests bring enough values to justify the waiting time? They are run on pull request, so it's pretty costly to make the whole workflow waiting for this. May be we can find a middle ground for this.

Thank you for looking into this!

Solutions

TBD

cc @ezyang @soumith @msaroufim @wconstab @ngimel @bdhirsh

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions