-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Issue
I'm investigating a list of slow tests on PyTorch CI, and this one stands out to me as the top of the list linux-bionic-py3.7-clang9 / test (dynamo, 1, 2, linux.2xlarge). The metric shows that it takes close to 3 hours to finish under normal condition.
Taking a closer look there, it turns out that the long pole hogging more than two third of the test time is a set of TestVisionTracing tests in test_fx.py. I have the test log with the timing below FYI, i.e. https://github.com/pytorch/pytorch/runs/7774079140. My guess is that the bigger the models, the slower it becomes.
2022-08-03T17:25:10.6477370Z test_torchvision_models_alexnet (__main__.TestVisionTracing) ... ok (0.886s)
2022-08-03T17:26:05.7221409Z test_torchvision_models_convnext_base (__main__.TestVisionTracing) ... ok (54.702s)
2022-08-03T17:27:02.4509609Z test_torchvision_models_convnext_large (__main__.TestVisionTracing) ... ok (56.729s)
2022-08-03T17:27:56.6380097Z test_torchvision_models_convnext_small (__main__.TestVisionTracing) ... ok (54.187s)
2022-08-03T17:28:12.1358494Z test_torchvision_models_convnext_tiny (__main__.TestVisionTracing) ... ok (15.498s)
2022-08-03T17:30:15.0829924Z test_torchvision_models_densenet121 (__main__.TestVisionTracing) ... ok (122.947s)
2022-08-03T17:33:58.2949895Z test_torchvision_models_densenet161 (__main__.TestVisionTracing) ... ok (223.212s)
2022-08-03T17:38:06.7859626Z test_torchvision_models_densenet169 (__main__.TestVisionTracing) ... ok (248.491s)
2022-08-03T17:38:06.7896555Z test_torchvision_models_densenet201 (__main__.TestVisionTracing) ... ok (358.177s)
2022-08-03T17:44:04.9669275Z test_torchvision_models_detection_fasterrcnn_mobilenet_v3_large_320_fpn ok (0.770s)
2022-08-03T17:44:05.7351427Z test_torchvision_models_detection_fasterrcnn_mobilenet_v3_large_fpn ok (0.243s)
2022-08-03T17:44:05.9786792Z test_torchvision_models_detection_fasterrcnn_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.560s)
2022-08-03T17:44:07.0258696Z test_torchvision_models_detection_fasterrcnn_resnet50_fpn_v2 (__main__.TestVisionTracing) ... ok (0.488s)
2022-08-03T17:44:07.0270010Z test_torchvision_models_detection_fcos_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.471s)
2022-08-03T17:44:07.4987991Z test_torchvision_models_detection_keypointrcnn_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.647s)
2022-08-03T17:44:08.1453012Z test_torchvision_models_detection_maskrcnn_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.453s)
2022-08-03T17:44:09.1151239Z test_torchvision_models_detection_maskrcnn_resnet50_fpn_v2 (__main__.TestVisionTracing) ... ok (0.518s)
2022-08-03T17:44:09.1162174Z test_torchvision_models_detection_retinanet_resnet50_fpn (__main__.TestVisionTracing) ... ok (0.435s)
2022-08-03T17:44:10.0464869Z test_torchvision_models_detection_retinanet_resnet50_fpn_v2 (__main__.TestVisionTracing) ... ok (0.496s)
2022-08-03T17:44:10.0476882Z test_torchvision_models_detection_ssd300_vgg16 (__main__.TestVisionTracing) ... ok (4.769s)
2022-08-03T17:44:14.8171701Z test_torchvision_models_detection_ssdlite320_mobilenet_v3_large (__main__.TestVisionTracing) ... ok (45.547s)
2022-08-03T17:45:58.0140988Z test_torchvision_models_efficientnet_b0 (__main__.TestVisionTracing) ... ok (57.650s)
2022-08-03T17:48:00.4734725Z test_torchvision_models_efficientnet_b1 (__main__.TestVisionTracing) ... ok (122.459s)
2022-08-03T17:50:03.9557458Z test_torchvision_models_efficientnet_b2 (__main__.TestVisionTracing) ... ok (123.482s)
2022-08-03T17:52:38.0078169Z test_torchvision_models_efficientnet_b3 (__main__.TestVisionTracing) ... ok (154.052s)
2022-08-03T17:56:36.5506145Z test_torchvision_models_efficientnet_b4 (__main__.TestVisionTracing) ... ok (238.543s)
2022-08-03T18:02:31.9753962Z test_torchvision_models_efficientnet_b5 (__main__.TestVisionTracing) ... ok (355.425s)
2022-08-03T18:10:34.0433740Z test_torchvision_models_efficientnet_b6 (__main__.TestVisionTracing) ... ok (482.068s)
2022-08-03T18:22:45.6426801Z test_torchvision_models_efficientnet_b7 (__main__.TestVisionTracing) ... ok (731.599s)
2022-08-03T18:22:47.1220214Z test_torchvision_models_efficientnet_v2_l (__main__.TestVisionTracing) ...ok (1194.113s)
2022-08-03T18:53:00.3677928Z test_torchvision_models_efficientnet_v2_m (__main__.TestVisionTracing) ... ok (620.612s)
2022-08-03T18:57:49.4276270Z test_torchvision_models_efficientnet_v2_s (__main__.TestVisionTracing) ... ok (289.060s)
2022-08-03T18:57:49.4438332Z test_torchvision_models_googlenet (__main__.TestVisionTracing) ... ok (29.098s)
2022-08-03T18:58:18.5393137Z test_torchvision_models_inception_v3 (__main__.TestVisionTracing) ... ok (63.303s)
2022-08-03T18:59:46.1718149Z test_torchvision_models_mnasnet0_5 (__main__.TestVisionTracing) ... ok (24.342s)
2022-08-03T18:59:46.1867787Z test_torchvision_models_mnasnet0_75 (__main__.TestVisionTracing) ... ok (25.346s)
2022-08-03T19:00:36.9759874Z test_torchvision_models_mnasnet1_0 (__main__.TestVisionTracing) ... ok (25.458s)
2022-08-03T19:01:02.9683669Z test_torchvision_models_mnasnet1_3 (__main__.TestVisionTracing) ... ok (25.992s)
2022-08-03T19:01:32.5979077Z test_torchvision_models_mobilenet_v2 (__main__.TestVisionTracing) ... ok (29.629s)
2022-08-03T19:02:08.5415970Z test_torchvision_models_mobilenet_v3_large (__main__.TestVisionTracing) ... ok (35.944s)
2022-08-03T19:02:32.4524047Z test_torchvision_models_mobilenet_v3_small (__main__.TestVisionTracing) ... ok (23.910s)
2022-08-03T19:03:29.8420213Z test_torchvision_models_regnet_x_16gf (__main__.TestVisionTracing) ... ok (57.389s)
2022-08-03T19:04:07.9935732Z test_torchvision_models_regnet_x_1_6gf (__main__.TestVisionTracing) ... ok (38.151s)
2022-08-03T19:05:12.0549659Z test_torchvision_models_regnet_x_32gf (__main__.TestVisionTracing) ... ok (64.061s)
2022-08-03T19:06:22.7225415Z test_torchvision_models_regnet_x_3_2gf (__main__.TestVisionTracing) ... ok (70.667s)
2022-08-03T19:07:18.2250778Z test_torchvision_models_regnet_x_400mf (__main__.TestVisionTracing) ... ok (55.502s)
2022-08-03T19:07:49.3684930Z test_torchvision_models_regnet_x_800mf (__main__.TestVisionTracing) ... ok (31.143s)
2022-08-03T19:08:51.6417110Z test_torchvision_models_regnet_x_8gf (__main__.TestVisionTracing) ... ok (62.273s)
2022-08-03T19:11:49.2177510Z test_torchvision_models_regnet_y_128gf (__main__.TestVisionTracing) ... ok (177.576s)
2022-08-03T19:11:49.2208120Z test_torchvision_models_regnet_y_16gf (__main__.TestVisionTracing) ... ok (73.293s)
2022-08-03T19:15:41.0194445Z test_torchvision_models_regnet_y_1_6gf (__main__.TestVisionTracing) ... ok (158.508s)
2022-08-03T19:17:12.7901234Z test_torchvision_models_regnet_y_32gf (__main__.TestVisionTracing) ... ok (91.770s)
2022-08-03T19:18:48.2377446Z test_torchvision_models_regnet_y_3_2gf (__main__.TestVisionTracing) ... ok (95.447s)
2022-08-03T19:19:45.2292131Z test_torchvision_models_regnet_y_400mf (__main__.TestVisionTracing) ... ok (56.991s)
2022-08-03T19:20:31.7528460Z test_torchvision_models_regnet_y_800mf (__main__.TestVisionTracing) ... ok (46.523s)
2022-08-03T19:20:32.2151362Z test_torchvision_models_regnet_y_8gf (__main__.TestVisionTracing) ... ok (41.056s)
2022-08-03T19:22:49.0783090Z test_torchvision_models_resnet101 (__main__.TestVisionTracing) ... ok (96.269s)
2022-08-03T19:26:08.9278863Z test_torchvision_models_resnet152 (__main__.TestVisionTracing) ... ok (199.849s)
2022-08-03T19:26:12.7781004Z test_torchvision_models_resnet18 (__main__.TestVisionTracing) ... ok (3.850s)
2022-08-03T19:26:24.1143151Z test_torchvision_models_resnet34 (__main__.TestVisionTracing) ... ok (11.336s)
2022-08-03T19:26:47.0665742Z test_torchvision_models_resnet50 (__main__.TestVisionTracing) ... ok (22.952s)
2022-08-03T19:28:18.6756152Z test_torchvision_models_resnext101_32x8d (__main__.TestVisionTracing) ... ok (91.609s)
2022-08-03T19:29:51.5175733Z test_torchvision_models_resnext101_64x4d (__main__.TestVisionTracing) ... ok (92.842s)
2022-08-03T19:30:15.0395468Z test_torchvision_models_resnext50_32x4d (__main__.TestVisionTracing) ... ok (23.522s)
2022-08-03T19:30:15.0417455Z test_torchvision_models_segmentation_deeplabv3_mobilenet_v3_large (__main__.TestVisionTracing) ... ok (41.518s)
2022-08-03T19:30:56.5593147Z test_torchvision_models_segmentation_deeplabv3_resnet101 (__main__.TestVisionTracing) ... ok (65.539s)
2022-08-03T19:32:02.0989764Z test_torchvision_models_segmentation_deeplabv3_resnet50 (__main__.TestVisionTracing) ... ok (19.032s)
2022-08-03T19:32:21.1306708Z test_torchvision_models_segmentation_fcn_resnet101 (__main__.TestVisionTracing) ...ok (55.015s)
2022-08-03T19:33:16.1455914Z test_torchvision_models_segmentation_fcn_resnet50 (__main__.TestVisionTracing) ... ok (15.044s)
2022-08-03T19:33:31.1890927Z test_torchvision_models_segmentation_lraspp_mobilenet_v3_large (__main__.TestVisionTracing) ... ok (22.548s)
2022-08-03T19:34:29.0257884Z test_torchvision_models_shufflenet_v2_x0_5 (__main__.TestVisionTracing) ... ok (35.290s)
2022-08-03T19:34:29.0365339Z test_torchvision_models_shufflenet_v2_x1_0 (__main__.TestVisionTracing) ... ok (35.166s)
2022-08-03T19:35:39.0395488Z test_torchvision_models_shufflenet_v2_x1_5 (__main__.TestVisionTracing) ... ok (34.847s)
2022-08-03T19:36:14.5451081Z test_torchvision_models_shufflenet_v2_x2_0 (__main__.TestVisionTracing) ... ok (35.505s)
2022-08-03T19:36:17.8882872Z test_torchvision_models_squeezenet1_0 (__main__.TestVisionTracing) ... ok (3.343s)
2022-08-03T19:36:21.1852522Z test_torchvision_models_squeezenet1_1 (__main__.TestVisionTracing) ... ok (3.297s)
2022-08-03T19:36:23.9077391Z test_torchvision_models_swin_b (__main__.TestVisionTracing) ... ok (5.616s)
2022-08-03T19:36:27.5764829Z test_torchvision_models_swin_s (__main__.TestVisionTracing) ... ok (3.819s)
2022-08-03T19:36:32.4463984Z test_torchvision_models_swin_t (__main__.TestVisionTracing) ... ok (1.826s)
2022-08-03T19:36:34.9411761Z test_torchvision_models_vgg11 (__main__.TestVisionTracing) ... ok (2.495s)
2022-08-03T19:36:38.0211472Z test_torchvision_models_vgg11_bn (__main__.TestVisionTracing) ... ok (3.080s)
2022-08-03T19:36:41.0269459Z test_torchvision_models_vgg13 (__main__.TestVisionTracing) ... ok (3.006s)
2022-08-03T19:36:45.1734879Z test_torchvision_models_vgg13_bn (__main__.TestVisionTracing) ... ok (4.146s)
2022-08-03T19:36:48.3880343Z test_torchvision_models_vgg16 (__main__.TestVisionTracing) ... ok (3.214s)
2022-08-03T19:36:52.6734397Z test_torchvision_models_vgg16_bn (__main__.TestVisionTracing) ... ok (4.285s)
2022-08-03T19:36:56.6082544Z test_torchvision_models_vgg19 (__main__.TestVisionTracing) ... ok (3.935s)
2022-08-03T19:37:02.3325897Z test_torchvision_models_vgg19_bn (__main__.TestVisionTracing) ... ok (5.724s)
2022-08-03T19:37:06.5155769Z test_torchvision_models_video_mc3_18 (__main__.TestVisionTracing) ... ok (4.183s)
2022-08-03T19:38:55.8700574Z test_torchvision_models_video_mvit_v1_b (__main__.TestVisionTracing) ... ok (109.354s)
2022-08-03T19:39:12.5654877Z test_torchvision_models_video_r2plus1d_18 (__main__.TestVisionTracing) ... ok (16.695s)
2022-08-03T19:39:17.3841514Z test_torchvision_models_video_r3d_18 (__main__.TestVisionTracing) ... ok (4.818s)
2022-08-03T19:39:21.9138265Z test_torchvision_models_vit_b_16 (__main__.TestVisionTracing) ... ok (4.529s)
2022-08-03T19:39:25.7134707Z test_torchvision_models_vit_b_32 (__main__.TestVisionTracing) ... ok (3.800s)
2022-08-03T19:39:25.7494094Z test_torchvision_models_vit_h_14 (__main__.TestVisionTracing) ... ok (24.109s)
2022-08-03T19:40:02.6691677Z test_torchvision_models_vit_l_16 (__main__.TestVisionTracing) ... ok (12.846s)
2022-08-03T19:40:15.0948724Z test_torchvision_models_vit_l_32 (__main__.TestVisionTracing) ... ok (12.426s)
2022-08-03T19:40:16.4909689Z test_torchvision_models_wide_resnet101_2 (__main__.TestVisionTracing) ... ok (90.164s)
2022-08-03T19:42:09.2416125Z test_torchvision_models_wide_resnet50_2 (__main__.TestVisionTracing) ... ok (23.982s)
On the other hand, these tests take only minutes without dynamo, i.e. https://github.com/pytorch/pytorch/runs/7774078598.
AFAIK, dynamo tests are run with torchdynamo.optimize("eager") #80106, and these figures are probably expected. So, my questions are:
- Is there a way to alleviate the situation here like running these tests in a different "lazier" mode to avoid this bottleneck?
- Does the team think that these tests bring enough values to justify the waiting time? They are run on pull request, so it's pretty costly to make the whole workflow waiting for this. May be we can find a middle ground for this.
Thank you for looking into this!
Solutions
TBD