Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ONNX] Support large attribute and subgraph for large model #38793

Closed
wants to merge 6 commits into from

Conversation

@BowenBao
Copy link
Contributor

BowenBao commented May 20, 2020

Previously large tensor data in attributes and subgraphs are not stored externally. ONNX won't be able to serialize the model for cases where the total size sums up to >= 2GB. This PR enables that.

@BowenBao BowenBao requested a review from apaszke as a code owner May 20, 2020
@dr-ci
Copy link

dr-ci bot commented May 20, 2020

💊 CI failures summary and remediations

As of commit 4b0a1a1 (more details on the Dr. CI page):


  • 3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_build (1/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found

MM -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\queue\blobs_queue_db.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\queue\blobs_queue_db.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

X2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_fp16_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_fp16_pack_op.cc 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/quantization/server/fbgemm_fp16_pack_op.cc.obj  
X2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_fp16_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_fp16_pack_op.cc 
C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C2039: 'runtime_error': is not a member of 'std'
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.26.28801\include\string(24): note: see declaration of 'std'
C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

 -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_pack_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

ION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fully_connected_dnnlowp_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fully_connected_dnnlowp_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

See CircleCI build pytorch_windows_vs2019_py36_cpu_build (2/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found

X2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fully_connected_fake_lowp_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fully_connected_fake_lowp_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

FINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_fp16_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_fp16_pack_op.cc 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/quantization/server/fbgemm_fp16_pack_op.cc.obj  
FINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fbgemm_fp16_pack_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fbgemm_fp16_pack_op.cc 
C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C2039: 'runtime_error': is not a member of 'std'
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.26.28801\include\string(24): note: see declaration of 'std'
C:\Users\circleci\project\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

experimental -DNDEBUG -DUSE_FBGEMM -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\sgd\wngrad_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\sgd\wngrad_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

E_AVX2_CPU_DEFINITION /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHsc /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\quantization\server\fully_connected_dnnlowp_op.cc.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\caffe2\quantization\server\fully_connected_dnnlowp_op.cc 
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (3/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

May 27 18:35:40 ERROR: test_accurracy (__main__.TrainMnist)
May 27 18:35:37 Core 0 got rendezvous! 
May 27 18:35:37 + python3 /var/lib/jenkins/workspace/xla/test/test_mp_save.py 
May 27 18:35:38 2020-05-27 18:35:38.281900: W tensorflow/compiler/jit/xla_device.cc:398] XLA_GPU and XLA_CPU devices are deprecated and will be removed in subsequent releases. Instead, use either @tf.function(experimental_compile=True) for must-compile semantics, or run with TF_XLA_FLAGS=--tf_xla_auto_jit=2 for auto-clustering best-effort compilation. 
May 27 18:35:38 + python3 /var/lib/jenkins/workspace/xla/test/test_mp_mesh_reduce.py 
May 27 18:35:39 Running MNIST Test 
May 27 18:35:39 + echo 'Running MNIST Test' 
May 27 18:35:39 + python test/test_train_mnist.py --tidy 
May 27 18:35:40  0it [00:00, ?it/s]   0%|          | 0/9912422 [00:00<?, ?it/s]E   0%|          | 8192/9912422 [00:00<07:08, 23098.72it/s] 
May 27 18:35:40  
May 27 18:35:40 ====================================================================== 
May 27 18:35:40 ERROR: test_accurracy (__main__.TrainMnist) 
May 27 18:35:40 ---------------------------------------------------------------------- 
May 27 18:35:40 Traceback (most recent call last): 
May 27 18:35:40   File "test/test_train_mnist.py", line 186, in test_accurracy 
May 27 18:35:40     self.assertGreaterEqual(train_mnist(), FLAGS.target_accuracy) 
May 27 18:35:40   File "test/test_train_mnist.py", line 74, in train_mnist 
May 27 18:35:40     transforms.Normalize((0.1307,), (0.3081,))])) 
May 27 18:35:40   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 71, in __init__ 
May 27 18:35:40     self.download() 
May 27 18:35:40   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 138, in download 
May 27 18:35:40     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 24 times.

@ailzhang ailzhang requested a review from lara-hdr May 21, 2020
@ailzhang ailzhang added the triaged label May 21, 2020
@BowenBao BowenBao force-pushed the BowenBao:onnx_large_attr_subgraph branch from 0680f40 to 78c26fd May 21, 2020
Copy link
Collaborator

neginraoof left a comment

LGTM, thanks!
About tests: since these tests won't fail even without your changes, is there a way to check external data export or files, maybe in test_utility?

@BowenBao
Copy link
Contributor Author

BowenBao commented Jun 2, 2020

@houseroad please take a look, thanks!

@BowenBao
Copy link
Contributor Author

BowenBao commented Jun 10, 2020

@houseroad please take a look at this PR.

Copy link
Contributor

facebook-github-bot left a comment

@houseroad has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Member

houseroad left a comment

Looks good.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 22, 2020

@houseroad merged this pull request in eaa9107.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.