Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible padding length overflow in DistributedSampler #45329

Closed
wants to merge 4 commits into from

Conversation

agemor
Copy link
Contributor

@agemor agemor commented Sep 25, 2020

Fixes #45324

This fix handles cases for len(dataset) * 2 < num_replica in DistributedSampler. (which previous code resulted in error.)

@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #45329 into master will decrease coverage by 0.00%.
The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #45329      +/-   ##
==========================================
- Coverage   68.08%   68.08%   -0.01%     
==========================================
  Files         393      393              
  Lines       50960    50963       +3     
==========================================
+ Hits        34698    34699       +1     
- Misses      16262    16264       +2     
Impacted Files Coverage Δ
torch/utils/data/distributed.py 74.46% <75.00%> (-0.54%) ⬇️
torch/testing/_internal/expecttest.py 77.55% <0.00%> (-1.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03dde4c...2d622d2. Read the comment docs.

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 25, 2020
@rohan-varma rohan-varma self-requested a review September 25, 2020 16:53
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix! Could we add a unitttest for this to ensure no future regression? The file distributed_test.py has a DistributedSampler test which you can use as an example.

@dr-ci
Copy link

dr-ci bot commented Sep 26, 2020

💊 CI failures summary and remediations

As of commit 2d622d2 (more details on the Dr. CI page):



🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 26 07:06:51 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Sep 26 07:06:51 At: 
Sep 26 07:06:51   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 26 07:06:51   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 26 07:06:51  
Sep 26 07:06:51 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 26 07:06:51  
Sep 26 07:06:51 At: 
Sep 26 07:06:51   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 26 07:06:51   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 26 07:06:51  
Sep 26 07:06:51 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 26 07:06:51  
Sep 26 07:06:51 At: 
Sep 26 07:06:51   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 26 07:06:51   /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 26 07:06:51  
Sep 26 07:06:51 ok (1.579s) 
Sep 26 07:06:52   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Sep 26 07:06:52 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Sep 26 07:06:52 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Sep 26 07:06:52 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 

See CircleCI build pytorch_python_doc_build (2/3)

Step: "Doc Build and Push" (full log | diagnosis details | 🔁 rerun)

Sep 26 06:03:55 Makefile:38: recipe for target 'html' failed
Sep 26 06:03:54 copying images... [100%] scripts/activation_images/ReLU6.png 
Sep 26 06:03:54  
Sep 26 06:03:54 copying static files... ... done 
Sep 26 06:03:54 copying extra files... done 
Sep 26 06:03:55 dumping search index in English (code: en)... done 
Sep 26 06:03:55 dumping object inventory... done 
Sep 26 06:03:55 build finished with problems, 4 warnings. 
Sep 26 06:03:55 /var/lib/jenkins/workspace/docs/src/pytorch-sphinx-theme/pytorch_sphinx_theme/search.html:21: RemovedInSphinx30Warning: To modify script_files in the theme is deprecated. Please insert a <script> tag directly in your theme instead. 
Sep 26 06:03:55   {% trans %}Please activate JavaScript to enable the search 
Sep 26 06:03:55 make: *** [html] Error 1 
Sep 26 06:03:55 Makefile:38: recipe for target 'html' failed 

See CircleCI build pytorch_doc_test (3/3)

Step: "Doc test" (full log | diagnosis details | 🔁 rerun)

Sep 26 05:52:07 caused by: Connection refused (os error 111)
Sep 26 05:52:07 +++++ extract_trap_cmd 
Sep 26 05:52:07 +++++ printf '%s\n' '' 
Sep 26 05:52:07 ++++ printf '%s\n' cleanup 
Sep 26 05:52:07 +++ trap -- ' 
Sep 26 05:52:07 cleanup' EXIT 
Sep 26 05:52:07 +++ [[ pytorch-linux-xenial-py3.6-gcc5.4-build != *pytorch-win-* ]] 
Sep 26 05:52:07 +++ which sccache 
Sep 26 05:52:07 +++ sccache --stop-server 
Sep 26 05:52:07 Stopping sccache server... 
Sep 26 05:52:07 error: couldn't connect to server 
Sep 26 05:52:07 caused by: Connection refused (os error 111) 
Sep 26 05:52:07 +++ true 
Sep 26 05:52:07 +++ rm /var/lib/jenkins/sccache_error.log 
Sep 26 05:52:07 +++ [[ pytorch-linux-xenial-py3.6-gcc5.4-build == *rocm* ]] 
Sep 26 05:52:07 +++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Sep 26 05:52:07 +++ SCCACHE_IDLE_TIMEOUT=1200 
Sep 26 05:52:07 +++ RUST_LOG=sccache::server=error 
Sep 26 05:52:07 +++ sccache --start-server 
Sep 26 05:52:07 Starting sccache server... 
Sep 26 05:52:08 +++ sccache --zero-stats 
Sep 26 05:52:08 Compile requests                 0 

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 26 06:18:43 CondaHTTPError: HTTP 000 CONNECTION FAILED for url
Sep 26 06:17:22 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *pytorch-linux-xenial-cuda10.1-cudnn7-py3* ]] 
Sep 26 06:17:22 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *pytorch-linux-trusty-py3.6-gcc7* ]] 
Sep 26 06:17:22 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *pytorch_macos* ]] 
Sep 26 06:17:22 ++ BUILD_TEST_LIBTORCH=0 
Sep 26 06:17:22 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *pytorch-xla-linux-bionic* ]] 
Sep 26 06:17:22 ++ which conda 
Sep 26 06:17:22 /opt/conda/bin/conda 
Sep 26 06:17:22 ++ conda install -q -y cmake 
Sep 26 06:18:43 Collecting package metadata (current_repodata.json): ...working... failed 
Sep 26 06:18:43  
Sep 26 06:18:43 CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/noarch/current_repodata.json> 
Sep 26 06:18:43 Elapsed: - 
Sep 26 06:18:43  
Sep 26 06:18:43 An HTTP error occurred when trying to retrieve this URL. 
Sep 26 06:18:43 HTTP errors are often intermittent, and a simple retry will get you on your way. 
Sep 26 06:18:43  
Sep 26 06:18:43 If your current network has https://www.anaconda.com blocked, please file 
Sep 26 06:18:43 a support request with your network engineering team. 
Sep 26 06:18:43  
Sep 26 06:18:43 'https://repo.anaconda.com/pkgs/main/noarch' 
Sep 26 06:18:43  

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 5 times.

@agemor
Copy link
Contributor Author

agemor commented Sep 26, 2020

Sure 👍 I added the unit test assuring that tiny datasets are adequately padded in DistributedSampler .

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@rohan-varma merged this pull request in a699108.

@facebook-github-bot
Copy link
Contributor

@rohan-varma merged this pull request in a699108.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DistributedSampler internal asserts if len(dataset) * 2 < number of GPUs
5 participants