Skip to content

Conversation

allwu
Copy link

@allwu allwu commented May 1, 2020

Stack from ghstack:

In some cases we may need to install a custom allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.

Differential Revision: D21258581

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena.

Two additional parameters are introduced to configure the 2-phase decay of the memory arena:
- caffe2_dirty_decay_ms
- caffe2_muzzy_decay_ms

In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1.

We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.

Differential Revision: [D21258581](https://our.internmc.facebook.com/intern/diff/D21258581/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21258581/)!

[ghstack-poisoned]
allwu pushed a commit that referenced this pull request May 1, 2020
Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena.

Two additional parameters are introduced to configure the 2-phase decay of the memory arena:
- caffe2_dirty_decay_ms
- caffe2_muzzy_decay_ms

In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1.

We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.

Differential Revision: [D21258581](https://our.internmc.facebook.com/intern/diff/D21258581/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D21258581/)!

ghstack-source-id: 103276877
Pull Request resolved: #37640
@allwu allwu changed the title Install HugePagesArena to optimize pytorch prediction performance Add priority flag in c10::Allocator May 1, 2020
@dr-ci
Copy link

dr-ci bot commented May 1, 2020

💊 Build failures summary and remediations

As of commit 6a72ecf (more details on the Dr. CI page):


  • 6/6 failures possibly* introduced in this PR
    • 1/6 non-CircleCI failure(s)

🕵️ 5 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun) <confirmed not flaky by 11 failures>

May 01 04:50:08 caused by: Connection refused (os error 111)
May 01 04:50:08 +++ eval 'extract_trap_cmd ' 
May 01 04:50:08 ++++ extract_trap_cmd 
May 01 04:50:08 ++++ printf '%s\n' '' 
May 01 04:50:08 +++ printf '%s\n' cleanup 
May 01 04:50:08 ++ trap -- ' 
May 01 04:50:08 cleanup' EXIT 
May 01 04:50:08 ++ which sccache 
May 01 04:50:08 ++ sccache --stop-server 
May 01 04:50:08 Stopping sccache server... 
May 01 04:50:08 error: couldn't connect to server 
May 01 04:50:08 caused by: Connection refused (os error 111) 
May 01 04:50:08 ++ true 
May 01 04:50:08 ++ rm /var/lib/jenkins/sccache_error.log 
May 01 04:50:08 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
May 01 04:50:08 ++ SCCACHE_IDLE_TIMEOUT=1200 
May 01 04:50:08 ++ RUST_LOG=sccache::server=error 
May 01 04:50:08 ++ sccache --start-server 
May 01 04:50:08 Starting sccache server... 
May 01 04:50:08 ++ sccache --zero-stats 
May 01 04:50:08 Compile requests                 0 
May 01 04:50:08 Compile requests executed        0 

See CircleCI build pytorch_linux_backward_compatibility_check_test (2/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun) <confirmed not flaky by 11 failures>

May 01 04:50:12 caused by: Connection refused (os error 111)
May 01 04:50:12 +++ eval 'extract_trap_cmd ' 
May 01 04:50:12 ++++ extract_trap_cmd 
May 01 04:50:12 ++++ printf '%s\n' '' 
May 01 04:50:12 +++ printf '%s\n' cleanup 
May 01 04:50:12 ++ trap -- ' 
May 01 04:50:12 cleanup' EXIT 
May 01 04:50:12 ++ which sccache 
May 01 04:50:12 ++ sccache --stop-server 
May 01 04:50:12 Stopping sccache server... 
May 01 04:50:12 error: couldn't connect to server 
May 01 04:50:12 caused by: Connection refused (os error 111) 
May 01 04:50:12 ++ true 
May 01 04:50:12 ++ rm /var/lib/jenkins/sccache_error.log 
May 01 04:50:12 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
May 01 04:50:12 ++ SCCACHE_IDLE_TIMEOUT=1200 
May 01 04:50:12 ++ RUST_LOG=sccache::server=error 
May 01 04:50:12 ++ sccache --start-server 
May 01 04:50:12 Starting sccache server... 
May 01 04:50:12 ++ sccache --zero-stats 
May 01 04:50:12 Compile requests                 0 
May 01 04:50:12 Compile requests executed        0 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (3/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun) <confirmed not flaky by 11 failures>

May 01 04:50:46 caused by: Connection refused (os error 111)
May 01 04:50:46 +++ eval 'extract_trap_cmd ' 
May 01 04:50:46 ++++ extract_trap_cmd 
May 01 04:50:46 ++++ printf '%s\n' '' 
May 01 04:50:46 +++ printf '%s\n' cleanup 
May 01 04:50:46 ++ trap -- ' 
May 01 04:50:46 cleanup' EXIT 
May 01 04:50:46 ++ which sccache 
May 01 04:50:46 ++ sccache --stop-server 
May 01 04:50:46 Stopping sccache server... 
May 01 04:50:46 error: couldn't connect to server 
May 01 04:50:46 caused by: Connection refused (os error 111) 
May 01 04:50:46 ++ true 
May 01 04:50:46 ++ rm /var/lib/jenkins/sccache_error.log 
May 01 04:50:46 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
May 01 04:50:46 ++ SCCACHE_IDLE_TIMEOUT=1200 
May 01 04:50:46 ++ RUST_LOG=sccache::server=error 
May 01 04:50:46 ++ sccache --start-server 
May 01 04:50:46 Starting sccache server... 
May 01 04:50:46 ++ sccache --zero-stats 
May 01 04:50:46 Compile requests                 0 
May 01 04:50:46 Compile requests executed        0 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test (4/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun) <confirmed not flaky by 11 failures>

May 01 04:51:25 caused by: Connection refused (os error 111)
May 01 04:51:25 +++ eval 'extract_trap_cmd ' 
May 01 04:51:25 ++++ extract_trap_cmd 
May 01 04:51:25 ++++ printf '%s\n' '' 
May 01 04:51:25 +++ printf '%s\n' cleanup 
May 01 04:51:25 ++ trap -- ' 
May 01 04:51:25 cleanup' EXIT 
May 01 04:51:25 ++ which sccache 
May 01 04:51:25 ++ sccache --stop-server 
May 01 04:51:25 Stopping sccache server... 
May 01 04:51:25 error: couldn't connect to server 
May 01 04:51:25 caused by: Connection refused (os error 111) 
May 01 04:51:25 ++ true 
May 01 04:51:25 ++ rm /var/lib/jenkins/sccache_error.log 
May 01 04:51:25 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
May 01 04:51:25 ++ SCCACHE_IDLE_TIMEOUT=1200 
May 01 04:51:25 ++ RUST_LOG=sccache::server=error 
May 01 04:51:25 ++ sccache --start-server 
May 01 04:51:25 Starting sccache server... 
May 01 04:51:25 ++ sccache --zero-stats 
May 01 04:51:25 Compile requests                 0 
May 01 04:51:25 Compile requests executed        0 

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (5/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun) <confirmed not flaky by 11 failures>

May 01 05:30:56 caused by: Connection refused (os error 111)
May 01 05:30:56 +++ eval 'extract_trap_cmd ' 
May 01 05:30:56 ++++ extract_trap_cmd 
May 01 05:30:56 ++++ printf '%s\n' '' 
May 01 05:30:56 +++ printf '%s\n' cleanup 
May 01 05:30:56 ++ trap -- ' 
May 01 05:30:56 cleanup' EXIT 
May 01 05:30:56 ++ which sccache 
May 01 05:30:56 ++ sccache --stop-server 
May 01 05:30:56 Stopping sccache server... 
May 01 05:30:56 error: couldn't connect to server 
May 01 05:30:56 caused by: Connection refused (os error 111) 
May 01 05:30:56 ++ true 
May 01 05:30:56 ++ rm /var/lib/jenkins/sccache_error.log 
May 01 05:30:56 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
May 01 05:30:56 ++ SCCACHE_IDLE_TIMEOUT=1200 
May 01 05:30:56 ++ RUST_LOG=sccache::server=error 
May 01 05:30:56 ++ sccache --start-server 
May 01 05:30:56 Starting sccache server... 
May 01 05:30:56 ++ sccache --zero-stats 
May 01 05:30:56 Compile requests                 0 
May 01 05:30:56 Compile requests executed        0 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 19 times.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f538cd6.

@facebook-github-bot facebook-github-bot deleted the gh/allwu/1/head branch May 10, 2020 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants