Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer #116141

Closed
wants to merge 4 commits into from

Conversation

XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Dec 19, 2023

Stack from ghstack (oldest at bottom):

Test Plan

  1. build pytorch
  2. execute TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead from the pytorch root directory.

without the change:
image

with the change:
image

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

…r which only happens when partially reading a buffer

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Dec 19, 2023
Copy link

pytorch-bot bot commented Dec 19, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116141

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 2730aea with merge base 5ba87a3 (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

XilunWu added a commit that referenced this pull request Dec 19, 2023
…r which only happens when partially reading a buffer

ghstack-source-id: 75197e797234bad767eb53290ff2b598d8840fc1
Pull Request resolved: #116141
@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 19, 2023
…fix an error which only happens when partially reading a buffer"

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin wanchaol fduwjj wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
XilunWu added a commit that referenced this pull request Dec 19, 2023
…r which only happens when partially reading a buffer

ghstack-source-id: 861efed0a47ca8e8a6dd6a10b9c82009ccd057a5
Pull Request resolved: #116141
…fix an error which only happens when partially reading a buffer"


**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">


with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">


cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin wanchaol fduwjj wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
XilunWu added a commit that referenced this pull request Dec 20, 2023
…r which only happens when partially reading a buffer

ghstack-source-id: c88cb2e6fc431a1f00e1b826614eafd7fc0564c9
Pull Request resolved: #116141
@XilunWu XilunWu added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Dec 20, 2023
@@ -396,6 +396,18 @@ void TCPStore::validate(void) {
buffer.flush();
}

void TCPStore::_splitSet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually make this into test by inherit TCPStore and add it to the child class instead of adding it to TCPStore itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that would require a refactor of TCPStore because the TCPClient object is a Private member of TCPStore and can only be accessed from the TCPStore object.

Copy link
Contributor

@wconstab wconstab Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think @fduwjj means, can you make a TestTCPStore class in TcpStoreTest.cpp, and just implement the extra method on that?

(Agree its better not to put this code into the real TCPStore file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried that attempt, but the thing is, the subclass cannot access the private member of the superclass, so the _splitSet function in the subclass won't be able to use the buffer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. could probably change the base class member to protected, add your test class as a friend class, or just copy some code into the test class.

Or, maybe you don't even need a tcpstore at all- just create a raw tcpclient obj in your test case, and then write the 'split_set' as a standalone helper function that uses tcpclient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. i was thinking about saying that you should just reverse the order, land the next refactor PR first, then land this one. But i think its worth landing the fix now and its not that bad to add the _split_set, as long as you promise to clean it up soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wconstab @fduwjj for the follow-up there're several options I can come up with, let me know your opinions:

  1. in test's TCP store client thread, do not use any TCPStore, SendBuffer, or TCPClient because they either cannot be used outside of TCPStore.cpp or encapsulate private members from users. Instead, use the Socket to perform the exact same thing as in those classes. The downside is making the test quite verbose (but we can hide them in test util files).

  2. change some members to Protected and make the test class a friend class. The downside is I don't know the consequence of making the underlying networking components Protected. Can there be any security issue? Who would be the expert we can ask?

  3. Refactor the structure of TCPStore.hpp and TCPStoreBackend.hpp. Currently, TCPClient and SendBuffer can only have incomplete declaration in TCPStore.hpp because QueryType is defined in TCPStoreBackend.hpp and TCPStoreBackend.hpp include TCPStore.hpp. This forbids them from being exposed outside of TCPStore.cpp. The question is, do we want to refactor them? Any benefit? A single test may not be a good justification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #2 should be fine? What security issue can you think of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If another class A inherits TCPStore, the class A or an instance of A can access a TCPStore's buffer and read/write on it. I assume the original design decision (making them private) is to avoid this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, what's the problem for that? If it's a child of TCPStore then it's ok to access the buffer?

c10d::test::check(*clientTCPStore, key, "v");
}
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need serverTCPStore = nullptr; here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will cause the issue. If we have it here, the server could be destructed before the client completes its execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe add after clientThread Join?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the unique_ptr object serverTCPStore goes out of the scope (in this case it's the end of the test), the object will be destroyed (same as serverTCPStore = nullptr) and the TCPStore will be freed. So adding serverTCPStore = nullptr will just be the same as the current code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is to set to null after join, by the time of join, all clients should have finished? But since this is a unique pointer, it should be fine.

…fix an error which only happens when partially reading a buffer"


**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">


with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">


cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin wanchaol fduwjj wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
XilunWu added a commit that referenced this pull request Dec 20, 2023
…r which only happens when partially reading a buffer

ghstack-source-id: 6067f62bfdc8d94411f42758102a612d97dd4994
Pull Request resolved: #116141
@XilunWu
Copy link
Contributor Author

XilunWu commented Dec 20, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

dmenig pushed a commit to dmenig/pytorch that referenced this pull request Dec 21, 2023
…r which only happens when partially reading a buffer (pytorch#116141)

**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">

with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">

Pull Request resolved: pytorch#116141
Approved by: https://github.com/wconstab
dmenig pushed a commit to dmenig/pytorch that referenced this pull request Dec 21, 2023
…r which only happens when partially reading a buffer (pytorch#116141)

**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">

with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">

Pull Request resolved: pytorch#116141
Approved by: https://github.com/wconstab
ZhiweiYan-96 pushed a commit to ZhiweiYan-96/pytorch that referenced this pull request Dec 22, 2023
…r which only happens when partially reading a buffer (pytorch#116141)

**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">

with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">

Pull Request resolved: pytorch#116141
Approved by: https://github.com/wconstab
@facebook-github-bot facebook-github-bot deleted the gh/XilunWu/54/head branch December 24, 2023 15:23
pytorchmergebot pushed a commit that referenced this pull request Jun 25, 2024
We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times.

This PR does a few things:
* Retry that connect and validate up to the specified timeout.
* Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep.
* Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu #116141

Test plan:

```
python test/distributed/test_store.py -v
./build/bin/BackoffTest
```

Will do internal testing with some large scale jobs to ensure TCPStore works correctly.

At 4k scale: 4x improvement

```
tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                   (pytorch-3.10)
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    1.98 secs    fish           external
   usr time    0.93 secs   91.00 micros    0.93 secs
   sys time    1.98 secs  954.00 micros    1.97 secs

tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10                                                                                                                                              (pytorch-3.10)
tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                          (torchdrive-3.10)
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    8.20 secs    fish           external
   usr time    2.15 secs    0.00 micros    2.15 secs
   sys time    2.76 secs  843.00 micros    2.76 secs
```

```py
import time
import os
import threading
from multiprocessing import Pool

WORLD_SIZE = 10000

import torch.distributed as dist

def run(rank):
    should_log = rank % (WORLD_SIZE // 10) == 0
    if should_log:
        print(f"started {rank}")
    store = dist.TCPStore(
        host_name="devvm4382.nao0.facebook.com",
        port=29500,
        world_size=WORLD_SIZE,
        is_master=rank == 0,
        use_libuv=True,
    )
    if should_log:
        print(f"init {rank}")
    store.set(f"key{rank}", "1234")
    if should_log:
        print(f"set {rank}")
    del store

def noop(rank):
    pass

print("starting pool")
with Pool(WORLD_SIZE) as pool:
    pool.map(noop, range(WORLD_SIZE), 1)
    print("pool hot")
    start = time.time()
    pool.map(run, range(WORLD_SIZE), 1)
    print("run finished", time.time()-start)
```

```
tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py                                                                                                                                (pytorch-3.10)
starting pool
pool hot
started 0
[W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
started 1000
init 1000
set 1000
started 2000
init 2000
set 2000
started 3000
init 3000
set 3000
started 4000
init 4000
set 4000
started 5000
init 5000
set 5000
started 6000
init 6000
set 6000
started 7000
init 7000
set 7000
started 8000
init 8000
set 8000
started 9000
init 9000
set 9000
init 0
set 0
run finished 0.705092191696167
```

Pull Request resolved: #129261
Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants