New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delete has no more data after the key #53886
Conversation
💊 CI failures summary and remediationsAs of commit 08e0964 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
3076c34
to
3fcb5eb
Compare
Codecov Report
@@ Coverage Diff @@
## master #53886 +/- ##
==========================================
- Coverage 77.30% 77.30% -0.01%
==========================================
Files 1888 1888
Lines 183589 183589
==========================================
- Hits 141925 141917 -8
- Misses 41664 41672 +8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome fix and good catch @t-vi! I am a bit wary about adding that new test case test_delkey_perf
since testing a perf change using multiprocessing can introduce flakiness / false signals in our CI.
Since TcpStore is used internally for storing membership information for nodes in distributed training, deletes are not frequent and time taken is small when compared to the actual training process.
test/distributed/test_c10d.py
Outdated
store = self._create_store() | ||
keys = [str(i) for i in range(10)] | ||
t0 = time.perf_counter() | ||
[store.set(k, k) for k in keys] | ||
dur_set = time.perf_counter() | ||
t0 = time.perf_counter() | ||
[store.delete_key(k) for k in keys] | ||
dur_delete = time.perf_counter() | ||
print(dur_set, dur_delete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep this part if you are able to add an assert when comparing the duration and it is consistent. I don't think we need the tests below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I didn't mean to have that in there. This part doesn't actually check anything useful related to the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this wasn't updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
I think the part of the test you are concerned about is the part we want to test which only happens when you actually have multiple processes. Note that the timing bound is exceedingly generous, so I would be optimistic that it's not too flaky. Of course, if we had a better place to put a basic perf test, it'd be better. The main alternative I'd see is not have a test if we conclude that testing timings in multi-process is too flaky. |
Under the hood, TCPStore is single threaded. There is a thread on the master which sequentially accepts query types (e.g. SET, GET, DELETE) and handles them, which is why testing the perf under multiprocessing does not have that much added benefits compared to a single process. You are correct about the timing bound it is pretty generous though 🙂 |
The bug was in the communication, not the backend. |
Ah good point, my misunderstanding. Code looks LGTM to me then! If you remove the extra part under |
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message. Thank you, @PetrochukM for the reproducing example. Fixes pytorch#53872
failure in mac os (https://app.circleci.com/pipelines/github/pytorch/pytorch/284617/workflows/812c7265-5c1b-4ecf-becf-fb119cd226b8/jobs/11504465). I think this is intermittent, I didn't see this in the earlier run |
That must be the flakyness. :/ Do we have hope that it is only mac or do we think this type of test is just touchy? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick and great fix! In general I'm pretty unsure about adding a unittest that validates performance, as unittesting is mostly about correctness and we generally have benchmarks for performance. I'll defer the discussion to folks who've owned/worked on Store more, though.
test/distributed/test_c10d.py
Outdated
processes = [] | ||
num_processes = world_size | ||
for i in range(num_processes): | ||
p = mp.Process(target=self._test_delkey_perf_worker, args=(i, addr, port, world_size, messages)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable for all distributed tests needing multiprocessing to inherit from MultiProcessTestCase
, as that class has had significant work to ensure proper error handling, failure reporting, etc. If we do add this test, could we add a class that inherits from MultiProcessTestCase and runs this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so I thought copying the test from within the case would be reasonably safe but I lack the expertise. I do agree that this isn't the right thing to do here - and it's demonstrated by the present failure.
test/distributed/test_c10d.py
Outdated
store.delete_key(k) | ||
dur_delete = (time.perf_counter() - start) * 1000 | ||
|
||
if dur_delete > 5 * max(dur_wait, dur_store): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems quite prone to flakiness, at the least could we make it a much more generous threshold?
test/distributed/test_c10d.py
Outdated
) | ||
sys.exit(MultiProcessTestCase.TEST_ERROR_EXIT_CODE) | ||
|
||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we try to avoid time.sleep()
in tests as it has previously been the source of a lot of nondeterminism and flakiness.
I agree with Rohan's comments. Regarding whether mac is the only tests that we've seen hit these flaky test, there have been multiple different CIs that we have seen multiprocessing cause issues with and we are also currently investigating them. Adding to our cpp tests (TCPStoreTest) may also be more stable, but that would involve a bit more work. For now, I think we should just leave out the perf test, and get the fix itself checked in. |
OK, so based on the feedback, I'll remove the test for now. If we have a place for benchmark, I'd be glad to add it there in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.
Thank you, @PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).
Fixes #53872