delete has no more data after the key #53886

t-vi · 2021-03-12T09:17:32Z

The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.

Thank you, @PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).

Fixes #53872

facebook-github-bot · 2021-03-12T09:17:39Z

💊 CI failures summary and remediations

As of commit 08e0964 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

codecov · 2021-03-12T13:29:35Z

Codecov Report

Merging #53886 (08e0964) into master (1772e26) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #53886      +/-   ##
==========================================
- Coverage   77.30%   77.30%   -0.01%     
==========================================
  Files        1888     1888              
  Lines      183589   183589              
==========================================
- Hits       141925   141917       -8     
- Misses      41664    41672       +8

H-Huang

Awesome fix and good catch @t-vi! I am a bit wary about adding that new test case test_delkey_perf since testing a perf change using multiprocessing can introduce flakiness / false signals in our CI.

Since TcpStore is used internally for storing membership information for nodes in distributed training, deletes are not frequent and time taken is small when compared to the actual training process.

H-Huang · 2021-03-12T15:35:12Z

test/distributed/test_c10d.py

+        store = self._create_store()
+        keys = [str(i) for i in range(10)]
+        t0 = time.perf_counter()
+        [store.set(k, k) for k in keys]
+        dur_set = time.perf_counter()
+        t0 = time.perf_counter()
+        [store.delete_key(k) for k in keys]
+        dur_delete = time.perf_counter()
+        print(dur_set, dur_delete)


I would keep this part if you are able to add an assert when comparing the duration and it is consistent. I don't think we need the tests below

Oh, sorry, I didn't mean to have that in there. This part doesn't actually check anything useful related to the fix.

looks like this wasn't updated

facebook-github-bot

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

t-vi · 2021-03-12T16:05:47Z

I think the part of the test you are concerned about is the part we want to test which only happens when you actually have multiple processes. Note that the timing bound is exceedingly generous, so I would be optimistic that it's not too flaky. Of course, if we had a better place to put a basic perf test, it'd be better. The main alternative I'd see is not have a test if we conclude that testing timings in multi-process is too flaky.

H-Huang · 2021-03-12T16:15:12Z

Under the hood, TCPStore is single threaded. There is a thread on the master which sequentially accepts query types (e.g. SET, GET, DELETE) and handles them, which is why testing the perf under multiprocessing does not have that much added benefits compared to a single process. You are correct about the timing bound it is pretty generous though 🙂

t-vi · 2021-03-12T16:18:26Z

The bug was in the communication, not the backend.

H-Huang · 2021-03-12T16:43:17Z

Ah good point, my misunderstanding. Code looks LGTM to me then! If you remove the extra part under _test_numkeys_delkeys I will merge it ASAP

@PetrochukM

The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message. Thank you, @PetrochukM for the reproducing example. Fixes pytorch#53872

H-Huang · 2021-03-12T20:03:33Z

failure in mac os (https://app.circleci.com/pipelines/github/pytorch/pytorch/284617/workflows/812c7265-5c1b-4ecf-becf-fb119cd226b8/jobs/11504465). I think this is intermittent, I didn't see this in the earlier run

t-vi · 2021-03-12T20:09:20Z

That must be the flakyness. :/ Do we have hope that it is only mac or do we think this type of test is just touchy?

rohan-varma

Thanks for the quick and great fix! In general I'm pretty unsure about adding a unittest that validates performance, as unittesting is mostly about correctness and we generally have benchmarks for performance. I'll defer the discussion to folks who've owned/worked on Store more, though.

rohan-varma · 2021-03-12T21:51:21Z

test/distributed/test_c10d.py

+        processes = []
+        num_processes = world_size
+        for i in range(num_processes):
+            p = mp.Process(target=self._test_delkey_perf_worker, args=(i, addr, port, world_size, messages))


It would be preferable for all distributed tests needing multiprocessing to inherit from MultiProcessTestCase, as that class has had significant work to ensure proper error handling, failure reporting, etc. If we do add this test, could we add a class that inherits from MultiProcessTestCase and runs this?

Yeah, so I thought copying the test from within the case would be reasonably safe but I lack the expertise. I do agree that this isn't the right thing to do here - and it's demonstrated by the present failure.

rohan-varma · 2021-03-12T21:53:40Z

test/distributed/test_c10d.py

+                        store.delete_key(k)
+                    dur_delete = (time.perf_counter() - start) * 1000
+
+                    if dur_delete > 5 * max(dur_wait, dur_store):


This seems quite prone to flakiness, at the least could we make it a much more generous threshold?

rohan-varma · 2021-03-12T21:54:02Z

test/distributed/test_c10d.py

+                        )
+                        sys.exit(MultiProcessTestCase.TEST_ERROR_EXIT_CODE)
+
+                    time.sleep(0.1)


Generally we try to avoid time.sleep() in tests as it has previously been the source of a lot of nondeterminism and flakiness.

H-Huang · 2021-03-12T23:54:46Z

I agree with Rohan's comments. Regarding whether mac is the only tests that we've seen hit these flaky test, there have been multiple different CIs that we have seen multiprocessing cause issues with and we are also currently investigating them. Adding to our cpp tests (TCPStoreTest) may also be more stable, but that would involve a bit more work. For now, I think we should just leave out the perf test, and get the fix itself checked in.

t-vi · 2021-03-13T09:30:35Z

OK, so based on the feedback, I'll remove the test for now. If we have a place for benchmark, I'd be glad to add it there in a separate PR.

facebook-github-bot

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-03-15T15:47:39Z

@H-Huang merged this pull request in 8734e88.

t-vi requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners March 12, 2021 09:17

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Mar 12, 2021

t-vi force-pushed the tcpstore_del_perf branch from 61b4a9f to c37714d Compare March 12, 2021 09:30

pytorchbot added the open source label Mar 12, 2021

t-vi force-pushed the tcpstore_del_perf branch 2 times, most recently from 3076c34 to 3fcb5eb Compare March 12, 2021 10:09

H-Huang reviewed Mar 12, 2021

View reviewed changes

facebook-github-bot reviewed Mar 12, 2021

View reviewed changes

delete has no more data after the key

81870e6

The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message. Thank you, @PetrochukM for the reproducing example. Fixes pytorch#53872

t-vi force-pushed the tcpstore_del_perf branch from 3fcb5eb to 81870e6 Compare March 12, 2021 17:23

PetrochukM mentioned this pull request Mar 12, 2021

[v1.8.1] Release Tracker #53572

Closed

rohan-varma reviewed Mar 12, 2021

View reviewed changes

don't add test for now

08e0964

facebook-github-bot reviewed Mar 14, 2021

View reviewed changes

H-Huang approved these changes Mar 14, 2021

View reviewed changes

facebook-github-bot closed this in 8734e88 Mar 15, 2021

facebook-github-bot added the Merged label Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delete has no more data after the key #53886

delete has no more data after the key #53886

t-vi commented Mar 12, 2021

facebook-github-bot commented Mar 12, 2021 •

edited

codecov bot commented Mar 12, 2021 •

edited

H-Huang left a comment

H-Huang Mar 12, 2021

t-vi Mar 12, 2021

H-Huang Mar 12, 2021

facebook-github-bot left a comment

t-vi commented Mar 12, 2021

H-Huang commented Mar 12, 2021

t-vi commented Mar 12, 2021

H-Huang commented Mar 12, 2021

H-Huang commented Mar 12, 2021

t-vi commented Mar 12, 2021

rohan-varma left a comment

rohan-varma Mar 12, 2021

t-vi Mar 13, 2021

rohan-varma Mar 12, 2021

rohan-varma Mar 12, 2021

H-Huang commented Mar 12, 2021

t-vi commented Mar 13, 2021

facebook-github-bot left a comment

facebook-github-bot commented Mar 15, 2021

delete has no more data after the key #53886

delete has no more data after the key #53886

Conversation

t-vi commented Mar 12, 2021

facebook-github-bot commented Mar 12, 2021 • edited

💊 CI failures summary and remediations

codecov bot commented Mar 12, 2021 • edited

Codecov Report

H-Huang left a comment

Choose a reason for hiding this comment

H-Huang Mar 12, 2021

Choose a reason for hiding this comment

t-vi Mar 12, 2021

Choose a reason for hiding this comment

H-Huang Mar 12, 2021

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

t-vi commented Mar 12, 2021

H-Huang commented Mar 12, 2021

t-vi commented Mar 12, 2021

H-Huang commented Mar 12, 2021

H-Huang commented Mar 12, 2021

t-vi commented Mar 12, 2021

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Mar 12, 2021

Choose a reason for hiding this comment

t-vi Mar 13, 2021

Choose a reason for hiding this comment

rohan-varma Mar 12, 2021

Choose a reason for hiding this comment

rohan-varma Mar 12, 2021

Choose a reason for hiding this comment

H-Huang commented Mar 12, 2021

t-vi commented Mar 13, 2021

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 15, 2021

facebook-github-bot commented Mar 12, 2021 •

edited

codecov bot commented Mar 12, 2021 •

edited