Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete has no more data after the key #53886

Closed
wants to merge 2 commits into from

Conversation

t-vi
Copy link
Collaborator

@t-vi t-vi commented Mar 12, 2021

The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.

Thank you, @PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).

Fixes #53872

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 12, 2021

💊 CI failures summary and remediations

As of commit 08e0964 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@t-vi t-vi force-pushed the tcpstore_del_perf branch 2 times, most recently from 3076c34 to 3fcb5eb Compare March 12, 2021 10:09
@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #53886 (08e0964) into master (1772e26) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #53886      +/-   ##
==========================================
- Coverage   77.30%   77.30%   -0.01%     
==========================================
  Files        1888     1888              
  Lines      183589   183589              
==========================================
- Hits       141925   141917       -8     
- Misses      41664    41672       +8     

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome fix and good catch @t-vi! I am a bit wary about adding that new test case test_delkey_perf since testing a perf change using multiprocessing can introduce flakiness / false signals in our CI.

Since TcpStore is used internally for storing membership information for nodes in distributed training, deletes are not frequent and time taken is small when compared to the actual training process.

Comment on lines 343 to 351
store = self._create_store()
keys = [str(i) for i in range(10)]
t0 = time.perf_counter()
[store.set(k, k) for k in keys]
dur_set = time.perf_counter()
t0 = time.perf_counter()
[store.delete_key(k) for k in keys]
dur_delete = time.perf_counter()
print(dur_set, dur_delete)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this part if you are able to add an assert when comparing the duration and it is consistent. I don't think we need the tests below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I didn't mean to have that in there. This part doesn't actually check anything useful related to the fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this wasn't updated

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 12, 2021

I think the part of the test you are concerned about is the part we want to test which only happens when you actually have multiple processes. Note that the timing bound is exceedingly generous, so I would be optimistic that it's not too flaky. Of course, if we had a better place to put a basic perf test, it'd be better. The main alternative I'd see is not have a test if we conclude that testing timings in multi-process is too flaky.

@H-Huang
Copy link
Member

H-Huang commented Mar 12, 2021

Under the hood, TCPStore is single threaded. There is a thread on the master which sequentially accepts query types (e.g. SET, GET, DELETE) and handles them, which is why testing the perf under multiprocessing does not have that much added benefits compared to a single process. You are correct about the timing bound it is pretty generous though 🙂

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 12, 2021

The bug was in the communication, not the backend.

@H-Huang
Copy link
Member

H-Huang commented Mar 12, 2021

Ah good point, my misunderstanding. Code looks LGTM to me then! If you remove the extra part under _test_numkeys_delkeys I will merge it ASAP

The tcpstore delete key implementation inadvertendly set "moreData"
when sending the key when it was in fact the last message.

Thank you, @PetrochukM for the reproducing example.

Fixes pytorch#53872
@H-Huang
Copy link
Member

H-Huang commented Mar 12, 2021

failure in mac os (https://app.circleci.com/pipelines/github/pytorch/pytorch/284617/workflows/812c7265-5c1b-4ecf-becf-fb119cd226b8/jobs/11504465). I think this is intermittent, I didn't see this in the earlier run

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 12, 2021

That must be the flakyness. :/ Do we have hope that it is only mac or do we think this type of test is just touchy?

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick and great fix! In general I'm pretty unsure about adding a unittest that validates performance, as unittesting is mostly about correctness and we generally have benchmarks for performance. I'll defer the discussion to folks who've owned/worked on Store more, though.

processes = []
num_processes = world_size
for i in range(num_processes):
p = mp.Process(target=self._test_delkey_perf_worker, args=(i, addr, port, world_size, messages))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable for all distributed tests needing multiprocessing to inherit from MultiProcessTestCase, as that class has had significant work to ensure proper error handling, failure reporting, etc. If we do add this test, could we add a class that inherits from MultiProcessTestCase and runs this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so I thought copying the test from within the case would be reasonably safe but I lack the expertise. I do agree that this isn't the right thing to do here - and it's demonstrated by the present failure.

store.delete_key(k)
dur_delete = (time.perf_counter() - start) * 1000

if dur_delete > 5 * max(dur_wait, dur_store):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems quite prone to flakiness, at the least could we make it a much more generous threshold?

)
sys.exit(MultiProcessTestCase.TEST_ERROR_EXIT_CODE)

time.sleep(0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we try to avoid time.sleep() in tests as it has previously been the source of a lot of nondeterminism and flakiness.

@H-Huang
Copy link
Member

H-Huang commented Mar 12, 2021

I agree with Rohan's comments. Regarding whether mac is the only tests that we've seen hit these flaky test, there have been multiple different CIs that we have seen multiprocessing cause issues with and we are also currently investigating them. Adding to our cpp tests (TCPStoreTest) may also be more stable, but that would involve a bit more work. For now, I think we should just leave out the perf test, and get the fix itself checked in.

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 13, 2021

OK, so based on the feedback, I'll remove the test for now. If we have a place for benchmark, I'd be glad to add it there in a separate PR.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@H-Huang merged this pull request in 8734e88.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TCPStore.delete is 150x to 600x slower than TCPStore.set
5 participants