New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Parallel push to share #3164

Merged

mitchellwrosen merged 12 commits into trunk from 22-06-28-push-conc

Jun 30, 2022

Member

mitchellwrosen commented Jun 28, 2022 •

edited

Depends on Parallel pull from Share #3153

Overview

This PR implements a parallel push, similar to the implementation used in #3153, but a little simpler.

The main thread pulls chunks of hashes off of a set and assigns them to one-shot workers.
Each worker downloads the chunk of entities, inserts them into the database, elaborates the hashes of the subset of those entities that went into temp storage, and appends the resulting set of hashes to the work queue.

Experimentally, it seems to go around 1.8x as fast as the serial implementation on trunk. Next step might be to go with a streaming endpoint, which can reuse a lot of this implementation, and would also allow us to get rid of the server-side sqlite write locks.

mitchellwrosen added 9 commits

June 28, 2022 15:37


          begin implementing parallel push

98db217


          more concurrent push work

71d380f


          give the dispatcher a name

1b6d8f1


          IntSet -> Set Int, because we want to call size

53a7884


          parallel push work

2d63656


          ⅄ trunk → 22-06-28-push-conc

f20c177


          refactoring for readability

ed0d654


          separate the upload progress callbacks into two

9bc1e24


          10 -> 5

ebc75b5

mitchellwrosen marked this pull request as ready for review

June 30, 2022 17:42

mitchellwrosen requested review from ChrisPenner, aryairani and tstat

June 30, 2022 17:42

ChrisPenner reviewed

View reviewed changes

unison-cli/src/Unison/Share/Sync.hs Outdated Show resolved Hide resolved

ChrisPenner reviewed

View reviewed changes

unison-cli/src/Unison/Share/Sync.hs Outdated Show resolved Hide resolved

ChrisPenner reviewed

View reviewed changes

unison-cli/src/Unison/Share/Sync.hs

+                      doJob :: [STM UploadDispatcherJob] -> IO Bool
+                      doJob jobs =
+                        atomically (asum jobs) >>= \case

Contributor

ChrisPenner Jun 30, 2022

Thanks for trying it this way, I find it clearer 👍🏼

ChrisPenner reviewed

View reviewed changes

unison-cli/src/Unison/Share/Sync.hs

+                      doJob jobs =
+                        atomically (asum jobs) >>= \case
+                          UploadDispatcherReturnFailure -> pure False
+                          UploadDispatcherForkWorkerWhenAvailable hashes -> doJob [forkWorkerMode hashes, checkForFailureMode]

Contributor

ChrisPenner Jun 30, 2022

Hrmm, was it intentional to check for failure only if we fail to start a new job?

I'm confused about the semantics of checking for failure, it seems in some spots it bails, and spots like here it bails only conditionally.

Member Author

mitchellwrosen Jun 30, 2022

Yeah, this ordering was intentional, but I think it would be correct either way. It's just a question of whether we want to be paranoid and re-check something we just checked (whether the failure TMVar has filled), or only fall back to checking if we end up blocking waiting for a chance to fork a worker (due to N workers already existing).

ChrisPenner reviewed

View reviewed changes

unison-cli/src/Unison/Share/Sync.hs

+                dedupeVar <- newTVarIO Set.empty
+                nextWorkerIdVar <- newTVarIO 0
+                workersVar <- newTVarIO Set.empty
+                workerFailedVar <- newEmptyTMVarIO

Contributor

ChrisPenner Jun 30, 2022 •

edited

Hrmm, ki propagates exceptions from threads to the parent right?
I wonder if we should just throw an exception on failure rather than need to remember to manually check a failure var every time before we do something.

I notice lower down you only bail on failure if there isn't more work to do, is there a particular reason for that?

Exiting with an exception would maybe mean we have some entities downloaded that we didn't insert, but honestly that seems like a reasonable sacrifice to make for simplicity and avoiding a foot-gun of forgetting to check for errors and rolling forward in some sort of invalid state.

It's also worth noting that the http client is already throwing exceptions on Timeouts, 500s, or bad gateways, so the workerFailedVar is only going to handle a subset of "blessed" failures, which I think maybe just adds some additional confusion; unless there's some other reason for it I'd say one failure mode is easier to maintain than two.

Member Author

mitchellwrosen Jun 30, 2022

Hmm... I'm not sure I agree. I think returning failures from the endpoint as 200s so we could pattern match on them was done so the code is easier to understand, and we don't have to know what exception types might be thrown, and where.

I agree it's confusing right now -- maybe we shouldn't have a catchSyncExceptions thing that only catches some subset of exceptions these actions might throw? Couldn't we return these as Lefts too, and pass them around manually? (Or use ExceptT syntactic sugar).

Contributor

ChrisPenner Jun 30, 2022 •

edited

Couldn't we return these as Lefts too, and pass them around manually? (Or use ExceptT syntactic sugar).

If you'd prefer to do it that way then yes, you could catch all of those error cases and pass around the Either's, but it's definitely a bit annoying, since ideally you'd do all of the "generic" error handling in the servant http client itself, but since each endpoint has its own custom Sum type of which errors/responses it can return, you'd need to convert from the generic http errors into the endpoint's error type for each and every endpoint (doable, but annoying), OR you'd have to defer catching the errors until outside of the creation of the http-client, which I don't like because it makes it very likely that some implementations will forget to catch certain error conditions.

I'm a big fan of using Exceptions for exceptional cases, as long as the set of "expected" exception types is well-bounded and well-known.

Another option would be to recognize the difficulties introduced by "200 Error" and try going back the other way, where we just send error codes with a good message and have the client propagate those errors.

ChrisPenner reviewed

View reviewed changes

unison-cli/src/Unison/Share/Sync.hs Outdated Show resolved Hide resolved

ChrisPenner reviewed

View reviewed changes

unison-cli/src/Unison/Share/Sync.hs Show resolved Hide resolved

ChrisPenner reviewed

View reviewed changes

unison-cli/src/Unison/Share/Sync.hs

+                        -- hashes of the entities we just uploaded from the `dedupe` set, because they will never be relevant for any
+                        -- subsequent deduping operations. If we didn't delete from the `dedupe` set, this algorithm would still be
+                        -- correct, it would just use an unbounded amount of memory to remember all the hashes we've uploaded so far.
+                        whenJust maybeYoungestWorkerThatWasAlive \youngestWorkerThatWasAlive -> do

Contributor

ChrisPenner Jun 30, 2022

Hrmm, I understand what this is doing; but it's very complex, and even though we have all the necessary information in the client, it kind of seems like an implementation detail of the server.

I suppose this is something else that can go away when we move to streaming right?

Member Author

mitchellwrosen Jun 30, 2022

Yeah, this can go away when we move to streaming, because the server won't have multiple simultaneous transactions open that all correspond to our push.

mitchellwrosen added 3 commits

June 30, 2022 14:48


          document a couple magic numbers

fcead4f


          fix typo

005dae4


          add a doc to dedupeVar

98d8d0e

ChrisPenner approved these changes

View reviewed changes

Contributor

ChrisPenner commented Jun 30, 2022

I don't love having multiple routes for errors to propagate, I think it's a foot-gun trap waiting to be sprung, but we can re-visit that later I suppose.

Member Author

mitchellwrosen commented Jun 30, 2022

Sure - wait, unclear to me what you mean - what's the trap? (Anyway, yeah we can revisit something).

mitchellwrosen merged commit ed94820 into trunk

mitchellwrosen deleted the 22-06-28-push-conc branch

June 30, 2022 21:21

pchiusano mentioned this pull request

M4 Release Notes (DRAFT) #3209

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment