This repository has been archived by the owner. It is now read-only.

Refactor sourceMgr #196

Merged
merged 43 commits into from Apr 4, 2017

Conversation

Projects
None yet
2 participants
@sdboyer
Copy link
Owner

sdboyer commented Mar 20, 2017

This PR refactors the sourceMgr in several areas:

  • Fixes race conditions/concurrency issues - fixes #170
  • Incorporates context.Context - #159
  • Make ProjectAnalyzer a method parameter - fixes #195

The changes here also pave the way for persistent caching - #130.

Smaller TODOs:

  • Rename update() to fetch() in our vcs repo impl
@spenczar
Copy link
Contributor

spenczar left a comment

I think the use of big anonymous closures like this is dangerous. They're tricky to read, because you have to scan all the surrounding code to understand which things are free variables and which are created within. That leads to shadowing bugs like #187 quite quickly.

It also complicates tests and makes it difficult to follow logic while debugging.

I'd really encourage you to consider breaking those anonymous closures out, making them named functions or methods, so it's clearer what their inputs and outputs are. My experience with this was in debugging that error shadowing case, and it was only hard because of the use of closures and callbacks.

Ultimately, that's a style thing, not a correctness thing, so it's your call, of course. But I think correctness (or incorrectness!) would be easier to see.

deduce.go Outdated
callMgr: cm,
ctx: cm.getLifetimeContext(),
rootxt: radix.New(),
deducext: pathDeducerTrie(),

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

Missing a line for action: make(chan func()) here

This comment has been minimized.

@sdboyer
deduce.go Outdated
for {
select {
case <-dc.ctx.Done():
close(dc.action)

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

This risks either a double close or a read of a nil action in the next loop iteration. Do you need to return in this branch?

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

ah yes, totally, thank you

deduce.go Outdated
@@ -690,6 +695,253 @@ func (sm *SourceMgr) deduceFromPath(path string) (deductionFuture, error) {
}, nil
}

type deductionCoordinator struct {
ctx context.Context

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

The context package warns against using contexts this way: "Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it."

I think it's good advice. Do you have a reason you need the context to be a struct field?

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

I don't really follow most of that discussion, but at the very least, this is unusual, which in the Go community is a strong knock against it :)

It appears that the stored context is only really used in deduceRootPath to check whether the coordinator has already been terminated. The fact that it's done with a stored context (but other times a context is passed in to the function call - why? what's the difference?) will make it harder to see how you go about canceling calls, I think. I could be wrong! There might be more inherent complexity here than I see, so maybe that difficulty is going to be here no matter what.

But I agree with Peter that I've regretted doing things this way every time. Letting the context flow down through method invocations has never confused me in the same way.

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

Yeah, so, I realize I actually conflated two things. THAT stored context doesn't necessarily need to be there; it could be grabbed from the callManager on the fly in the actor loop. But that doesn't really change much, as there's still a context on the callManager struct.

The basic issue here comes down to there being two different possible origins for cancellation of any given SourceManager method call:

  1. A cancellation of the context passed directly into the method - the standard Go way of doing it
  2. A cancellation from the long-lived context (e.g., a signal is sent to a dep process, which then needs to tell the sourceMgr to shut down via a context that's injected in the NewSourceManager() call), which must propagate to all open calls

There are different possible approaches to this, but that's the essential reality, and dealing with it entails combining cancelation paths somehow.

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

I think vanilla contexts handle that pretty well with just one tree of contexts, I don't see how you need to combine cancelation paths. I imagine it looking something like this:

func main() {
  ctx, cancel := context.WithCancel(context.Background())
  go cancelOnSignal(cancel) // yknow, set up a loop on a chan signal.Signal, blah blah

  sm := NewSourceManager(ctx, [...]) 
  sm.doStuff(ctx)
}

If the root context is canceled because of a signal, that gets propagated down to all the children, which will be all the open calls, like you want.

For the second case, if you want a 'locally' cancelable function, you call context.WithCancel again. The newly generated context will cancel if the old papa context made in main gets canceled, or if the new local cancel method is called. But the local cancel method won't close any of the context's ancestors. This seems to be what you want, right?

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

The little code block is fine for that set of calls - but in a dep run that includes a solve the vast majority of SourceManager calls are coming from the solver, not dep itself. And yes, we could thread a context through that as well (in fact, doing so might be quite useful to establish running time limits on the algorithm), and that could be subctx'd down into the calls the solver makes to the SourceManager.

But while that's a fine guarantee for an implementation to provide, within gps, we have no idea if the caller has wired the long-lived context together with the ephemeral call contexts. To ensure correctness, we still have to internally combine cancelation.

For the second case, if you want a 'locally' cancelable function, you call context.WithCancel again. The newly generated context will cancel if the old papa context made in main gets canceled, or if the new local cancel method is called. But the local cancel method won't close any of the context's ancestors. This seems to be what you want, right?

Yes, and this is mostly the counterargument that Peter made on reddit. (I also did up what you're describing in some pseudocode).

I'm not saying it's not possible to do that way. I'm saying it's fundamentally equivalent to the constext approach. And, while constext is a new thing to learn, it's also a reasonably well-defined formalism. Given that we have to solve this problem, I'm interested in exploring the formalism, rather than just following the guideline...for now 😄 .

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

we could thread a context through that as well

Yep! That's exactly what I'd expect to see naively.

But while that's a fine guarantee for an implementation to provide, within gps, we have no idea if the caller has wired the long-lived context together with the ephemeral call contexts. To ensure correctness, we still have to internally combine cancelation.

I don't mean to be dense, but I'm still not totally getting this. If the calls are coming from the solver, not dep itself, then we seem to be sure of the provenance of the contexts, right? We know they came as children made by the solver off of the original one from dep (or some other implementation using the solver).

Are you concerned that there are multiple entrypoints to this code, and you want to be sure the caller is using the same context for them? If so, I think the right approach is to shrink the entrypoints. Do you have a concrete case here? That might help me understand.

Yes, and this is mostly the counterargument that Peter made on reddit. (I also did up what you're describing in some pseudocode).

Hm, this pseudocode isn't what I was thinking of. There's no need for quitchan there - you'd just call cancel() at the end of the function directly. Each spawned goroutine is checking whether the context is closed before (or while) it does significant work like network calls. If you want to only exit once all spawned goroutines have exited, you keep track of that with sync.WaitGroup.

I agree that both approaches eventually work, of course. No question there! This is more about how easy those approaches are to keep up and maintain. I just think it's simpler to think about the one flow of contexts than this idea of merging two.

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

@rogpeppe has also responded on reddit, with a hopefully-clearer explanation :)

Are you concerned that there are multiple entrypoints to this code, and you want to be sure the caller is using the same context for them? If so, I think the right approach is to shrink the entrypoints.

Yes. I don't see how it would be possible to shrink the entrypoints - there's the creation of the sourceManager, and there's calling methods on it - the context for the former cancels ALL running activities, and the context for the latter cancels only the current method.

These contexts are probably from the same tree, but we cannot know if they are from within the sourceMgr, so we have to take precautions.

There's no need for quitchan there - you'd just call cancel() at the end of the function directly.

Without the quitchan, we don't know to cancel this running call if the long-lived context was canceled. We'd have to make the assumption that the two contexts are part of the same tree.

If you want to only exit once all spawned goroutines have exited, you keep track of that with sync.WaitGroup.

Yep, but this is entirely separate.

I just think it's simpler to think about the one flow of contexts than this idea of merging two.

I agree, if we could make that assumption, it would be the simpler way to do it :)

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

Okay! That makes a lot more sense. The root cause here is that you don't trust the caller to responsibly use the right context value.

So, the design tradeoff you're making is that you can be more confident that this code will work for callers who do weird things with the package (and using different contexts would be weird), but you are sacrificing simplicity to gain that confidence.

I don't think I agree that that's the right design decision here since gps isn't ever going to have, like, tons of direct users. It's likely that dep is the only user for quite a while. I think clear documentation on the source manager's constructor (use the same context, or children of the original context, whenever you call its methods) would be sufficient.

But! That's just my feeling. I think we've actually gotten to the core of things now, and we have the tradeoff much more clearly on the table, which is good enough for me. 👍

deducers.go Outdated

// The rest of the work needs its own goroutine, the results of which will
// be re-joined to this call via the return chans.
go func() {

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

Consider making these anonymous closures into named functions or methods. That way, it's clear what the inputs and outputs are.

In this case, I think it's something like

func (sc *sourceCoordinator) computeSourceGateway(ctx context.Context, name string, rc srcReturnChans)

and the invocation becomes go sc.computeSourceGateway(ctx, normalizedName, rc).

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

Thanks for the review!

Yeah, I've been leaning in that direction; I went heavy pseudo-actor model initially, and am slowly walking it back. The general issue I have with it is distinguishing between things that should be called by the wrapper object, vs those that should not. That's a smaller design consideration, though - like I said, I'm walking it back :)

This one isn't shouldn't be a problem to make its own method for, but the other one (in deductionCoordinator.deduce() is absolutely not threadsafe unless it's being serialized through that channel.

deducers.go Outdated
sc.srcmut.RLock()
srcGate, has := sc.srcs[normalizedName]
sc.srcmut.RUnlock()
if has {

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

This code might be quite a bit simpler if it used sync.Cond to rendezvous the goroutines instead of channels. As it is, there are lots of opportunities for races in this code. For example, here it looks to me like we're unlocking too early - this should be

sc.srcmut.RLock()
srcGate, has := sc.srcs[normalizedName]
if has {
  sc.srcmut.RUnlock()
  doReturn(srcGate, nil)
  return
}
sc.srcmut.RUnlock()

but of course this is all a sad state to be in - we've wasted time running deduceRootPath twice. Instead of guarding the map with a lock, I think we'd be better off guarding the function invocation. sync.Cond is the simplest way to do that.

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

Hmm...well, while this is certainly my first, untested pass at the code, I'm not seeing the opportunity for races here. And I don't see how it would be possible for deduceRootPath to be run multiple times for the same path.

The big reason there's so much staged locking/unlocking here is because we have to keep both the deduceRootPath() AND sourceURL() calls on a non-blocking path, as both may involve I/O, especially the latter.

That said, I'd totally forgotten that sync.Cond even exists. It may well be a much better way of coordinating these fold-in behaviors. I'll have a look!

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

I don't see how it would be possible for deduceRootPath to be run multiple times for the same path.

Suppose getSourceGatewayFor is called twice with the same arguments in two different goroutines, G1 and G2.

G1 does the L164 logic - it grabs the srcmut read lock, sees that sc.nameToURL doesn't have normalizedName, so it releases the lock. Now G2 is scheduled, and does exactly the same thing. They're both now waiting right in front of the go call at L182.

G1 starts this goroutine - the one from L182 to L272. It grabs the psrcmut write lock, sees that there is no entry in p.protoSrcs, so it creates one. It continues doing work, including running deduceRootPath, all the way to running doReturn. Again, it grabs the write lock, sends its output on the right channel, and then calls delete(sc.protoSrcs, normalizedName). It releases the lock and exits.

Now G2 is scheduled. It starts at the top of the big goroutine. It grabs the psrcmut lock. It checks if sc.protoSrcs has normalized name. It doesn't, because G1 deleted its entry. So, G2 continues - it will run deduceRootPath again.

So, there we are. Certainly not a very common scheduling, but definitely possible.

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

Ah, sorry, yes, that's possible - but that's the one I'm accounting for with this additional medial check we're commenting on here 😄

In that case, we do run deduceRootPath() twice, but the results will be coming back from a trie - no regex, and definitely no HTTP call. So we're not paying a significant cost in cycles. And there's no unsynchronized data access or duplication of resources, as we return out early in the (fixed) version of this check we're commenting on.

(Also, the original code change you suggested is equivalent to what's there - I just hoisted the Unlock() out of the conditional)

Are there other potential bad interleavings that you can see?

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

Oh, the trie thing is good - I didn't realize that! But that kind of proves my point here, I think, which is that wowzers this is pretty tricky code. I think sync.Cond could make it simpler. I'll scan for more bad schedulings, but no promises :)

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

No argument here! I'd definitely like to see this decomposed into more comprehensible pieces. IMO it's already less complected than it was in the old system (where root deduction and source creation were inextricably bound up in those stacks of futures), but there's still a ways to go.

deducers.go Outdated
// scheduled. Guard against that by checking the main sources map again
// and bailing out if we find an entry.
sc.srcmut.RLock()
srcGate, has := sc.srcs[normalizedName]

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

We're checking here for normalizedName, but later we set based on url. Is that correct?

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

Ah yes, thank you - I initially wrote this medial check before introducing the nameToURL map, and forgot to update it when I changed the check at the top.

deducers.go Outdated
doReturn(sa, nil)
return
}

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

This anonymous closure runs to several pages long in my editor, which makes it tough to keep track of everything. Could it be broken up into smaller function calls that have clear inputs/outputs?

This comment has been minimized.

@sdboyer

sdboyer Mar 21, 2017

Owner

I'm sure we can get it there :) Per the above response, once I've weaned off of the closure in general towards a method, splitting it up into parts would be easier.

@spenczar
Copy link
Contributor

spenczar left a comment

Thanks for being receptive to review! 😀

deducers.go Outdated
sc.srcmut.RLock()
srcGate, has := sc.srcs[normalizedName]
sc.srcmut.RUnlock()
if has {

This comment has been minimized.

@spenczar

spenczar Mar 21, 2017

Contributor

I don't see how it would be possible for deduceRootPath to be run multiple times for the same path.

Suppose getSourceGatewayFor is called twice with the same arguments in two different goroutines, G1 and G2.

G1 does the L164 logic - it grabs the srcmut read lock, sees that sc.nameToURL doesn't have normalizedName, so it releases the lock. Now G2 is scheduled, and does exactly the same thing. They're both now waiting right in front of the go call at L182.

G1 starts this goroutine - the one from L182 to L272. It grabs the psrcmut write lock, sees that there is no entry in p.protoSrcs, so it creates one. It continues doing work, including running deduceRootPath, all the way to running doReturn. Again, it grabs the write lock, sends its output on the right channel, and then calls delete(sc.protoSrcs, normalizedName). It releases the lock and exits.

Now G2 is scheduled. It starts at the top of the big goroutine. It grabs the psrcmut lock. It checks if sc.protoSrcs has normalized name. It doesn't, because G1 deleted its entry. So, G2 continues - it will run deduceRootPath again.

So, there we are. Certainly not a very common scheduling, but definitely possible.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Mar 22, 2017

Codecov Report

Merging #196 into master will increase coverage by 1.04%.
The diff coverage is 72.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #196      +/-   ##
==========================================
+ Coverage   77.17%   78.22%   +1.04%     
==========================================
  Files          27       28       +1     
  Lines        3987     4018      +31     
==========================================
+ Hits         3077     3143      +66     
+ Misses        692      655      -37     
- Partials      218      220       +2
Impacted Files Coverage Δ
rootdata.go 92.1% <ø> (ø) ⬆️
hash.go 81.35% <100%> (ø) ⬆️
bridge.go 75.91% <100%> (-0.25%) ⬇️
solver.go 83.87% <100%> (+0.09%) ⬆️
maybe_source.go 63.86% <58.1%> (-16.14%) ⬇️
vcs_repo.go 54.19% <60.6%> (+8.04%) ⬆️
version.go 77.45% <62.85%> (-1.34%) ⬇️
cmd.go 79.66% <66.66%> (-8.58%) ⬇️
source_cache.go 67.01% <67.01%> (ø)
vcs_source.go 77.17% <69.56%> (+4.11%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c36f26...01e3ba5. Read the comment docs.

@sdboyer

This comment has been minimized.

Copy link
Owner

sdboyer commented Mar 22, 2017

#202 is quite related to this, but I'm gonna leave it out of this PR to kinda sorta avoid scope creep just a teensy bit.

@sdboyer

This comment has been minimized.

Copy link
Owner

sdboyer commented Mar 29, 2017

Noting this here because I'm not sure if it'll remain an issue or not, so I don't want to make it a separate issue until this is merged. But, this PR as currently written has a problematic case. (The problem exists on master, too, but is even less organized in how it deals with the issue)

Say there's a ProjectRoot, github.com/foo/bar. When attempting to get the source handler for that project, the default deduction and maybeSource path will end up pointing to https://github.com/foo/bar. That's the first scheme it'll try, anyway.

The role of the maybeSource is interesting. It's the way that the root deduction system abstracts over which exact upstream URL (scheme, mostly) might be used to fulfill an import. maybeSource.try() needs to balance two things: returning as quickly as possible, but only returning if it's reasonably sure the source it's deduced actually exists.

There's two definitions we could use for "a source exists": it exists upstream, or it exists in the cache. The latter, being local, is faster to check, but potentially wrong about the state of the world - even if the source exists in local cache, the upstream could have moved, been deleted, become inaccessible due to expired credentials, etc.

For now, I think I'm going to just force the remote check to always happen. However, it seems likely that we may need to revisit this, and rely only on the local check initially. If/when we make that shift, we'll need to add a mechanism by which to reach back into the sourceCoordinator and change its various maps to correspond to the new URLs. This could present issues for the current logic in getSourceGatewayFor(), as it's all carefully crafted around the idea that those maps are essentially append-only.

@sdboyer sdboyer force-pushed the sm-channels branch from 0759522 to 13bc00d Mar 29, 2017

@sdboyer sdboyer force-pushed the sm-channels branch from 53a6431 to 38bedf4 Apr 1, 2017

sdboyer added some commits Apr 1, 2017

More source-related cleanup and cruft removal.
* Move baseVCSSource into vcs_source.go
* Rename source_test.go to vcs_source_test.go
* Remove sourceExistence type and remaining uses of it
Add sourceGateway tests
This test isn't exhaustive, but it covers the basics.

@sdboyer sdboyer force-pushed the sm-channels branch from 8847ade to fce2662 Apr 3, 2017

sdboyer added some commits Apr 3, 2017

chan instead of real sigs in TestSignalHandling
Using actual signals was far too flaky in tests.

@sdboyer sdboyer force-pushed the sm-channels branch from fce2662 to a3c79f3 Apr 3, 2017

sdboyer added some commits Apr 3, 2017

Rename update() to fetch() in local repo interface
This more accurately reflects how we were actually using the method for
some time, as well as what we'd more recently modified the method to do.

@sdboyer sdboyer force-pushed the sm-channels branch from 56923fc to 0eddf21 Apr 3, 2017

@sdboyer sdboyer force-pushed the sm-channels branch from 5aeb5e8 to 01e3ba5 Apr 4, 2017

@sdboyer

This comment has been minimized.

Copy link
Owner

sdboyer commented Apr 4, 2017

As-is, this is very likely to cause a perceived performance regression. It will end up serializing calls from the solver that previously had been parallelized, as it reverses the change made in 4815cd5.

The fundamental problem is, we can't avoid that regression while maintaining the invariant I described as valuable earlier.

However, this PR has been open for far too long, and diverged far too much as-is. Because it's accomplished its main purpose, I'm going to merge it now, and then we can revise and modify more incrementally.

@sdboyer sdboyer merged commit c7c4309 into master Apr 4, 2017

5 of 6 checks passed

codecov/patch 72.82% of diff hit (target 77.17%)
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 78.22% (+1.04%) compared to 1c36f26
Details
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
hound No violations found. Woof!

@sdboyer sdboyer deleted the sm-channels branch Apr 4, 2017

@sdboyer sdboyer changed the title [WIP] Refactor sourceMgr Refactor sourceMgr Apr 16, 2017

kris-nova pushed a commit to kris-nova/dep that referenced this pull request Apr 21, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.