Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync single external service #13483

Merged
merged 57 commits into from Sep 9, 2020
Merged

Sync single external service #13483

merged 57 commits into from Sep 9, 2020

Conversation

ryanslade
Copy link
Contributor

@ryanslade ryanslade commented Aug 31, 2020

This PR makes two main changes:

  • We sync each external service independently
  • The syncing process now happens using the workerutil package

Syncer.Run now enqueues external services into a job queue that are due to be synced every minute.

These jobs are picked by up to three sync workers concurrently each worker syncs one service using the new Syncer.SyncExternalService method.

We rely on database triggers to mark repos as deleted once they no longer exist in any external service. This is based on the state of the external_service_repos table which tracks the relationship between repos and external services.

Triggers also exist that remove rows from external_service_repos when a repo or external service is deleted or soft deleted.

@keegancsmith We'd love your feedback specifically on the the code related to SyncSubset as we are not 100% sure that this still works as expected.

ryanslade and others added 11 commits August 27, 2020 12:50
Also removes a test case that tests a sync of muktiple external services
as we no longer support that.
Code now compiles but a lot of tests still fail.

Added a cleanup function that needs to be called when we stop
a worker to unregister prometheus metrics so that it doesn't panic when we
start another worker in a subsequent test.
@codecov

This comment has been minimized.

@codecov

This comment has been minimized.

ryanslade and others added 12 commits September 1, 2020 17:14
If a job is processing when repo-updater dies that job would never
complete. We would then not requeue the associated external service for
syncing. To be safe we delete non locked processing rows on startup.
This will cause a sync to be triggered ASAP for the saved external
service.

On save, we trigger a call to repo-updater which causes us to
enqueue any pending sync jobs. As we've just upated next_sync_at the job
for the newly saved repo will be queued.
We no longer trigger a full sync but instead trigger enqueueing pending
sync jobs.
We now have a trigger that does the same thing
@ryanslade ryanslade marked this pull request as ready for review September 2, 2020 15:38
@ryanslade
Copy link
Contributor Author

LGTM. Too much code to review, so I'm not confident I have validated all issues. I think in this sort of change a better approach is if I also had a test plan to review. Our old tests worked in a global context, so the switch to per external service may miss things. So I think a few things need to be manually tested:

  1. two external services returning the same repo, but with different data. Easiest conflict here is on the name (eg different name template). Is this deterministic? Does the name flip flop as each syncer runs?

Yes, if different data is returned then we'll end up flip flopping as each service syncs. This isn't something we considered unfortunately.

Perhaps we can make it eventually consistent on upsert. Currently newDiff sorts the sourced repos deterministically and merges them in that order. Maybe whenever we update a repo we also store the external service id that was used to source the repo. So we record the last service to "touch" the repo. If the id's are the same we know it's a simple update, just replace everything. If the ids are different the merge operation happens in a deterministic order based on source id, which is pretty much what is happening here:

func (r *Repo) Less(s *Repo) bool {

  1. Streaming insertion of new repos

I've tested this and it works as expected.

  1. SyncSubset still works. This is all that sourcegraph.com uses. I can't remember if non dot-com instances use this, maybe streaming inserts?

In the new implementation all instances will use this since, as you guessed, it's use in the streaming inserter. So given that it's used in all code paths now and we've included a bunch of new tests as well as done a fair amount of manual testing I'm fairly confident that it still works.

@ryanslade ryanslade mentioned this pull request Sep 4, 2020
18 tasks
@keegancsmith
Copy link
Member

I just remember another case we came across. Not sure how it would apply now:

  1. Syncs repo {Name=A ExternalID=1}
  2. A is deleted and a new repo is created {Name=A ExternalID=2}
  3. Sync repo. Currently we would detect this in the old code (or eventually be consistent) as a brand new repo and delete the old one.

We also have cases we handle like repos swapping names between syncs:

  1. Sync {Name=A ExternalID=1} and {Name=B ExternalID=2}
  2. funky renames
  3. Sync {Name=A ExternalID=2} and {Name=B ExternalID=1}

Throw in different external services to make it more fun. EG to make this applicable, imagine if what is synced by ExternalID=1 and ExternalID=2 are different external services? Because the old code treated this globally, it would pick one winner for a certain name. So we would never violate the unique name constraint.

  1. two external services returning the same repo, but with different data. Easiest conflict here is on the name (eg different name template). Is this deterministic? Does the name flip flop as each syncer runs?

Yes, if different data is returned then we'll end up flip flopping as each service syncs. This isn't something we considered unfortunately.

This flip flopping may be fine. For example you could enforce name templates are the same for same codehosts? Then if a codehost flipflops data we could maybe just accept that as the codehost doing something bad? I can't remember if we ever came across something like that, we only specifically cared about name.

Perhaps we can make it eventually consistent on upsert. Currently newDiff sorts the sourced repos deterministically and merges them in that order. Maybe whenever we update a repo we also store the external service id that was used to source the repo. So we record the last service to "touch" the repo. If the id's are the same we know it's a simple update, just replace everything. If the ids are different the merge operation happens in a deterministic order based on source id, which is pretty much what is happening here:

func (r *Repo) Less(s *Repo) bool {

I think for merges picking a winner like only min(external_service_id) gets to choose the name may work? But you get into issues like min(external_service_id) having its access token revoked and never syncing (this seems somewhat likely for sourcegraph.com usecase). I think flip flopping is acceptable, as long as we don't flip flop on name.

  1. Streaming insertion of new repos

I've tested this and it works as expected.

Great

  1. SyncSubset still works. This is all that sourcegraph.com uses. I can't remember if non dot-com instances use this, maybe streaming inserts?

In the new implementation all instances will use this since, as you guessed, it's use in the streaming inserter. So given that it's used in all code paths now and we've included a bunch of new tests as well as done a fair amount of manual testing I'm fairly confident that it still works.

Sounds good.

@asdine
Copy link
Contributor

asdine commented Sep 7, 2020

@keegancsmith We have added some logic for resolving name conflicts when they happen, this should prevent the flip-flopping when a repo is renamed for any reason.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some concerns over the size of the names list I commented on inline. One wishes names were not unique and externalrepospec was the primary key :)

What is the behaviour when syncers run in parallel? I am guessing this conflicting code may run in a strange way. However, I believe that is probably fine to ignore since a future run will converge to the correct values.

cmd/repo-updater/repos/syncer.go Show resolved Hide resolved
return errors.Wrap(err, "syncer.sync.store.list-repos")
}
conflicting = conflicting.Filter(func(r *Repo) bool {
for _, id := range r.ExternalServiceIDs() {
if id == externalServiceID {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not worth it, but it seems this filter could relatively easily be represented in SQL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to modify ListRepos to allow filtering by external service id and names, but apparently it is not supported. We decided to keep that change for another PR, even though it feels wrong doing that filtering at runtime

@ryanslade
Copy link
Contributor Author

What is the behaviour when syncers run in parallel? I am guessing this conflicting code may run in a strange way. However, I believe that is probably fine to ignore since a future run will converge to the correct values.

Yes, there's a chance that two syncers could try and modify the same repo in different transactions. One will fail and be requeued and should succeed on the next sync so they'll eventually converge.

@ryanslade ryanslade merged commit d00157a into main Sep 9, 2020
@ryanslade ryanslade deleted the sync_single_external_service branch September 9, 2020 12:22
ryanslade added a commit that referenced this pull request Sep 9, 2020
ryanslade added a commit that referenced this pull request Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants