New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep repository set in sync with config #2025

Open
sqs opened this Issue Jan 25, 2019 · 13 comments

Comments

Projects
None yet
4 participants
@sqs
Copy link
Member

sqs commented Jan 25, 2019

Tracking issue for repository improvements implemented by @tsenart and @keegancsmith that will ship in Sourcegraph 3.1.

User story:

As a site admin, I configure Sourcegraph with a personal access token to GitHub. I want to share Sourcegraph with my team, but I don't want Sourcegraph to be configured with all of the repositories that my token has access to. Either I update the token, or I limit the scope of synced repositories (i.e. via a search query or regex) and the repositories that are no longer accessible immediately disappear from my Sourcegraph instance.

Open questions:

  • Should we remove enabled/disabled state for repo? If so, do we (eventually) need to replace it with a "do not search/index" flag?

This tracking issue will likely have a lot of overlap with #914 and #1467, but this issue will describe only what we intend to ship in 3.1.

Work items:

  • Tomás to walk Keegan through existing proposal and code.
  • Experiment with migrating a single code host to use new syncing code while maintaining backwards compatibility.
  • If the above works, migrate one code host at a time in the same way, and partition work per code host.

@sqs sqs added repos roadmap labels Jan 25, 2019

@sqs sqs added this to the 3.1 milestone Jan 25, 2019

@sqs sqs changed the title Improved repository handling Improved repository mirroring Jan 28, 2019

@sqs sqs changed the title Improved repository mirroring Keep repository set in sync with config Jan 30, 2019

@keegancsmith

This comment has been minimized.

Copy link
Member

keegancsmith commented Jan 30, 2019

Something to keep in mind if we make the primitive "repoQuery" is that github's search API has a rate limit of 20 requests an hour. This will be fine when reacting to config changes, but may back us into a corner when we want to notice changes in the search results. Additionally a search request only returns 1000 results. @tsenart's idea of us doing the filtering and just listing everything may be necessary, but also makes us lose some power / hard to do against public codehosts due to the amount of repos.

I am not sure about other APIs + search, but I don't think they are as limited as githubs w.r.t. rate limits.

@nicksnyder

This comment has been minimized.

Copy link
Member

nicksnyder commented Jan 30, 2019

github's search API has a rate limit of 20 requests an hour
Additionally a search request only returns 1000 results

Are these limitations only on GitHub.com or do they also impact GitHub Enterprise?

@keegancsmith

This comment has been minimized.

Copy link
Member

keegancsmith commented Jan 31, 2019

I could not find any information from some googling about GitHub Enterprise. There was nothing to indicate that it would be different for enterprise vs cloud. I also don't have any experience with GitHub Enterprise directly (maybe @beyang or @sqs can remember if we encounter rate limits)

@sqs

This comment has been minimized.

Copy link
Member Author

sqs commented Jan 31, 2019

GHE does not rate limit by default, but site admins can enable it: https://help.github.com/enterprise/2.16/admin/guides/installation/configuring-rate-limits/.

@nicksnyder

This comment has been minimized.

Copy link
Member

nicksnyder commented Feb 1, 2019

Given FOSDEM and the fact that both Keegan and Tomás are working on 3.0 release blockers, this tracking issue will not get updated with a plan until next week.

@nicksnyder

This comment has been minimized.

Copy link
Member

nicksnyder commented Feb 11, 2019

This is a priority to make progress on, so even if there is nothing to ship in 3.1, there should be a goal to ship some improvements here for 3.2.

@tsenart

This comment has been minimized.

Copy link
Contributor

tsenart commented Feb 12, 2019

This is a priority to make progress on, so even if there is nothing to ship in 3.1, there should be a goal to ship some improvements here for 3.2.

Sounds good to me.

@keegancsmith

This comment has been minimized.

Copy link
Member

keegancsmith commented Feb 13, 2019

FYI we are actively working on this. Our current plan is to take the work @tsenart did and make it shippable just for github code hosts. Then based on what we learn, create a more detailed plan.

@keegancsmith

This comment has been minimized.

Copy link
Member

keegancsmith commented Feb 13, 2019

Update

We need to be concerned about the external_* columns on the repo table. They
can be unset, or not be unique as a triple. However, we need them to always be
set and unique. So we need to implement a migration process which is not just
a SQL migration, since we will need to fetch data from the upstream codehost
when the fields are not set. We must also identify which external service kind
a row belongs to, before attempting to contact a codehost. This functionality
probably exists in RepoLookup.

As a path forward, we should investigate making the uniqueness constraint only
apply to non-null columns. This way we can continue on our current path.

Side-note: We talked about using new columns instead, but those would also
require back-filling the values.

Github syncing is working and it can co-exist with the older syncers. It
handles deletes gracefully.

Next

Goal is to merge github support tomorrow so that it doesn't break anything #2266

  • Test renames.
  • Make migration partial and handle duplicate values.
  • Make it more robust under failures.
  • Tests.
  • TODOs in code.
@keegancsmith

This comment has been minimized.

Copy link
Member

keegancsmith commented Feb 14, 2019

Raw notes from today:

- [ ] still support Kind in api call
- [ ] discuss URN and associating with an external service rather than ID
- [ ] figure out deduplication between sources that yield the same repos
- [ ] are external services soft deleted
- [ ] need to fix external service being deleted => delete repos
- [ ] to dedup repo metadata conflicts, just pick newest updated_at timestamp.
- [ ] ensure external_service_ids is the set of sources listing the repo
- [ ] id bigserial

Data model wishlist:
- Associate a repo with external services. 1+
- deduplication of repo rows that represent the same upstream repo.
- Make it easier to have more attributes for a repository without updating the table schema. (star count, is archive, etc)
- alias => Prevent conflicts on name in repo table.
- renames => do upserts enforcing uniqueness constraint on external id.
- id bigserial
- enforce that external_service_ids is a set

blacklist repos

- Directly visit => 
- Search for repo by name (just needs to match substring); Find results in repo => have a toggle just for this
- List repos in admin view
- Visit on codehost and extension communicates

1. sourcegraph/sourcegraph -> sourcegraph/enterprise
2. Created new repo sourcegraph/sourcegraph

1. sourcegraph/sourcegraph id=a
2. sourcegraph/enterprise id=a
3. sourcegraph/enterprise id=a; sourcegraph/sourcegraph id=b

upsert WHERE service_id = 


source_1 creating a row
source_2 creating a row   same

source_1 with different naming pattern
source_2 with default naming pattern    insert same upstream repo, but different names.


- We want name to be unique and canonical. Visiting sourcegraph.com/{name}
  needs to take you to the (only) repository and should map to the upstream
  code host.
- We want a repo to be eventually consistent with the upstream codehost:
  - name
  - metadata
  - existance
  - still configured to sync
- We want to support the OTHER codehost kind. OTHER is a code host without
  metadata (only name and clone URL). Need to handle conflicts with codehosts.
- We want to support multiple configurations for an external service
  kind. Example: Different search queries or different tokens.
         Column          |           Type           |                     Modifiers                     
-------------------------+--------------------------+---------------------------------------------------
 id                      | integer                  | not null default nextval('repo_id_seq'::regclass)
 name                    | citext                   | not null
 external_service_type   # GITHUB
 external_service_id     # https://github.com
 external_id             # ====ASDSDF
 created_at              | timestamp with time zone | not null default now()
 updated_at              | timestamp with time zone | not null default now()
 deleted_at              | timestamp with time zone | 
 metadata                | jsonb                    | not null default '{}'::json

Indexes:
    "repo_pkey" PRIMARY KEY, btree (id)
    "repo_urn" UNIQUE, btree (urn)
    "repo_name_unique" UNIQUE, btree (name)
    "repo_name_trgm" gin (lower(name::text) gin_trgm_ops)
Check constraints:
    "check_external" CHECK (external_id IS NULL AND external_service_type IS NULL AND external_service_id IS NULL OR external_id IS NOT NULL AND external_service_type IS NOT NULL AND external_service_id IS NOT NULL)
    "check_name_nonempty" CHECK (name <> ''::citext)
Referenced by:
    TABLE "discussion_threads_target_repo" CONSTRAINT "discussion_threads_target_repo_repo_id_fkey" FOREIGN KEY (repo_id) REFERENCES repo(id) ON DELETE RESTRICT
    TABLE "global_dep" CONSTRAINT "global_dep_repo_id" FOREIGN KEY (repo_id) REFERENCES repo(id) ON DELETE RESTRICT
    TABLE "pkgs" CONSTRAINT "pkgs_repo_id" FOREIGN KEY (repo_id) REFERENCES repo(id) ON DELETE RESTRICT


unique(external_service_id, external_id)
<1-to-1>
unique(name)

OTHER
@tsenart

This comment has been minimized.

Copy link
Contributor

tsenart commented Feb 15, 2019

Some notes from today's investigation.

Global diff scalability

Currently loading all of sourcegraph.com's ~31M repos from Postgres takes about 10s on my machine. This means that a sync interval of 10s in repo updater wouldn't make sense, but also, I was pleasantly surprised by how fast it is.

However, this number will only grow unbounded for sourcegraph.com in particular. It may not be something we need to address now, but eventually, we could need to find a way to create a diff between sourced repos (from external services) and stored repos (in the repo table) without having to load all stored repos in memory at once.

external_xxx columns

The three existing columns external_id, external_service_id and external_service_type suit our needs of deduplicating sourced repos per code host instance. Since we can have multiple external services configured with the same code host instance, and the IDs of repos returned by those instances are guaranteed to be unique only in scope of those instances, we need to rely on a unique constraint on (external_id, external_service_id) for repository deduplication. Please note that the value of the external_service_id column is the URL of the code host.

select codehosts.domain, count(*)
from (select split_part(repo.name, '/', 1) as domain from repo) as codehosts
group by codehosts.domain;
                                   
+--------------------------+---------+
| domain                   | count   |
|--------------------------+---------|
| android.googlesource.com | 1       |
| bitbucket.org            | 7       |
| git.nssm.cc              | 1       |
| github.com               | 3141537 |
| gitlab.com               | 927     |
| hg.openjdk.java.net      | 2       |
+--------------------------+---------+

Unfortunately, it seems that many repo row don't have these columns filled in. This happens in situations when we get rate limited.

Here's the query to show that.

 select codehosts.domain, count(*) 
 from (
    select split_part(repo.name, '/', 1) as domain from repo
    where external_id is null and external_service_id is null
 ) as codehosts 
 group by codehosts.domain;       
                                                                                                                                                                                                                                                                                                                                                   
+--------------------------+---------+
| domain                   | count   |
|--------------------------+---------|
| android.googlesource.com | 1       |
| bitbucket.org            | 7       |
| git.nssm.cc              | 1       |
| github.com               | 749858  |
| hg.openjdk.java.net      | 2       |
+--------------------------+---------+
@nicksnyder

This comment has been minimized.

Copy link
Member

nicksnyder commented Feb 15, 2019

However, this number will only grow unbounded for sourcegraph.com in particular.

Historically, sourcegraph.com has been a special snowflake with respect to repo syncing (i.e. everything is on-demand instead of scheduled). Is that going to change in the future?

@tsenart

This comment has been minimized.

Copy link
Contributor

tsenart commented Feb 17, 2019

Is that going to change in the future?

Nope, good point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment