New borges design #389

jfontan · 2019-03-20T16:12:11Z

Rovers

Get also if the repository is a fork and which is the parent repository. This can be done checking if "fork": true in the JSON and getting with the api the repository and checking source, this is the first parent. This information comes handy and it can be used to get the rooted repo where to push this repository. With it we have can accomplish two things:

Do not schedule two repositories from he same rooted repo at the same time to decrease the changes of locked siva
If we already know the rooted repo where it's going to be downloaded we can use it as the base to do the fetch and download only the objects needed

It would also be interesting to get size as it could be used to schedule a mix of small and big repositories to decrease the chances of memory starvation.

NOTE: there can be a special discoverer that uses ghtorrent as input.

NOTE: fork is already retrieved by rovers and stored in the database. source needs to be retrieved from the repository information and would require a second call to API. Maybe using the new graphql API can improve this.

Queues

We are using the queues as databases and they become huge. This make backups and other maintenance more complex than needed. It also makes us rely too much in the information they contain. Rovers queue will be consumed and repositories created in the database instead of creating new jobs. The producer will maintain the number messages in the jobs queue withing a given threshold and refill it as needed.

To make this possible the repositories will have more states and columns.

States:

discovered: the repository was added to the database
pending: repository is in the queue to download
fetching: it's being downloaded
fetched: it was successfully downloaded
error: download had an error

Database:

status_at: time when it changed its state
siva: name of the siva file where it is located
priority: priority for the repository, used in scheduling
fork_endpoint: used to find the siva where it should be stored
error: cause for the error

On error the cause will go into error column. There are no not_found and auth_req statuses.

Components

The components will change its names to make them more user friendly. Consumer and producer does not mean anything for non developers.

discoverer: gets the new repositories from rovers or file and fills the database
scheduler: schedules the jobs to be downloaded and sends them to the queue
downloader: downloads or updates repositories

Discoverer

To have the same features as we have now we will have mentions and file discoverers. They will work the same as the current producers but will only create the repositories in the database instead of sending the jobs to the queue.

Scheduler

We had multiple producers (file, mentions, buried,...) but now all this functionality will be in the scheduler. This will make feasible to do more complex scheduling if needed.

There are four main types of jobs to schedule:

download: repositories in discovered state, the initial download of a repository
update: repositories in fetched state
retry: repositories in error state
recover: repositories in a transient state that are known not to be downloaded (pending, fetching) of in the buried queue

Each state could be configured with a ratio to send to the queue. The repositories for each group will be queried taking repository priority into account.

NOTE: As we improve and optimize download methods we could have a different queues for big repositories or repositories that we know beforehand that cannot be optimized and will use more memory. We can then process these special repositories in some producers that have less workers to minimize the problem of memory starvation.

DISCARDED NOTE: using a queue is interesting as it already provides HA and makes possible to restart the scheduler without stopping the production. Calling directly the scheduler is another option and enables even better scheduling as information like free memory in a worker can be taken into account. Still for now I believe this adds some complexity that may not be worth the effort.

Downloader

Do not get more jobs in case memory consumption is above some threshold.
Use a unified cache for all repositories. We found in gitbase that memory can be easier controlled using a unique cache. There is still the problem of high memory consumption on clone but this has to be tackled separately.
Jobs will have a unique identifier instead of using the repository UUID for better tracking.

Optimizations

Do not separate repositories per rooted repo but maintain them in the same siva file: #380

With this change some optimizations can be applied to the clone step.

Make the storer show all references from the rest of the repositories plus the ones from the opened repository. Announcing this references when cloning will make server send only the objects to update. No push will be needed.
If we already have the location (initial commit from the default branch) in the database we can apply the previous optimization to download new repositories that are a fork from some other repository. Before downloading we check if it is a fork and if that's true we use the same location as the parent.

go-borges siva

The current transactioner has to change to a customizable locker. In our case locks will be done using etcd mechanism so it can be used in a distributed fashion.

go-siva will use a storer that shows all objects but will mangle references. On writing a reference its name will be changed to add the repository ID. On read it will show all references for all repositories plus some virtual references for the current repository with the correct name (repository ID stripped). This allows download optimizations.

  uuid := "0168e2c7-eedc-7358-0a09-39ba833bdd54"
  loc, _ := library.Location("49ab543d4930a9c5c6ce5de74d2875cf57ab5d5c")
  repo, _ := loc.Init(uuid)
  r := loc.R()
  remote, _ := r.CreateRemote(&config.RemoteConfig({
    Name: uuid,
    URLs:[]string{"github.com/src-d/go-borges"},
  })
  remote.Fetch(git.FetchOptions{
    RemoteName: uuid,
    RefSpecs: []config.RefSpec{FetchRefSpec, FetchHead},
  })

Rooted Repositories

Single siva file for repositories

Siva files will contain complete repositories instead of single history trees. This is explained in a couple of proposals in borges:

TL;DR: The siva file used to store a repository will be the initial commit of the default branch.

Reference naming

Currently the references for repositories in a rooted repository have a strange nomenclature. The identifier of the repository is added after the original name of the reference:

refs/heads/HEAD/01612921-75cc-2c53-5f7a-ff728f14563e
refs/heads/master/01612921-75cc-2c53-5f7a-ff728f14563e
refs/pull/10/head/01612921-75cc-2c53-5f7a-ff728f14563e
refs/pull/10/merge/01612921-75cc-2c53-5f7a-ff728f14563e
refs/tags/1.2/01612921-75cc-2c53-5f7a-ff728f14563e
refs/tags/v0.5/01612921-75cc-2c53-5f7a-ff728f14563e

And its remote configuration refspec does not match:

[remote "01612921-75cc-2c53-5f7a-ff728f14563e"]
	url = https://github.com/jkk/eidogo
	fetch = +refs/heads/*:refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/*

As the repositories are added as remotes in the config file we can use the same system as git, add them to the refs/remotes. Here's the config modified:

[remote "01612921-75cc-2c53-5f7a-ff728f14563e"]
	url = https://github.com/jkk/eidogo
	fetch = +refs/*:refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/*
	fetch = +HEAD:refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/HEAD

And its references:

refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/heads/master
refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/HEAD
refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/pull/10/merge
refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/pull/10/head
refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/tags/1.2
refs/remotes/01612921-75cc-2c53-5f7a-ff728f14563e/tags/v0.5

This is much more similar than what git does with remotes and also lets use work with these repositories with git CLI. For example we can unpack a siva file and use git fetch --all to update the repositories in it.

The text was updated successfully, but these errors were encountered:

jfontan · 2019-03-20T16:12:42Z

Prototype

The prototype is similar to borges pack. It gets a file with a list of URLs and downloads them to siva files. Only single job and the worst case scenario is implemented, that is, the name of the parent repository is not known beforehand so it has to be calculated.

Create a temporary repository with the remote configured but download only HEAD
Calculate root from the repo
Prepare siva
- If the corresponding siva file does not exist copy the previously downloaded files
- If it already exists add the new remote
Fetch remote in the siva file

It uses go-borges in transactional mode. Crashing the downloading process was tested and the files recovered correctly.

Small repositories or repos without forks and only one rooted repo do not have great improvements as there's already a fast path for them in the latest version of borges. When the repos have forks or several rooted repos the advantage is bigger.

Problems

go-git transactional.Storer and go-borges siva.Storer don't implement PackfileWriter and the objects are written as loose objects. This function was implemented locally for the prototype.
using non transactional storage causes feature not supported error. Still have to check what's not supported.
go-siva / go-billy-siva is quite slow with a big number of files
- each time a file is accessed a new copy of the siva index is retrieved (from memory) and ToSafePaths is applied to it. For the prototype we don't apply ToSafePaths
- go-siva already maintains an updated index that could be used instead of the copy returned by reader.Index. This one should be used to make things faster
- the way to get the files from a directory is using Glob. This checks all the files in the index but it's not needed. The index is kept in memory sorted so a binary search will be enough to find the fist one and we can stop after the prefix changes.
generating packfile indexes is quite slow

Benchmarks

Tensorflow

Only the main repo.

url: https://github.com/tensorflow/tensorflow
roots: 2
references: 13109

Old:

1166.74user 51.00system 15:44.46elapsed 128%CPU (0avgtext+0avgdata 5896128maxresident)k
8568inputs+3381912outputs (52major+1380164minor)pagefaults 0swaps

New:

downloading github.com/tensorflow/tensorflow
init github.com/tensorflow/tensorflow f41959ccb2d9d4c722fe8fc3351401d53bcf4900
clone: 1m41.816004408s, copy: 212.002587ms, fetch: 2m16.470772185s, commit: 8.353655806s
finished github.com/tensorflow/tensorflow 4m9.539703768s
157.88user 16.50system 4:15.45elapsed 68%CPU (0avgtext+0avgdata 1642792maxresident)k
112inputs+5530488outputs (0major+417796minor)pagefaults 0swaps

Sinatra

Small repository

url: https://github.com/sinatra/sinatra
roots: 1
forks: 29

Old:

Size: 9298261
Only time: 5:42

New :

Size: 9951084

207.10user 12.07system 4:05.14elapsed 89%CPU (0avgtext+0avgdata 214776maxresident)k
2160inputs+413648outputs (0major+53398minor)pagefaults 0swaps

Gerrit

references: 98039

Old:

stopped after 1 hour 17 minutes and 523 siva files

New:

downloading github.com/gerrit-review/gerrit
init github.com/gerrit-review/gerrit 23571ab1fa7fedc262d6c21510614353e9d8a4dc
clone: 58.441369269s, copy: 301.315884ms, fetch: 19m58.818910707s, commit: 9.08975328s
finished github.com/gerrit-review/gerrit 21m12.810144736s
1254.15user 30.73system 21:12.93elapsed 100%CPU (0avgtext+0avgdata 1394472maxresident)k
16112inputs+1857072outputs (67major+553372minor)pagefaults 0swaps

NOTE: This repository takes a lot of time writing references (18 minutes of the total are spent writing reference files).

jfontan · 2019-03-22T18:03:10Z

Code for the prototype:

https://github.com/jfontan/borges/tree/new_borges

Uses go modules so clone outside GOPATH. The command gets one argument that's a file containing a list of repository URLs. It downloads siva files to ./sivas directory.

$ go run cmd/main.go repositories.list

jfontan added the proposal proposal for new additions or changes label Mar 20, 2019

jfontan mentioned this issue Mar 22, 2019

Add Init column to repository src-d/core-retrieval#69

Open

jfontan mentioned this issue Jun 19, 2019

[go-borges] Use go-borges to access repositories src-d/gitbase#888

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New borges design #389

New borges design #389

jfontan commented Mar 20, 2019

jfontan commented Mar 20, 2019

jfontan commented Mar 22, 2019

New borges design #389

New borges design #389

Comments

jfontan commented Mar 20, 2019

Rovers

Queues

Components

Discoverer

Scheduler

Downloader

Optimizations

go-borges siva

Rooted Repositories

Single siva file for repositories

Reference naming

jfontan commented Mar 20, 2019

Prototype

Problems

Benchmarks

Tensorflow

Sinatra

Gerrit

jfontan commented Mar 22, 2019