Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
mcarmonaa Merge pull request #65 from mcarmonaa/fix/cmd-warn-no-organizations
cmd: warns if none organization is provided exiting without error
Latest commit 66b7544 Jul 18, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
cmd/gitcollector cmd: warns if none organization is provided exiting without error Jul 18, 2019
discovery *: refactor discovery Jul 16, 2019
downloader downloader: refactor downloadRepository function Jul 17, 2019
library library: remove a println leftover from library/job.go Jul 3, 2019
metrics metrics: add tests Jul 3, 2019
updater *: refactor discovery Jul 16, 2019
.gitignore Add initial and ci project files. May 31, 2019
.travis.yml travis-ci: fix travis file Jul 3, 2019
MAINTAINERS Add @mcarmonaa as maintainer Jun 25, 2019
Makefile Add initial and ci project files. May 31, 2019 README: update README to show new cli options Jul 3, 2019
go.mod *: refactor discovery Jul 16, 2019
go.sum *: refactor discovery Jul 16, 2019
job.go *: refactor discovery Jul 16, 2019
worker.go *: code refactoring Jun 21, 2019
worker_pool.go *: code refactoring Jun 21, 2019
worker_pool_test.go *: code refactoring Jun 21, 2019

gitcollector GitHub version Build Status codecov GoDoc Go Report Card

gitcollector collects and stores git repositories.

gitcollector is the source{d} tool to download and update git repositories at large scale. To that end, it uses a custom repository storage file format called siva optimized for saving storage space and keeping repositories up-to-date.


The project is in a preliminary stable stage and under active development.

Storing repositories using rooted repositories

A rooted repository is a bare Git repository that stores all objects from all repositories that share a common history, that is, they have the same initial commit. It is stored using the Siva file format.

Root Repository explanatory diagram

Rooted repositories have a few particularities that you should know to work with them effectively:

  • They have no HEAD reference.
  • All references are of the following form: {REFERENCE_NAME}/{REMOTE_NAME}. For example, the reference refs/heads/master of the remote foo would be /refs/heads/master/foo.
  • Each remote represents a repository that shares the common history of the rooted repository. A remote can have multiple endpoints.
  • A rooted repository is simply a repository with all the objects from all the repositories which share the same root commit.
  • The root commit for a repository is obtained following the first parent of each commit from HEAD.

Getting started

Plain command

gitcollector entry point usage is done through the subcommand download (at this time is the only subcommand):

  gitcollector [OPTIONS] download [download-OPTIONS]

Help Options:
  -h, --help                                     Show this help message

[download command options]
          --library=                             path where download to [$GITCOLLECTOR_LIBRARY]
          --bucket=                              library bucketization level (default: 2) [$GITCOLLECTOR_LIBRARY_BUCKET]
          --tmp=                                 directory to place generated temporal files (default: /tmp) [$GITCOLLECTOR_TMP]
          --workers=                             number of workers, default to GOMAXPROCS [$GITCOLLECTOR_WORKERS]
          --half-cpu                             set the number of workers to half of the set workers [$GITCOLLECTOR_HALF_CPU]
          --no-updates                           don't allow updates on already downloaded repositories [$GITCOLLECTOR_NO_UPDATES]
          --no-forks                             github forked repositories will not be downloaded [$GITCOLLECTOR_NO_FORKS]
          --orgs=                                list of github organization names separated by comma [$GITHUB_ORGANIZATIONS]
          --token=                               github token [$GITHUB_TOKEN]
          --metrics-db=                          uri to a database where metrics will be sent [$GITCOLLECTOR_METRICS_DB_URI]
          --metrics-db-table=                    table name where the metrics will be added (default: gitcollector_metrics) [$GITCOLLECTOR_METRICS_DB_TABLE]
          --metrics-sync-timeout=                timeout in seconds to send metrics (default: 30) [$GITCOLLECTOR_METRICS_SYNC]

    Log Options:
          --log-level=[info|debug|warning|error] Logging level (default: info) [$LOG_LEVEL]
          --log-format=[text|json]               log format, defaults to text on a terminal and json otherwise [$LOG_FORMAT]
          --log-fields=                          default fields for the logger, specified in json [$LOG_FIELDS]
          --log-force-format                     ignore if it is running on a terminal or not [$LOG_FORCE_FORMAT]

Usage example, --library and --orgs are always required:

gitcollector download --library=/path/to/repos/directoy --orgs=src-d

To collect repositories from several github organizations:

gitcollector download --library=/path/to/repos/directoy --orgs=src-d,bblfsh

Note that all the download command options are also configurable with environment variables.


gitcollector upload a new docker image to docker hub on each new release. To use it:

docker run --rm --name gitcollector_1 \
-e "GITHUB_ORGANIZATIONS=src-d,bblfsh" \
-e "GITHUB_TOKEN=foo" \
-v /path/to/repos/directory:/library \

Note that you must mount a local directory into the specific container path shown in -v /path/to/repos/directory:/library. This directory is where the repositories will be downloaded into rooted repositories in siva files format.


GPL v3.0, see LICENSE

You can’t perform that action at this time.