pga: the Public Git Archive tool
pga to list and download the repositories included in Public Git Archive.
There are no binary distributions available yet, but we're planning on releasing them sometime soon. In the meanwhile you'll need to compile this tool.
- install Go 1.11+ (https://golang.org/doc/install) and
- fetch and build:
go get github.com/src-d/datasets/PublicGitArchive/pga
- add the built binary
PATHenvironment variable or move it to somewhere easier to find.
- verify the installation went well, simply run
pga -hand you should see some help.
There are three subcommands in
When you run
pga list two things wil happen.
First a copy of the latest index for the Public Git Archive will be downloaded and cached locally.
pga will list all the URLs for the repositories in the index.
By default only the repository URL is displayed, but you can change that with the
-f cvs) will print CVS rows with all the details,
-f json) will print do the same for JSON.
The extended information includes the fields:
Note that the fields
SIZE can hold the value
-1 to point out that the index doesn't have information about those. This ensures compatibility between different index versions.
SIZE represents the sum of the sizes of all the siva files you need to collect to get the complete repository. Because a siva file can hold several repositories information, when you need to download more than one repository the total amount of bytes to be downloaded will be at most the sum of their
SIZES values though it could be less if they share any of the siva files.
You can now add some filters to decide which ones you want to see, for now we've implemented only two of them:
-l java,go) will list only repositories that have at least some code in those two languages,
-u regexp) will list only the repositories for which the url matches the given regular expression.
You can always use any of your favorite tools to decide what repositories to download, such as
pass the resulting list of siva files back to
Read below how to download repositories given the siva filenames.
Downloading siva files
get! You also get a couple of extra flags.
-o path) determines under what path the siva files should be stored.
- if the path is a URL with schema
hdfsHDFS will be used.
- if the path is a URL with schema
-j n) sets the maximum number of download hapenning concurrently, it defaults to
Downloading siva files given their names
Simply pass a list of siva filenames through standard input to
For instance, this command lists all of the repositories under github.com/src-d, filter out those with less than 50 files,
and downloads the siva files with
pga get to the
pga list -u github.com/src-d/ -f json | jq -r 'select(.fileCount > 50) | .sivaFilenames' | pga get -i -o repositories
Note on partial downloads
pga get the tool will check whether the files already
downloaded match the md5 hash of the files on the server. If that's the case,
the files will not be downloaded.
This provides a simple way to resume failed downloads. Simply run the tool again.
Extracting files from downloaded siva-s
The following will write the contents of each HEAD revision contained in a siva file to the current working directory:
pga siva dump /path/to/siva
-o /output/path allows setting the output path other than the current working directory.
Listing the commits and references in a downloaded siva file
pga siva list /path/to/siva
The output format is JSON. In the
"commits" dictionary, each value is the list of the commit's parents.
"references" dictionary, each value is the reference's target.
Dumping the raw siva contents (advanced)
It is possible to extract the raw contents of a siva archive with
pga siva unpack /path/to/siva
It is possible to specify a regular expression for matching specific files to be extracted: