Skip to content
Branch: master
Find file History
r0mainK Add UAST dataset
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
Latest commit 9cb72b3 Oct 11, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
doc Add the PGA poster (#67) Jun 6, 2018
list-pga-heads Add the PGA heads docs Sep 9, 2019
pga-create Switch to go mod Sep 24, 2019
pga Add UAST dataset Oct 11, 2019
pga2uast Run git repo enumeration in parallel Aug 19, 2019
web mention pga on the docs Mar 3, 2018
README.md Add the PGA heads docs Sep 9, 2019

README.md

Public Git Archive size 6.0TB

Paper (accepted to MSR'18). Presentation.

This dataset consists of two parts:

  • Siva files with Git repositories.
  • Index file in CSV format.

Besides, there is a number of auxiliary datasets:

Tools

  • pga - explore the dataset, or download its contents easily.
  • pga-create - reproduce PGA dataset generation.
  • borges-indexer - exports a CSV file with metadata from repositories fetched with Borges.
  • pga2uast - extracts Babelfish UASTs from the HEADs of siva files.
  • list_heads - lists files in each HEAD contained in siva.

Listing and downloading

To see the full list of repositories in the dataset or download it, you will need to install pga. Simply install Go and then run go get github.com/src-d/datasets/PublicGitArchive/pga.

Then to list all of the repositories in the dataset, simply run:

pga list

If you'd rather get a detailed dump of the dataset (not including the file contents) you can choose either pga list -f json or pga list -f csv.

To download the full dataset, execute:

pga get

Or if you want to download only those repositories containing at least a line of Java code:

pga get -l java

The pga command has -j/--workers argument which specifies the number of downloading threads to run, it defaults to 10.

For more information, check the pga documentation, or simply run pga -h.

Reproduction

Refer to pga-create documentation for more details about how PGA is generated.

Blacklist

We understand that some GitHub projects may become private or deleted with time. Previous dataset snapshots will continue to include such dead code. If you are the author and want to remove your project from all present and future public snapshots, please send a request to datasets@sourced.tech.

You can’t perform that action at this time.