Tool to create the PGA dataset.
The following commands exist:
repack- downloads latest GHTorrent MySQL dump and repacks it only with the required files (optional step).
discover- extract the needed information from GHTorrent MySQL dump on the fly. Requires only 1.5 GB of storage.
select- compile the list of repositories to clone according to various filters, such as stars or languages.
index- create the index
set-forks- add fork counts
There are 64-bit binaries for Linux, MacOS and Windows on Releases page.
Build from source
go get -v github.com/src-d/datasets/PublicGitArchive/pga-create
Obtain the list of repositories to clone
The list must be a text file with one URL per line. The paper chooses
repositories on GitHub with ≥50 stars, which is equivalent to
the following commands which generate
pga-create discover pga-create select -m 50 > repository_list.txt
You are going to need Borges and all it's dependencies: RabbitMQ and PostgreSQL. The following commands are an artificial simplified cloning scenario, please refer to Borges docs for the detailed manual.
In the first terminal execute
borges init borges producer --source=file --file repository_list.txt
In the second terminal execute
export CONFIG_ROOT_REPOSITORIES_DIR=/path/where/repositories/will/be/stored borges consumer
To process the downloaded repositories you will need the
pga-create index command, and run it querying the database populated in the previous step. This will generate a CSV with the extracted information of all those repositories.
Same environment variables as in borges can be used to configure the database access.
pga-create index --debug --logfile=pga-create-index.log
The options accepted by
pga-create index are the following:
-o, --output= csv file path with the results (default: data/index.csv) --debug show debug logs --logfile= write logs to file --limit= max number of repositories to process --offset= skip initial n repositories --workers= number of workers to use (defaults to number of CPUs) --repos-file= path to a file with a repository per line, only those will be processed -s, --stars= input path for the file with the numbers of stars per repository (default: data/stars.gz) -r, --repositories= input path for the gzipped file with the repository names and identifiers (default: data/repositories.gz)
To set the
SIZE field properly, it relies on the default temporary directories configuration for the core-retrieval dependency but for the
CONFIG_CLEAN_TEMP_DIR environment variable which must be set to
CONFIG_CLEAN_TEMP_DIR=true pga-create index --debug --logfile=pga-create-index.log
NOTE: this spawns as many workers as CPUs are available in the machine. Take into account that some repositories may be considerably large and this process may take a very big amount of memory in the machine.
After being processed with
index you will have a
result.csv file with all the content you need. The only missing content will be the
FORK_COUNT, but for that you can use the also included
This will take
result.csv and add the forks to it, resulting in a
result_forks.csv file with the same data you had in the original CSV, only with the forks added.