Skip to content
Permalink
Browse files

mention pga on the docs

Signed-off-by: Francesc Campoy <campoy@golang.org>
  • Loading branch information...
campoy committed Mar 3, 2018
1 parent bf76d0c commit 3f3d7ec52bceea7ce89359154941e9f67b4e8cf6
Showing with 62 additions and 15 deletions.
  1. +23 −15 PublicGitArchive/README.md
  2. +8 −0 PublicGitArchive/pga/README.md
  3. +31 −0 PublicGitArchive/web/md/index.md
@@ -8,32 +8,40 @@ This dataset consists of two parts:

## Tools

* [pga](pga) - explore the dataset, or download its contents easily.
* [multitool](multitool) - compile the list of repositories and retrieve an existing dataset.
* [borges-indexer](borges-indexer) - exports a CSV file with metadata from repositories fetched with Borges.

## Download
## Listing and downloading

To download an existing dataset, execute:
To see the full list of repositories in the dataset or download it, you will need to install
[pga](pga).
Simply install Go and then run `go get github.com/src-d/datasets/PublicGitArchive/pga`.

```
multitool get-index -o index.csv
# create siva_file_list.txt from index.csv by running custom query code
cat siva_file_list.txt | multitool get-dataset -o /path/where/repositories/will/be/stored
Then to list all of the repositories in the dataset, simply run:

```bash
pga list
```

One-liner to fetch all the files:
If you'd rather get a detailed dump of the dataset (not including the file contents)
you can choose either `pga list -f json` or `pga list -f csv`.

```
multitool get-index | grep -oP '[a-z0-9]{40}\.siva' | multitool get-dataset -o /path/where/repositories/will/be/stored
To download the full dataset, execute:

```bash
pga get
```

`get-dataset` command has `-j/--workers` argument which specifies the number of downloading threads
to run.
Or if you want to download only those repositories containing at least a line of Java code:

Both `get-index` and `get-dataset` have `-b/--base` argument which specifies the base URL of the datasets.
source{d}'s address is hardcoded to be the default.
```bash
pga get -l java
```

The `pga` command has `-j/--workers` argument which specifies the number of downloading threads to run, it defaults to 10.

Example of getting only Java repositories: [examples/java.md](examples/java.md).
For more information, check the [pga documentation](pga), or simply run `pga -h`.

## Reproduction

@@ -43,7 +51,7 @@ The list must be a text file with one URL per line. The paper chooses
repositories on GitHub with ≥50 stars, which is equivalent to
the following commands which generate `list.txt`:

```
```bash
multitool discover -s stars.txt -r repos.txt.gz
multitool select -s stars.txt -r repos.txt.gz -m 50 > list.txt
```
@@ -59,3 +59,11 @@ and downloads the siva files with `pga get` to the `repositories` directory.
```bash
pga list -u github.com/src-d/ -f json | jq -r 'select(.fileCount > 500) | .sivaFilenames[]' | pga get -i -o repositories
```

_Note on partial downloads_

When running `pga get` the tool will check whether the files already
downloaded match the md5 hash of the files on the server. If that's the case,
the files will not be downloaded.

This provides a simple way to resume failed downloads. Simply run the tool again.
@@ -2,6 +2,37 @@

On this dataset you can find all the repositories from GitHub with more than 50 stars.

## Downloading the dataset

To see the full list of repositories in the dataset or download it, you will need to install
[pga](pga).
Simply install Go and then run `go get github.com/src-d/datasets/PublicGitArchive/pga`.

Then to list all of the repositories in the dataset, simply run:

```bash
pga list
```

If you'd rather get a detailed dump of the dataset (not including the file contents)
you can choose either `pga list -f json` or `pga list -f csv`.

To download the full dataset, execute:

```bash
pga get
```

Or if you want to download only those repositories containing at least a line of Java code:

```bash
pga get -l java
```

The `pga` command has `-j/--workers` argument which specifies the number of downloading threads to run, it defaults to 10.

For more information, check the [pga documentation](pga), or simply run `pga -h`.

## Links

- ### [CSV file](LINK TO CSV FILE HERE)

0 comments on commit 3f3d7ec

Please sign in to comment.
You can’t perform that action at this time.