Skip to content
Permalink
Browse files

Fill TODOs, improve doc

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
  • Loading branch information...
r0mainK committed Oct 29, 2019
1 parent b89df12 commit d6f083a54e767d87a1a5030dc2adece8a50479f0
@@ -13,7 +13,7 @@ Besides, there is a number of auxiliary datasets:
* [configs.tar.xz](https://drive.google.com/open?id=1_cij4BMrPiKVBVdZyUzg1iOhB3pL6EPR) - raw git config files for each siva.
* [heads.csv.xz](https://drive.google.com/open?id=136vsGWfIwfd0IrAdfphIU6lkMmme4-Pj) - mapping from HEAD UUID to repository name.

Since the second version of PGA, we have also created a [dataset of UASTs](../PublicGitArchiveUASTs), obtained from files in the HEAD commit of each repository.
Since the second version of PGA, we additionally provide the derived [dataset of UASTs](../PublicGitArchiveUASTs), extracted from the files in the latest revision of each repository.

## Tools

@@ -18,17 +18,17 @@ There are three subcommands in `pga`: `list`, `get`, and `siva`.

### Datasets

Two datasets are exposed through this tool, and both the `list` and `get` command can be used to explore and retrieve them. To do so, you must specify with a keyword which dataset you want to work on :
Two datasets are exposed through this tool, and both the `list` and `get` commands can be used to explore and retrieve them. You must specify which dataset you want to work with:

- `siva`: The original Public Git Archive dataset, made up of Siva files.
- `uast`: The [dataset](../../PublicGitArchiveUASTs) created by extracting UASTs from the HEAD commit of each repository, made up of Parquet files.
- `siva`: The original Public Git Archive dataset, made of Siva files.
- `uast`: The [dataset](../../PublicGitArchiveUASTs) of extracted UASTs from the HEAD revision of each repository, made of Parquet files.

Note that the `siva` _command_ does not work on Parquet files.
Note that the `siva` _command_ does not work with Parquet files.

### Listing repositories

When you run `pga list` two things wil happen.
First a copy of the latest index for the dataset specified will be downloaded and cached locally.
First a copy of the latest index for the specified dataset will be downloaded and cached locally.
Then `pga` will list all the URLs for the repositories in the index.

By default only the repository URL is displayed, but you can change that with the `--format` flag:
@@ -40,7 +40,7 @@ The extended information includes the fields:
- `URL`, `SIVA_FILENAMES`, `FILE_COUNT`, `LANGS`,`LANGS_BYTE_COUNT`, `LANGS_LINES_COUNT`,`LANGS_FILES_COUNT`, `COMMITS_COUNT`, `BRANCHES_COUNT`, `FORK_COUNT`, `EMPTY_LINES_COUNT`, `CODE_LINES_COUNT`, `COMMENT_LINES_COUNT`, `LICENSE`, `STARS` and `SIZE` for the original dataset.
- `URL`, `PARQUET_FILENAMES`, `FILE_COUNT`, `SIZE`, `FILE_EXTRACT_RATE`, `BYTE_EXTRACT_RATE`, `LANGS`, `LANGS_FILE_COUNT`, `LANGS_BYTE_COUNT`, `LANGS_FILE_EXTRACT_RATE` and `LANGS_BYTE_EXTRACT_RATE` for the UASTs dataset.

Note that the fields `STARS` and `SIZE` can hold the value `-1` to point out that the index doesn't have information about those for the orginal dataset. This ensures compatibility between different index versions.
Note that the fields `STARS` and `SIZE` can hold the value `-1` to point out that the index doesn't have information about those for the original dataset. This ensures compatibility between different index versions.

`SIZE` represents the sum of the sizes of all the siva files you need to collect to get the complete repository. Because a siva file can hold several repositories information, when you need to download more than one repository the total amount of bytes to be downloaded will be at most the sum of their `SIZES` values though it could be less if they share any of the siva files.

@@ -93,7 +93,38 @@ ORDER BY repo
The cool property of ClickHouse is that the following sample queries finish within a minute
on a single machine with 64 vcores and RAID-0 over NVMe SSDs.

TODO for Romain
```
# Counting the number of distinct files and repositories
SELECT COUNT(DISTINCT repo) AS repo_count,
COUNT(DISTINCT repo, file) AS file_count
FROM uasts;
# Extracting all C# keywords:
SELECT *
FROM uasts
WHERE lang = 'csharp'
AND type = 'Keyword';
# Extracting all comments in files from source{d} repositories:
SELECT *
FROM uasts
WHERE repo LIKE 'src-d/%'
AND type = 'Comment';
# Extracting all identifiers from Go files, excluding vendoring files
SELECT *
FROM uasts
WHERE lang = 'go'
AND file NOT LIKE '%vendor/%'
AND type = 'Identifier';
```

For more complex queries, e.g. extracting imports, some tricks are required (see [limitations](#limitations)) - however it can be [done](https://github.com/src-d/ml-mining).


### Origin

@@ -106,10 +137,19 @@ Google Cloud.

As was already mentioned in the [schema section](#schema), the UASTs are flattened in a lossy way.
This includes the aggressive normalization of the data for each programming language. While we did our
best at detecting the possible problems early on, a few sneaked inside when it was too late to fix them.
Below is the list of known issues:
best at detecting the possible problems early on, a few sneaked inside when it was too late to fix them. Furthermore, the dataset was not updated after the 21st of september, so any commit merged after will not have been included.

Below is an incomplete list of known issues:

* Duplication:
* Some repositories were processed multiple times, because they appeared in multiple Parquet files. We created the DB by iterating on each Parquet file, and did not check for duplicate rows.
* Errors inherited from Babelfish that are language-specific, e.g. [duplicate go comments](https://github.com/bblfsh/go-driver/issues/56).
* Missing data:
* Due to how we traversed the trees, some useful information was erased, e.g. all Ruby imports were unfortunately discarded. See [here](https://github.com/src-d/uast2clickhouse/issues/11).
* Errors inherited from Babelfish that are language-specific, e.g. some drivers do not keep keywords.
* Wrong ordering: the `left` and the `right` columns are sometimes unreliable, due to how we traversed the UASTs and propagated positional information. The line numbers should always be correct though.

* TODO Romain
If you notice something strange or have trouble using the dataset, please speak up [here](https://github.com/src-d/uast2clickhouse/issues) and we will help to find a workaround. The dump will not be updated until Public Git Archive v3 is released in 2020.

### License

@@ -2,28 +2,68 @@ UASTs extracted from Public Git Archive ![size 5TB](https://img.shields.io/badge
=======================================

The [Universal Abstract Syntax Trees](https://doc.bblf.sh/uast/uast-specification-v2.html) (UASTs) extracted
from the latest (HEAD) revision of every Git reference contained in [Public GitArchive](../../PublicGitArchive).
from the latest (HEAD) revision of every Git reference contained in [Public Git Archive](../../PublicGitArchive).
The dataset is distributed as Parquet files, which you can download using the [pga CLI](../PublicGitArchive/pga).
There is also a [ClickHouse DB version](ClickHouse) which is more lightweight and easier to work with.

### Format

TODO for Romain: describe only the format of the dataset here. Which columns? Types?
The Parquet files have 3 columns, one row per file:

The Parquet files were created by using the [pga2uast](../PublicGitArchive/pga2uast) on the HEAD commit of each repository in the original dataset. The tables below should provide insights on the contents of the dataset.
- `head` (string): the UUID of the repository of the given file. You can use [this mapping](https://drive.google.com/open?id=136vsGWfIwfd0IrAdfphIU6lkMmme4-Pj)
to obtain the repository names from UUIDs.
- `path` (string): the filepath to the given file, in the repository structure.
- `uast` (variable-length byte array): the UAST of the given file.

### Usage

Each row in the parquet files contains the UAST of one file, alongside the filepath and the UUID of
the repository. You can use [this mapping](https://drive.google.com/open?id=136vsGWfIwfd0IrAdfphIU6lkMmme4-Pj)
to obtain the repository names from UUIDs.
The Parquet files can be read using any library that supports the format, however using Spark is strongly advised if you need to process a large part of the dataset. The UASTs are stored as byte arrays, and thus you can use any of the [Babelfish client libraries](https://doc.bblf.sh/using-babelfish/clients.html) to read and manipulate them.

For example, this is how to extract all identifiers from the UASTs in a given Parquet file:

```Python
import bblfsh
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, udf
from pyspark.sql.types import ArrayType, StringType
# We create the Spark Config - tune accordingly
conf = SparkConf().setAll([ ... ])
# We create the Spark Session - the master URL may be wrong depending on your cluster
spark = SparkSession.builder \
.appName("pga-example") \
.master("spark://spark-spark-master:7077") \
.config(conf=conf) \
.getOrCreate()
# We define the function that will extract identifiers from each UAST
def extract_identifiers(uast):
ctx = bblfsh.decode(uast) # Decode the Byte Array and create the Context
identifiers = []
for node in ctx.filter("//uast:Identifier"): # Iterate over the identifier nodes
node = node.load() # Load the node in memory
identifiers.append(node["Name"]) # Extract the identifier from the node
return identifiers
# We create the Spark User Defined Function usaing above function
extract_identifiers_udf = udf(extract_identifiers, ArrayType(StringType()))
# We apply the pipeline, then trigger execution with `show`
The Parquet files can be read using any library that supports the format, however in the case where
you want to process large amounts of data we recommend using Spark. The UASTs are stored as byte arrays,
and thus you can use any of the [Babelfish Clients](https://doc.bblf.sh/using-babelfish/clients.html)
to read and manipulate them.
df = spark.read.parquet("/path/to/parquet")
df = df.withColumn("identifier", explode(extract_identifiers_udf(df.uast))) \
.select("head", "path", "identifier")
df.show()
```

TODO for Romain: actually write a code snippet to read a Parquet file and show smth from it.
Please note that the [Babelfish Python client library](https://github.com/bblfsh/python-client) needs to be present on the Spark workers for this snippet to function, **not only on the driver.**

### Origin

@@ -32,28 +72,28 @@ Please refer to [this GitHub issue](https://github.com/src-d/ml-backlog/issues/7
the procedure in high detail. It was quite sophisticated because we wanted to cover as much data as we could.
We used 11 "Start-2-L" machines on online.net.

### Limitations - TODO for Romain

| | # of repos | # of files | # of distinct files | % of duplicates |
|:-----------:|:------------:|:----------:|:--------------------:|:-----------------:|
| **PGA** | 220,174 | 40,971,787 | 40,829,244 | 0.3 % |
| **UASTs** | 218,023 | 36,162,330 | 35,991,340 | 0.5 % |

As you see, we were not able to process 100% of the HEAD of Public Git Archive. For one, we could not process all languages, as Babelfish currently only has drivers for 9 languages - and like all software, it not immune to errors. Furthermore, some repositories proved too large to be processable in a reasonnable amount of time.

| | file count | file extraction % | file size | byte extraction % |
|:--------------:|:----------:|:--------------------:|:-----------------:|:--:|
| **ALL** | 35,991,340 | 88.15 % | 484.7 GB | 65.37 % |
||
| **Go** | 4,126,578 | 99.88 % | 56.48 GB | 96.12 % |
| **Python** | 2,994,169 | 89.70 % | 22.84 GB | 84.36 % |
| **C++** | 8,726,368 | 80.41 % | 92.85 GB | 63.69 % |
| **C#** | 2,379,754 | 98.99 % | 15.43 GB | 93.12 % |
| **Java** | 6,985,742 | 96.85 % | 42.19 GB | 95.26 % |
| **JavaScript** | 10,466,131 | 80.54 % | 227.68 GB | 50.09 % |
| **Ruby** | 1,143,654 | 96.70 % | 3.42 GB | 91.56 % |
| **PHP** | 2,888,395 | 87.64 % | 15.55 GB | 71.92 % |
| **Shell** | 1,118,453 | 87.54 % | 8.26 GB | 25.97 % |
### Limitations

| | # of repos | # of files | # of distinct files | % of duplicates |
|:---------:|:----------:|:----------:|:-------------------:|:---------------:|
| **PGA** | 220,174 | 40,971,787 | 40,829,244 | 0.3 % |
| **UASTs** | 218,023 | 36,162,330 | 35,991,340 | 0.5 % |

As the above table shows, we were not able to process 100% of the HEAD of Public Git Archive. We did not process all the languages because Babelfish currently has drivers for only 9 languages. Furthermore, some files proved to be too large to be processed in a reasonable amount of time. Combined with parsing errors and bugs on Babelfish's side, those resulted in missing ~12% of all parsable files in the HEAD of PGA. They amount for ~45% of all the data in bytes. As we can see from the table below, the distribution of the number of errors by language is not uniform: for instance, the C++ driver, which handles all C-like languages (C, C++, Metal, Cuda), performed worse than the others, while the Go driver performed much better.

| | # of distinct files processed | % of files processed | # of bytes processed | % of bytes processed |
|:--------------:|:-----------------------------:|:--------------------:|:--------------------:|:--------------------:|
| **All parsable** | 35,991,340 | 88.15 % | 484.7 GB | 65.37 % |
| |
| **Go** | 4,126,578 | 99.88 % | 56.48 GB | 96.12 % |
| **Python** | 2,994,169 | 89.70 % | 22.84 GB | 84.36 % |
| **C++** | 8,726,368 | 80.41 % | 92.85 GB | 63.69 % |
| **C#** | 2,379,754 | 98.99 % | 15.43 GB | 93.12 % |
| **Java** | 6,985,742 | 96.85 % | 42.19 GB | 95.26 % |
| **JavaScript** | 10,466,131 | 80.54 % | 227.68 GB | 50.09 % |
| **Ruby** | 1,143,654 | 96.70 % | 3.42 GB | 91.56 % |
| **PHP** | 2,888,395 | 87.64 % | 15.55 GB | 71.92 % |
| **Shell** | 1,118,453 | 87.54 % | 8.26 GB | 25.97 % |

### License

0 comments on commit d6f083a

Please sign in to comment.
You can’t perform that action at this time.