# Analyzing the Public Git Archive with source{d} engine

Let's start by downloading repositories into the `/repositories` directory.
In order to do this you will need to install the `pga` tool.

Run the following commands in your command line (they need to be executed outside of this docker container).
If you need to install Go, follow the instructions in https://golang.org/doc/install.

```bash
$ git clone https://github.com/src-d/datasets
$ cd datasets
$ cd PublicGitArchive/pga
$ go install
```

This will install the `pga` tool in your `GOPATH/bin`.

Try running `pga -h` and you should see the available commands.

### Downloading all src-d repositories

First let's list all the repositories that match the url `/src-d/`. This should be all of the repositories under the `src-d` organization.

```bash
$ pga list --url /src-d/
https://github.com/src-d/beanstool
https://github.com/src-d/kmcuda
https://github.com/src-d/hercules
https://github.com/src-d/proteus
https://github.com/src-d/lapjv
https://github.com/src-d/go-kallax
https://github.com/src-d/wmd-relax
https://github.com/src-d/enry
https://github.com/src-d/awesome-machine-learning-on-source-code
https://github.com/src-d/go-git
```

Let's now download them and ask them to be stored inside of our `repositories` directory (the one at the root of this guide's repositiory.

```
$ pga get --url /src-d/ -o repositories
 8 / 10 [=======================================================>--------------------------------]  80.00%
```

This shouldn't take long.

### Downloading all Java repositories!

Do you want to use all the repositories containing any Java code?
You can use the `--language` or `-l` flag for this in `list` and `get`.

```bash
$ pga list -l java
...
```

More flags will come soon, feel free to file an issue asking for something extra in htts://github.com/src-d/datasets.


# Time to analyze

Let's first make sure that there are any files inside of `/repositories`.

In [1]:
! find /repositories

/repositories
/repositories/siva
/repositories/siva/latest
/repositories/siva/latest/05
/repositories/siva/latest/05/05ea82f75e9ba7c2158e94dd4a714d359d0cab02.siva
/repositories/siva/latest/33
/repositories/siva/latest/33/338126bc0b8a7b447acf1830030a39c16bc39195.siva
/repositories/siva/latest/5d
/repositories/siva/latest/5d/5d7303c49ac984a9fec60523f2d5297682e16646.siva
/repositories/siva/latest/65
/repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva
/repositories/siva/latest/6b
/repositories/siva/latest/6b/6bc52531e707eb4b9b875c418a84f2e100ff6e73.siva
/repositories/siva/latest/73
/repositories/siva/latest/73/738658b11c94345a8003fa41b5d19f39b09bba7f.siva
/repositories/siva/latest/9e
/repositories/siva/latest/9e/9e7f20d3c0a40a715f993db75adfbf56e268a30a.siva
/repositories/siva/latest/c1
/repositories/siva/latest/c1/c13587212de574c5dadeac9fa483367d53717abe.siva
/repositories/siva/latest/cc
/repositories/siva/latest/cc/cce947b98a050c6d356bc6ba9503025

This should have shown a bunch of `siva` files. These are pieces of the repositories we downloaded encoded in siva format. You can learn more about the format in https://github.com/src-d/siva.

## Loading the repositories into the engine

We will now start to use the source{d} engine from with its Python API. The libraries are available in this Docker container, so you don't need to do much to get started.

First we need to import some packages.

In [2]:
from sourced.engine import Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

Next we will create a Spark session, since the source{d} engine is powered by [Spark](https://spark.apache.org/).

In [3]:
spark = SparkSession.builder\
  .master("local[*]").appName("Examples")\
  .getOrCreate()

Now we create an instance of source{d} engine, asking it to parse all of the `siva` formatted files we downloaded.

In [4]:
engine = Engine(spark, "/repositories/siva/latest/*/", "siva")

_Note_: If you had added repos in the `/repositories` directory by using `git clone` directly, you can use `"standard"` as the format parameter instead of `"siva"`.

And finally, let's list all of the repositories we obtained from those files!

In [5]:
engine.repositories.select('id').distinct().show(10, False)

+--------------------------------------------------------+
|id                                                      |
+--------------------------------------------------------+
|github.com/src-d/awesome-machine-learning-on-source-code|
|github.com/src-d/go-git                                 |
|github.com/src-d/hercules                               |
|github.com/src-d/beanstool                              |
|github.com/src-d/proteus                                |
|github.com/src-d/enry                                   |
|github.com/src-d/lapjv                                  |
|github.com/src-d/wmd-relax                              |
|github.com/src-d/go-kallax                              |
|github.com/src-d/kmcuda                                 |
+--------------------------------------------------------+



This should have listed the 10 repositories we fetched, but don't worry if you get more, as the Public Git Archive datasets will evolve over time.

What information do we have for each repository? Well, there's a lot, but a way to start exploring is to see the schema of the `repositories` table.

In [6]:
engine.repositories.printSchema()

root
 |-- id: string (nullable = false)
 |-- urls: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- is_fork: boolean (nullable = true)
 |-- repository_path: string (nullable = true)



In addition to these fields in the schema, we can also access to all the references in the repository by using the `references` method, or simply get the hash for the `HEAD` reference in each repository.

In [7]:
engine.repositories.references.head_ref.show()

+--------------------+---------------+--------------------+---------+
|       repository_id|           name|                hash|is_remote|
+--------------------+---------------+--------------------+---------+
|github.com/src-d/...|refs/heads/HEAD|8adc7b4e324353ea6...|     true|
|github.com/src-d/...|refs/heads/HEAD|c254447c1e1bd7857...|     true|
|github.com/src-d/...|refs/heads/HEAD|98916b85c6fe08f2b...|     true|
|github.com/src-d/...|refs/heads/HEAD|2a161296e79cc1c98...|     true|
|github.com/src-d/...|refs/heads/HEAD|0db3b4b5536e6dc4d...|     true|
|github.com/src-d/...|refs/heads/HEAD|3fea3cb739570b458...|     true|
|github.com/src-d/...|refs/heads/HEAD|205141d7c3b7f600b...|     true|
|github.com/src-d/...|refs/heads/HEAD|b77b1a244948d1a1d...|     true|
|github.com/src-d/...|refs/heads/HEAD|014493bed229e27d8...|     true|
|github.com/src-d/...|refs/heads/HEAD|8fb9cc2fee1b08597...|     true|
+--------------------+---------------+--------------------+---------+



It is also very simple to select what fields we can to display with the `select` method.

In [8]:
engine.repositories.references.head_ref.select('repository_id', 'hash').show(10, False)

+--------------------------------------------------------+----------------------------------------+
|repository_id                                           |hash                                    |
+--------------------------------------------------------+----------------------------------------+
|github.com/src-d/kmcuda                                 |8adc7b4e324353ea691743e446b7f433a31b0937|
|github.com/src-d/wmd-relax                              |c254447c1e1bd7857499e638a29e68ddc2df32b6|
|github.com/src-d/go-git                                 |98916b85c6fe08f2be5a235db43957d493ba37b9|
|github.com/src-d/go-kallax                              |2a161296e79cc1c98a5dc303deecc223abb482e5|
|github.com/src-d/enry                                   |0db3b4b5536e6dc4d9109d42897c00a5d92af0a7|
|github.com/src-d/awesome-machine-learning-on-source-code|3fea3cb739570b45845bfad35e41f85307e95097|
|github.com/src-d/proteus                                |205141d7c3b7f600b063260371ec9e41ed8a3827|


## Exploring a bit more

Let's try to fetch the first line of the `README.md` file pointed by the `HEAD` reference in each repository.

In [66]:
repos = engine.repositories
head_refs = repos.references.head_ref
tree_entries = head_refs.commits.tree_entries
readmes = tree_entries.filter(tree_entries.path == 'README.md')
contents = readmes.blobs.collect()

In [75]:
for row in contents:
    print(row.repository_id)
    lines = [l.decode("utf-8") for l in row.content.splitlines()]
    for (i, line) in enumerate(lines):
        if len(line) == 0:
            continue
        if line[0] == '#':
            print(line)
            break
        if line[0] == '=':
            print(lines[i-1])
            break
    print('')

github.com/src-d/kmcuda
"Yinyang" K-means and K-nn using NVIDIA CUDA

github.com/src-d/wmd-relax
Fast Word Mover's Distance

github.com/src-d/go-git
### Basic example

github.com/src-d/go-kallax
## Contents

github.com/src-d/enry
# enry [![GoDoc](https://godoc.org/gopkg.in/src-d/enry.v1?status.svg)](https://godoc.org/gopkg.in/src-d/enry.v1) [![Build Status](https://travis-ci.org/src-d/enry.svg?branch=master)](https://travis-ci.org/src-d/enry) [![codecov](https://codecov.io/gh/src-d/enry/branch/master/graph/badge.svg)](https://codecov.io/gh/src-d/enry)

github.com/src-d/awesome-machine-learning-on-source-code
# Awesome Machine Learning On Source Code [![Awesome Machine Learning On Source Code](https://awesome.re/badge.svg)](https://github.com/src-d/awesome-machine-learning-on-source-code)

github.com/src-d/proteus
# ![proteus](https://rawgit.com/src-d/proteus/master/proteus.svg)

github.com/src-d/beanstool
beanstool [![Circle CI](https://circleci.com/gh/src-d/beanstool.svg?style=svg)](h