In [1]:
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import tabulate
from IPython.display import HTML, display

import common


DATASET = Path('../experiments')

def d(p):
    x, y = common.load_text_distribution(p)
    return common.Distribution(x, y, '', '', '')

distributions = {
    'In-degrees': [
        ("Full", d(DATASET / 'inout/full_in.txt')),
        ("Filesystem", d(DATASET / 'inout/dir+cnt_in.txt')),
        ("Commit", d(DATASET / 'inout/rev_in.txt')),
        ("History", d(DATASET / 'inout/rel+rev_in.txt')),
        ("Hosting", d(DATASET / 'inout/ori+snp_in.txt')),
    ],
    'Out-degrees': [
        ("Full", d(DATASET / 'inout/full_out.txt')),
        ("Filesystem", d(DATASET / 'inout/dir+cnt_out.txt')),
        ("Commit", d(DATASET / 'inout/rev_out.txt')),
        ("History", d(DATASET / 'inout/rel+rev_out.txt')),
        ("Hosting", d(DATASET / 'inout/ori+snp_out.txt')),
    ],
    'Connected components': [
        ("Full", d(DATASET / 'connectedcomponents/full/distribution.txt')),
        ("Filesystem", d(DATASET / 'connectedcomponents/dir+cnt/distribution.txt')),
        ("Commit", d(DATASET / 'connectedcomponents/rev/distribution.txt')),
        ("History", d(DATASET / 'connectedcomponents/rel+rev/distribution.txt')),
        ("Hosting", d(DATASET / 'connectedcomponents/ori+snp/distribution.txt')),
    ],
    'Clustering coefficient': [
        ("Full", d(DATASET / 'clusteringcoeff/distribution-full.txt')),
        ("Filesystem", d(DATASET / 'clusteringcoeff/distribution-dircnt.txt')),
        ("Commit", d(DATASET / 'clusteringcoeff/distribution-rev.txt')),
        ("History", d(DATASET / 'clusteringcoeff/distribution-relrev.txt')),
        # ("Hosting", d(DATASET / 'clusteringcoeff/distribution-orisnp.txt')),
    ],
    'Shortest path': [
        ("Filesystem", d(DATASET / 'shortestpath/dir+cnt/distribution.txt')),
        ("Commit", d(DATASET / 'shortestpath/rev/distribution.txt')),
    ]
}

## Graph layer statistics

Statistics of the graph layers and their associated distributions, as reported in the article.

In [2]:
# it can take few minutes to process
headers = ["Algorithm", "Layer", "Number of objects", "Scaling parameter", "X decades", "Y decades"]
table = []
for algo_name, algo_distributions in distributions.items():
    for name, distribution in algo_distributions:
        row = [
            algo_name,
            name,
            f'{int(np.sum(distribution.y)):,}',
            distribution.fitted_power(),
            np.log10(np.max(distribution.x)),
            np.log10(np.max(distribution.y)),
        ]
        table.append(row)

display(HTML(tabulate.tabulate(table, headers=headers, tablefmt='html')))

Algorithm,Layer,Number of objects,Scaling parameter,X decades,Y decades
In-degrees,Full,19330739526,1.86533,8.47619,10.1453
In-degrees,Filesystem,17050437427,1.86295,8.47619,10.0273
In-degrees,Commit,1976476233,2.20457,5.84003,9.23299
In-degrees,History,1993015770,2.14762,5.84003,9.23155
In-degrees,Hosting,287286329,2.76256,7.03349,8.16881
Out-degrees,Full,19330739526,1.94752,6.01419,9.96291
Out-degrees,Filesystem,17050437427,1.94683,6.01419,9.96169
Out-degrees,Commit,1976476233,5.80822,5.0,9.24394
Out-degrees,History,1993015770,5.80822,5.0,9.24802
Out-degrees,Hosting,287286329,2.20614,4.98671,8.22387


## Data integrity: in and out degrees

This data helps getting an overview of the graph properties and check whether it is consistent to our expectations as a way to perform data integrity checks.

### Node and edge statistics of the studied graph corpus.

It corresponds to https://annex.softwareheritage.org/public/dataset/graph/2020-12-15/compressed/ (same as Table 1)

**TODO** Confirm that these numbers do not come from a calculation based on the distributions but from the raw data, and provide script to generate them from raw data. Fill object to be used in script below for automatic integrity check.

|Layer|Node type|Nodes|%|
|:------|:------|------:|---:|
|hosting|origins|147 453 557|0.76%|
||snapshots|139 832 772|0.72%|
|history|releases|16 539 537|0.09%|
||commits|1 976 476 233|10.22%|
|filesystem|directories|7 897 590 134|40.86%|
||contents|9 152 847 293|47.35%|
||Total|19 330 739 526|100%|

|Layer|Edge type|Edges|%|
|:------|:------|------:|---:|
|hosting|origin $\to$ snapshot|776 112 709|0.35%|
||snapshot   $\to$ commit|1 358 538 567|0.61%|
||snapshot   $\to$ release|70 0823 546|0.32%|
|history|release    $\to$ commit|16 492 908|0.01%|
||commit     $\to$ commit|2 021 009 703|0.91%|
||commit     $\to$ directory|1 971 187 167|0.89%|
|filesystem|directory  $\to$ directory|64 584 351 336|29.16%|
||directory  $\to$ commit|792 196 260|0.36%|
||directory  $\to$ content|149 267 317 723|67.39%|
||Total|221 488 073 659|100%|

In [1]:
#compress dataet 2021-12-15
rawstats={"nodes":
          {
              "origin":147453557,
              "snapshot":139832772,
              "release":16539537,
              "commit":1976476233,
              "directory":7897590134,
              "content":9152847293
          },
          "edges":{
              "origin":{"snapshot":776112709},
              "snapshot":{"commit":1358538567,"release":700823546},
              "release":{"commit":16492908},
              "commit":{"commit":2021009703,"directory":1971187167},
              "directory":{"directory":64584351336,"commit":792196260,"content":149267317723}
          }
         }

### Criteria list
Here are a few examples of criteria that can be checked on the table:

1. The number of nodes computed from the distributions (= the sum of the second column) is always the same in all distributions starting from the same node type. For instance, `dir_in_*` and `dir_out_*` all have the same number of directory nodes which have to be equals to the number of directory nodes in the raw swh dataset (namely 7 897 590 134).
2. The total or average in/outdegree of a given object type is consistent when each neighbor type is looked independently and when they are all aggregated together (e.g. the average degree of `dir_out_all` is a weighted average of the average degrees of the `dir_out_{cnt,dir,rev}` distributions).
3. The number of objects with a total indegree of 0 should be small in all types of objects that are supposed to be reachable from the upper layers of the graph.
4. Some specific per-layer indegrees are expected to be relatively small compared to the total number of objects (e.g. most revisions do not have an associated release)

In [3]:
inout_per_type = [
    'cnt_in_dir',
    'dir_in_all',
    'dir_in_dir',
    'dir_in_rev',
    'dir_out_all',
    'dir_out_cnt',
    'dir_out_dir',
    'dir_out_rev',
    'ori_out_snp',
    'rel_in_snp',
    'rev_in_all',
    'rev_in_dir',
    'rev_in_rel',
    'rev_in_rev',
    'rev_in_snp',
    'rev_out_rev',
    'snp_in_ori',
    'snp_out_all',
    'snp_out_rel',
    'snp_out_rev',
]

headers = ["Node type", "Direction", "Neighbor type", "# Nodes", "# Edges", "Avg degree", "# (Lowest degree)", "# (Second-lowest)"]
table = []
for name in inout_per_type:
    dist = d(DATASET / f'inout/per_type/{name}.txt')
    src, direction, dst = name.split('_')
    row = [
        common.types_verbose[src],
        ("← in " if direction == 'in' else "→ out "),
        common.types_verbose[dst],
        f'{int(np.sum(dist.y)):,}',
        f'{int(np.sum(dist.x * dist.y)):,}',
        np.sum(dist.x * dist.y) / np.sum(dist.y),
        f'{int(dist.y[0]):,} ({int(dist.x[0]):,})',
        f'{int(dist.y[1]):,} ({int(dist.x[1]):,})',
    ]
    table.append(row)

display(HTML(tabulate.tabulate(table, headers=headers, tablefmt='html')))

Node type,Direction,Neighbor type,# Nodes,# Edges,Avg degree,# (Lowest degree),# (Second-lowest)
contents,← in,directories,9152847293,143786784566,15.7095,"5,978,249,005 (1)","1,098,223,970 (2)"
directories,← in,everything,7897590134,65200402547,8.25573,"1,343,830 (0)","6,134,767,929 (1)"
directories,← in,directories,7897590134,63229213027,8.00614,"1,607,262,793 (0)","4,669,554,466 (1)"
directories,← in,revisions,7897590134,1971187167,0.249594,"6,261,880,169 (0)","1,504,272,429 (1)"
directories,→ out,everything,7897590134,207805470722,26.3125,"557,087 (0)","1,713,055,834 (1)"
directories,→ out,contents,7897590134,143786781408,18.2064,"1,787,869,540 (0)","1,421,143,792 (1)"
directories,→ out,directories,7897590134,63229213027,8.00614,"2,753,589,255 (0)","1,734,567,306 (1)"
directories,→ out,revisions,7897590134,789473873,0.0999639,"7,860,017,187 (0)","23,267,141 (1)"
origins,→ out,snapshots,147453557,189314705,1.28389,"22,710,546 (0)","77,244,971 (1)"
releases,← in,snapshots,16539537,700135072,42.331,"427,531 (0)","4,408,973 (1)"


In [None]:
small script to check sum and everything

### Discussion 

Data few examples that works fine et few nodes without ancestors

## Data integrity failures ecountered during this study

### Error in/out distributions (switch fallthrough bug)


https://forge.softwareheritage.org/rDGRPH6ef89157db57834ad94607f3691e43adaa78a21e


### Data integrity: nodes without ancestors / Compression Pipline
https://annex.softwareheritage.org/public/dataset/graph/2020-05-20/compressed/ 


### Data integrity: nodes without ancestors / Raw Dataset

By browsing the entire graph, it is possible to list all the nodes without ancestors.  A list of their identifiers and types is available in this replication package ("./experiments/nodesmissingancestor/*").

https://forge.softwareheritage.org/T3660

Dump check

rev1000

