# Provenance Graph Building Scorer
## Description

In [None]:
ProvenanceGraphBuildingScorer.py

This script performs scoring for the Medifor Provenance Graph Building task.  The script produces output in the form of a pipe-separated csv file, and returns an exit status of 0 if scoring was successful.  If the script aborts due to an error, the error will be written to standard out, and the script will exit with a status of 1.

The script reports the following metrics:
* SimNLO - Similarity of node link overlap
* SimNO - Similarity of node overlap
* SimLO - Similarity of link overlap
* Node recall - Recall for the nodes reported by the system

Where:
$$\begin{equation*}
Sim_{NLO}(G, G') = 2\frac{|N \cap N'| + |L \cap L'|}{|N| + |N'| + |L| + |L'|}
\end{equation*}$$

$$\begin{equation*}
Sim_{NO}(G, G') = 2\frac{|N \cap N'|}{|N| + |N'|}
\end{equation*}$$

$$\begin{equation*}
Sim_{LO}(G, G') = 2\frac{|L \cap L'|}{|L| + |L'|}
\end{equation*}$$

$$\begin{equation*}
recall = \frac{|\{relevant\} \cap \{retrieved\}|}{|\{relevant\}|}
\end{equation*}$$

The script also reports various counts, such as the number of correct nodes and links.  See the examples section for sample output.

## Command-line interface

### Required Arguments

-x, --index-file INDEX_FILE
* Index file (e.g. NC2017_Beta/indexes/NC2017_Dev2-provenancefiltering-index.csv)

-r, --reference-file REFERENCE_FILE
* Reference file (e.g. NC2017_Beta/reference/provenancefiltering/NC2017_Dev2-provenancefiltering-ref.csv)

-n, --node-file NODE_FILE
* Node file (e.g. NC2017_Beta/reference/provenancefiltering/NC2017_Dev2-provenancefiltering-node.csv)

-w, --world-file WORLD_FILE
* World file (e.g. NC2017_Beta/indexes/NC2017_Dev2-provenancefiltering-world.csv)

-R, --reference-dir REFERENCE_DIR
* Reference directory (e.g. NC2017_Beta/), relative paths to journals in the REFERENCE_FILE are relative to this directory

-s, --system-output-file SYSTEM_OUTPUT_FILE
* System output file (e.g. `<EXPID>`.csv)

-S, --system-dir SYSTEM_DIR
* System output directory where system output json files can be found.  Paths to system output json files in the SYSTEM_OUTPUT_FILE should be relative to this directory

### Optional Arguments

-t, --skip-trial-disparity-check
* By default the script will check that the trials in INDEX_FILE correspond one-to-one to the trials in the SYSTEM_OUTPUT_FILE.  If this option is set, no such check is performed

-d, --direct
* Toggles on direct path scoring

-o, --output-file OUTPUT_FILE
* By default the scoring script writes results to standard output.  If this option is provided, results instead are written to OUTPUT_FILE

## Example Usage

The following example uses the test case files included with the tool


In [1]:
%%bash
./ProvenanceGraphBuildingScorer.py -x "../../data/test_suite/provenanceScorerTests/test_case_2-provenancegraphbuilding-index.csv" \
                                   -r "../../data/test_suite/provenanceScorerTests/test_case_2-provenance-ref.csv" \
                                   -n "../../data/test_suite/provenanceScorerTests/test_case_2-provenance-node.csv" \
                                   -w "../../data/test_suite/provenanceScorerTests/test_case_2-provenancegraphbuilding-world.csv" \
                                   -R "../../data/test_suite/provenanceScorerTests/" \
                                   -s "../../data/test_suite/provenanceScorerTests/test_case_2-system_output_1_index.csv" \
                                   -S "../../data/test_suite/provenanceScorerTests/"

JournalName|ProvenanceProbeFileID|Direct|ProvenanceOutputFileName|NumSysNodes|NumSysLinks|NumRefNodes|NumRefLinks|NumCorrectNodes|NumMissingNodes|NumFalseAlarmNodes|NumCorrectLinks|NumMissingLinks|NumFalseAlarmLinks|SimNLO|SimNO|SimLO|NodeRecall
c257f3e674aaa597784ae6f5a402c748|9f796693249720d9d4eb8f1c3347c300|False|jsons/test_case_2-system_output_1.json|20|21|18|18|15|3|5|6|12|15|0.545454545455|0.789473684211|0.307692307692|0.833333333333


## Disclaimer

This software was developed at the National Institute of Standards
and Technology (NIST) by employees of the Federal Government in the
course of their official duties. Pursuant to Title 17 Section 105
of the United States Code, this software is not subject to copyright
protection and is in the public domain. NIST assumes no responsibility
whatsoever for use by other parties of its source code or open source
server, and makes no guarantees, expressed or implied, about its quality,
reliability, or any other characteristic.