# Provenance Filtering Scorer
## Description

In [None]:
ProvenanceFilteringScorer.py

This script performs scoring for the Medifor Provenance Filtering task.  The script produces output files in the form of a pipe-separated csv files, and returns an exit status of 0 if scoring was successful.  If the script aborts due to an error, the error will be written to standard out, and the script will exit with a status of 1.  The script produces two output files, 'trial_scores.csv', which reports scores at the trial (probe) level, and 'scores.csv', which aggregates metrics over trials.  The script also generates a 'node_mapping.csv' file which lists each individual node for each trial and measure and it's designation as either "Correct", "FalseAlarm", or "Missing".

The script reports the following metrics:
* Node recall @ 50 - Recall for the top 50 nodes (by confidence score) reported by the system
* Node recall @ 100 - Recall for the top 100 nodes (by confidence score) reported by the system
* Node recall @ 200 - Recall for the top 200 nodes (by confidence score) reported by the system

Where:
$$\begin{equation*}
recall = \frac{|\{relevant\} \cap \{retrieved\}|}{|\{relevant\}|}
\end{equation*}$$

The script also reports various counts, such as the number of correct nodes @ 50.  See the examples section for sample output.

## Command-line interface

### Required Arguments

-o, --output-dir OUTPUT_DIR
* Output directory, where you want the script to write results

-x, --index-file INDEX_FILE
* Index file (e.g. NC2017_Beta/indexes/NC2017_Dev2-provenancefiltering-index.csv)

-r, --reference-file REFERENCE_FILE
* Reference file (e.g. NC2017_Beta/reference/provenancefiltering/NC2017_Dev2-provenancefiltering-ref.csv)

-n, --node-file NODE_FILE
* Node file (e.g. NC2017_Beta/reference/provenancefiltering/NC2017_Dev2-provenancefiltering-node.csv)

-w, --world-file WORLD_FILE
* World file (e.g. NC2017_Beta/indexes/NC2017_Dev2-provenancefiltering-world.csv)

-R, --reference-dir REFERENCE_DIR
* Reference directory (e.g. NC2017_Beta/), relative paths to journals in the REFERENCE_FILE are relative to this directory

-s, --system-output-file SYSTEM_OUTPUT_FILE
* System output file (e.g. `<EXPID>`.csv)

-S, --system-dir SYSTEM_DIR
* System output directory where system output json files can be found.  Paths to system output json files in the SYSTEM_OUTPUT_FILE should be relative to this directory

### Optional Arguments

-t, --skip-trial-disparity-check
* By default the script will check that the trials in INDEX_FILE correspond one-to-one to the trials in the SYSTEM_OUTPUT_FILE.  If this option is set, no such check is performed

-H, --html-report
* Generates an HTML report file which includes tables for trial-level and aggregate scores, the report is saved as `<OUTPUT_DIR>`/report.html

## Example Usage

The following example uses the test case files included with the tool

In [1]:
%%bash
./ProvenanceFilteringScorer.py -o "compcheckfiles/filtering_example/" \
                               -x "../../data/test_suite/provenanceScorerTests/test_case_2-provenancegraphbuilding-index.csv" \
                               -r "../../data/test_suite/provenanceScorerTests/test_case_2-provenance-ref.csv" \
                               -n "../../data/test_suite/provenanceScorerTests/test_case_2-provenance-node.csv" \
                               -w "../../data/test_suite/provenanceScorerTests/test_case_2-provenancegraphbuilding-world.csv" \
                               -R "../../data/test_suite/provenanceScorerTests/" \
                               -s "../../data/test_suite/provenanceScorerTests/test_case_2-filtering-system_output_1_index.csv" \
                               -S "../../data/test_suite/provenanceScorerTests/"
                               
head compcheckfiles/filtering_example/scores.csv \
     compcheckfiles/filtering_example/trial_scores.csv \
     compcheckfiles/filtering_example/node_mapping.csv \

==> compcheckfiles/filtering_example/scores.csv <==
MeanNodeRecall@50|MeanNodeRecall@100|MeanNodeRecall@200
0.333333333333|0.666666666667|1.0

==> compcheckfiles/filtering_example/trial_scores.csv <==
JournalName|ProvenanceProbeFileID|ProvenanceOutputFileName|NumSysNodes|NumRefNodes|NumCorrectNodes@50|NumMissingNodes@50|NumFalseAlarmNodes@50|NumCorrectNodes@100|NumMissingNodes@100|NumFalseAlarmNodes@100|NumCorrectNodes@200|NumMissingNodes@200|NumFalseAlarmNodes@200|NodeRecall@50|NodeRecall@100|NodeRecall@200
c257f3e674aaa597784ae6f5a402c748|9f796693249720d9d4eb8f1c3347c300|jsons/test_case_2-filtering-system_output_1.json|200|18|6|12|44|12|6|88|18|0|182|0.333333333333|0.666666666667|1.0

==> compcheckfiles/filtering_example/node_mapping.csv <==
JournalName|ProvenanceProbeFileID|ProvenanceOutputFileName|Measure|WorldFileID|Mapping
c257f3e674aaa597784ae6f5a402c748|9f796693249720d9d4eb8f1c3347c300|jsons/test_case_2-filtering-system_output_1.json|NodeRecall@50|05091cd3d392e1296004ea4f6487cf

## Disclaimer

This software was developed at the National Institute of Standards
and Technology (NIST) by employees of the Federal Government in the
course of their official duties. Pursuant to Title 17 Section 105
of the United States Code, this software is not subject to copyright
protection and is in the public domain. NIST assumes no responsibility
whatsoever for use by other parties of its source code or open source
server, and makes no guarantees, expressed or implied, about its quality,
reliability, or any other characteristic.