# Video Spatial-Temporal Localization Scorer

## Table of Contents

0\. [Synopsis](#Synopsis)

1\. [Description](#Description)

2\. [Spatial-Temporal System Masks](#Sysmasks)

3\. [Reference Masks](#Refmasks)

3.1\. [Binarizing the Reference Mask](#Binmasks)

4\. [Generating No-Score Zones](#noscorezone)

5\. [Metrics](#metrics)

5.1 [System Mask Thresholding](#metricsthresholding)

5.2 [Notation](#metricsnotation)

5.3 [Matthews Correlation Coefficient (MCC)](#metricsmcc)

5.4 [Nimble Mask Metric (NMM)](#metricsnmm)

5.5 [Weighted L1 Loss (BWL1 and GWL1)](#metricswl1)

5.6 [Area Under Curve (AUC)](#metricsauc)

5.7 [Equal Error Rate (EER)](#metricseer)

6\. [ROC Curves](#roccurves)

7\. [Options](#options)

7.1 [Task Type Options](#optionstask)

7.2 [Input Options](#optionsinputs)

7.3 [Output Options](#optionsoutputs)

7.4 [Printout Options](#optionsprintout)

7.5 [Scoring Options](#optionsscoring)

7.6 [Performance Evaluation by Query](#optionsquery)

8\. [Examples](#examples)

## 0. Synopsis<a name="Synopsis"></a>

python2 VideoSpatialLocalizationScorer.py<br/>
                      --refDir reference-directory<br/>
                      -r reference-file<br/>
                      -x index-file<br/>
                      -s system-output-file<br/>
                      --outRoot output-directory<br/>
                      [options]

## 1. Description<a name="Description"></a>

The video spatial-temporal localization (VSTL) scorer calculates performance scores that measure the accuracy of the system output masks to their corresponding reference masks. This scorer's function is analogous to that of the image localization scorer (under tools/MaskScorer), but is applied to detecting manipulated pixels in videos.

The script produces output files in the form of pipe-separated ("|") CSV files, one containing scores at the trial level (\*-pervideo.csv), another containing an average of the metrics in the first CSV (\*-mask_score.csv), and a third containing a list of manipulations that were evaluated or skipped over (\*-journalResults.csv).

The mask scorer takes the following steps to score a mask:

1. Reads in the reference and system masks.
2. Binarizes the reference mask.
3. Generates the no-score zone.
4. Computes metrics 

If the script aborts due to an error, the error will be written to standard out, and the script will exit with a status of 1.

## 2. Spatial-Temporal System Masks<a name="Sysmasks"></a>

The system mask is read in as an HDF5 file. Each file contains a set of blocks indexed according to their first frames (with the convention that frame 1 is the starting frame). Each block, representing a contiguous set of manipulated frames, is a three-dimensional integer matrix with entries ranging from 0 to 255, the values of which denote whether or not a particular pixel is manipulated (0 meaning it is, 255 meaning it is not). The dimensions of each block is according to the following convention: (total_frames,height,width).

Thus each block in the HDF5 system mask denotes a range of frames where the performer's algorithm detects localizable manipulations. The validator will fail the mask if it does not conform to the desired specifications.

## 3. Reference Masks<a name="Refmasks"></a>

The reference mask is read in as an HDF5 mask containing blocks indexed according to the same convention as the system output mask. A fourth dimension might be added that adds layers with additional manipulations for the same block, leading to the following indexing scheme for each block in the reference mask: (total_frames,height,width,layer).

### 3.1. Binarizing the Reference Mask<a name="Binmasks"></a>

In order to score the system output mask against the reference mask, the reference mask needs to be converted to a single-channel mask. The video spatial localization scorer handles this by selecting the set of Bit Planes that correspond to manipulations for the probe and using this as the ground truth positives.

If selective scoring is utilized (via the -qm option), only regions that match the query will be scored against. Regions listed in the journal as explicitly non-matching will be treated as no-score regions, which will be discussed below. Other regions that are not listed in the journal are ignored, whether due to later manipulations recorded in the same journals that do not apply to the probe in question or due to otherwise faulty data.

## 4. Generating No-Score Zones<a name="noscorezone"></a>

The following information is identical to that provided in the image localization scorer under tools/MaskScorer.

No-score zones are automatically generated to grant an amount of flexibility to the performer, to exclude manipulated regions that are not of interest, and to exclude regions that the performer does not deem of interest. Each subset of the no-score zone is denoted by a specified type of no-score zone based on these motivations: the boundary no-score zone, selective no-score zone, and pixel no-score zone respectively.

To ease the demand in identifying the region manipulated, the boundary around each region is eroded and dilated, with the difference serving as the boundary no-score zone.

If selective scoring is utilized (via the -qm option), the mask scorer also generates the distraction no-score zone by dilating the positive pixels of the irrelevant regions in the reference mask. This number need not be the same as the dilation parameter used to generate the boundary no-score zone and may be adjusted via option --ntdks.

An additional system-generated no-score zone may be used by the performer. The performer specifies a particular pixel value to treat as a no-score zone via option --nspx, and all pixels with the same value in the system mask serves as the no-score zone.

The three are merged to generate the overall no-score zone.

## 5. Metrics<a name="metrics"></a>

The system computes the following metrics for Spatial-Temporal scoring. Only the MCC is computed for Temporal scoring; it is labeled "TemporalMCC":
 * Matthews Correlation Coefficient (MCC)
 * Nimble Mask Metric (NMM)
 * Weighted L1 Loss: Binary (BWL1) and Grayscale (GWL1)
 * Area Under the ROC Curve (AUC)
 * Equal Error Rate (EER)

The details below concerning the metrics are identical to those of the image localization scorer, and are duplicated here for ease of reference.

### 5.1 System Mask Thresholding<a name="metricsthresholding"></a>

Most of these metrics depend on the binarized values of the grayscale system mask. There are various ways binarize the system mask, namely by picking the threshold that determines whether to treat a particular grayscale pixel as white or black.

One is to threshold each mask according to whether a threshold returns the greatest MCC; these are called the "optimum" metrics. Another is to specify a fixed threshold for all masks to be binarized with the --sbin option; the metrics computed from this fixed threshold are called the "actual" metrics. Specifying said threshold will also prompt the scorer to compute the threshold that yields the maximum average MCC; the metrics computed by this threshold are called the "maximum" metrics.

Averages of the metrics will be recorded across all probes. The average and standard deviation of the optimum thresholds will also be recorded.

### 5.2 Notation<a name="metricsnotation"></a>

The following notation is used in discussing the metrics:
 * $gt$ refers to the binarized reference mask discussed in section 3.
 * $sys_{\theta}$ refers to the system output mask binarized by specified threshold $\theta$
   * $sys_{*}$ refers to the unbinarized grayscale system output mask
 * $wts$ is a matrix of the overall no-score region discussed in section 4.
   * The no-score region can be further subdivided into the boundary, distraction, and system-generated no-score zones. The pixel counts corresponding to these regions are denoted $BNS$, $SNS$, and $PNS$ respectively. In cases where regions from different types of no-score zone intersect, the pixels count towards $PNS$, then $SNS$, or $BNS$, in order of decreasing priority.
 * $TP$, $TN$, $FP$, and $FN$ are functions of $gt$, $sys_{\theta}$, and $wts$. All four are measures derived from the confusion matrix. Pixels for which the weights matrix is equal to 0 are not counted in the confusion measures:
   * $TP$ refers to the total number of True Positives, pixels where $gt$ is positive and $sys_{\theta}$ is thresholded positive (manipulated)
   * $TN$ refers to the total number of True Negatives, pixels where $gt$ is negative and $sys_{\theta}$ is positive (not manipulated)
   * $FP$ refers to the total number of False Positives, pixels where $gt$ is negative but $sys_{\theta}$ is positive.
   * $FN$ refers to the total number of False Negatives, pixels where $gt$ is positive but $sys_{\theta}$ is negative.

### 5.3 Matthews Correlation Coefficient (MCC)<a name="metricsmcc"></a>
\begin{equation*}
MCC(gt,sys_{\theta}) = \frac{TP*TN - FP*FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
\end{equation*}

An MCC of 1 denotes perfect correlation, an MCC of 0 denotes no correlation at all, and an MCC of -1 denotes perfect anti-correlation. If any of the sums in the denominator equals 0, the MCC is taken to be 0 by convention.

### 5.4 Nimble Mask Metric (NMM)<a name="metricsnmm"></a>
\begin{equation*}
NMM(gt,sys_{\theta},wts,c)=\max{\left(\frac{TP - FN - FP}{\sum\limits_{px\in gt}wts(px)},c\right)}
\end{equation*}

$\Sigma_{px \in GT} wts(px)$ refers to the sum over the pixels in the ground truth that are marked black. $c$ denotes a minimum cutoff value for the scoring to have any meaning; by default, $c=-1$.

### 5.5 Weighted L1 Loss (BWL1 and GWL1)<a name="metricswl1"></a>

A Weighted L1 of 0 denotes perfect or near perfect match up to variation within the weights that are 0; 1 denotes perfect mismatch. $(FP+FN)_{weights > 0}$ refers to the total number of $FP$ and $FN$ pixels where weights are greater than 0.

The Weighted L1 is applied separately to the original grayscale system output and the binarized mask, producing the grayscale Weighted L1 (GWL1) and binarized Weighted L1 (BWL1) metrics respectively. In the case of the original grayscale, the value is summed over the weighted difference in pixel intensity.

\begin{equation*}
BWL1(gt,sys_{\theta},wts)=\frac{(FP+FN)_{wts > 0}}{\sum\limits_{px\in gt}wts(px)}
\end{equation*}
\begin{equation*}
GWL1(gt,sys_{*},wts)=\frac{\sum\limits_{px\in gt} \left|gt(px) - sys_{*}(px)\right|wts(px)}{\sum\limits_{px\in gt}wts(px)}
\end{equation*}

### 5.6 Area Under Curve (AUC)<a name="metricsauc"></a>

The possibility of multiple thresholds giving rise to a set of varying confusion measures gives rise to the receiver operating characteristic (ROC). The ROC curve measures a relation between the true positive rate ($TPR$) and false positive rate ($FPR$), defined as follows:

\begin{equation*}
TPR = \frac{TP}{TP + FN}
\end{equation*}

\begin{equation*}
FPR = \frac{FP}{FP + TN}
\end{equation*}

The area under the ROC curve (AUC) is a measure for the accuracy of the system.


\begin{equation*}
AUC(gt,sys_{*},wts) = \frac{1}{2}\sum\limits_{\theta_{n} \in \{\theta_{0},\ldots,\theta_{N-1}\}} \left(TPR(gt,sys_{\theta_{n+1}},wts) + TPR(gt,sys_{\theta_{n}},wts)\right)\left(FPR(gt,sys_{\theta_{n+1}},wts) - FPR(gt,sys_{\theta_{n}},wts)\right)
\end{equation*}

### 5.7 Equal Error Rate (EER)<a name="metricseer"></a>
The value of the $FPR$ at the threshold where it is equal to the $FNR$ (or vice versa). $FPR$ is given by the formula in 5.6. $FNR$ is simply $1 - TPR$.


## 6. ROC Curves<a name="roccurves"></a>

A pixel-weighted average ROC curve and probe-weighted average ROC curve will be recorded and defined across all thresholds relevant to all system masks. Computing the pixel- and probe-weighted average ROC curves is exactly the same as in the image localization scorer.

For a given threshold $\theta$, the true positive rate (TPR) for the ROC curves are as follows:

\begin{equation*}
TPR_{pixel}(\theta) = \frac{\sum_{probes}TP_{\theta}}{\sum_{probes}\left(TP_{\theta} + FN_{\theta}\right)}
\end{equation*}

\begin{equation*}
TPR_{probe}(\theta) = \frac{1}{|probes|}\sum_{probes}\frac{TP_{\theta}}{TP_{\theta} + FN_{\theta}}
\end{equation*}

The false postive rates (FPR) are defined similarly. The pixel weighted average and probe weighted average ROC curves are drawn from the pairs of (FPR,TPR) points. Their AUC's are recorded with the average metrics mentioned at the end of section 5.1.

## 7. Options<a name="options"></a>

The command-line options for the mask scorer can be categorized as follows:

### 7.1 Task Type Options<a name="optionstask"></a>

-t --task [manipulation]

  * Specify the task type for evaluation (default = manipulation). As the manipulation task is the only one defined for video spatial localization at the moment, this option need not be selected.

### 7.2 Input Options<a name="optionsinputs"></a>

All CSV files passed to the Mask Scorer must contain headers and must have their columns separated by pipe characters ('|'). Fields and values in the CSV should <i>not</i> be enclosed in quotes ( ' or " ) if possible (e.g. entries 'foo', an empty field, and 'bar', in that order, should look like this in the csv: foo||bar). Additional specifications for the index and system output files can be found in the ValidatorNotebook.html file under the Validator directory.

All of the options here are identical in function to their image localization scorer counterparts.

--refDir

  * Specify the path to the dataset (e.g. "/NC2016_Test0601"). Note that this path must be specified as masks will be read relative to this directory.


-r --inRef

  * Specify the reference CSV file within refDir that contains the ground-truth information and metadata about each image.

-x --inIndex

  * Define the index CSV file within refDir. The index file contains the TaskID, ProbeFileID, ProbeFileName, ProbeWidth, and ProbeHeight fields.

--sysDir

  * Specify the system output data path, for example "mysysoutput/" (default = .) 


-s --inSys

  * Specify the CSV file of the system performance results formatted according to NC2016 specification. The file must contain the ProbeFileID, ConfidenceScore, and OutputProbeMaskFileName fields, in that order, and if scoring on the splice task, the ProbeFileID, DonorFileID, ConfidenceScore, OutputProbeMaskFileName, and OutputDonorMaskFileName fields, in that order. The OutputProbeMaskFileNames and OutputDonorMaskFileNames (where relevant) should be directory strings relative to the location of the system performance CSV.

--sbin

  * Evaluate the system output masks as binarized masks with a numeric threshold in the interval [-1,255]. -1 is allowed to give the performer the option to binarize the entire mask to white. Choosing -10 will forego this option. (default = -10)

### 7.3 Output Options<a name="optionsoutputs"></a>

--outRoot

  * Specify the report output path and the file prefix joined by a '/'. For example, specifying "--outRoot test/output" for a submission NIST_001 will output the aggregate score report "output_mask_score.csv" and the per-image report "output_mask_scores_perimage.csv" in the "test" directory.

### 7.4 Printout Options<a name="optionsprintout"></a>

-v verbose

  * Control print output. Select 1 to print all non-error related output and 0 to suppress all print output (bar argument-parsing errors).

### 7.5 Scoring Options<a name="optionsscoring"></a>

--eks

  * Erosion kernel size. (number must be odd; default = 15)
  
--dks

  * Dilation kernel size. (number must be odd; default = 11)

--ntdks

  * Non-target dilation kernel for regions that the user does not want scored. (number must be odd; default = 15)

--nspx

  * Set a pixel value in the system output mask to be a no-score region. The pixel value must be in the interval [0,255]. -1 indicates that no particular pixel value will be chosen to be the no-score zone. (default = -1)

--pppns perProbePixelNoScore

  * "Use the pixel values in the OptOutPixel column of the system output to designate no-score zones. This value will override the value set for the global no-score pixel.

-k kernel

  * The shape of the kernel to be used, for both erosion and dilation. Choose from 'box','disc','diamond','gaussian', or 'line'. The default kernel is 'box'.
  
--optOut

  * Evaluate algorithm performance on a select number of trials determined by the performer via values in the ProbeStatus column. The values in the column that are opted out of the evaluation are "NonProcessed" (denoting some kind of system failure on that probe), "OptOutLocalization" (denoting opting out of the localization task only), and "OptOutAll" (denoting opting out of both the detection and localization tasks).
  
--precision

  * The number of digits to round computed scores. Note that rounding is not absolute, but is by significant digits (e.g. a score of 0.003333333333333... will round to 0.0033333 for a precision of 5). (default = 16).

#### 7.5.1 Temporal Scoring Options

The following options were taken from or inspired by the Video Temporal Localization Scorer and apply primarily to temporal scoring, but have analogous effects in Spatial-Temporal scoring when set.

-c, --collars

  * An integer designating the number of frames to add to each side of the endpoints of the reference intervals, generating a temporal boundary no-score zone. In spatial-temporal localization scoring, every pixel of a collared frame is treated as a boundary no-score zone.

--truncate

  * Truncates any system intervals that go beyond the video reference framecount, setting them to the framecount value and enabling them to be scored.
  
--temporal_gt_only

  * Scores only on frames where there is temporal localization manipulation.

--temporal_scoring_only

  * Generate only the temporal localization metrics, skipping the spatial localization metrics.


### 7.6 Performance Evaluation by Query<a name="optionsquery"></a>

This option allows the user to evaluate their algorithm performance on either subsets or partitions of the data based on the specified queries and query options. The reference and index CSV files contain a list of factors (e.g., ProbePostProcessed|DonorPostProcessed|ManipulationQuality|IsManipulationTypeRemoval|...). Selecting none of the following factors will output a single report table (CSV) over the entire computed dataset.

All of the options here are identical in function to their image localization scorer counterparts.

-q query
 * Evaluate algorithm performance on a partitioned dataset using multiple factor queries, one at a time (e.g. "Collection==['NC2017'] & Purpose==['add','remove']" will average over the rows that fit this criterion for one queried average, but "Collection==['NC2017']" "Purpose==['add','remove']" will average over the first and then the second independently for two queried averages). The option generates N report tables (CSV), one for each query.
   * Syntax: -q "query1" "query2" "query3" ...
   - The syntax is the same as Pandas' query syntax. Please see the detailed query rule in the website: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query.
   
   Examples:
   
   % -q "Collection==['Nimble-SCI']" => 1 query
   
   % -q "Collection==['Nimble-SCI'] and PostProcessing=['rescale']" => 1 query
   
   % -q "Collection==['Nimble-SCI','Nimble-WEB']" "PostProcessing=['rescale']" "200<ProbeWidth<=3000" => 3 queries

-qp queryPartition
 * Uses one factor query to evaluate algorithm performance on a partitioned dataset through its individual sub-queries (e.g. "Collection==['NC2017'] & Purpose==['add','remove']" will average over "Collection==['NC2017'] & Purpose==['add']" and "Collection==['NC2017'] & Purpose==['remove']" for a total of two queried averages). This option generates a single report table (CSV) that contains M partition results, one result for each query.
   * Syntax: -qp "query"
   
   Examples:
   
   % -qp "Collection==['Nimble-SCI']" => 1 partition
   
   % -qp "Collection==['Nimble-SCI','Nimble-WEB'] & PostProcessing=['rescale']" => 2 partitions
   
   % -qp "Collection==['Nimble-SCI','Nimble-WEB'] & PostProcessing=['rescale','noise']" => 4 partitions
   
-qm queryManipulation
 * Filters the dataset before scoring takes place for some number of queries. It is functionally similar to the -q query option. The option generates M report tables (CSV), one for each query.
   * Syntax: -qm "query1" "query2" "query3" ...
   - Like the -q option, the syntax is the same as Pandas' query syntax.
   
   Examples:
   
   % -qm "Purpose==['remove']" => 1 query
   
   % -qm "Operation==['PasteSplice']" "Operation==['FillContentAwareFill']" => 2 query
   
   % -qm "Purpose==['remove']" "Purpose==['add']" "Purpose==['splice']"=> 3 queries

## 8. Examples<a name="examples"></a>

To set up the tests, run the code below to generate the masks for the sample reference and system output. The masks for these test cases can take up at least 0.5 GB each otherwise.

In [None]:
%%bash
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd )"
TESTDIR=$(realpath $DIR/../../data/test_suite/videoSpatialLocalizationScorerTests)
python2 $TESTDIR/gen_masks_for_ds.py -ds $TESTDIR
python2 $TESTDIR/gen_spatial_mask.py -s $TESTDIR/p-vsltest_1/p-vsltest_1.csv -x $TESTDIR/indexes/MFC18_Dev2-manipulation-video-index.csv

The code below runs a test over these masks.

In [None]:
%%bash
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd )"
TESTDIR=$(realpath $DIR/../../data/test_suite/videoSpatialLocalizationScorerTests)

python2 $DIR/VideoSpatialLocalizationScorer.py -t manipulation\
                                          --refDir $TESTDIR\
                                          --sysDir $TESTDIR/p-vsltest_1\
                                          -r reference/manipulation-video/MFC18_Dev2-manipulation-video-ref.csv\
                                          -s p-vsltest_1.csv\
                                          -x indexes/MFC18_Dev2-manipulation-video-index.csv\
                                          -oR $DIR/test1/test1\
                                          --truncate

Running this code will generate an aggregate report of the computed mask scores titled test1_mask_scores.csv and a per-image score report titled test1_pervideo.csv for the manipulation task, in the tools/VideoSpatialLocalizationScorer/test1 directory.

The user may also select which manipulation regions to score based on other factors in the reference files. Other regions are dilated by a separate factor and counted as selective no-score zones in addition to the boundary no-score zones applied to the regions of interest. Multiple pre-filters can be applied independently to the data, resulting in the output of multiple output indices corresponding to the number of queries passed to -qm.

The following code demonstrates selective scoring.

In [None]:
%%bash
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd )"
TESTDIR=$(realpath $DIR/../../data/test_suite/videoSpatialLocalizationScorerTests)

python2 $DIR/VideoSpatialLocalizationScorer.py -t manipulation\
                                          --refDir $TESTDIR\
                                          --sysDir $TESTDIR/p-vsltest_1\
                                          -r reference/manipulation-video/MFC18_Dev2-manipulation-video-ref.csv\
                                          -s p-vsltest_1.csv\
                                          -x indexes/MFC18_Dev2-manipulation-video-index.csv\
                                          -oR $DIR/sel1/sel1\
                                          -qm "Operation == 'PasteSampled'"\
                                          --truncate

## Disclaimer

This software was developed at the National Institute of Standards
and Technology (NIST) by employees of the Federal Government in the
course of their official duties. Pursuant to Title 17 Section 105
of the United States Code, this software is not subject to copyright
protection and is in the public domain. NIST assumes no responsibility
whatsoever for use by other parties of its source code or open source
server, and makes no guarantees, expressed or implied, about its quality,
reliability, or any other characteristic.