SCROLLS - Evaluate Single Dataset

Following are instructions to evaluate predictions and compute metrics for each dataset in the SCROLLS benchmark seperatly (only for the validation splits).

If you have predictions for all of the scrolls datasets, you can go to Evaluate Benchmark to evaluate all of them at once.

Requirements

Setup environment.
Prediction Format expected by the script below.

Please set:

DATASET_NAME to be one of:

["gov_report", "summ_screen_fd", "qmsum", "narrative_qa", "qasper", "quality", "contract_nli"]

PREDICTIONS_JSON to be the path to your prediction file (in the format specified in Prediction Format).
METRICS_OUTPUT_DIR to be the path you want the results and errors (if any) will be saved to.

Run:

python dataset_evaluator.py --split validation --dataset_name DATASET_NAME --predictions PREDICTIONS_JSON  --metrics_output_dir METRICS_OUTPUT_DIR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EVALUATE_DATASET.md

EVALUATE_DATASET.md

SCROLLS - Evaluate Single Dataset

Requirements

Files

EVALUATE_DATASET.md

Latest commit

History

EVALUATE_DATASET.md

File metadata and controls

SCROLLS - Evaluate Single Dataset

Requirements