Following are instructions to evaluate predictions and compute metrics for each dataset in the SCROLLS benchmark seperatly (only for the validation splits).
If you have predictions for all of the scrolls datasets, you can go to Evaluate Benchmark to evaluate all of them at once.
- Setup environment.
- Prediction Format expected by the script below.
Please set:
-
DATASET_NAME
to be one of:["gov_report", "summ_screen_fd", "qmsum", "narrative_qa", "qasper", "quality", "contract_nli"]
-
PREDICTIONS_JSON
to be the path to your prediction file (in the format specified in Prediction Format). -
METRICS_OUTPUT_DIR
to be the path you want the results and errors (if any) will be saved to.
Run:
python dataset_evaluator.py --split validation --dataset_name DATASET_NAME --predictions PREDICTIONS_JSON --metrics_output_dir METRICS_OUTPUT_DIR