Skip to content

To assess the longtext capabilities more comprehensively, we propose Needle-in-a-Haystack PLUS, which shifts the focus from simple fact retrieval to more challenging single-document/multi-document question answering tasks.

zuucan/NeedleInAHaystack-PLUS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

NeedleInAHaystack-PLUS

To assess the longtext capabilities more comprehensively, we propose Needle-in-a-Haystack PLUS, which shifts the focus from simple fact retrieval to more challenging single-document/multi-document question answering tasks.

How to evaluate on NeedleInAHaystack-PLUS

Load Data

Our test data can be download in NeedleInAHaystack-PLUS.

Data Format

All datas in NeedleInAHaystack-PLUS are standardized to the following format:

Single-document QA

{
    "id": "The unique identifier for each test data.",
    "context": "The long context of the single-document question answering task.",
    "context_length": "The length of haystack ranges from 1,000 to 128,000 tokens with equal intervals, totaling 15 different lengths.",
    "depth_percent": "The position of the needle in the haystack.",
    "input": "The questions of the question single-document answering task.",
    "dataset": "needle_squad",
    "answers": "A List of all true answers.",
}

Multi-document QA

{
    "id": "The unique identifier for each test data.",
    "context": "The long context of the single-document question answering task.",
    "context_length": "The length of haystack ranges from 1,000 to 128,000 tokens with equal intervals, totaling 15 different lengths.",
    "depth_percent1": "The position of the first needle in the haystack.",
    "depth_percent2": "The position of the second needle in the haystack.",
    "input": "The questions of the question single-document answering task.",
    "dataset": "needle_hotpotqa",
    "answers": "A List of all true answers.",
}

Results Visualization

The invocation time of the APIs:

  • OpenAI's GPT-4-128K (Run 2024-01-31)
  • Anthropic's Claude 2.1 (Run 2024-02-08)

Single-document QA

picture alt

Multi-document QA

picture alt

Acknowledgement

NeedleInAHaystack-PLUS is based on the datasets proposed by previous researchers, including NeedleInAHaystack, Squad, HotpotQA.

Citation

@misc{zhao2024longagent,
      title={LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration}, 
      author={Jun Zhao and Can Zu and Hao Xu and Yi Lu and Wei He and Yiwen Ding and Tao Gui and Qi Zhang and Xuanjing Huang},
      year={2024},
      eprint={2402.11550},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

To assess the longtext capabilities more comprehensively, we propose Needle-in-a-Haystack PLUS, which shifts the focus from simple fact retrieval to more challenging single-document/multi-document question answering tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published