Skip to content

Here lies all the little nifty tools that might help others to make the most out of SQuAD

Notifications You must be signed in to change notification settings

salmedina/SQuAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

SQUAD Reformatter

The SQuAD dataset was designed for the task of machine reading comprehension. The task consists in answering a question related to a specific passage by selecting which part of the passage answers the question.

In this repo you will find a simple tool which reformats the SQuAD dataset into a more useful way in which it breaks apart the passages into sentences, and adds to the answers the sentence in which it is found.

Usage

python SquadProcessing.py <squad_file> <save_file>

An example of a call from the root folder would be:

python src/SquadProcessing.py data/dev-v1.0.json out/proc-dev-v1.0.json

Data Description

Original format

The dataset was given in 2 different JSON files that have the following structure:

  • title : str
  • paragraphs : []
    • context : str
    • qas : []
      • question : str
      • id : int
      • answers : []
        • text : str
        • answer_start : int

The index described in answer_start is 0 based.

Output format

Processed QA JSON Structure:

  • root : []
    • topic : str
    • paragraphs : []
      • pid : int
      • sentences: []
        • text : str
        • pos : int
        • start_idx : int
      • qas: []
        • qid : int
        • question : str
        • answers:
          • sent_pos : int
          • text : str
          • answer_start : int

The passage id embeds the id for the topic and its corresponding passages in the following way:

  • the 10k's describe the topic
  • the units describe the passage

About

Here lies all the little nifty tools that might help others to make the most out of SQuAD

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages