Skip to content

stanfordnlp/plot-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

plot-data

This repo contains data for plot formatting actions in VegaLite, which you can see in the viewer. You can find the procssed data in the releases.

To see the last five lines of processed data, try jq . plot-data.sample.jsonl

Processed data

The processed data is inside ./data, which is generated by make from the content of hits.

URL of procssed: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/plot-data.jsonl

Collecting data

  • First, deploy speaker HITs: python mturk/create_speaker_hit.py --num-hit 10 --num-assignment 5, optionally --is-sandbox
    • This creates hits/timestamp/speaker.HITs.txt, and speaker.sample_hit and deploys the HITs
    • note that assignment_ids are only available once someone works on the hit
    • run make speaker.assignments to check if these are completed
  • In Makefile set the SPEAKER_EXEC variable to correspond to where the server log is located
  • make speaker.jsonl to filter and process the data, and make speaker.review to approve and reject hits
  • Restart the server and use the previous speaker data as VegaResources.examplesPath which selects randomly from the specified examples as the listeners
  • Run python mturk/create_listener_hit.py hits/SPEAKER_HIT --num-hit 10 --num-assignment 5 optionally --is-sandbox
    • Wait for these HITs to complete, make listener.assignments to check and make listener.review to approve
  • Set LISTENER_EXEC as well, and run make speaker.listener.jsonl to process the data
  • Alternatively, wait for both speaker and listener hits to complete, and run make visualize
    • There seems to be some need to inspect speaker.status to make sure there is no incorrect rejections, and no new weird spam before deploying them to the listener. This prevents the process from being fully automated.

Useful commands

jq -c 'if .q[0]=="accept" then .q[1] else empty end' speaker.raw.jsonl
cat data/query.json | jq -c '.q[1].utterance'

Generating splits

Use split_data.py to split data into train/test (no dev since all the Turk data is dev data):

python split_data.py randomWithNoCanon.jsonl randomWithNoCanon_splitIndep  # Split each example separately
python split_data.py -s randomWithNoCanon.jsonl randomWithNoCanon_splitSess  # Split by sessionId == MTurk ID