preprocess-conll05

Scripts for preprocessing the CoNLL-2005 SRL dataset.

Requirements:

Python 3
Bash
A copy of the Penn TreeBank

Basic CoNLL-2005 pre-processing

These pre-processing steps download the CoNLL-2005 data and gather gold part-of-speech and parse info from your copy of the PTB. The output will look like:

The         DT    (S(NP-SBJ-1(NP*  *    -   -      (A1*      
economy     NN    *                *    -   -      *      
's          POS   *)               *    -   -      *      
temperature NN    *)               *    -   -      *)     
will        MD    (VP*             *    -   -      (AM-MOD*)     
be          VB    (VP*             *    -   -      *      
taken       VBN   (VP*             *    01  take   (V*)

Field 1: word form
Field 2: gold part-of-speech tag
Field 3: gold sytax
Field 4: placeholder
Field 5: verb sense
Field 6: predicate (infinitive form)
Field 7+: for each predicate, a column representing the labeled arguments of the predicate.

First, set up paths to existing data:

export WSJ="/your/path/to/wsj/"
export BROWN="/your/path/to/brown"

Download CoNLL-2005 data and scripts:

./bin/basic/get_data.sh

Extract pos/parse info from gold data:

./bin/basic/extract_train_from_ptb.sh
./bin/basic/extract_dev_from_ptb.sh
./bin/basic/extract_test_from_ptb.sh
./bin/basic/extract_test_from_brown.sh

Format into combined output files:

./bin/basic/make-trainset.sh
./bin/basic/make-devset.sh 
./bin/basic/make-wsj-test.sh
./bin/basic/make-brown-test.sh

Further pre-processing (e.g. for LISA)

Sometimes it's nice to convert constituencies to dependency parses and provide automatic part-of-speech tags, e.g. if you wish to train a parsing model. BIO format is also a more standard way of representing spans than the default CoNLL-2005 format. This pre-processing converts the constituency parses to Stanford dependencies (v3.5), assigns automatic part-of-speech tags from the Stanford left3words tagger, and converts SRL spans to BIO format. The output will look like:

conll05 0       0       The         DT      DT      2       det         _       -       -       -       -       O       B-A1
conll05 0       1       economy     NN      NN      4       poss        _       -       -       -       -       O       I-A1
conll05 0       2       's          POS     POS     2       possessive  _       -       -       -       -       O       I-A1
conll05 0       3       temperature NN      NN      7       nsubjpass   _       -       -       -       -       O       I-A1
conll05 0       4       will        MD      MD      7       aux         _       -       -       -       -       O       B-AM-MOD
conll05 0       5       be          VB      VB      7       auxpass     _       -       -       -       -       O       O
conll05 0       6       taken       VBN     VBN     0       root        _       01      take    -       -       O       B-V

Field 1: domain placeholder
Field 2: sentence id
Field 3: token id
Field 4: word form
Field 5: gold part-of-speech tag
Field 6: auto part-of-speech tag
Field 7: dependency parse head
Field 8: dependency parse label
Field 9: placeholder
Field 10: verb sense
Field 11: predicate (infinitive form)
Field 12: placeholder
Field 13: placeholder
Field 14: NER placeholder
Field 15+: for each predicate, a column representing the labeled arguments of the predicate.

First, set up paths to Stanford parser and part-of-speech tagger:

export STANFORD_PARSER="/your/path/to/stanford-parser-full-2017-06-09"
export STANFORD_POS="/your/path/to/stanford-postagger-full-2017-06-09"

The following script will then convert dependencies, tag, and reformat the data. This will create a new file in the $CONLL05 directory with the same name as the input and suffix .parse.sdeps.combined. If $CONLL05 is not set, you should set it to the conll05st-release directory.

./bin/preprocess_conll05_sdeps.sh $CONLL05/train-set.gz
./bin/preprocess_conll05_sdeps.sh $CONLL05/dev-set.gz
./bin/preprocess_conll05_sdeps.sh $CONLL05/test.wsj.gz
./bin/preprocess_conll05_sdeps.sh $CONLL05/test.brown.gz

Now all that remains is to convert fields to BIO format. The following script will create a new file in the same directory as the old file with the suffix .bio:

./bin/convert-bio.sh $CONLL05/train-set.gz.parse.sdeps.combined
./bin/convert-bio.sh $CONLL05/dev-set.gz.parse.sdeps.combined
./bin/convert-bio.sh $CONLL05/test.wsj.gz.parse.sdeps.combined
./bin/convert-bio.sh $CONLL05/test.brown.gz.parse.sdeps.combined

You may also want to generate a matrix of transition probabilities for performing Viterbi inference at test time. You can use the following to do so:

python3 bin/compute_transition_probs.py --in_file_name $CONLL05/train-set.gz.parse.sdeps.combined.bio > $CONLL05/transition_probs.tsv

Pre-processing for evaluation scripts

To evaluate using the CoNLL eval.pl and srl-eval.pl scripts, you'll need files in a different format to evaluate against. To generate files for parse evaluation (eval.pl), use the following script:

python3 bin/eval/extract_conll_parse_file.py --input_file $CONLL05/dev-set.gz.parse.sdeps.combined --id_field 2 --word_field 3 --pos_field 4 --head_field 6 --label_field 7 > $CONLL05/conll2005-dev-gold-parse.txt

For SRL evaluation, use the following:

python3 bin/eval/extract_conll_prop_file.py --input_file $CONLL05/dev-set.gz.parse.sdeps.combined --take_last --word_field 3 --pred_field 10 --first_prop_field 14 > $CONLL05/conll2005-dev-gold-props.txt

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
bin		bin
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

preprocess-conll05

Requirements:

Basic CoNLL-2005 pre-processing

Further pre-processing (e.g. for LISA)

Pre-processing for evaluation scripts

About

Releases

Packages

Contributors 2

Languages

strubell/preprocess-conll05

Folders and files

Latest commit

History

Repository files navigation

preprocess-conll05

Requirements:

Basic CoNLL-2005 pre-processing

Further pre-processing (e.g. for LISA)

Pre-processing for evaluation scripts

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages