This repository contains an implementation of the parsers described in Arc-swift: A Novel Transition System for Dependency Parsing.
Running the parsers requires Tensorflow 1.0 or above. The data preparation script also requires a working Java 8 installation to run Stanford CoreNLP. Other Python dependencies are included in
requirements.txt, and can be installed via
pip by running
pip install -r requirements.txt
Train your own parsers
We provide code for converting the Wall Street Journal section of Penn Treebank into Stanford Dependencies.
To use the code, you first need to obtain the corresponding parse trees from LDC, make the standard train/dev/test split (Sections 02-21 for training, 22 for development, and 23 for testing). Copy the splits to
utils/ptb3-wsj and name them
ptb3-wsj-test.trees, respectively, then run the following scripts
./setup_corenlp.sh 3.3 ./convert_splits_to_depparse.sh
The first script downloads Stanford CoreNLP v3.3.0 in this directory, which is necessary for converting Penn Treebank parse trees to dependency parses in the second script.
To make the parse trees available for training, it is necessary to keep projective trees only in the training set. For this, go to
utils/ and run
python filter_nonproj.py ptb3-wsj/ptb3-wsj-train.conllx ../src/ptb3-wsj-train.conllx
The dev and test sets shouldn't be altered, so we just copy them directly to the
src directory for later use.
cp ptb3-wsj/ptb3-wsj-dev.conllx ../src cp ptb3-wsj/ptb3-wsj-test.conllx ../src
As a final step in data preparation, we would need to create a file that maps all dependency arc types and all part-of-speech (POS) types into integers. This can be achieved by running the following script under
./create_mappings.sh ptb3-wsj/ptb3-wsj-train.conllx > ../src/mappings-ptb.txt
To train the parser, we would also need to create the oracle sequence of transitions for our parsers to follow. To do this, go to
src/, and run
python gen_oracle_seq.py ptb3-wsj-train.conllx train.ASw.seq --transsys ASw --mappings mappings-ptb.txt
mappings-ptb.txt is the mappings file we just created,
ASw stands for arc-swift, and
train.ASw.seq is the output file containing oracle transitions for the training data.
The processing steps of Universal Dependencies trees and that of Penn Treebank parses is very similar, modulo that conversion to dependency parses is not necessary. The Universal Dependencies v1.3 data used in the paper can be found here.
To train the parsers, you might want to download the pretrained GloVe vectors. For the experiments in the paper, we used the 100-dimensional embeddings trained on Wikipedia and Gigaword (glove.6B.zip). Download and unzip the GloVe files in
src, and to train the arc-swift parser, simply run
mkdir ASw_models python train_parser.py ptb3-wsj-train.conllx --seq_file train.ASw.seq --wordvec_file glove.6B.100d.txt --vocab_file vocab.pickle --feat_file ptb_ASw_feats.pickle --model_dir ASw_models --transsys ASw --mappings_file mappings-ptb.txt
Note that if you're using the GPU, you might want to specify
CUDA_VISIBLE_DEVICES to tell Tensorflow which GPU device to use.
ptb_ASw_feats.pickle are files that the training code will automatically generate and reuse should you want to train the parser more than one time with the same data. For more arguments the training code supports, run
python train_parser.py -h
To train parsers for other transition systems, simply replace the
--transsys argument with the short name for the transition system you are interested in.
|Short name||Transition system|
To train the parsers on Universal Dependencies English Treebank,
--epoch_multiplier 3 should also be used to reproduce the training settings described in the paper.
To evaluate the trained parsers, run
python eval_parser.py ptb3-wsj-dev.conllx --vocab_file vocab.pickle --model_dir ASw_models --transsys ASw --eval_dataset dev --mappings_file mappings-ptb
This will generate output files in the model directory with names like
models_ASw/dev_eval_beam_1_output_epoch0.txt, which contain the predicted dependency parses.
We also provide a python implementation of labelled and unlabelled attachment score evaluation. The interface is very similar to the CoNLL official script, simply run
cut -d" " -f1-6 ptb3-wsj-dev.conllx| paste - ASw_models/dev_eval_beam_1_output_epoch0.txt > system_pred.conllx python eval.py -g ptb3-wsj-dev.conllx -s system_pred.conllx
-g stands for gold file, and
-s stands for system prediction. Note that the delimiter in the call to
cut is a tab character. By default the script removes punctuation according to the gold Penn Treebank POS tags of the tokens. To run generate results compatible with the CoNLL official evaluation script, make sure to use
All work contained in this package is licensed under the Apache License, Version 2.0. See the included LICENSE file.