This repository contains my master's thesis project, which focuses on compositional Discourse Representation Structure (DRS) parsing. The project involves developing a system that utilizes AM-Parser, effectively handling both non-compositional and compositional information. Our system demonstrates enhanced capabilities in parsing longer and more complex sentences, delivering robust performance. Additionally, our system performs competitively in parsing elements like scope, coreference, and reentrancies, holding its own against other well-established baseline systems.
What you need to make everything run smoothly.
- Python 3.9
- Java JKD 11
- Gradle 0.8
- Dependencies of UD-Boxer, AM-Parser, and AM-Tools.
This project tailored the code from branches of the four repositories.
- Ud-boxer: preprocessing, converting SBN files in PMB4 and PMB5 to DRGs, and postprocessing.
- AM-Parser: training a compositional parser to parse scopeless and simplified DRGs.
- AM-Tools: preparing training data for AM-Parser.
- SBN-Evaluation: providing a fine-grained evaluation of the results in different experiments.
To use the code, please follow the following steps: (1) create a conda virtual environment by
conda create -n drsparsing python=3.9
(2) clone our repository
git clone https://github.com/xiulinyang/compositional_drs_parsing.git
(3) clone other useful repositories for training and evaluation inside this repository.
cd compositional_drs_parsing
git clone -b unsupervised2020 https://github.com/xiulinyang/am-parser.git
git clone https://github.com/xiulinyang/am-tools.git
git clone https://github.com/xiulinyang/SBN-evaluation-tool.git
git clone https://github.com/yzhangcs/parser
Other useful repositories could be useful and we do not make changes to the code.
- Supar: to train the dependency parser
- vulcan: to visualize AM-tree for error analysis
- SMATCH++: to evaluate the performance of different parsers
- SMATCH_RE: to evaluate the performance of the parsers on reentrancies
The pipeline works as below:
The preprocessing procedure is designed to transform SBNs into DRGs. Once the process is complete, you can expect three distinct outputs:
- Penman Notation File
- Location: Stored under each specific file directory.
- Penman Information File
- Location: Also found under each individual file directory.
- Data Split Folder
- Location: Located in the working directory.
- Contents: This folder contains a total of eight files:
- Data Splits: Four files that represent different splits of the data.
- Gold Data: Four files that correspond to the gold standard data for each of the data splits.
cd ud_boxer
python sbn_drg_generator.py -s the/starting/path/of/pmb -f split/file/of/four/splits/(in the order of train, dev, test, other) -v 4 or 5 -e name of the directory to store penman info and split
For more details, please run:
python sbn_drg_generator.py -h
Note that in PMB5, the test-long dataset hasn't been manually corrected yet and the gold SBN files are not stored in the released data yet. Therefore, when generating the test-long data split, please comment on the last line.
The split data has been generated in the data/data_split
folder. If you need files for penman information, node-token alignment, and visualization data for each DRS, please contact me and I will send you a google drive link.
Run the following command to generate .conll file to train a dependency parser to learn scope information.
python scope_converter.py -i data_split/gold4/en_eval.txt (the data split file) -o scope_edge/eval4.conll (the output file) -v 4 (version of PMB)
To generate training data
java -cp build/libs/am-tools.jar de.saar.coli.amtools.decomposition.SourceAutomataCLI -t examples/decomposition_input/mini.dm.sdp -d examples/decomposition_input/mini.dm.sdp -o examples/decomposition_input/dm_out/ -dt DMDecompositionToolset -s 2 -f
To generate dev and test data
java -cp build/libs/am-tools.jar de.saar.coli.amtools.decomposition.SourceAutomataCLI -t examples/decomposition_input/mini.dm.sdp -d examples/decomposition_input/mini.dm.sdp -o examples/decomposition_input/dm_out/ -dt de.saar.coli.amtools.decomposition.formalisms.toolsets.DMDecompositionToolset -s 2 -f
Please see the wiki page for further details for training instructions.
python -u train.py </path/to/drs_scopeless5.jsonnet> -s <where to save the model> -f --file-friendly-logging -o ' {"trainer" : {"cuda_device" : <your cuda device> } }'
# biaffine
$ python -u -m supar.cmds.dep.biaffine train -b -d 0 -c dep-biaffine-en -p model -f char \
--train ptb/train.conllx \
--dev ptb/dev.conllx \
--test ptb/test.conllx \
--embed glove-6b-100
The dependency approach:
python scope_match.py -i /split/parse/file -a /alignment/file -s /scope/parse/ -o /save/directory/file
The heuristics approach:
python sbn_postprocess.py -i /split/parse/file -o /save/directory/file
The evaluation script should be run in the SBN-Evaluation repository.
cd 2.evaluation-tool-detail
bash evaluation.sh pred.txt gold.txt
If you have questions, feel free to contact me: xiulin.yang.compling@gmail.com
SMATCH (left) and SMATCH++ (right) of all models on PMB4 (Top) and PMB5 (Bottom) Datasets - the model trained exclusively on gold data is denoted in bold, while the overall best-performing model is indicated with underlining.
If you have any questions, please feel free to reach out to me at xiulin.yang.compling@gmail.com.