Home

Miguel Ballesteros edited this page Jul 7, 2015 · 9 revisions
Clone this wiki locally

Deep-Syntactic Parser

``Deep-syntactic" dependency structures that capture the argumentative, attributive and coordinative relations between full words of a sentence have a great potential for a number of NLP-applications. The abstraction degree of these structures is in between the output of a syntactic dependency parser (connected trees defined over all words of a sentence and language-specific grammatical functions) and the output of a semantic parser (forests of trees defined over individual lexemes or phrasal chunks and abstract semantic role labels which capture the argument structure of predicative elements, dropping all attributive and coordinative dependencies). We propose a parser that delivers deep syntactic structures as output.

You can try an online version of the parser: http://dparse.multisensor.taln.upf.edu/main


This package contains the implementation of the system described in (Miguel Ballesteros, Bernd Bohnet, Simon Mille and Leo Wanner. Deep-Syntactic Parsing. COLING 2014)

USAGE - training and parsing.

 java -jar deepParser.jar -s surfacetreebank -d deeptreebank -st surfaceinput -t 1 

this would train an SVM model and parse a surface input. It would produce a file "dsynt_final_output.conll" which is the output of the system.

USAGE - only parsing.

If you want to parse with an existing training model, you can use the following command.

 java -jar deepParser.jar -s surfacetreebank -d deeptreebank -st surfaceinput -t 0 

It would produce a file "dsynt_final_output.conll" which is the output of the system. Note that you should call it from the same folder in which you trained the model.

USAGE - long version.

Assuming that you want to parse plain text sentences and you have a SURFACE treebank and a DEEP treebank.

Let

test.text
be a corpus that you want to parse in which you have plain text sentences, one sentence per line. The steps are:
  1. Tokenize. You should tokenize your data. For that you can use any available tokenizer.

2. Transform to one token per line, that is correct CoNLL 2009 data format. You may use the Mate anna parser script for that: http://code.google.com/p/mate-tools/downloads/list
java -cp anna-3.3.jar is2.util.Split test.txt > testOneWordPerLine.txt

3. Lemmatize your data. Again you can use anna parser for that, and download any of the models available in http://code.google.com/p/mate-tools/downloads/list
 java -Xmx2G -cp anna-3.3.jar is2.lemmatizer.Lemmatizer -model model.lemmatizer.model -test testOneWordPerLine.txt -out test_lemma.txt

4. Train and parse test_lemma.txt with a dependency parser, a pos-tagger and a morphology tagger. For that, we recommend Mate joint transition-based, pos tagger and morph tagger: http://code.google.com/p/mate-tools/wiki/ParserAndModels Follow the instructions of Mate joint parser but you will end up running a script like this:
 nohup ./pet-lang-model >log.txt & 

5. The output of Mate parser only fill predicted columns, you should write a script that fills all columns to be the input of the Deep-syntactic parser. Something like this would do, though you can come up with any script in any programming language. See conll2009 data format here.
cat outMate.txt | awk 'NF==0{print ""} NF{print $1, $4, $4, $4, $6, $6, $8, $8, $10, $10, $12, $12, $13, $14}' OFS="\t" > outputSurfaceParser.txt

Important note: The Deep-Syntactic parser has as input CoNLL09 data format files. In the deep syntax and surface syntax versions. It uses the following columns as input/features: (1) FORM, (2) LEMMA, (3) POS (4) FEAT, (5) HEAD, (6) DEPREL. That is, columns 0, 1, 2, 4, 5, 7 and 9.
This means that the input should have this columns filled and it should be correctly formatted.

  1. Replace all /n/t by /n. Just to be sure that you have correct conll2009 data format.

7. Train and parse with the deep-syntactic parser.
java -Xmx16g -jar deepParser.jar -s CorpusSSynt.conll -d CorpusDSynt.conll -st outputSurfaceParser.txt -t 1

(the flag -t 1 will train and parse, if it is 0 then it will only parse)


(in the figure, "transducer" refers to Deep-Syntax parser)

------------

EVALUATION

This script is able to evaluate the deep syntax output produced by the parser.
 java -jar evaluation.jar -g deepgold -s deepoutput 

If you have any questions or issues, please do not hesitate to contact <a href="http://miguelballesteros.com>Miguel Ballesteros (miguel.ballesteros@upf.edu)


References

Please, cite the following paper if you use the deep-syntactic parser.
  • Miguel Ballesteros, Bernd Bohnet, Simon Mille and Leo Wanner. 2014. Deep-Syntactic Parsing. The 24th International Conference on Computational Linguistics (COLING 2014), Dublin, Ireland.

There is a demo paper accepted at NAACL 2015.

  • Joan Soler Company, Miguel Ballesteros, Bernd Bohnet, Simon Mille and Leo Wanner. 2015. Visualizing Deep-Syntactic Parser Output. NAACL 2015, Denver, USA.
Please, cite the following paper if you use the Spanish corpus.
  • Simon Mille, Alicia Burga and Leo Wanner. 2013. Ancora-UPF: A Multi-Level Annotation of Spanish. The 2nd International conference on Dependency Linguistics (DEPLING 2013)

If you have any doubts, please, do not hesitate and contact the authors. Note that you need a LDC license (for the Penn treebank) to use the English version of the parser.