Skip to content

Turn the TiGer treebank into a real dependency treebank

License

Notifications You must be signed in to change notification settings

wbwseeker/tiger2dep

Repository files navigation

This is the tiger2dep conversion script,
which converts the TiGer corpus into dependency format.
It now also converts the German part of the SMULTRON corpus 
(http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks/smultron_en.html)
and the 707 sentences of EuroParl manually annotated by Sebastian Padó
(http://www.nlpado.de/~sebastian/data/projection/german_goldgold.xml.gz)

The script contains two parts, one that applies a number of corrections to the source files,
and one that converts the corrected version to dependencies. The dependency conversion is also capable of 
introducing empty heads for verb ellipses in the data (only tested for TiGer so far).

REFERENCES:
@inproceedings{SEEKER12.235.L12-1088,
  author = {Wolfgang Seeker and Jonas Kuhn},
  title = {Making Ellipses Explicit in Dependency Conversion for a German Treebank},
  booktitle = {Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)},
  year = {2012},
  month = {May},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Mehmet U\u{g}ur Do\u{g}an and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {English},
  pages = {3132--3139},
  note = {ACL Anthology Identifier: L12-1088},
  url = {http://www.lrec-conf.org/proceedings/lrec2012/pdf/235_Paper.pdf}
}

@InProceedings{SEEKER14.809,
  author = {Wolfgang Seeker and Jonas Kuhn},
  title = {An Out-of-Domain Test Suite for Dependency Parsing of German},
  booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
  year = {2014},
  month = {may},
  date = {26-31},
  address = {Reykjavik, Iceland},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-8-4},  
  language = {english},
  url = {http://www.lrec-conf.org/proceedings/lrec2014/pdf/809_Paper.pdf}
} 



REQUIREMENTS:
the script relies on BeautifulSoup 4 (included in archive) for reading the tigerxml

USAGE:
for producing a corrected version of the treebank (prerequisite for the conversion)
$ python apply-corrections.py tiger_release_aug07.xml corrections/* > tiger_release_aug07.corrected.xml

for producing the conversion:
$ python tigerxml2conll09.py -i tiger_release_aug07.corrected.xml -m corrections/manual_heads.pl > tiger_release_aug07.corrected.conll09

type -h with the conversion script to get help on configuration options

There are bash scripts (convert-tiger.sh, convert-smultron.sh, convert-europarl.sh) that you can run to convert the source files. 
Usage is documented inside the scripts.


The conversion of TiGer needs quite a lot of memory because the whole corpus is represented as a DOM
object for the error correction. I would run it on a server with a couple of GB available.
Once this is done, the actual conversion to dependencies does not need much memory.

LICENSE:
The software is licensed under the GNU General Public License (GPL) version 3,
which can be found under www.gnu.org/licenses or in the attached file LICENSE.
Third party libraries are subject to their own license.


----
05/27/2014 Wolfgang Seeker

About

Turn the TiGer treebank into a real dependency treebank

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages