Efficient Sequence Labeling
Switch branches/tags
Nothing to show
Clone or download
Latest commit ba04339 Aug 3, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
3rdparty Added evaluation script for Swedish NER. May 9, 2016
c Future-proofed Python tagger wrapper with PY_SSIZE_T_CLEAN Sep 4, 2016
data Updated Swedish pipeline for UDv2 Feb 22, 2017
scripts Added evaluations to README Jun 30, 2017
tests Add some tests for other utils. Jun 23, 2016
.gitignore Ignore generated pysuc_ne.c file. Jun 5, 2016
LICENSE Initial commit Nov 18, 2015
Makefile Use Python 3 for unit tests as well. Jun 11, 2016
README.md Updated README Aug 3, 2018
build_suc.py Improved/automated training procedure for Swedish models. Apr 7, 2016
build_suc_dalin.py Added model for older Swedish Apr 18, 2017
build_suc_ne.py Added --n-train-fields option to train_suc.sh instead of hard-coding May 8, 2016
build_udt.py Added generic UD model Jun 28, 2017
build_udt_en.py Added Universal PoS tag support to Swedish annotation pipeline. Jan 21, 2016
build_udt_suc_sv.py Added Universal PoS tag support to Swedish annotation pipeline. Jan 21, 2016
build_wsj.py Added Universal PoS tag support to Swedish annotation pipeline. Jan 21, 2016
commandline.py Added --non-capitalized option for sentence segmentation May 5, 2017
configuration.py Implemented dropout regularization Aug 23, 2016
conll.py Pass a list of annotated lines to maltparser instead of a filename. Apr 16, 2016
fasthash.c Define PY_SSIZE_T_CLEAN in Python module. Fixes #16. Sep 4, 2016
form.py Implemented dropout regularization Aug 23, 2016
lemmatize.pyx Added lemmatization to the Swedish pipeline. Jan 18, 2016
options.py Automated UD treebank training/evaluation for many languages Aug 23, 2016
setup.cfg Use default python line-length instead. Jun 23, 2016
setup.py Add linebreak. Jun 5, 2016
swe_pipeline.py Updated README Aug 3, 2018
tagger.py Updated Swedish pipeline for UDv2 Feb 22, 2017
taglexicon.py Added Universal PoS tag support to Swedish annotation pipeline. Jan 21, 2016
tagset.py Imported first version, with English UDT config + data. Nov 18, 2015
tokenizer.py Added --non-capitalized option for sentence segmentation May 5, 2017
tools.py Imported first version, with English UDT config + data. Nov 18, 2015
translation.py Fixed bug with very long input tokens, found by Aaron Smith Mar 21, 2016
wclexicon.py Created named entity model for Swedish, added it to the Swedish pipel… May 8, 2016

README.md

efselab

Efficient Sequence Labeling

efselab is a compiler for sequence labeling tools, aimed at producing accurate and very fast part-of-speech (PoS) taggers and named entity recognizers (NER).

To create a PoS tagger, all you need to do is to edit a Python file which specifies which feature templates, tag lexicons and/or word clusters to use. efselab then compiles this specification into C code, which is then compiled into an executable.

The basic algorithm used is simple: a structured perceptron using greedy search for decoding. To maximize performance, all strings are represented as hash sums internally. During decoding, these can then be efficiently combined into feature hashes.

In this way, even rather complex feature templates using word clusters and external lexicons can generate taggers capable of around a million tokens per second.

Description

A detailed description of the algorithms used along with evaluations can be found in the following paper:

Installing and using efselab

efselab is implemented in Python/C and requires the following software to be installed:

  • Python 3 (tested with version 3.4) and setuptools
  • gcc (tested with version 4.9) and GNU Make
  • Cython (only needed if you want to use the Swedish lemmatizer)

There is no installation as such, all the software is (somewhat inelegantly) contained in the root directory, where configuration files are also assumed to be placed and executed.

First, you need to build the Python module fasthash, which is used to construct lexicon hash tables. Simply type:

make

Each tagger specification file (build_*.py) also functions as a build script. For a complete list of arguments, run e.g.:

python3 build_udt_en.py --help

Then, to build a tagger simply run the corresponding configuration file, the following will build a tagger for the English Universal Dependencies treebank with the name udt_en, both an executable file and a Python module:

python3 build_udt_en.py --name udt_en --python

which will build a tagger for the English part of the Universal Dependencies Treebank. This produces a binary file, udt_en, which contains everything except the model weights. These need to be learned in the following way:

./udt_en train data/udt-en-train.tab data/udt-en-dev.tab udt-en.bin

The final weights are written to the file udt-en.bin, and can the be used for tagging:

./udt_en tag data/udt-en-test.tab udt-en.bin evaluate >/dev/null

Note that the evaluate option requires a tagged input, if you want to tag an untagged file, this can also be done (in this example by stripping off the tags using the cut tool, and using - as the input file to read from stdin):

cut -f 1 data/udt-en-test.tab | ./udt_en tag - udt-en.bin >udt-en-retag.tab

Swedish annotation pipeline

There is a Swedish annotation pipeline, adapted from the Swedish Treebank pipeline (originally using hunpos for POS tagging) created at Uppsala University by Filip Salomonsson. It can do the following:

  • Tokenization (using a Python tokenizer by Filip Salomonsson)
  • POS tagging (using efselab with a SUC + SIC model)
  • Conversion to Universal PoS tags (using efselab trained on Universal Dependencies data plus some postprocessing heuristics by Aaron Smith)
  • Lemmatization (using the lexicon-based lemmatizer in lemmatize.pyx)
  • Named Entity Recognition (using efselab with a SUC + SIC model)
  • Dependency prasing (using MaltParser by Joakim Nivre et al.) into Universal Dependencies format (version 2)

To start using the pipeline, you first need to execute the following convenience script:

scripts/install_swedish_pipeline.sh

Then you should be able to run the pipeline like this:

mkdir output
./swe_pipeline.py -o output --all file.txt

For a more detailed description of the command-line options, run:

./swe_pipeline.py --help

Accuracy

These evaluations are performed with the 17-element PoS tagset from Universal Dependencies version 2. The Swedish model in the leftmost column of the first table uses additional data, all other models use only the training part of the corresponding Universal Dependencies treebank.

Swedish

Test data Swedish Generic
Swedish UD treebank 97.7 96.3
Swedish (LinES) UD treebank 91.9 95.0

Universal Dependencies treebanks (version 2.0)

Treebank Accuracy
Anc.\ Greek 87.7
Anc.\ Greek (PROIEL) 96.5
Arabic 94.8
Basque 94.2
Belarusian 90.2
Bulgarian 98.0
Catalan 97.6
Chinese 91.0
Coptic 95.1
Croatian 96.4
Czech 98.6
Czech (CAC) 98.8
Czech (CLTT) 97.5
Danish 96.3
Dutch 92.3
Dutch (LassySmall) 97.6
English 94.8
English (LinES) 95.3
English (ParTUT) 94.7
Estonian 90.3
Finnish 95.8
Finnish (FTB) 93.2
French 96.6
French (ParTUT) 94.1
French (Sequoia) 97.4
Galician 97.4
Galician (TreeGal) 90.4
German 92.8
Gothic 95.7
Greek 96.8
Hebrew 95.5
Hindi 96.5
Hungarian 93.4
Indonesian 92.9
Irish 84.9
Italian 97.7
Japanese 96.2
Korean 94.0
Latin 84.4
Latin (ITTB) 97.3
Latin (PROIEL) 95.8
Latvian 91.1
Lithuanian 79.1
Norwegian (Bokmaal) 97.1
Norwegian (Nynorsk) 96.7
Old Church Slavonic 95.3
Persian 96.6
Polish 97.0
Portuguese 96.6
Portuguese (BR) 96.9
Romanian 97.2
Russian 95.9
Sanskrit 61.0
Slovak 95.2
Slovenian 96.8
Slovenian (SST) 89.0
Spanish 95.9
Spanish (AnCora) 97.9
Swedish 96.3
Swedish (LinES) 95.0
Tamil 86.6
Turkish 94.3
Ukrainian 61.9
Urdu 93.3
Vietnamese 88.3

Performance-related options

The --beam-size argument of the build scripts controls the beam size of the decoder, which is the most important way to balance accuracy and performance. A beam size of 1 is equivalent to a greedy search, which is the fastest option but results in significantly higher error rates than the default beam size (4).

Python interface

To build a Python module for your tagger, use the --python argument with the configuration script:

python3 build_udt_en.py --name udt_en --python

After this, the tagger can be used from Python in the following way:

>>> import udt_en
>>> with open('udt-en.bin', 'rb') as f: weights = f.read()
...
>>> udt_en.tag(weights, ['A', 'short', 'sentence', '.'])
('DET', 'ADJ', 'NOUN', 'PUNCT')

The second argument can be a tuple or list of either str objects, or a tuple or list containing the values of the different input fields. Using a tuple with a single str object is equivalent to using just the str object.

weights is a bytes object, containing the contents of a model file, i.e. a binary vector of floating-point values.

Distributing taggers

Users with access to (possibly restricted) training material will likely want to distribute the generated C file and the model file. The end user can then compile the C file for their own platform, and start tagging files.

Issues

There are some things to be aware of:

  • Tokens are simply truncated at 4095 bytes, don't feed it very long strings!
  • Currently only UTF-8 input is supported.

Third-party code and data

This repository includes a few third-party contributions:

Credits

Thanks to Emil Stenström for testing and feedback.

The Swedish pipeline wrapper script (including the tokenizer) was originally written by Filip Salomonsson (Uppsala), later modified by Robert Östling (Stockholm/Helsinki) and Aaron Smith (Uppsala). Joakim Nivre and Jesper Näsman (Uppsala) contributed to different parts of the Swedish pipeline.