A part-of-speech tagger with support for domain adaptation and external resources.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
data
someweta
utils
.gitignore
CHANGES.txt
LICENSE.txt
MANIFEST.in
README.md
README.rst
setup.py

README.md

SoMeWeTa

Introduction

SoMeWeTa (short for Social Media and Web Tagger) is a part-of-speech tagger that supports domain adaptation and that can incorporate external sources of information such as Brown clusters and lexica. It is based on the averaged structured perceptron and uses beam search and an early update strategy. It is possible to train and evaluate the tagger on partially annotated data.

SoMeWeTa achieves state-of-the-art results on the German web and social media texts from the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media. Therefore, SoMeWeTa is particularly well-suited to tag all kinds of written German discourse, for example chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.

The system is described in greater detail in Proisl (2018).

For tokenization and sentence splitting on these kinds of text, we recommend SoMaJo, a tokenizer and sentence splitter with state-of-the-art performance on German web and social media texts:

somajo-tokenizer --split_sentences <file> | somewe-tagger --tag <model> -

In addition to the German web and social media model, we also provide models trained on German, English and French newspaper texts. For all three languages, SoMeWeTa achieves highly competitive results close to the current state of the art.

Installation

SoMeWeTa can be easily installed using pip:

pip3 install SoMeWeTa

Alternatively, you can download and decompress the latest release or clone the git repository:

git clone https://github.com/tsproisl/SoMeWeTa.git

In the new directory, run the following command:

python3 setup.py install

Usage

You can use the tagger as a standalone program from the command line. General usage information is available via the -h option:

somewe-tagger -h

Tagging a text

SoMeWeTa requires that the input texts are tokenized and split into sentences. Tokenization and sentence splitting have to be consistent with the corpora the tagger model has been trained on. For German texts, we recommend SoMaJo, a tokenizer and sentence splitter with state-of-the-art performance on German web and social media texts. The expected input format is one token per line with an empty line after each sentence.

To tag a file, run the following command:

somewe-tagger --tag <model> <file>

If your machine has multiple cores, you can use the --parallel option to speed up tagging. To tag a file using four cores, use this command:

somewe-tagger --parallel 4 --tag <model> <file>

Using the option -x or --xml, it is possible to tag an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --tag <model> <file>

Training the tagger

The expected input format for training the tagger is one token-pos pair per line, where token and pos are seperated by a tab character, and an empty line after each sentence. To train a model, run the following command:

somewe-tagger --train <model> <file>

SoMeWeTa supports domain adaptation. First train a model on the background corpus, then use this model as prior when training the in-domain model:

somewe-tagger --train <model> --prior <background_model> <file>

SoMeWeTa can make use of additional sources of information. You can use the --brown option to provide a file with Brown clusters (the paths file produced by wcluster) and the --lexicon option to provide a lexicon with additional token-level information. The lexicon should consist of lines with tab-separated token-value pairs, e.g.:

welcome	ADJ
welcome	INTJ
welcome	NOUN
welcome	VERB
work	NOUN
work	VERB

It is also possible to train the tagger on partially annotated data. To do this, assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --train <model> --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to train the tagger on an XML file. It is assumed that each XML tag is on a separate line:

somewe-tagger --xml --train <model> <file>

Evaluating a model

To evaluate a model, you need an annotated input file in the same format as for training. Then you can run the following command:

somewe-tagger --evaluate <model> <file>

You can also evaluate a model on partially annotated data. Simply assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --evaluate <model> --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to evaluate a model on an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --evaluate <model> <file>

Performing cross-validation

You can also perform a 10-fold cross-validation on a training corpus:

somewe-tagger --crossvalidate <file>

To perform a cross-validation on partially annotated data, assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --crossvalidate --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to perform a cross-validation on an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --crossvalidate <file>

Model files

Model tagset est. accuracy
German newspaper STTS (TIGER) 97.98%
German web and social media STTS_IBK 91.42%
English newspaper Penn 97.25%
French newspaper FTB-29 97.71%

German newspaper texts

This model has been trained on the entire TIGER corpus and uses Brown clusters extracted from DECOW14 and coarse wordclasses extracted from Morphy as additional information.

To estimate the accuracy of this model, we performed a 10-fold cross-validation on the TIGER corpus with the same settings, resulting in a mean accuracy plus or minus two standard deviations of 97.98% ±0.32.

Download model (115 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

German web and social media texts

This model uses a variant of the above model as prior and is trained on the entire data from the EmpiriST 2015 shared task, i.e. both the training and the test data, as well as a little bit of additional training data (cf. the data directory of this repository). It uses the same additional sources of information as the prior model.

A variant of this model that only uses the training part of the EmpiriST 2015 data achieves a mean accuracy of 91.42% on the two test sets:

Corpus all words known words unknown words
CMC 89.08 ±0.25 90.95 ±0.27 77.41 ±1.14
Web 93.77 ±0.26 95.34 ±0.25 83.31 ±0.63

As of December 2017, those figures represent the state of the art on the EmpiriST data.

Download model (115 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

English newspaper texts

This model has been trained on all sections of the Wall Street Journal part of the Penn Treebank and uses Brown clusters extracted from ENCOW14 and part-of-speech data extracted from the English DELA dictionary as additional information.

A variant of this model that was trained only on sections 0–18 of the Wall Street Journal achieves the following results on the usual development and test sets:

Data set all words known words unknown words
dev (19–21) 97.15 ±0.02 97.41 ±0.03 89.59 ±0.28
test (22–24) 97.25 ±0.02 97.42 ±0.03 91.05 ±0.29

Download model (38 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

French newspaper texts

This model has been trained on the French Treebank and uses Brown clusters extracted from FRCOW16 and part-of-speech data extracted from the French DELA dictionary as additional information.

The French Treebank is annotated with two different tagsets: A coarse-grained tagset consisting of 15 tags and a more fine-grained tagset consisting of 29 tags. The model has been trained on the more fine-grained tagset. However, we provide a mapping to the smaller tagset (data/mapping_french_29_to_15.json) that can be used to annotate a text with both tagsets:

somewe-tagger --tag <model> --mapping <mapping> <file>

To estimate the accuracy of the model, we performed a 10-fold cross-validation on the French Treebank using the same settings:

tagset accuracy
29 tags 97.71 ±0.36
15 tags (mapped) 98.21 ±0.30

Download model (28 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

References

  • Proisl, Thomas (2018): “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA), 665–670. PDF.