General-Purpose Neural Networks for Sentence Boundary Detection
Branch: master
Clone or download

README.md

General-Purpose Neural Networks for Sentence Boundary Detection

In this repository we present general-purpose neural network models for sentence boundary detection. We report on a series of experiments with long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM) and convolutional neural network (CNN) for sentence boundary detection. We show that these neural networks architectures achieve state-of-the-art results both on multi-lingual benchmarks and on a zero-shot scenario.

Introduction

The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging (Manning, 2011), dependency parsing (Yu and Vu, 2017), named entity recognition or machine translation.

Sentence boundary detection is a nontrivial task, because of the ambiguity of the period sign ., which has several functions (Grefenstette and Tapanainen, 1994), e.g.:

  • End of sentence
  • Abbreviation
  • Acronyms and initialism
  • Mathematical numbers

A sentence boundary detection system has to resolve the use of ambiguous punctuation characters to determine if the punctuation character is a true end-of-sentence marker. In this implementation we define ?!:;. as potential end-of sentence markers.

Various approaches have been employed to achieve sentence boundary detection in different languages. Recent research in sentence boundary detection focus on machine learning techniques, such as hidden Markov models (Mikheev, 2002), maximum entropy (Reynar and Ratnaparkhi, 1997), conditional random fields (Tomanek et al., 2007), decision tree (Wong et al., 2014) and neural networks (Palmer and Hearst, 1997). Kiss and Strunk (2006) use an unsupervised sentence detection system called Punkt, which does not depend on any additional resources. The system use collocation information as evidence from unannotated corpora to detect e.g. abbreviations or ordinal numbers.

The sentence boundary detection task can be treated as a classification problem. Our work is similar to the SATZ system, proposed by Palmer and Hearst (1997), which uses a fully-connected feed-forward neural network. The SATZ system disambiguates a punctuation mark given a context of k surrounding words. This is different to our approach, as we use a char-based context window instead of a word-based context window.

In the present work, we train different architectures of neural networks, such as long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM) and convolutional neural network (CNN) and compare the results with OpenNLP. OpenNLP is a state-of-the-art tool and uses a maximum entropy model for sentence boundary detection. To test the robustness of our models, we use the Europarl corpus for German and English and the SETimes corpus for nine different Balkan languages.

Additionally, we use a zero-shot scenario to test our model on unseen abbreviations. We show that our models outperform OpenNLP both for each language and on the zero-shot learning task. Therefore, we conclude that our trained models can be used for building a robust, language-independent state-of-the-art sentence boundary detection system.

Datasets

Similar to Wong et al. (2014) we use the Europarl corpus (Koehn, 2005) for our experiments. The Europarl parallel corpus is extracted from the proceedings of the European Parliament and is originally created for the research of statistical machine translation systems. We only use German and English from Europarl. Wong et al. (2014) do not mention that the Europarl corpus is not fully sentence-segmented. The Europarl corpus has a one-sentence per line data format. Unfortunately, in some cases one or more sentences appear in a line. Thus, we define the Europarl corpus as "quasi"-sentence segmented corpus.

We use the SETimes corpus (Tyers and Alperen, 2010) as a second corpus for our experiments. The SETimes corpus is based on the content published on the SETimes.com news portal and contains parallel texts in ten languages. Aside from English the languages contained in the SETimes corpus fall into several linguistic groups: Turkic (Turkish), Slavic (Bulgarian, Croatian, Macedonian and Serbian), Hellenic (Greek), Romance (Romanian) and Albanic (Albanian). The SETimes corpus is also a "quasi"-sentence segmented corpus. For our experiments we use all the mentioned languages except English, as we use an English corpus from Europarl. We do not use any additional data like abbreviation lists.

For a zero-shot scenario we extracted 80 German abbreviations including their context in a sentence from Wikipedia. These abbreviations do not exist in the German Europarl corpus.

Preprocessing

Both Europarl and SETimes are not tokenized. Text tokenization (or, equivalently, segmentation) is highly non-trivial for many languages (Schütze, 2017). It is problematic even for English as word tokenizers are either manually designed or trained. For our proposed sentence boundary detection system we use a similar idea from Lee et al. (2016). They use a character-based approach without explicit segmentation for neural machine translation. We also use a character-based context window, so no explicit segmentation of input text is necessary.

For both corpora we use the following preprocessing steps: (a) we remove duplicate sentences, (b) we extract only sentences with ends with a potential end-of-sentence marker. For Europarl and SETimes each text for a language is split into train, dev and test sets. The following table shows a detailed summary of the training, development and test sets used for each language.

Language # Train # Dev # Test
German 1,476,653 184,580 184,580
English 1,474,819 184,352 184,351
Bulgarian 148,919 18,615 18,614
Bosnian 97,080 12,135 12,134
Greek 159,000 19,875 19,874
Croatian 143,817 17,977 17,976
Macedonian 144,631 18,079 18,078
Romanian 148,924 18,615 18,615
Albanian 159,323 19,915 19,915
Serbian 158,507 19,813 19,812
Turkish 144,585 18,073 18,072

Download

A script for automatically downloading and extracting the datasets is available and can be used with:

./download_data.sh

Training, development and testdata is located in the data folder.

Model

We use three different architectures of neural networks: long short-term memory (LSTM), bidirectional long short-term memory (Bi-LSTM) and convolutional neural network (CNN). All three models capture information at the character level. Our models disambiguate potential end-of-sentence markers followed by a whitespace or line break given a context of k surrounding characters. The potential end-of-sentence marker is also included in the context window. The following table shows an example of a sentence and its extracted contexts: left context, middle context and right context. We also include the whitespace or line break after a potential end-of-sentence marker.

Input sentence Left Middle Right
I go to Mr. Pete Tong to Mr . _Pete

LSTM

We use a standard LSTM (Hochreiter and Schmidhuber, 1997; Gers et al., 2000) network with an embedding size of 128. The number of hidden states is 256. We apply dropout with probability of 0.2 after the hidden layer during training. We apply a sigmoid non-linearity before the prediction layer.

Bi-LSTM

Our bidirectional LSTM network uses an embedding size of 128 and 256 hidden states. We apply dropout with a probability of 0.2 after the hidden layer during training, and we apply a sigmoid non-linearity before the prediction layer.

CNN

For the convolutional neural network we use a 1D convolution layer with 6 filters and a stride size of 1 (Waibel et al., 1989). The output of the convolution filter is fed through a global max pooling layer and the pooling output is concatenated to represent the context. We apply one 250-dimensional hidden layer with ReLU non-linearity before the prediction layer. We apply dropout with a probability of 0.2 during training.

Other Hyperparameters

Our proposed character-based model disambiguates a punctuation mark given a context of k surrounding characters. In our experiments we found that a context size of 5 surrounding characters gives the best results. We found that it is very important to include the end-of-sentence marker in the context, as this increases the F1-score of 2%. All models are trained with averaged stochastic gradient descent with a learning rate of 0.001 and mini-batch size of 32. We use Adam for first-order gradient-based optimization. We use binary cross-entropy as loss function. We do not tune hyperparameters for each language. Instead, we tune hyperparameters for one language (English) and use them across languages. The following table shows the number of trainable parameters for each model.

Model # Parameters
LSTM 420,097
Bi-LSTM 814,593
CNN 33,751

Results

We train a maximum of 5 epochs for each model. For the German and English corpus (Europarl) the time per epoch is 54 minutes for the Bi-LSTM model, 35 minutes for the LSTM model and 7 minutes for the CNN model. For each language from the SETimes corpus the time per epoch is 6 minutes for the Bi-LSTM model, 4 minutes for the LSTM model and 50 seconds for the CNN model. Timings are performed on a DGX-1 with a Nvidia P-100.

Development set

The results on the development set for both Europarl and SETimes are shown in the following table. Download link for model and vocab files for each language are included, as well as detailed evaluation results.

Language LSTM Bi-LSTM CNN OpenNLP
German 0.9759 (model, vocab) 0.9760 (model, vocab) 0.9751 (model, vocab) 0.9736
English 0.9864 (model, vocab) 0.9863 (model, vocab) 0.9861 (model, vocab) 0.9843
Bulgarian 0.9928 (model, vocab) 0.9926 (model, vocab) 0.9924 (model, vocab) 0.9900
Bosnian 0.9953 (model, vocab) 0.9958 (model, vocab) 0.9952 (model, vocab) 0.9921
Greek 0.9959 (model, vocab) 0.9964 (model, vocab) 0.9959 (model, vocab) 0.9911
Croatian 0.9947 (model, vocab) 0.9948 (model, vocab) 0.9946 (model, vocab) 0.9917
Macedonian 0.9795 (model, vocab) 0.9799 (model, vocab) 0.9794 (model, vocab) 0.9776
Romanian 0.9906 (model, vocab) 0.9904 (model, vocab) 0.9903 (model, vocab) 0.9888
Albanian 0.9954 (model, vocab) 0.9954 (model, vocab) 0.9945 (model, vocab) 0.9934
Serbian 0.9891 (model, vocab) 0.9890 (model, vocab) 0.9886 (model, vocab) 0.9838
Turkish 0.9860 (model, vocab) 0.9867 (model, vocab) 0.9858 (model, vocab) 0.9830

For each language the best neural network model outperforms OpenNLP. On average, the best neural network model is 0.32% better than OpenNLP. The worst neural network model also outperforms OpenNLP for each language. On average, the worst neural network model is 0.26% better than OpenNLP. In over 60% of the cases the bi-directional LSTM model is the best model. In almost all cases the CNN model performs worse than the LSTM and bi-directional LSTM model, but it still achieves better results than the OpenNLP model. This suggests that the CNN model still needs more hyperparameter tuning.

Test set

The results on the development set for both Europarl and SETimes are shown in the following table. Download link for model and vocab files for each language are included, as well as detailed evaluation results.

Language LSTM Bi-LSTM CNN OpenNLP
German 0.975 (model, vocab) 0.9760 (model, vocab) 0.9751 (model, vocab) 0.9738
English 0.9861 (model, vocab) 0.9860 (model, vocab) 0.9858 (model, vocab) 0.9840
Bulgarian 0.9922 (model, vocab) 0.9923 (model, vocab) 0.9919 (model, vocab) 0.9887
Bosnian 0.9957 (model, vocab) 0.9959 (model, vocab) 0.9953 (model, vocab) 0.9925
Greek 0.9967 (model, vocab) 0.9969 (model, vocab) 0.9963 (model, vocab) 0.9925
Croatian 0.9946 (model, vocab) 0.9948 (model, vocab) 0.9943 (model, vocab) 0.9907
Macedonian 0.9810 (model, vocab) 0.9811 (model, vocab) 0.9794 (model, vocab) 0.9786
Romanian 0.9907 (model, vocab) 0.9906 (model, vocab) 0.9904 (model, vocab) 0.9889
Albanian 0.9953 (model, vocab) 0.9949 (model, vocab) 0.9940 (model, vocab) 0.9934
Serbian 0.9877 (model, vocab) 0.9877 (model, vocab) 0.9870 (model, vocab) 0.9832
Turkish 0.9858 (model, vocab) 0.9854 (model, vocab) 0.9854 (model, vocab) 0.9808

For each language the best neural network model outperforms OpenNLP. On average, the best neural network model is 0.32% better than OpenNLP. The worst neural network model also outperforms OpenNLP for each language. On average, the worst neural network model is 0.25% better than OpenNLP. In half of the cases the bi-directional LSTM model is the best model. In almost all cases the CNN model performs worse than the LSTM and bi-directional LSTM model, but it still achieves better results than the OpenNLP model.

Zero-shot

Model Precision Recall F1-Score
LSTM 0.6046 0.9750 0.7464
Bi-LSTM 0.6341 0.9750 0.7684
CNN 0.57350 0.9750 0.7222
OpenNLP 54.60 96.25 69.68

The table above shows the results for the zero-shot scenario. The bi-directional LSTM model outperforms OpenNLP by a large margin and is 7% better than OpenNLP. The bi-directional LSTM model also outperforms all other neural network models. That suggests that the bi-directional LSTM model generalizes better than LSTM or CNN for unseen abbreviations. The worst neural network model (CNN) still performs 2,5% better than OpenNLP.

Conclusion

In this repository, we propose a general-purpose system for sentence boundary detection using different architectures of neural networks. We use the Europarl and SETimes corpus and compare our proposed models with OpenNLP. We achieve state-of-the-art results.

In a zero-shot scenario, in which no manifestation of the test abbreviations is observed during training, our system is also robust against unseen abbreviations.

The fact that our proposed neural network models perform well on different languages and on a zero-shot scenario leads us to the conclusion that our system is a general-purpose system.

Evaluation

To reproduce this results, the following scripts can be used:

  • benchmark_all.sh - runs evaluation for various neural network models and all languages
  • benchmark_all_opennlp - runs evaluation for OpenNLP for all languages

Implementation

We use Keras and TensorFlow for the implementation of the neural network architectures.

Options

The following commandline options are available:

$ python3 main.py --help
usage: main.py [-h] [--training-file TRAINING_FILE] [--test-file TEST_FILE]
               [--input-file INPUT_FILE] [--epochs EPOCHS]
               [--architecture ARCHITECTURE] [--window-size WINDOW_SIZE]
               [--batch-size BATCH_SIZE] [--dropout DROPOUT]
               [--min-freq MIN_FREQ] [--max-features MAX_FEATURES]
               [--embedding-size EMBEDDING_SIZE] [--kernel-size KERNEL_SIZE]
               [--filters FILTERS] [--pool-size POOL_SIZE]
               [--hidden-dims HIDDEN_DIMS] [--strides STRIDES]
               [--lstm_gru_size LSTM_GRU_SIZE] [--mlp-dense MLP_DENSE]
               [--mlp-dense-units MLP_DENSE_UNITS]
               [--model-filename MODEL_FILENAME]
               [--vocab-filename VOCAB_FILENAME] [--eos-marker EOS_MARKER]
               {train,test,tag,extract}

positional arguments:
  {train,test,tag,extract}

optional arguments:
  -h, --help            show this help message and exit
  --training-file TRAINING_FILE
                        Defines training data set
  --test-file TEST_FILE
                        Defines test data set
  --input-file INPUT_FILE
                        Defines input file to be tagged
  --epochs EPOCHS       Defines number of training epochs
  --architecture ARCHITECTURE
                        Neural network architectures, supported: cnn, lstm,
                        bi-lstm, gru, bi-gru, mlp
  --window-size WINDOW_SIZE
                        Defines number of window size (char-ngram)
  --batch-size BATCH_SIZE
                        Defines number of batch_size
  --dropout DROPOUT     Defines number dropout
  --min-freq MIN_FREQ   Defines the min. freq. a char must appear in data
  --max-features MAX_FEATURES
                        Defines number of features for Embeddings layer
  --embedding-size EMBEDDING_SIZE
                        Defines Embeddings size
  --kernel-size KERNEL_SIZE
                        Defines Kernel size of CNN
  --filters FILTERS     Defines number of filters of CNN
  --pool-size POOL_SIZE
                        Defines pool size of CNN
  --hidden-dims HIDDEN_DIMS
                        Defines number of hidden dims
  --strides STRIDES     Defines numer of strides for CNN
  --lstm_gru_size LSTM_GRU_SIZE
                        Defines size of LSTM/GRU layer
  --mlp-dense MLP_DENSE
                        Defines number of dense layers for mlp
  --mlp-dense-units MLP_DENSE_UNITS
                        Defines number of dense units for mlp
  --model-filename MODEL_FILENAME
                        Defines model filename
  --vocab-filename VOCAB_FILENAME
                        Defines vocab filename
  --eos-marker EOS_MARKER
                        Defines end-of-sentence marker used for tagging

Training

A new model can be trained using the train parameter. The only mandatory argument in training mode is the --training-file parameter. This parameter specifices the training file with sentence-separated entries.

python3 main.py train --training-file <TRAINING_FILE>

Testing

A previously trained model can be evaluated using the test parameter. The only mandatory argument for the testing mode is the --test-file parameter, that specifies the test file with sentence-separated entries.

python3 main.py test --test-file <TEST_FILE>

Tagging

To tag an input text with a previously trained model, the tag parameter must be used in combination with specifying the to be tagged input text via the --input-file parameter.

python3 main.py tag --input-file INPUT_FILE

Evaluation

A evaluation script can be found in the eos-eval folder. The main arguments for the eval.py script are:

$ python3 eval.py --help
usage: eval.py [-h] [-g GOLD] [-s SYSTEM] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -g GOLD, --gold GOLD  Gold standard
  -s SYSTEM, --system SYSTEM
                        System output
  -v, --verbose         Verbose outpu

The system and gold standard file must use </eos> as end-of-sentence marker. Then the evaluations script calculates precision, recall and F1-score. The --verbose parameter gives a detailed output of e.g. false negatives.

Acknowledgments

We would like to thank the Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften (LRZ) for giving us access to the NVIDIA DGX-1 supercomputer.

Contact (Bugs, Feedback, Contribution and more)

For questions about deep-eos, contact the current maintainer: Stefan Schweter stefan@schweter.it. If you want to contribute to the project please refer to the Contributing guide!

License

To respect the Free Software Movement and the enormous work of Dr. Richard Stallman this implementation is released under the GNU Affero General Public License in version 3. More information can be found here and in COPYING.