PRPE (Prefix-Root-Postfix-Encoding) Segmenter

This repository contains Python (2 and 3) text segmentation scripts for machine translation.

The main principles

Performs close-to-morphological segmentation
For the present, tuned for English and Latvian languages
Requires minor adaptation for using in other languages: -- small source code changes -- tuning of hyperparameters
to support open-vocabulary (in machine translation), segmented texts should be post-processed (e.g., using BPE)

Usage instructions

PRPE is used in the following way:

Learning phase (files of potential segments produced): learn_prpe.py
Segmentation phase (using produced files of the previous phase): apply_prpe.py
Removing segmentation: unprocess_prpe.py

Phase #1. Learning (learn_prpe.py)

During this phase several 'code' files are produced from the {corpus} representing building blocks for the segmentation:

{prefixes}
{roots}
{suffixes}
{postfixes} (in the lastest version this particular file is redundant for segmentation but is left as legacy and for the purpose to potentially assist in hyperparamter adaptation)
{endings}
vocabulary of most frequent {words} to avoid segmentation

Terminology of word parts in the context of this program

Word "unbelievables" consists of:

prefix: un
root: believ
postfix: ables

where postfix "ables" consists of:

suffix: abl
ending: es

Running the learning script:

prpe6/learn_prpe.py -i {corpus} -p {prefixes} -r {roots} -s {suffixes} -t {postfixes} -u {endings} -w {words} -a {prefix rate} -b {suffix rate} -c {postfix rate} -v {vocabulary size} -l {language}

where:

prefix rate: how many prefixes to collect (greater than 1 means exact number, less means percentage)
suffix rate: how many suffixes to collect (greater than 1 means exact number, less means percentage)
postfix rate: how many postfixes to collect (greater than 1 means exact number, less means percentage) (in the latest version this particular parameter is redundant)
vocabulary size: how many the most frequent words to store to avoid segmentations for them (in order to reduce number of segments)
language: for the present, only 'lv' and 'en' are acceptable; otherwise 'en' scripts will be run

All the rates (prefix, suffix, postfix, vocabulary) should be experimentally tuned. Fitness conditions for roots and endings (not represented in parameters) are hardcoded.

Sample configuration used for Latvian (see produced files in 'codefiles-lv' directory):

prpe6/learn_prpe.py -i corpus.lv -p prefixes.lv -r roots.lv -s suffixes.lv -t postfixes.lv -u endings.lv -w words.lv -a 32 -b 1000 -c 0.1 -v 5000 -l lv

Sample configuration used for English (see produced files in 'codefiles-en' directory):

prpe6/learn_prpe.py -i corpus.en -p prefixes.en -r roots.en -s suffixes.en -t postfixes.en -u endings.en -w words.en -a 32 -b 200 -c 180 -v 5000 -l en

Just for a brief insight; the first 10 English prefixes collected in the learning phase (in file {prefixes}):

re 1
un 2
de 3
in 4
dis 5
mis 6
sub 7
over 8
ac 9
im 10

Phase #2. Segmentation (apply_prpe.py)

During this phase, input text is segmented using 'code' files produced in the learning phase.

Running the segmentation script:

prpe6/apply_prpe.py -i {input text} -o {output text} -p {prefixes} -r {roots} -s {suffixes} -t {postfixes} -u {endings} -w {words} -l {language} -d {segmentation mode} -m {segmentation marker} -n {uppercase marker}

where:

prefixes, roots, suffixes, postfixes, endings: files produced in the learning phase
language: for the present, only 'lv' and 'en' are acceptable
segmentation mode -- to determine segmentation optimization, processing words with uppercase and usage of segmentation markers (see below)
segmentation marker -- a character or sequence of character to mark segments to constitute words (if sequence of digits, then is converted to the character represented by that number, default '9474')
upercase marker -- a character or sequence of character to mark next word as starting with uppercase (if sequence of digits, then is converted to the character represented by that number, default '9553')

Segmentation mode

Segmentation mode is a positive integer number up to 4 digits in the form ABCC, where

A: optimization mode
B: uppercase mode
CC: marking mode

In the next examples the text "Unbelievable salespersons" will be used, "/" -- segmentation marker, "|" -- uppercase marker.

A: Optimization mode Optimization mode is intended to reduce number of segments, thus the length of sequence of segments.

Value 0 -- no optimization: Un/ believ/ able sal/ es/ person/ s

Value 1 -- small optimization: Un/ believ/ able sales/ persons

Value 2 -- full optimization: Unbeliev/ able sales/ person/ s

B: Uppercase mode Uppercase mode converts word starting with uppercase and the rest symbols in lowercase to lowercase with putting uppercase marker before it:

Value 0 -- uppercase processing ON: | unbeliev/ able sales/ person/ s

Value 1 -- uppercase processing OFF: Unbeliev/ able sales/ person/ s

C: Marking mode Although there are modes 0, 1, 2, 3 available, we suggest using mode 3 (marker indicates that the next segment is of the same word)

Examples of valid segmentation modes: 3, 103, 2103, 1003

Sample configuration of segmentation used for English (paramaters for markers omitted):

prpe6/apply_prpe.py -i input.en -o output.en -p prefixes.en -r roots.en -s suffixes.en -t postfixes.en -u endings.en -w words.en -l en -d 2103

3. Removing segmentation (unprocess_prpe.py)

During this operation, a segmented text is coverted back to a normal text:

prpe6/unprocess_prpe.py -i {input text} -o {output text} -d {segmentation mode} -m {segmentation marker} -n {uppercase marker}

For example:

prpe6/unprocess_prpe.py -i input.en -o output.en -d 2103 -m / -n |

will convert text "| un/ believ/ able sales/ persons" back into "Unbelievable salespersons"

4. Summary of the best configurations for NMT discovered by far

English:

Prefix rate a = 32
Suffix rate b = 200
Vocabulary size v = 5000 (or less if NMT system not sensitive to long sequences)
Segmentation mode d = 2003

Latvian:

Prefix rate a = 32
Suffix rate b = 1000
Vocabulary size v = 5000 (or less if NMT system not sensitive to long sequences)
Segmentation mode d = 2003

5. Adaptation to other languages

To make the adaptation, it is unfortunately required that you have some understanding about the language.

A brief activity list for adaptation:

in prpe.py, add the language specific information to the function "add_heuristics" which can eventually cause creation of several language specific function (such as "is_good_root")
run learning phase with looser hyperparameters for code files, i.e., "-a 1000 -b 0.1 -c 0.1"
go through the code files (prefixes, postfixes, suffixes, endings) to determine the number of words parts to be collected for segmentation, and eventually to additionally adjust the language specific source code.

Publications

Jānis Zuters, Gus Strazds, and Kārlis Immers. Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation: 13th International Baltic Conference, DB&IS 2018, Trakai, Lithuania, July 1-4, 2018, Proceedings.

Acknowledgements

The research has been supported by the European Regional Development Fund within the research project ”Neural Network Modelling for Inflected Natural Languages” No. 1.1.1.1/16/A/215, and the Faculty of Computing, University of Latvia.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
codefiles-en		codefiles-en
codefiles-lv		codefiles-lv
prpe6		prpe6
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRPE (Prefix-Root-Postfix-Encoding) Segmenter

The main principles

Usage instructions

Phase #1. Learning (learn_prpe.py)

Phase #2. Segmentation (apply_prpe.py)

Segmentation mode

3. Removing segmentation (unprocess_prpe.py)

4. Summary of the best configurations for NMT discovered by far

5. Adaptation to other languages

Publications

Acknowledgements

About

Releases

Packages

Languages

License

zuters/prpe

Folders and files

Latest commit

History

Repository files navigation

PRPE (Prefix-Root-Postfix-Encoding) Segmenter

The main principles

Usage instructions

Phase #1. Learning (learn_prpe.py)

Phase #2. Segmentation (apply_prpe.py)

Segmentation mode

3. Removing segmentation (unprocess_prpe.py)

4. Summary of the best configurations for NMT discovered by far

5. Adaptation to other languages

Publications

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages