Skip to content

Build scripts for the UniSegments collection of morphologically segmented lexicons for many languages

Notifications You must be signed in to change notification settings

ufal/universal-segmentations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Universal Segmentations

This repository servers as the main storage of scripts and documentation of the Universal Segmentations project, which aims to convert existing resources for morphological segmentation into a common format and, in the future, to develop new resources.

Currently, we focus on extracting information from existing resources and putting into a file format specifically designed for the segmentation task.

API

A Python API for working with morphological segmentation is available in src/useg/seg_lex.py. For a simple usage example, see src/import_derinet.py, which converts segmentation present in DeriNet-2.0-formatted files to our format.

Basic usage

  1. Create a lexicon. The lexicon is the main object you’ll be interacting with.

    from useg import SegLex
    lexicon = SegLex()
  2. Create any number of lexemes using the lexicon.add_lexeme() method, which requires a word form, lemma and a part-of-speech tag and returns an ID of the created lexeme. This ID is used to work with the lexeme using other methods.

    The last argument is (optional) features of the lexeme. Pass a dict with the required items, such as {"gender": "Masc"}. Keys are not yet standardized, we’ll have to talk about it.

    The POS tag can be replaced by an empty string if you don’t have it. Form should be always filled in – it is expected that a lemma is one of the word forms, so if you only have a lemma, specify it both as the lemma and as the form.

    Try to use Universal POS tags for the part of speech, if possible.

    lex_id = lexicon.add_lexeme("counterexamples", "counterexample", "NOUN")
  3. Add segmentation by using the lexicon.add_morpheme() method, or one of the convenience methods lexicon.add_contiguous_morpheme() or lexicon.add_morphemes_from_list() (see below).

    The lexicon.add_morphemes_from_list() method is the simplest one to use, but it requires that the morphs of the word, when concatenated, correspond exactly to the word form. Therefore, it doesn’t work with allomorphs and doesn’t allow you to add further annotation to the morphemes. Simply pass a list of morphs as the argument.

    lexicon.add_morphemes_from_list(lex_id, "test_1", ["count", "er", "example", "s"])

    The lexicon.add_contiguous_morpheme() allows you to add morphemes one by one and manually resolve allomorphy and annotate the individual morphemes. Pass start and end indices of the morpheme, as if they were slice indices that select the morph from the form.

    lexicon.add_contiguous_morpheme(lex_id, "test_2", 0, 5, {"type": "root", "morpheme": "contra"})
    lexicon.add_contiguous_morpheme(lex_id, "test_2", 5, 7, {"type": "suffix", "morpheme": "er"})

    The lexicon.add_morpheme() method is the lowest level one. It requires you to manually specify, which positions in the form the morph corresponds to using a list of integer indices, therefore allowing to mark non-contiguous morphemes such as circumfixes.

    lexicon.add_morpheme(lex_id, "test_2", [7, 8, 9, 10, 11, 12, 13], {"type": "root", "morpheme": "example"})
    lexicon.add_morpheme(lex_id, "test_2", [14], {"type": "suffix", "morpheme": "PLURAL"})
  4. When you’re done, you can save the lexicon to a file:

    lexicon.save("output.useg")  # Or e.g. lexicon.save(sys.stdout)

Loading and using pre-created lexicons

In addition to creating the lexicon from scratch, it is also possible to load an existing resource from a file using the lexicon.load() method, analogously to saving:

lexicon = SegLex()
lexicon.load("input.useg")  # Or e.g. lexicon.load(sys.stdin)

The lexicon now contains lexemes and their segmentation. The lexemes can be enumerated using the lexicon.iter_lexemes() method. When passed no arguments, it will iterate over the whole lexicon, returning IDs of all lexemes. When passed the string arguments form, lemma or pos, it will iterate only over lexeme IDs with the specified properties.

The methods lexicon.form(), .lemma() and .pos() can be used to obtain these strings, and lexicon.features() retrieves the features dict. The returned dictionary is shared with the API, so any edits will be visible in future calls.

# Print lemmas and features of all nouns.
for lex_id in lexicon.iter_lexemes(pos="NOUN"):
    print(lexicon.lemma(lex_id), lexicon.features(lex_id))

Troubleshooting

To make sure that Python sees the API sources and is able to import them, add the src/ directory to PYTHONPATH, either by setting the corresponding environment variable, or by changing the path before importing from useg. That is, either:

PYTHONPATH=/path/to/universal-segmentations/src python3 script.py

or:

import os
import sys

# Configure the path to the sources.
sys.path.insert(0, os.path.abspath('../../src'))

TODO

Add documentation.

  • ✓ Document setting the PYTHONPATH

  • ✓ Document importing and creating the lexicon

  • ✓ Document creating new lexemes

  • ❏ Document working with lexemes

    • ✓ Getting form, lemma, POS tag, other morphological info

    • ❏ Deleting lexemes

    • ✓ Changing morphological info

    • ❏ What to do when we need to change the lemma or POS (A: Create a new lexeme instead and delete the old one.)

    • ✓ Getting lexemes by lemma, form, pos etc.

  • ❏ Document basic concepts of morphemes

    • ❏ They are bound to the form of the lexeme

    • ❏ Spans index, where in the form is the morph of the morpheme located

    • ❏ Ideally, each lexeme has its form perfectly subdivided between morphemes with no gaps nor overlap

      • ❏ But you don’t need to fully segment the word if you don’t know all the morphemes

      • ❏ And edge cases (overlaps – morfémové uzly) are supported too

    • ❏ Each lexeme can have multiple alternative segmentations

      • ❏ These are distinguished using annotation layer names

  • ✓ Document creating new morphemes (contiguous or not)

  • ❏ Document deleting morphemes (not implemented yet)

  • ❏ Document morpheme properties; what are you expected to fill out

    • ❏ And how to query and obtain the properties

File format

Note
TODO Add documentation.

About

Build scripts for the UniSegments collection of morphologically segmented lexicons for many languages

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages