Wiktfinnish - Inflecting Finnish words
This is a Python module for inflecting Finnish words (verb inflection, comparatives, cases, possessive suffixes, clitics) using Wiktionary-compatible declensions and conjugations.
This Python module is intended for generating inflected forms of Finnish words in Wiktionary. It is most conveniently used with dictionaries extracted using the wiktextract. The intention is that this module can be used to generate the complete set of inflected forms for any Finnish word in Wiktionary - including comparisons, possessive suffixes, clitics, and nominally inflected verb forms.
appropriate), or clone the repository and install from the source:
git clone https://github.com/tatuylonen/wiktfinnish.git cd wiktfinnish python3 setup.py install
This will install the
Note that this software has currently only been tested with Python 3. Back-porting to Python 2.7 should not be difficult; it just hasn't been tested yet. Please report back if you test and make this work with Python 2.
This package includes tests written using the
They can be run using, for example,
nose, which can be installed
pip3 install nose.
To run the tests, just use the following command in the top-level directory:
Generating an inflected word form
The basic way to generate an inflected word form is to use the following code snippet.
import wiktfinnish results = wiktfinnish.inflect(name, args, form)
inflect function returns a list of strings, which are the
alternative forms generated for that word form. The preferred, or
most common form should be first in the list and rare/archaic examples
The name and arguments specify the conjugation/declension for the word. That includes specifying the word to be inflected. See below for details.
The form is a 5-tuple
(verbform, comparison, case, possessive, clitic), that specifies the inflected form to be generated. It is described in detail below.
Specifying the conjugation/declension
In the API, each word to be inflection must be specified by a
conjugation (verbs) or declination (nominals) specification. The
specification is basically the arguments of the
template from Wiktionary, encoded into a Python dictionary. In the API, this template will be calledargs``.
Additionally, the API requies the name of the template to be supplied
as the ``name`` argument. These are readily available in the proper
format if the dictionary has been extracted using ``wiktextract`` (see
Specifying the desired word form
The desired word form is specified by a 5-tuple
(verbform, comparison, case, possessive, clitic), where any unused components
must be empty strings.
Generally, finite verb forms only have the
verbform part specified
and other parts empty. Nouns, pronouns, adjectives, and numerals
always have the
comparison parts empty.
Adjectives (and some verb forms) may also use
possessive specifies a possessive suffix, and they are mostly used
with nouns. The
clitic is specifies any clitics to be attached at
the end of the word, and can be used with any part-of-speech.
Verb form names
The following values are allowed for
verbform, in addition to the
empty string. The list of valid verb form names can be found in
wiktfinnish.VERB_FORMS. (There may still be some changes coming
in how case endings are handled for infinitives.)
pres-1sg pres-2sg pres-3sg pres-1pl pres-2pl pres-3pl pres-1sg-neg pres-2sg-neg pres-3sg-neg pres-1pl-neg pres-2pl-neg pres-3pl-neg pres-pass pres-pass-neg past-1sg past-2sg past-3sg past-1pl past-2pl past-3pl past-1sg-neg past-2sg-neg past-3sg-neg past-1pl-neg past-2pl-neg past-3pl-neg past-pass past-pass-neg cond-1sg cond-2sg cond-3sg cond-1pl cond-2pl cond-3pl cond-1sg-neg cond-2sg-neg cond-3sg-neg cond-1pl-neg cond-2pl-neg cond-3pl-neg cond-pass cond-pass-neg impr-2sg impr-3sg impr-1pl impr-2pl impr-3pl impr-2sg-neg impr-3sg-neg impr-1pl-neg impr-2pl-neg impr-3pl-neg impr-pass impr-pass-neg potn-1sg potn-2sg potn-3sg potn-1pl potn-2pl potn-3pl potn-1sg-neg potn-2sg-neg potn-3sg-neg potn-1pl-neg potn-2pl-neg potn-3pl-neg potn-pass potn-pass-neg pres-part pres-pass-part past-part past-pass-part agnt-part nega-part inf1 inf1-long inf2-ine inf2-pass-ine inf2-ins inf3-ine inf3-ela inf3-ill inf3-ade inf3-abe inf3-ins inf3-pass-ins inf4-nom inf4-par inf5
Adjectives, participles, and some other adverbs accept comparisons.
The normal positive form is marked by the empty string.
indicates comparative, and
sup indicates superlative form. The
list of valid comparison names (including the empty string) can be
Nouns, pronouns, adjectives, numerals, and various verb forms
(especially participles) accept case endings. The following names are
used to specify both case ending and number. The
acc-pl values are only valid for certain pronouns. For all other
parts of speech, one of
gen-pl should be used instead. The list of valid case+number
values can be found in
nom-sg - nominative (singular) acc-sg - accusative gen-sg - genitive ptv-sg - partitive ine-sg - inessive ela-sg - elative ill-sg - illative ade-sg - adessive abl-sg - ablative all-sg - allative ess-sg - essive tra-sg - translative ins-sg - instructive abe-sg - abessive cmt-sg - comitative nom-pl - nominative (plural) acc-pl - etc. gen-pl ptv-pl ine-pl ela-pl ill-pl ade-pl abl-pl all-pl ess-pl tra-pl ins-pl abe-pl cmt-pl
The following values are used for possessive suffixes. The empty
string indicates that no possessive suffix is to be attached. Note
that for the third person, the
3x value is used for both singular
and plural, as the forms are always the same. The list of valid
possessive forms (including the empty string) can be found in
1s - first person singular 2s - second person singular 3x - third person (singular or plural) 1p - first person plural 2p - second person plural
There is a fixed set of clitics that can be attached. In practice,
however, more clitics may be used in spoken language and there are
various other alternations. The following values can be used for
clitics, in addition to the empty string, which signifies no clitic.
The list of valid clitic values (including the empty string) can be
kO kin kAAn pA s kA hAn kOhAn pAhAn pAs kOs kinkO kAAnkO kinkOhAn
Iterating over all possible word forms
Functions are also provided for iterating over all valid 5-tuples indicating word forms. These are useful if one wants to generate all possible forms of a word. The following code snippet iterates over all adjective forms:
import wiktfinnish for verbform, comp, case, poss, clitic in wiktfinnish.all_forms_iter("adj"): print(verbform, comp, case, poss, clitic)
all_forms_iter function takes as a mandatory argument a
part-of-speech (as returned by the
wiktextract module, see below),
including "noun", "adj", "verb", "num", "pron", "adv", etc. It can
also take the following optional keyword arguments (more will likely
be added later) to restrict the forms that are enumerated:
transitive: if True (default), include agent participle (forms that are only valid for verbs with an agent)
no_comp: If set to True, prevents inclusion of comparative forms (for adjectives, adverbs)
no_case: If set to True, prevents inclusion of case forms (other than nominative singular)
no_poss: If set to True, prevents inclusion of possessive suffixes
no_clitic: if set to True, prevents inclusion of clitics.
Fast way of obtaining list of possible forms for a part-of-speech
There is also a cached version of the iterator that returns a sequence containing all valid forms for the given part-of-speech and keyword arguments. It takes the same arguments (including keyword arguments) as the iterator, but instead of returning an iterator returns a list. This function is also much faster and caches its results for maximum performance.
import wiktfinnish lst = wiktfinnish.all_forms_list("verb")
Standard vs. colloquial Finnish
Currently this generates forms according to standard written Finnish. The intention is to generate spoken language / colloquial forms for standard Finnish in the future, as well as possibly some dialectical forms. However, that is not yet implemented.
The official repository of this project is on github.
Please email to ylo at clausal.com if you wish to contribute or have patches or suggestions.
Credit and linking to the project's website and/or citing any future papers on the project would be highly appreciated.