NEWS

This file lists noteworthy changes between releases, for full list of changes, see git log and then ChangeLog.old.

Significant changes in 20160515

Universal Dependencies for Finnish is the new standard format we now follow:
- POS is now UPOS and classes were changed accordingly (new classes: AUX, PROPN, DET, CONJ, SCONJ, PUNCT, SYM, and VERB, NOUN, ADP, ADV as before)
- other features mostly match the feature field in UD documentation
- release cycle aims to be same six month cycle as with UD
- the automatic tests verify compatibility with UD; 92 % of lemmas, primary POS tags and morphological features are the same as Finnish UD corpus, 75 % same as Finnish FTB UD corpus
- analyser for reading and writing CONLL-U format
tokenisation as script and more hacks to token stripping in corner cases
continuous integration with travis-ci, currently only testing basic script programming conventions
added a lot of high coverage words and forms by hand
by popular request, some of the words can now be blacklisted, when you don't want that guy named Mutta to ambiguate your conjunction analyses or the odd new guinean bird to clash with some common verb
the "database" is now only keyed on lemma + homonym number; paradigm is extra information like anything else
a lot of work on morphological segmentation towards statistical machine translation; check proceedings of WMT shared tasks 2015 and 2016 to see why
started refactoring some python code into classes

allomorphy can be tagged again to distinguish e.g. -iden and -itten when generating
FinnTreeBank-1 format provided by Miikka Silfverberg is available but not built by default since it lacks a test set
lexicalised inflections can have separate tag, e.g. kännissä can be lexical inessive distinguished from regular inessive
preliminary VISL CG-3 support, with original grammar by Fred Karlsson; convenience bash scripts available for disambiguated parsing
preliminary support for conllu and conllx analysis formats
paradigm categorisation is now verified by regular expressions
lots of paradigm fixes and some added words

speed is up to >20,000 tokens per second from ~500
coverages are up to: europarl (99 %) gutenberg (97 %), JRC Acquis (94 %) and fiwiki (93 %)
moses factored model format supported
segmentation supported
Java API
Python hacks packaged to API and module
Rest of hand-written Xerox legacy data removed; all is script-generated
github migration since google code is EOL'd
file naming for automata changed to include omorfi prefix for all file names in case they are distributed separately.

The regressions are also set on coverage over popular corpora: Europarl (98 %), FTB 3.1 (97 %), gutenberg (96 %), JRC Acquis (93 %) and fiwiki (90 %)
sti derivation tentatively added
number of new paradigms and paradigm moves, esp. in old and archaic styles
some new words manually added
apertium formats updated totally
interjection chaining
rest of hand-written lexc removed: everything in db and python code now
more strict building and testing altogether (no more dangling references or missing tags allowed)
morphological segmentation should be usable now
lots of other classifications and attributes added

Default tag format is now FTB3.1. Recall is 90 % and the format is stable and easy to read by humans, which is now the main target for computational morphologies.
The omor tagsets are now permanently unstable and subject to change any day. To use them, python scripts have been provided.
Lots of proper nouns and semantics from Uni Hel projects
speller build support for new voikko versions
New regression tests for stuffs
Most of legacy lexc sources removed; they are now generated from TSV "databases".
The morphological classes now follow 3 main classes with some subclasses that are less morphological
Twol rules and flag diacritics have been eliminated
Lots of support scripts to verify and extend classifications
Lots of new word-forms, inflections and changes to derivations
Some python support scripts for omor formats

Added fi.wiktionary.org as lexical source (much thanks to students of my unix tools course for scripting)
Added first batch of new proper nouns from a project in Univ. Helsinki
Lexc data is now rebuild from lexical sources as standard processing;
- requiring python3
Minor bug fixes to man pages, special boundaries (e.g. in arkki_tehti)

whole new finntreebank tagset for forthcoming finntreebank work
uppercasing is noted in the analysis level
the word boundaries of lexicalised compounds may be available for more cases (depending on the tagset)
whole new lemmatizer tagset is available
some dozens of new words added and fixed
combine corpus analysis script with apertium's preprocessors
causative derivation chain added
bbreviations, adpositions, prefixes and suffixes are no longer pos but subcat analyses

completely new morphology built on traditional lexc-twolc model
easier route to add new lexical data via simple CSV format
lots of new lexical data from Joukahainen project as well as extended from kotus-sanalista semi-automatically and by hand.
titlecasing filter for regular words
š filter for old orthography variants
compounding much less haphazard concoction
parts of speech classified and included
pronouns, interjections, numerals, proper nouns
much closer to real full fledged morphology
movement from SFST to HFST toolset with lots of new cool toys (SFST support is retained in HFST)
towards full-scale automatic test suite