Various scripts and such used to create dictionary candidates for nob2smj and nob2sma from nob2sme dictionary + corpora.
The scripts are as of now quite tied to Giellatekno’s formats and infrastructure, and would probably need a lot of work to be made generally re-usable.
The main prerequisites are
- CorpusTools
- the sme/nob/sma/smj analysers from $GTHOME/langs
- all sme/nob/sma/smj-related folders from $GTHOME/words/dicts
See http://giellatekno.uit.no/doc/infra/GettingStarted.html for how to set things up – there’s no need to mess up your ~/.bashrc, but you’ll need to set at least the $GTHOME/$GTFREE/$GTCORE (optionally $GTBOUND) variables, and for now the Xerox tools (xfst/lookup etc.) are needed.
Run make
to make the stuff; the first time around you need to also
run make corpus
(just runs convert2xml
on your corpora).
The results will appear in the out/
directory.
Folders with two language codes are intermediate outputs, where
e.g. out/smesmj
has a two-column format where the first word is a
sme input and the second is a smj candidate.
The final files appear in out/nobsmjsme
and out/nobsmasme
, they
include both nob and sme translations for the candidates, annotated
with normalised frequencies for all the three words as well as
number of hits in parallel sentences. The format is, tab-separated:
nob candidate sme fr_nob fr_candidate fr_sme para_hits
(Note that parallel counts can actually be higher than fr_candidate, since they consider dynamic compounds as hits for the concatenated lemmas.)
The files named *_kintel_*
have been merged with words from
nobsmj-kintel. For every candidate we generate (whether from smj
or nob), we also include the Kintel translation of the Bokmål
word. The words are grouped by the Bokmål word, and the Kintel
translations are marked with an initial `+’ symbol. Some times the
Kintel translation was part of our generated candidates, in which
case it’ll have frequency numbers, other times it was not, in
which case only the nob and smj words will be listed for that
line.
The filenames have this format:
PoS_method_inFST_NN_sourcelang
where PoS
is part of speech (V, N, A or nonVNA for “the rest”) and
method
is one of
- decomp
- input is compound analysed, parts are translated with existing dictionaries and glued back together
- precomp
- existing dictionaries are compound analysed to create a dictionary of compound-part-translations; then input is compound analysed, parts are translated using the decompounded dictionaries, and glued back together
- anymalign
- from parallel word alignment (see para/anymalign)
- xfst
- using
$GTHOME/words/dicts/smesmj/scripts/sme2smj-$PoS.fst
- lexc
- using
$GTHOME/words/dicts/smesmj/bin/smesmj.fst
- kintel
- candidates here might come from other files, but there is also a suggestion from Kintel for every Bokmål word, pre-marked with a plus sign (+).
- loan
- input is translated using very simple loan-word regex replacement rules
The sourcelang
is the input for the method (nob or sme), while
inFST
is “ana” or “noana” depending on whether the word had an
analysis with the right PoS in
$GTHOME/langs/${lang}/src/analyser-gt-desc.xfst
.
The numbers (NN
above) indicate the frequency rank; the 00
file
contains the 1000 highest-frequency candidates, the 01
the 1000
next-highest-frequency candidates, and so on. Within each file,
candidates are sorted alphabetically by the reverse of the source
string (so e.g. all words ending in «-miljø» will be near each
other).
Follow para/anymalign/README.org to run the alignment. To get them into the same format as the other files, in this directory do
make anymalign make
You should now have some files in out/nob*sme/*anymalign*
… currently uses some ocaml stuff, TODO
See TODO.org.
In general:
- candidates from _decomp are better than _precomp
- candidates from _sme are better than _nob
- candidates from _multis are better than _singles
- candidates from _ana are better than _noana