Prerequisites

Various scripts and such used to create dictionary candidates for nob2smj and nob2sma from nob2sme dictionary + corpora.

The scripts are as of now quite tied to Giellatekno’s formats and infrastructure, and would probably need a lot of work to be made generally re-usable.

Prerequisites

The main prerequisites are

CorpusTools
the sme/nob/sma/smj analysers from $GTHOME/langs
all sme/nob/sma/smj-related folders from $GTHOME/words/dicts

See http://giellatekno.uit.no/doc/infra/GettingStarted.html for how to set things up – there’s no need to mess up your ~/.bashrc, but you’ll need to set at least the $GTHOME/$GTFREE/$GTCORE (optionally $GTBOUND) variables, and for now the Xerox tools (xfst/lookup etc.) are needed.

Running the candidate generation

Run make to make the stuff; the first time around you need to also run make corpus (just runs convert2xml on your corpora).

Output file format

The results will appear in the out/ directory.

Folders with two language codes are intermediate outputs, where e.g. out/smesmj has a two-column format where the first word is a sme input and the second is a smj candidate.

The final files appear in out/nobsmjsme and out/nobsmasme, they include both nob and sme translations for the candidates, annotated with normalised frequencies for all the three words as well as number of hits in parallel sentences. The format is, tab-separated:

nob 	candidate	sme	fr_nob	fr_candidate	fr_sme	para_hits

(Note that parallel counts can actually be higher than fr_candidate, since they consider dynamic compounds as hits for the concatenated lemmas.)

Kintel files

The files named *_kintel_* have been merged with words from nobsmj-kintel. For every candidate we generate (whether from smj or nob), we also include the Kintel translation of the Bokmål word. The words are grouped by the Bokmål word, and the Kintel translations are marked with an initial `+’ symbol. Some times the Kintel translation was part of our generated candidates, in which case it’ll have frequency numbers, other times it was not, in which case only the nob and smj words will be listed for that line.

Output filename format

The filenames have this format:

PoS_method_inFST_NN_sourcelang

where PoS is part of speech (V, N, A or nonVNA for “the rest”) and method is one of

decomp: input is compound analysed, parts are translated with existing dictionaries and glued back together
precomp: existing dictionaries are compound analysed to create a dictionary of compound-part-translations; then input is compound analysed, parts are translated using the decompounded dictionaries, and glued back together
anymalign: from parallel word alignment (see para/anymalign)
xfst: using $GTHOME/words/dicts/smesmj/scripts/sme2smj-$PoS.fst
lexc: using $GTHOME/words/dicts/smesmj/bin/smesmj.fst
kintel: candidates here might come from other files, but there is also a suggestion from Kintel for every Bokmål word, pre-marked with a plus sign (+).
loan: input is translated using very simple loan-word regex replacement rules

The sourcelang is the input for the method (nob or sme), while inFST is “ana” or “noana” depending on whether the word had an analysis with the right PoS in $GTHOME/langs/${lang}/src/analyser-gt-desc.xfst.

The numbers (NN above) indicate the frequency rank; the 00 file contains the 1000 highest-frequency candidates, the 01 the 1000 next-highest-frequency candidates, and so on. Within each file, candidates are sorted alphabetically by the reverse of the source string (so e.g. all words ending in «-miljø» will be near each other).

Running the parallel word alignment

Follow para/anymalign/README.org to run the alignment. To get them into the same format as the other files, in this directory do

make anymalign
make

You should now have some files in out/nob*sme/*anymalign*

Running the candidate-respelling

… currently uses some ocaml stuff, TODO

stuff

See TODO.org.

Quality impressions

In general:

candidates from _decomp are better than _precomp
candidates from _sme are better than _nob
candidates from _multis are better than _singles
candidates from _ana are better than _noana

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
para		para
spell		spell
.gitignore		.gitignore
COPYING		COPYING
Makefile		Makefile
README.org		README.org
TODO.org		TODO.org
badparts.nob.grep		badparts.nob.grep
badparts.sma.grep		badparts.sma.grep
badparts.sme.grep		badparts.sme.grep
badparts.smj.grep		badparts.smj.grep
canonicalise.sh		canonicalise.sh
cluster-comp.sh		cluster-comp.sh
compound_translate.awk		compound_translate.awk
corp-to-freqlist.sh		corp-to-freqlist.sh
coverage.sh		coverage.sh
cross.sh		cross.sh
decompound.sh		decompound.sh
expand-synonyms.sh		expand-synonyms.sh
fad-words.sh		fad-words.sh
functions.sh		functions.sh
intersect.sh		intersect.sh
merge-kintel.sh		merge-kintel.sh
nob2sma-loan.sh		nob2sma-loan.sh
nob2smj-loan.sh		nob2smj-loan.sh
precomp.awk		precomp.awk
precomp.sh		precomp.sh
prep-corp.sh		prep-corp.sh
sme2sma-loan.sh		sme2sma-loan.sh
sme2smjify.sh		sme2smjify.sh
trans_annotate.awk		trans_annotate.awk
tsv2xml.awk		tsv2xml.awk
uniq_ana.awk		uniq_ana.awk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prerequisites

Running the candidate generation

Output file format

Kintel files

Output filename format

Running the parallel word alignment

Running the candidate-respelling

stuff

Quality impressions

About

Releases

Packages

Languages

License

unhammer/evttohus

Folders and files

Latest commit

History

Repository files navigation

Prerequisites

Running the candidate generation

Output file format

Kintel files

Output filename format

Running the parallel word alignment

Running the candidate-respelling

stuff

Quality impressions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages