-
Notifications
You must be signed in to change notification settings - Fork 9
/
settings
70 lines (62 loc) · 3.05 KB
/
settings
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# Languages for training and testing.
# The set TRAIN_LANG must be a subset of ALL_LANGS, and should exclude the other XXX_LANG's.
# TRAIN_LANG omits TK because there's no data/nativetranscripts/XX/ref_train.
ALL_LANGS=(AR CA DT HG MD SW UR UZ) # Used by stage 1.
TRAIN_LANG=(AR CA DT HG MD SW UR ) # Used by stage 3 and 7.
DEV_LANG=(UZ) # Used by stage 3.
EVAL_LANG=(UZ) # Used by stage 3.
TEST_LANG=(UZ) # Used by stage 3.
LANG_NAME=uzbek # For test or dev. Used by $EXPLOCAL, $evalreffile, $phonelm, and $Gfst. Compatible with $langmap.
# Input test DATA and output EXPeriment results.
DATA=data
DATA_URL=http://isle.illinois.edu/mc/PTgenTest/data-2016-08-24.tgz
EXP=$HOME/Tmp/Exp
# Either 'dev' or 'eval'.
# Use 'dev' to tune hyperparameters, Pscale, Gscale, etc.
# Then use the settings that scored best in stage 15 for 'eval'.
# If you 'eval' without first 'dev', then stage 6 fails: no file "$EXP/mandarin-X/carmel/simple" for untrained phone-2-letter model.
TESTTYPE='dev' # Used by stage 3, and (through $evalreffile) stage 15.
# Transcriptions.
TURKERTEXT=$DATA/batchfiles # Read by stage 1.
LISTDIR=$DATA/lists # Read by stage 3.
langmap=$LISTDIR/langcodes.txt # Read by stage 3.
TRANSDIR=$DATA/nativetranscripts # Read by stage 7.
evalreffile=$TRANSDIR/$LANG_NAME/${TESTTYPE}_text # Read by stage 15. Known-good transcription for scoring by compute-wer.
# Alphabets.
engdict=$DATA/let2phn/eng_dict.txt # Read by stage 1.
engalphabet=$DATA/let2phn/englets.vocab # Read by stage 6 and 11.
phnalphabet=$DATA/phonesets/univ.compact.txt # Read by stage 6, 10, 12, and 15.
phonelm=$DATA/langmodels/$LANG_NAME/bigram.fst # Read by stage 10. (Or ...unigram.fst.)
# Parameters.
rmprefix="http://isle.illinois.edu/mc/" # Used by stage 1, to create file IDs.
gapsymbol='_' # Used by stage 4, via $aligneropt.
nparts=1 # Used by stage 4.
topN=2 # Used by stage 4.
alignswitchpenalty=1 # Used by stage 5, via $alignertofstopt.
delimsymbol='%' # Used by stage 6.
Pstyle=simple # Used by stage 6. One of simple, letctxt or phnletctxt.
nrand=12 # Used by stage 7.
phneps='<eps>' # Used by stage 9.
leteps='-' # Used by stage 9.
disambigdel='#2' # Used by stage 9, 10 and 12.
disambigins='#3' # Used by stage 9, 10 and 12.
Pscale=1 # Used by stage 9.
Gscale=1 # Used by stage 10.
Lscale=1 # Used by stage 11.
Tnumins=10 # Used by stage 12.
Tnumdel=5 # Used by stage 12.
Mscale=1 # Used by stage 14.
prunewt=2 # Used by stage 15.
# Command-line options.
aligneropt="--dist $aligndist --empty $gapsymbol" # Used by stage 4.
alignertofstopt="--switchpenalty $alignswitchpenalty" # Used by stage 5.
carmelinitopt="--$Pstyle" # --startwith=$EXPLOCAL/simple.trained # Used by stage 6. $EXPLOCAL is set by init.sh.
# Flags.
makeGTPLM=1 # Used by stage 14.
#makeTPLM=1 # Used by stage 14.
#decode_for_adapt=1 # Used by stage 14. Omit stage 15. Build PTs for training utterances in the test language, to adapt other ASR systems.
#evaloracle=1 # Used by stage 15.
debug=1 # Used by stage 15.
# Which stages to run, inclusive.
startstage=1
endstage=14