/
doc.go
200 lines (135 loc) · 6.77 KB
/
doc.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
/*Package rbg2p contains utilities for rule based, manually written, grapheme to phoneme rules.
Each g2p rule set is defined in a .g2p file with the following content:
* specific variables
- used to define constant variables such as character set and phoneme delimiter
* variables
- any variables for use in the context of the actual rules
* sylldef - definitions for dividing transcriptions into syllables
* rules - g2p rules
* filters - transcription filters applied after the rules
* tests - input/output tests
* comments
SPECIFIC VARIABLES
Defines a set of constant variables, such as character set and phoneme delimiter. Please note that quotes are required around the value, since space and the empty string can be used as a value.
<NAME> "<VALUE>"
Available variables (* means required):
CHARACTER_SET* (default: none)
- used to check that each character in the character set has at least one rule
PHONEME_SET (default: none)
- space separated symbol set, used to validate the phonemes in the g2p rules
DEFAULT_PHONEME (default: "_")
- used for input input (orthographic) symbols
PHONEME_DELIMITER (default: " ")
- used to concatenate phonemes into a transcriptions
DOWNCASE_INPUT (default: true)
Examples:
CHARACTER_SET "abcdefghijklmnopqrstuvwxyzåäö"
PHONEME_SET "a au o u i y e eu p t k b d g r s f h j l v w m n S tS"
DEFAULT_PHONEME "_"
PHONEME_DELIMITER " "
VARIABLES
Regexp variables prefixed by VAR, that can be used in the rule context and filters as exemplified below. The variable names must not contain underscore (_).
VAR <NAME> <VALUE>
Examples:
VAR VOWEL [aeyuio]
VAR AFFRICATE (tS|dZ)
VAR VOICELESS [ptksf]
SYLLDEF
An set of variables prefixed by SYLLDEF, used for syllabification (not required).
SYLLDEF <NAME> "<VALUE>"
Currently, only maximum onset (MOP) syllabification can be used.
Variables currently available:
TYPE (default: MOP)
- currently, the only value allowed here is MOP
ONSETS
- a comma separated list of valid syllable onsets (typically consonant clusters)
SYLLABIC
- a space separated list of syllabic phonemes (typically vowels)
STRESS
- a space separated list of stress symbols
DELIMITER
- syllable delimiter symbol
Examples:
SYLLDEF TYPE MOP
SYLLDEF ONSETS "p, b, t, rt, m, n, d, rd, k, g, rn, f, v, C, rs, r, l, s, x, S, h, rl, j, s, p, r, rs p r, s p l, rs p l, s p j, rs p j, s t r, rs rt r, s k r, rs k r, s k v, rs k v, p r, p j, p l, b r, b j, b l, t r, rt r, t v, rt v, d r, rd r, d v, rd v, k r, k l, k v, k n, g r, g l, g n, f r, f l, f j, f n, v r, s p, s t, s k, s v, s l, s m, s n, n j, rs p, rs rt, rs k, rs v, rs rl, rs m, rs rn, rn j, m j"
SYLLDEF SYLLABIC "i: I u0 }: a A: u: U E: {: E { au y: Y e: e 2: 9: 2 9 o: O @ eu"
SYLLDEF STRESS "\" %"
SYLLDEF DELIMITER "."
RULES
Grapheme to phoneme rules written in a format loosely based on phonotactic rules. The rules are ordered, and typically the rule order is of great importance.
<INPUT> -> <OUTPUT>
<INPUT> -> <OUTPUT> / <CONTEXT>
<INPUT> -> (<OUTPUT1>, <OUTPUT2>)
<INPUT> -> (<OUTPUT1>, <OUTPUT2>) / <CONTEXT>
Context:
<LEFT CONTEXT> _ <RIGHT CONTEXT>
<INPUT> is a string of one or more input characters. <OUTPUT> is a string representing the output (separated by the pre-defined phoneme delimiter, above). For empty output, i.e., when a character should not be pronounced, use the empty set symbol "∅" (U+2205).
<CONTEXT> is the context in which the <INPUT> should occur for the rule to apply. Pre-defined variables (above) can be use in the context specs. # is used for anchoring (marks the start/end of the input string).
Examples:
a -> ? a / # _
a -> a
e -> e
skt -> (s t, s k t) / _
ck -> k
b -> p / _ VOICELESS
h -> ∅ / # _
PREFILTERS
Regexp replacement filters for transcriptions. The filters are applied before the g2p rules. Pre-defined variables (above) can be use in the input regexp surrounded by curly brackets.
PREFILTER "<FROM RE>" -> "<TO STRING>"
PREFILTER "<FROM RE WITH {VARIABLENAME}>" -> "<TO STRING>"
Example:
PREFILTER "п" -> "p" // cyrillic char to latin
FILTERS
Regexp replacement filters for transcriptions. The filters are applied after the g2p rules. Pre-defined variables (above) can be use in the input regexp surrounded by curly brackets.
FILTER "<FROM RE>" -> "<TO STRING>"
FILTER "<FROM RE WITH {VARIABLENAME}>" -> "<TO STRING>"
Example:
FILTER "^" -> "\" " // place stress first in transcription
COMMENTS
Comments are prefixed by // or #
TESTS
Test examples prefixed by TEST:
TEST <INPUT> -> <OUTPUT>
or with variants:
TEST <INPUT> -> (<OUTPUT1>, <OUTPUT2>)
Examples:
TEST hit -> h i t
TEST kex -> (k e k s, C e k s)
---
SEPARATE SYLLABIFICATION RULE FILE
A .syll file for syllabification contains a subset of the items used for a proper g2p.
Example (for the CMU lexicon):
PHONEME_SET "AA AE AH AX AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH 1 2"
PHONEME_DELIMITER " "
SYLLDEF TYPE MOP
SYLLDEF ONSETS "P, T, K, B, D, G, CH, JH, F, V, T, D, S, Z, S, Z, H, L, M, N, N, R, W, J, P R, T R, B R, G R, S T R, S P R, S K R, P L, T L, B L, G L, S T L, S P L, S K L, S P, S T, S K"
SYLLDEF SYLLABIC "AA AE AH AX AO AW AY EH ER EY IH IY OW OY UH UW"
SYLLDEF STRESS "1 2"
SYLLDEF DELIMITER "$"
SYLLDEF TEST AX P R 1 AA K S AX M AX T -> AX $ P R 1 AA K $ S AX $ M AX T
SYLLDEF TEST W 1 UH D S T R 2 IY M -> W 1 UH D $ S T R 2 IY M
For details on the .g2p file format, check docs for the root folder of this package.
For more examples (used for unit tests), see the test_data folder: https://github.com/stts-se/rbg2p/tree/master/test_data
To test a single g2p file from the command line, use cmd/g2p.
To import and use the rbg2p rule package in another go program:
import (
"github.com/stts-se/rbg2p"
)
func main() {
// TODO: initialize g2pFile and orth
var g2pFile = "", orth = ""
// Load rule file
ruleSet, err := rbg2p.LoadFile(g2pFile)
// TODO: check for error in err
// Test rule set
testRes := ruleSet.Test()
// TODO: check for errors in testRes
// testRes is an instance of rbg2p.TestResult
// - you can do a quick check using testRes.Failed() to find out if there were any errors
// - you can retrieve all errors using testRes.AllErrors()
// - you can retrieve all errors and warnings using testRes.AllMessages()
// Transcribe an input word
transes, err := ruleSet.Apply(orth)
}
*/
package rbg2p