-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
executable file
·175 lines (130 loc) · 13.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="generator" content="Hugo 0.66.0" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,700" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
<link rel="stylesheet" href="css/normalize.css">
<link rel="stylesheet" href="css/skeleton.css">
<link rel="stylesheet" href="css/style.css">
<link rel="alternate" href="index.xml" type="application/rss+xml" title="SpeechResearch">
<title>Explicit Intensity Control for Accented Text-to-Speech</title>
</head>
<body>
<div class="container">
<main role="main">
<article>
<h2 class="title">Explicit Intensity Control for Accented Text-to-Speech</h2>
<br>
<div class="abstract">
<p><strong>Abstract</strong><br>
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, such a control method lacks interpretability, and there is no direct correlation between the controlling factor and natural accent intensity. To this end, this paper propose a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability, called as "Goodness of Pronunciation (GoP)" from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that the our method outperforms the baseline model in terms of accent rendering and intensity control. </p>
</div>
<br> <br>
<img src="model.png" width="100%">
<!--<h5> Preliminary Experiments for Accent Expression</h5>
To understand how the accent renderer performs, we randomly select 100 utterances from the test set as the test samples and report the 5-scale Mean Opinion Score (MOS) for three systems, including <b>Ground Truth</b> L2 speech, synthesized L2 speech by <b>FastSpeech2</b> <font color='blue'>[1]</font> and our <b>Ai-TTS</b>. For fair comparison, we set $i$ to 1 for all input utterances of Ai-TTS. We invite 20 listeners and report the subjective MOS results in the second column of Table 1. It is observed that our Ai-TTS achieves a MOS of 4.01 $\pm$ 0.022, that is significantly higher than <i>FastSpeech2</i> baseline and very close to the <i>Ground Truth</i>. For objective evaluation, we follow <font color='blue'>[1]</font> and report the moments (including standard deviation ($\sigma$), skewness ($\gamma$) and kurtosis ($\mathcal{K}$)), and average dynamic time warping (DTW) <font color='blue'>[2]</font> ($\varrho$) of the pitch distribution between the synthesized L2-accented speech and the ground truth reference in the third to sixth columns of Table 1. It can be seen that the Ai-TTS system is reported with all values that are closer to those of the Ground Truth than FastSpeech2. <font color='blue'>The subjective and objective evaluations suggest that our Ai-TTS with accent renderer achieves more expressive L2 speech in terms of accent expression.</font>
<br> <br>
<img src="tab1.png" width="100%"> !-->
<h5> Main Results</h5>
<h5 class="section">Controllability Evaluation on Utterance-level</h5>
<!-- (1) -->
<p class="transcript"><em> Unconsciously, our yells and exclamations yielded to this rhythm. <br> (Speaker: TXHC; Accent: Mandarin)</em></p>
<table>
<thead style="border-bottom: 1px solid #E1E1E1;"><tr>
<th>Ground Truth</th>
<th>DAW (Intensity = "strong")</th>
<th>Ai-TTS (Intensity = "strong")</th>
</tr></thead>
<tbody>
<tr>
<td><audio controls="controls"><source src="audios/part1/GT-1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td><audio controls="controls"><source src="audios/part1/DAW-1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td><audio controls="controls"><source src="audios/part1/Ai-TTS-1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
</table>
<!-- (2) -->
<p class="transcript"><em> He made no reply as he waited for Whittemore to continue. <br> (Speaker: NCC; Accent: Mandarin)</em></p>
<table>
<thead style="border-bottom: 1px solid #E1E1E1;"><tr>
<th>Ground Truth</th>
<th>DAW (Intensity = "strong")</th>
<th>Ai-TTS (Intensity = "strong")</th>
</tr></thead>
<tbody>
<tr>
<td><audio controls="controls"><source src="audios/part1/GT-2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td><audio controls="controls"><source src="audios/part1/DAW-2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td><audio controls="controls"><source src="audios/part1/Ai-TTS-2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
</table>
<h5 class="section">Controllability Evaluation on Phoneme-level</h5>
<!-- (5) -->
<p class="transcript"><em>Unconsciously, our yells and exclamations yielded to this rhythm. <br> (Speaker: TXHC; Accent: Mandarin)</em></p>
<h6>Phoneme Sequence: <font color=blue>AH2 N K AA1 N SH AH0 S L IY0 sp AW1 ER0 Y EH1 L Z AE1 N D sp EH2 K S K L AH0 M EY1 SH AH0 N Z sp Y IY1 L D IH0 D T UW1 DH IH1 S R IH1 DH AH0 M.</font></h6>
<table>
<tr> <h6>Sample (1): </h6></tr>
<tr> <font color=blue>AH2 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> K <font color=red>(0.9)</font> AA1 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> SH <font color=red>(0.9)</font> AH0 <font color=red>(0.9)</font> S <font color=red>(0.9)</font> L <font color=red>(0.9)</font> IY0 <font color=red>(0.9)</font> sp AW1 <font color=orange>(0.1)</font> ER0 <font color=orange>(0.1)</font> Y <font color=red>(0.9)</font> EH1 <font color=red>(0.9)</font> L <font color=red>(0.9)</font> Z <font color=red>(0.9)</font> AE1 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> sp EH2 <font color=red>(0.9)</font> K <font color=red>(0.9)</font> S <font color=red>(0.9)</font> K <font color=red>(0.9)</font> L <font color=red>(0.9)</font> AH0 <font color=red>(0.9)</font> M <font color=red>(0.9)</font> EY1 <font color=red>(0.9)</font> SH <font color=red>(0.9)</font> AH0 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> Z <font color=red>(0.9)</font> sp Y <font color=orange>(0.1)</font> IY1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> IH0 <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> T <font color=orange>(0.1)</font> UW1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> R <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font>.</font></tr>
<tr> <td><img src="audios/part2/1.png" width="100%"> </td> <td><audio controls="controls"><source src="audios/part2/1.wav" autoplay/>Your browser does not support the audio element.</audio></td></tr>
</table>
<table>
<tr> <h6>Sample (2): </h6></tr>
<tr> <font color=blue>AH2 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> K <font color=red>(0.9)</font> AA1 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> SH <font color=red>(0.9)</font> AH0 <font color=red>(0.9)</font> S <font color=red>(0.9)</font> L <font color=red>(0.9)</font> IY0 <font color=red>(0.9)</font> sp AW1 <font color=orange>(0.1)</font> ER0 <font color=orange>(0.1)</font> Y <font color=orange>(0.1)</font> EH1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> Z <font color=orange>(0.1)</font> AE1 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> sp EH2 <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font> EY1 <font color=orange>(0.1)</font> SH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> Z <font color=orange>(0.1)</font> sp Y <font color=orange>(0.1)</font> IY1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> IH0 <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> T <font color=orange>(0.1)</font> UW1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> R <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font>.</font></tr>
<tr> <td><img src="audios/part2/2.png" width="100%"> </td> <td><audio controls="controls"><source src="audios/part2/2.wav" autoplay/>Your browser does not support the audio element.</audio></td></tr>
</table>
<table>
<tr> <h6>Sample (3): </h6></tr>
<tr> <font color=blue>AH2 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> AA1 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font>SH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> IY0 <font color=orange>(0.1)</font> sp AW1 <font color=orange>(0.1)</font> ER0 <font color=orange>(0.1)</font> Y <font color=orange>(0.1)</font> EH1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> Z <font color=orange>(0.1)</font> AE1 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> sp EH2 <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font> EY1 <font color=orange>(0.1)</font> SH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> Z <font color=orange>(0.1)</font> sp Y <font color=orange>(0.1)</font> IY1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> IH0 <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> T <font color=orange>(0.1)</font> UW1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> R <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font>.</font></tr>
<tr> <td><img src="audios/part2/3.png" width="100%"> </td> <td><audio controls="controls"><source src="audios/part2/3.wav" autoplay/>Your browser does not support the audio element.</audio></td></tr>
</table>
<h5 class="section">References:</h5>
[1] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2020.
<br><br>
[2] Meinard M ̈uller, “Dynamic time warping,” Information retrieval for music and motion, pp. 69–84, 2007.
<br><br><br><br>
</article>
</main>
</div>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-139981676-1', 'auto');
ga('send', 'pageview');
</script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
HTML: ["input/TeX","output/HTML-CSS"],
TeX: {
Macros: {
bm: ["\\boldsymbol{#1}", 1],
argmax: ["\\mathop{\\rm arg\\,max}\\limits"],
argmin: ["\\mathop{\\rm arg\\,min}\\limits"]},
extensions: ["AMSmath.js","AMSsymbols.js"],
equationNumbers: { autoNumber: "AMS" } },
extensions: ["tex2jax.js"],
jax: ["input/TeX","output/HTML-CSS"],
tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true },
"HTML-CSS": { availableFonts: ["TeX"],
linebreaks: { automatic: true } }
});
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
}
});
</script>
<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
</body>
</html>