index.html

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="generator" content="Hugo 0.66.0" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,700" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
<link rel="stylesheet" href="css/normalize.css">
<link rel="stylesheet" href="css/skeleton.css">
<link rel="stylesheet" href="css/style.css">
<link rel="alternate" href="index.xml" type="application/rss+xml" title="SpeechResearch">
<title>Explicit Intensity Control for Accented Text-to-Speech</title>
</head>
<body>

<div class="container">
	<main role="main">
		<article>
      
      <h2 class="title">Explicit Intensity Control for Accented Text-to-Speech</h2>
      
       
      <br>
      <div class="abstract">
        <p><strong>Abstract</strong><br>
          Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, such a control method lacks interpretability, and there is no direct correlation between the controlling factor and natural accent intensity. To this end, this paper propose a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability, called as "Goodness of Pronunciation (GoP)" from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that the our method outperforms the baseline model in terms of accent rendering and intensity control. </p>
      </div>

   <br> <br>
      <img src="model.png" width="100%">


      <!--<h5> Preliminary Experiments for Accent Expression</h5>
       To understand how the accent renderer performs, we randomly select 100 utterances from the test set as the test samples and report the 5-scale Mean Opinion Score (MOS) for three systems, including <b>Ground Truth</b> L2 speech, synthesized L2 speech by <b>FastSpeech2</b> <font color='blue'>[1]</font> and our <b>Ai-TTS</b>. For fair comparison, we set $i$ to 1 for all input utterances of Ai-TTS. We invite 20 listeners and report the subjective MOS results in the second column of Table 1. It is observed that our Ai-TTS achieves a MOS of 4.01 $\pm$ 0.022, that is significantly higher than <i>FastSpeech2</i> baseline and very close to the <i>Ground Truth</i>. For objective evaluation, we follow <font color='blue'>[1]</font> and report the moments (including standard deviation ($\sigma$), skewness ($\gamma$) and kurtosis ($\mathcal{K}$)), and average dynamic time warping (DTW) <font color='blue'>[2]</font> ($\varrho$) of the pitch distribution between the synthesized L2-accented speech and the ground truth reference in the third to sixth columns of Table 1. It can be seen that the Ai-TTS system is reported with all values that are closer to those of the Ground Truth than FastSpeech2. <font color='blue'>The subjective and objective evaluations suggest that our Ai-TTS with accent renderer achieves more expressive L2 speech in terms of accent expression.</font>
      <br> <br>
      <img src="tab1.png" width="100%"> !-->

      <h5> Main Results</h5>

      <h5 class="section">Controllability Evaluation on Utterance-level</h5>
      <!-- (1) -->
      <p class="transcript"><em> Unconsciously, our yells and exclamations yielded to this rhythm. <br> (Speaker: TXHC; Accent: Mandarin)</em></p>
      <table>
        <thead style="border-bottom: 1px solid #E1E1E1;"><tr>
         <th>Ground Truth</th>
        <th>DAW (Intensity = "strong")</th>
        <th>Ai-TTS (Intensity = "strong")</th>

        </tr></thead>
        <tbody>
          <tr>
            <td><audio controls="controls"><source src="audios/part1/GT-1.wav" autoplay/>Your browser does not support the audio element.</audio></td>

            <td><audio controls="controls"><source src="audios/part1/DAW-1.wav" autoplay/>Your browser does not support the audio element.</audio></td>

            <td><audio controls="controls"><source src="audios/part1/Ai-TTS-1.wav" autoplay/>Your browser does not support the audio element.</audio></td>

          </tr>
      </table>


      <!-- (2) -->
      <p class="transcript"><em> He made no reply as he waited for Whittemore to continue. <br> (Speaker: NCC; Accent: Mandarin)</em></p>
      <table>
        <thead style="border-bottom: 1px solid #E1E1E1;"><tr>
        <th>Ground Truth</th>
        <th>DAW (Intensity = "strong")</th>
        <th>Ai-TTS (Intensity = "strong")</th>
        </tr></thead>
        <tbody>
          <tr>
             <td><audio controls="controls"><source src="audios/part1/GT-2.wav" autoplay/>Your browser does not support the audio element.</audio></td>

            <td><audio controls="controls"><source src="audios/part1/DAW-2.wav" autoplay/>Your browser does not support the audio element.</audio></td>

            <td><audio controls="controls"><source src="audios/part1/Ai-TTS-2.wav" autoplay/>Your browser does not support the audio element.</audio></td>

          </tr>
      </table>


      <h5 class="section">Controllability Evaluation on Phoneme-level</h5>

      <!-- (5) -->
      <p class="transcript"><em>Unconsciously, our yells and exclamations yielded to this rhythm. <br> (Speaker: TXHC; Accent: Mandarin)</em></p>
       <h6>Phoneme Sequence: <font color=blue>AH2 N K AA1 N SH AH0 S L IY0 sp AW1 ER0 Y EH1 L Z AE1 N D sp EH2 K S K L AH0 M EY1 SH AH0 N Z sp Y IY1 L D IH0 D T UW1 DH IH1 S R IH1 DH AH0 M.</font></h6>

      <table>
      <tr>  <h6>Sample (1): </h6></tr>
      <tr> <font color=blue>AH2 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> K <font color=red>(0.9)</font> AA1 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> SH <font color=red>(0.9)</font> AH0 <font color=red>(0.9)</font> S <font color=red>(0.9)</font> L <font color=red>(0.9)</font> IY0 <font color=red>(0.9)</font> sp AW1 <font color=orange>(0.1)</font> ER0 <font color=orange>(0.1)</font> Y <font color=red>(0.9)</font> EH1 <font color=red>(0.9)</font> L <font color=red>(0.9)</font> Z <font color=red>(0.9)</font> AE1 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> sp EH2 <font color=red>(0.9)</font> K <font color=red>(0.9)</font> S <font color=red>(0.9)</font> K <font color=red>(0.9)</font> L <font color=red>(0.9)</font> AH0 <font color=red>(0.9)</font> M <font color=red>(0.9)</font> EY1 <font color=red>(0.9)</font> SH <font color=red>(0.9)</font> AH0 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> Z <font color=red>(0.9)</font> sp Y <font color=orange>(0.1)</font> IY1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> IH0 <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> T <font color=orange>(0.1)</font> UW1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> R <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font>.</font></tr>
      <tr> <td><img src="audios/part2/1.png" width="100%"> </td> <td><audio controls="controls"><source src="audios/part2/1.wav" autoplay/>Your browser does not support the audio element.</audio></td></tr>
      </table>


      <table>
      <tr>  <h6>Sample (2): </h6></tr>
      <tr> <font color=blue>AH2 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> K <font color=red>(0.9)</font> AA1 <font color=red>(0.9)</font> N <font color=red>(0.9)</font> SH <font color=red>(0.9)</font> AH0 <font color=red>(0.9)</font> S <font color=red>(0.9)</font> L <font color=red>(0.9)</font> IY0 <font color=red>(0.9)</font> sp AW1 <font color=orange>(0.1)</font> ER0 <font color=orange>(0.1)</font> Y <font color=orange>(0.1)</font> EH1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> Z <font color=orange>(0.1)</font> AE1 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> sp EH2 <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font> EY1 <font color=orange>(0.1)</font> SH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> Z <font color=orange>(0.1)</font> sp Y <font color=orange>(0.1)</font> IY1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> IH0 <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> T <font color=orange>(0.1)</font> UW1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> R <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font>.</font></tr>
      <tr> <td><img src="audios/part2/2.png" width="100%"> </td> <td><audio controls="controls"><source src="audios/part2/2.wav" autoplay/>Your browser does not support the audio element.</audio></td></tr>
      </table>


      <table>
      <tr>  <h6>Sample (3): </h6></tr>
      <tr> <font color=blue>AH2 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> AA1 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font>SH <font color=orange>(0.1)</font> AH0  <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> IY0 <font color=orange>(0.1)</font> sp AW1 <font color=orange>(0.1)</font> ER0 <font color=orange>(0.1)</font> Y <font color=orange>(0.1)</font> EH1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> Z <font color=orange>(0.1)</font> AE1 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> sp EH2 <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> K <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font> EY1 <font color=orange>(0.1)</font> SH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> N <font color=orange>(0.1)</font> Z <font color=orange>(0.1)</font> sp Y <font color=orange>(0.1)</font> IY1 <font color=orange>(0.1)</font> L <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> IH0 <font color=orange>(0.1)</font> D <font color=orange>(0.1)</font> T <font color=orange>(0.1)</font> UW1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> S <font color=orange>(0.1)</font> R <font color=orange>(0.1)</font> IH1 <font color=orange>(0.1)</font> DH <font color=orange>(0.1)</font> AH0 <font color=orange>(0.1)</font> M <font color=orange>(0.1)</font>.</font></tr>
      <tr> <td><img src="audios/part2/3.png" width="100%"> </td> <td><audio controls="controls"><source src="audios/part2/3.wav" autoplay/>Your browser does not support the audio element.</audio></td></tr>
      </table>


<h5 class="section">References:</h5>
[1] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2020.
<br><br>
[2] Meinard M ̈uller, “Dynamic time warping,” Information retrieval for music and motion, pp. 69–84, 2007.
<br><br><br><br>

		</article>
	</main>


</div>

<script>
	(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
	(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
	m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
	})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
	ga('create', 'UA-139981676-1', 'auto');
	ga('send', 'pageview');
</script>

<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>


<script type="text/x-mathjax-config">
     MathJax.Hub.Config({
         HTML: ["input/TeX","output/HTML-CSS"],
         TeX: {
                Macros: {
                         bm: ["\\boldsymbol{#1}", 1],
                         argmax: ["\\mathop{\\rm arg\\,max}\\limits"],
                         argmin: ["\\mathop{\\rm arg\\,min}\\limits"]},
                extensions: ["AMSmath.js","AMSsymbols.js"],
                equationNumbers: { autoNumber: "AMS" } },
         extensions: ["tex2jax.js"],
         jax: ["input/TeX","output/HTML-CSS"],
         tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
                    displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
                    processEscapes: true },
         "HTML-CSS": { availableFonts: ["TeX"],
                       linebreaks: { automatic: true } }
     });
 </script>

 <script type="text/x-mathjax-config">
     MathJax.Hub.Config({
       tex2jax: {
         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
       }
     });
 </script>

 <script type="text/javascript" async
   src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-MML-AM_CHTML">
 </script>

</body>
</html>