Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synthesizeUsingWebSocket producing bad files #956

Closed
luiscastro193 opened this issue Sep 15, 2019 · 10 comments · Fixed by #957
Closed

synthesizeUsingWebSocket producing bad files #956

luiscastro193 opened this issue Sep 15, 2019 · 10 comments · Fixed by #957

Comments

@luiscastro193
Copy link

Text to speech using synthesizeUsingWebSocket apparently works but the output file can't be opened by any player. The files produced using synthesize (without web socket) work fine but I can't always use the http API because texts are too long.

This is the code:

"use strict";
require('dotenv').config();
const fs = require('fs');
const path = require('path');
const TextToSpeechV1 = require('ibm-watson/text-to-speech/v1');
const textToSpeech = new TextToSpeechV1({
	iam_apikey: process.env.APIKEY,
	url: process.env.URL
});
const files = process.argv.slice(2);

function toSpeech(text, fileName) {
	let audio = textToSpeech.synthesizeUsingWebSocket({text, voice: 'es-ES_EnriqueV3Voice', accept: 'audio/ogg;codecs=opus'});
	audio.pipe(fs.createWriteStream(`${fileName}.ogg`));
}

for (let file of files)
	toSpeech(fs.readFileSync(file, 'utf8'), path.basename(file, path.extname(file)));

SDK Version
4.4.0

Additional information:

  • OS: Linux Mint 19.2
  • Which version of Node are you using? 10.16.3
@luiscastro193
Copy link
Author

A simplified version of the code:

"use strict";
require('dotenv').config();
const fs = require('fs');
const TextToSpeechV1 = require('ibm-watson/text-to-speech/v1');
const textToSpeech = new TextToSpeechV1({
	iam_apikey: process.env.APIKEY,
	url: process.env.URL
});

function toSpeech(text, fileName) {
	let audio = textToSpeech.synthesizeUsingWebSocket({text, voice: 'es-ES_EnriqueV3Voice', accept: 'audio/ogg;codecs=opus'});
	audio.pipe(fs.createWriteStream(`${fileName}.ogg`));
}

toSpeech("Esto es una prueba", "test");

@MasterOdin
Copy link
Contributor

MasterOdin commented Sep 15, 2019

I've seen this to. I think the reason is here:

socket.onmessage = message => {
const chunk = message.data;
// some messages are strings - emit those unencoded, but push them to
// the stream as binary
const data = typeof chunk === 'string' ? chunk : Buffer.from(chunk);
/**
* Emit any messages received over the wire, mainly used for debugging.
*
* @event SynthesizeStream#message
* @param {Object} message - frame object received from service
* @param {Object} data - a data attribute of the frame that's either a string or a Buffer/TypedArray
*/
self.emit('message', message, data);
self.push(Buffer.from(chunk));
};

where chunk is a string only really at the beginning and that's because https://cloud.ibm.com/docs/services/text-to-speech?topic=text-to-speech-usingWebSocket will initially pass a JSON string that tells you the shape of the file:

{
  'binary_streams': [
    {
      content_type: 'audio/ogg;codecs=opus'
    }
  ]
}

which the above implementation (and example in the repo) is appending to the front of the generated *.ogg file.

Changing the check for a string to be something like:

            if (typeof chunk === 'string') {
                try {
                    chunk = JSON.parse(chunk);
                    return
                }
                catch (e) {

                }

            }

seems to fix the problem, though this throws away the above JSON object as well as the timings JSON object. Should both of these be accounted for downstream of the synthesizeUsingWebsocket method or dealt within inside of it, possibly just outputting it to a specific emitter off the stream?

@luiscastro193 For a current workable solution, you'll have to put a filter on the piped binary stream doing something like:

const { Transform } = require('stream')

class Filter extends Transform {

  constructor() {
    super({
      readableObjectMode: true,
      writableObjectMode: true
    })
  }

  _transform(chunk, encoding, next) {
    try {
      JSON.parse(chunk.toString());
      next();
    }
    catch (e) {}
    return next(null, chunk)
  }

  has(value) {
    return !!value
  }
}

const transformer = new Filter();

audio.pipe(transformer).pipe(fs.createWriteStream(`${fileName}.ogg`))

@dpopp07
Copy link
Contributor

dpopp07 commented Sep 16, 2019

@luiscastro193 Thanks for the issue and @MasterOdin thanks for the detailed explanation. Unless there is a sensible use case for including that information in the audio file itself, I like the idea of not including it in the file and emitting it from a different event so the user can capture it if they want. That seems better if the current approach is currently generating unusable (or difficult to use) files.

Is there a reason users might want this info in their files? If not, we can go ahead with this change. Ideally, we would get it in before the major release. I can open a PR unless you've already started working on this @MasterOdin

@MasterOdin
Copy link
Contributor

MasterOdin commented Sep 16, 2019

I would agree that I don't think there is a sensible reason to embed it into the file itself (as pulling out the JSON structure without corrupting the file is a bit tricky) , and just emit on a specifically named events in-case the user wants said info.

@dpopp07 I'll make a PR as this does give me immense satisfaction in finally figuring out/fixing why ffplay would fail on audio that was too short. I'll make the PR against the v4 line though as I don't think it requires anything that was explicitly new in the v5 release candidate and that that function remained largely the same.

@dpopp07
Copy link
Contributor

dpopp07 commented Sep 16, 2019

Since this changes default behavior, my instinct is to be concerned this has the potential to be breaking. That said, if this is purely removing a pain for users than I could consider it a fix and a necessary change in behavior (and/or feature for the new event).

Sounds good though, I appreciate it!

@dpopp07 dpopp07 added the bug label Sep 16, 2019
@MasterOdin
Copy link
Contributor

MasterOdin commented Sep 16, 2019

Yeah, it's technically "breaking", but it's breaking in that I think it makes things more inline with how people would expect it to work (certainly how I was expecting it to work), especially based on prior usage of things in older versions and the examples themselves document it. I'd also very much like to use this in the v4 branch immediately as I'm not yet ready to spend time in dealing with the authentication changes for v5 for my work.

@dpopp07
Copy link
Contributor

dpopp07 commented Sep 16, 2019

Right, I think it is fine. In that case, I will await the PR. Thanks again

@MasterOdin
Copy link
Contributor

No problem, I've got some amount of dead time in my ramp-up for my IBM internship and I figure no one can yell at me for working on a IBM project to fill in the day.

@dpopp07
Copy link
Contributor

dpopp07 commented Sep 16, 2019

Oh, nice! I certainly won't yell at you 😄

@watson-github-bot
Copy link
Member

🎉 This issue has been resolved in version 4.5.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants