Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode MP3 from Memory #815

Closed
wants to merge 11 commits into from
Closed

Conversation

jjedele
Copy link

@jjedele jjedele commented Feb 27, 2020

No description provided.

Port operator definition and kernel.
Add operator to API.
Make the linters happy.
@jjedele
Copy link
Author

jjedele commented Feb 27, 2020

Linter check fails, but it does not seem to be required to my changes:

@jjedele
Copy link
Author

jjedele commented Feb 27, 2020

Open questions:

  • Is the API placed well right now or should this go in experimental?
  • Add test for stereo file

@jjedele
Copy link
Author

jjedele commented Feb 27, 2020

@yongtang Got to open 1st version of the PR. Please have a look and tell me your thoughts when you have some time :)

@yongtang
Copy link
Member

@jjedele Thanks for the PR, overall looks good!

Some thought about API:

  • If we can get the API right and allow future expansions then I think placing it under tfio.audio is fine.
  • While tf.audio.decode_wav has the argument of desired_channels and desired_samples, I am wondering if they are truly needed? If I understand correctly, desired_channels and desired_samples was in tensorflow's core repo, because in graph mode the samples and channels are needed to get the right shape [samples, channels] in DecodeWavShapeFn.
    However, the shape is not truly needed in eager mode with TF 2.0 (shape function is not called in eager mode). And, even in graph mode, it is ok to provide a shape with only unknown dimension (e.g., [None, None]).
    My concern with desired_channels and desired_samples is that, if they conflict with the intrinsic channels and samples, it adds additional error processing that actually makes thing complicated. For example, if user insists on desired_channels=2 while the audio is truly channels=1, we probably don't have a good way to resolve the conflict?
  • Another thing to consider, is that from API point of view, do we want to have a list of tfio.audio.decode_mp3, tfio.audio.decode_mp4a, tfio.audio.decode_flac, etc., or maybe we could just have a tfio.audio.decode() that will decode any audio clips into a tensor of [samples, channels] shape? Check if the file is mp3, mp4, flac, ogg is actually possible. Most of the time we only need to check the first few magic bytes. Will that be something we want to consider?

@jjedele
Copy link
Author

jjedele commented Feb 28, 2020

@yongtang : Thanks for your input!

My thoughts:

desired_channels, desired_samples

For decode_wav I found this pretty useful because I was working with a non-RNN audio model which would take audio snippets of approx. 1s length, convert them to spectrograms followed by CNN-based classification. For the CNN (and also batching) it was nice that the API allowed me to get all the data samples with deterministic shape.

The situations you describe I currently handle the same way decode_wav does. So e.g. if desired_channels==2 and actual_channels==1, the mono channel from the source will be duplicated into both stereo channels of the outputs. The other cases are pretty standard cropping and padding.

However you got me thinking now if this really should be a responsibility of the decoding functions or rather of some other part of the API. Instead of desired_samples, one could use tensor slicing or padded_batch I think. For desired_channels I have to think a bit more, I haven't worked with stereo models yet.

decode() vs decode_format()

Good point. The situation is similar to the image decoding functions, for which they ended up offering both. My first experiences with the general decode_image have been a bit frustrating because it would produce a different shape for GIFs than for other image formats (I didn't know about the expand_animations flag). I currently can't think of similar issues for audio data though.

This is something we would need to implement on operator-level (in contrast to adding it to the Python API) because we need access to the actual data to make the decision, right?

--

I also added 2 questions about technical details into the code above (error reporting in TF and operator naming). For those I would also appreciate your input.

@jjedele
Copy link
Author

jjedele commented Feb 28, 2020

The failing CI checks are nothing I'm causing, right? Looks like generic git problems.

@yongtang
Copy link
Member

@jjedele Yes the failing CI is not a concern (likely GitHub actions checkout is not able to find the base commit), once you update it will work I think.

@yongtang
Copy link
Member

@jjedele Thanks!

decode() vs decode_format()

I think we can certainly offer both. A generic decode() is very useful in building a data pipeline where the input could be a mixture of audio clips with different format and different channels.

A decode_format() could be useful as well, as certain format types might need additional information. For example, mp3 seems to always default to float, though in mp4a user may occasionally want to get the raw non-decoded sample frames (e.g., with ADTS header).

I think both will help user in different use cases, and the underlying implementation could be consolidated so that code are re-used.

desired_channels, desired_samples

One thing we tend to favor, is to push the optional args into python level whenever possible, and makes C++ level ops stable. The reason is that making changes to C++ could be hard for some contributors in this day and age. With more code in python level it will be much easier to get more contributors involved in the project.

For example, even if the basic C++ level ops may only expose [None, None] shape, it is quite easy to add a wrapper to get the shape right in python, e.g, through tf.reshape.

Also, for duplicate the mono channel to stereo channels, at the python level tf.broadcast_to could be easily added to achieve the same goal. Compared to C++ changes, those python code is much easy to debug and maintain.

@jjedele
Copy link
Author

jjedele commented Feb 28, 2020

@yongtang : Yeah, what you're saying makes total sense. So let me summarize the plan:

  1. We simplify the current operator to not do any modifications to the shape of the data.
  2. We lift the shaping functionality to Python level, i.e. tensorflow_io.core.python.ops.audio_ops
  3. We introduce a new DecodeAudio operator which looks at the header of the data and dispatches the decoding to the appropriate decoding operator.

Sounds good?

@yongtang
Copy link
Member

@jjedele The summary looks good! Let me know if you need any help in bazel build/etc 👍

Lift functionality to fix shapes from C++ to Python level.
@jjedele
Copy link
Author

jjedele commented Feb 29, 2020

Implemented and pushed the part with lifting the shaping functionality to Python level. Going to look at the generic decode() operator next. I will have a look at the decode_image() source code and try to stay close to that.

@jjedele
Copy link
Author

jjedele commented Feb 29, 2020

After initial investigation it unfortunately seems like media files are not always easily identifiable by the magic number in the header, e.g. https://stackoverflow.com/questions/11360286/detect-if-a-file-is-an-mp3-file

Hope the situation is better for Ogg, Flac and MP4a

@yongtang
Copy link
Member

@jjedele We could start with having DecodeOp support mp3 initially, and gradually expand to other types in later or follow up PRs.

In case of detecting mp3 file, we could also try detecting other files and fall back to mp3 as the "last" type. If this is still not a mp3 file, then minimp3 will return an error anyway.

Below are the place of AudioIOTensor' of checking ogg/flac/wav and fall back to mp3:

TF_RETURN_IF_ERROR(env_->NewRandomAccessFile(input, &file));
char header[8];
StringPiece result;
TF_RETURN_IF_ERROR(file->Read(0, sizeof(header), &result, header));
if (memcmp(header, "RIFF", 4) == 0) {
return WAVReadableResourceInit(env_, input, resource_);
} else if (memcmp(header, "OggS", 4) == 0) {
return OggReadableResourceInit(env_, input, resource_);
} else if (memcmp(header, "fLaC", 4) == 0) {
return FlacReadableResourceInit(env_, input, resource_);
}
Status status = MP3ReadableResourceInit(env_, input, resource_);
if (status.ok()) {
return status;
}

@jjedele
Copy link
Author

jjedele commented Feb 29, 2020

@yongtang Thx for hinting me at that code.

For MP4 it also seems to work with the header (https://www.file-recovery.com/mp4-signature-format.htm).

I'm not a big fan of things happening implicit, but in this case it's probably the best solution. Also, as long the MP3 files have ID3 metadata (which I guess usually is the case), that can be identified as well.

@yongtang
Copy link
Member

yongtang commented Mar 1, 2020

@jjedele Another option is to add an Attr of format in Decode to forcefully stay with one format. For example, if users just want to decode a list of files with different format, then they could use:

audio = decode(input) # format = None

if they know exactly the format then they could also use:

audio = decode(input, format="mp3")

For many users, honestly they don't care about the format itself, they only want to have an API to decode an audio file (in different format) and not worry about all kinds of parameters. (We could see a similar case of decode_image which is used heavily).

For some other users, they do want fine control of the format. So ideally I think an option of auto probing the format makes sense.

@jjedele
Copy link
Author

jjedele commented Mar 2, 2020

@yongtang Unfortunately didn't get to continue on this yet since I'm a bit busy with other things right now.

I think I would stay with the approach with specific decode_format methods and a general decode method that infers the format automatically. Reason: If you already know the format, calling the right method instead of setting a parameter is not really more difficult. And if we consider having specific decoding options for different formats, we can clearly separate those into the individual decode methods. If you have single one with format as parameter, we would need to throw all these things together in one API and add lots of documentation as in "this parameter will only be considered if format == mp3", etc. Not a fan of the second solution. Also it is consistent with how image decoding in core TF is implemented currently.

@jjedele
Copy link
Author

jjedele commented Mar 2, 2020

@yongtang I'm also thinking right now about what's the best approach to share the code between decode and decode_mp3. I've seen (https://github.com/tensorflow/io/blob/master/tensorflow_io/core/kernels/audio_kernels.cc#L83) that it is possible to call other kernels' compute methods from a kernel. Does making the decode operator a simple wrapper that looks at the header and then dispatches to the appropriate operator's compute method sound like a good idea to you?

@yongtang
Copy link
Member

yongtang commented Mar 2, 2020

@jjedele Yes we want to reduce code duplication as much as possible, so reusing code in different parts would be great 👍

@jjedele
Copy link
Author

jjedele commented Mar 2, 2020

@yongtang This is not so much about reusing code yes or no, but rather about the best way to do it. I currently see 2 options:

  1. Leave the DecodeMp3Operator as it is currently and implement a DecodeOperator which directly calls DecodeMp3Operator::Compute.
  2. Extract the decoding logic into a new function with a signature similar to void* DecodeMp3(void *data) which would then be called from both operators.

Option 1 seems preferable to me since it's simpler. For 2 I would still need to duplicate the shape logic. I don't know if option 1 makes any problems if we pass along the TF Context objects, etc though.

@yongtang
Copy link
Member

yongtang commented Mar 2, 2020

@jjedele I would suggest go with 2, as honestly I don't know if there will any implications if we go with 1 (with graph node, context, etc).

@jjedele
Copy link
Author

jjedele commented Mar 2, 2020

@yongtang Ok, I will do that! Thanks again for your help.

@lieff
Copy link

lieff commented Mar 3, 2020

After initial investigation it unfortunately seems like media files are not always easily identifiable by the magic number in the header, e.g. https://stackoverflow.com/questions/11360286/detect-if-a-file-is-an-mp3-file

I can write helper function for mp3 detect. Basically we need to check several consecutive frames to prove this is really mp3 if id3v2 is absent + support damaged files like https://github.com/lieff/minimp3/blob/master/vectors/l3-sin1k0db.bit .

@jjedele
Copy link
Author

jjedele commented Mar 3, 2020

@lieff Thx for offering! For my use case where we already have the whole data in memory this would be awesome.

I'm a bit unsure yet how we would integrate it for the file-based reader. We would probably need to read a considerable bigger piece than just the file header for testing. But might be worth it as long it's in kb range.

@yongtang
Copy link
Member

yongtang commented Mar 3, 2020

Thanks @lieff for offering help!

@jjedele File based access is possible, as TensorFlow's FileSystem is truly a set of callback functions with random offset Read and GetFileSize. That should be enough for any processing. On a side note this is also how TensorFlow handles different scheme storage like s3 or gcs where they are not truly traditional local files.

Minimp3 already have a callback API (again thanks @lieff for the great work 👍 ) so it is quite easy to wire up the two callbacks.

The following is how callback is wrapped to use mp3 for processing IOTensor and Dataset:

class MP3Stream {
public:
MP3Stream(SizedRandomAccessFile* file, int64 size)
: file(file), size(size), offset(0) {}
~MP3Stream() {}
static size_t ReadCallback(void* buf, size_t size, void* user_data) {
MP3Stream* p = static_cast<MP3Stream*>(user_data);
StringPiece result;
Status status = p->file->Read(p->offset, size, &result, (char*)buf);
p->offset += result.size();
return result.size();
}
static int SeekCallback(uint64_t position, void* user_data) {
MP3Stream* p = static_cast<MP3Stream*>(user_data);
if (position < 0 || position > p->size) {
return -1;
}
p->offset = position;
return 0;
}
SizedRandomAccessFile* file = nullptr;
int64 size = 0;
long offset = 0;
};

In fact even in case the whole file is in memory, the callback could be applied as well by just translating callback to memory direct access.

@lieff
Copy link

lieff commented Mar 3, 2020

We would probably need to read a considerable bigger piece than just the file header for testing.

Yes, ~16kb worst case is needed for 10 consecutive frames. I'll create detect functions and note here when they're ready.

@lieff
Copy link

lieff commented Mar 4, 2020

Here new detect functions lieff/minimp3@0a2ff3b .
Returns zero if detect succeed and MP3D_E_USER if failed, or MP3D_E_IOERROR on IO error.

Refactor to work towards general DecodeAudio operator.
@jjedele
Copy link
Author

jjedele commented Mar 4, 2020

Thx for doing this so quickly @lieff !

@jjedele
Copy link
Author

jjedele commented Mar 4, 2020

New ToDo list:

  • Add @lieff 's MP3 identification function. Probably we'll need to update the dependency in the build file?
  • Create a DecodeAudioBaseOp so we do not have to duplicate the output tensor creation logic.
  • Implement generic decode() operator.

@yongtang
Copy link
Member

yongtang commented Mar 4, 2020

Thanks @lieff for the help!

@jjedele you can update the workspace file in

io/WORKSPACE

Lines 724 to 732 in 7a19d34

http_archive(
name = "minimp3",
build_file = "//third_party:minimp3.BUILD",
sha256 = "53dd89dbf235c3a282b61fec07eb29730deb1a828b0c9ec95b17b9bd4b22cc3d",
strip_prefix = "minimp3-2b9a0237547ca5f6f98e28a850237cc68f560f7a",
urls = [
"https://github.com/lieff/minimp3/archive/2b9a0237547ca5f6f98e28a850237cc68f560f7a.tar.gz",
],
)

The strip_prefix and urls fields should be replaced with new git commit, the sha256 is the new sha256 of the .tar.gz file.

Implement general DecodeAudio operator.
Add lieff's detect_mp3 function.
Copy link
Author

@jjedele jjedele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongtang Implemented the general audio.decode operator now and wired it together with the MP3 decoding. The code is in a state which seems OK to me, but I'm not an experienced C++ programmer, so happy about any feedback. I would suggest that we finish up this PR and then implement decoding for the other formats in follow up PRs.

// DecodeAudioBaseOp
DecodeAudioBaseOp::DecodeAudioBaseOp(OpKernelConstruction *context) : OpKernel(context) {}

void DecodeAudioBaseOp::Compute(OpKernelContext *context) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unhappy with the fact that these methods are in tensorflow::data while everything else is in this nested nameless namespace. I'm not experienced enough with C++ to know what this is about, so happy about feedback/ideas.


// DecodedAudio
size_t DecodedAudio::data_size() {
return channels * samples_perchannel * sizeof(int16);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think google's code style prefer method name with CamelCase, so DataSize instead?

Also, not all audio files stay with int16 so this one has to at least take into consideration the data types. But that is a larger discussion.

@@ -30,23 +108,27 @@ class AudioReadableResource : public AudioReadableResourceBase {
mutex_lock l(mu_);
std::unique_ptr<tensorflow::RandomAccessFile> file;
TF_RETURN_IF_ERROR(env_->NewRandomAccessFile(input, &file));
char header[8];
char header_buf[8];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think header is fine here?

public:
DecodeAudioOp(OpKernelConstruction *context) : DecodeAudioBaseOp(context) {}

std::unique_ptr<DecodedAudio> decode(StringPiece &data, void *config) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google's style is to return Status in return field of the function, and for any values/pointers that needs to be returned as well, they will be placed at the end of the function with *, so something like:

Status Decode(StringPiece &data, DecodeAudio** audio);

Then you could use:

DecodeAudio* audio;
Status status = Decode(data, &audio);

std::unique_ptr<DecodedAudio> d;
d.reset(audio);

You might also pass unique_ptr as well I think:

Status Decode(StringPiece &data, std::unique_ptr<DecodeAudio>* audio);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing a pointer to a unique pointer sounds pretty hacky, do you think that's a good idea? Probably I'd rather go with the first option. Thx for pointing me to the coding style!

DecodeAudioOp(OpKernelConstruction *context) : DecodeAudioBaseOp(context) {}

std::unique_ptr<DecodedAudio> decode(StringPiece &data, void *config) {
auto error = std::unique_ptr<DecodedAudio>(new DecodedAudio(false, 0, 0, 0, nullptr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does no capture the error situation.

@@ -45,5 +54,40 @@ Status MP4ReadableResourceInit(
Env* env, const string& input,
std::unique_ptr<AudioReadableResourceBase>& resource);

// Container for decoded audio.
class DecodedAudio {
public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need a class here, as it is just a struct with one function that does a

channels * samples_perchannel * sizeof(int16);

maybe we don't need this class after all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right. I think a struct is enough.

const int sampling_rate;
// should first contain all samples of the left channel
// followed by the right channel
const int16 *data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could avoid allocate the memory here, as we can create the output Tensor which will hold the memory. Then the output Tensor can be used directly to get the data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this for a while, but I'm not sure how it would work. The problem is I need a shape to allocate the output tensor, which I get by decoding the MP3, for which in turn I need to have memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a shape, in:

Status Read(const int64 start, const int64 stop,

a callback-type of lambda is passed which allows the allocation to be done when shape is ready. This woucl be helpful when we only want to call once to read.

public:
DecodeAudioOp(OpKernelConstruction *context) : DecodeAudioBaseOp(context) {}

std::unique_ptr<DecodedAudio> decode(StringPiece &data, void *config) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't see the decode here? as it only does a classify?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongtang
Copy link
Member

yongtang commented Mar 5, 2020

@jjedele If you take a look at the exiting MP3ReadableResource, you probably noticed that if you replace:

file_.reset(new SizedRandomAccessFile(env_, filename, nullptr, 0));

With the buffer's data and size as :

file_.reset(new SizedRandomAccessFile(env_, filename, buffer, length));

You pretty much have the process of memory backed mp3 decoder in place.

After that, you can check the intrinsic shape ([samples, channels]) and dtype (int16/float32/etc) through:

 Status Spec(TensorShape* shape, DataType* dtype, int32* rate)

Once you have the shape and dtype, you can just read the whole thing into output Tensor and decode_mp3 op is pretty much complete. Have you considered this as an approach as well?

@jjedele
Copy link
Author

jjedele commented Mar 5, 2020

@yongtang Thx for your feedback! The approach you mention in your last comment I actually haven't considered. I did just assume that SizedRandomAccessFile would need to be backed by a real file. Now that you mentioned it I looked at the implementation and see that what you say would likely work.

I'm not sure how much I like the approach though. If we start doing this it would mean we must from now on always ensure that it works without a real file in the background. And since the actual decoding logic is super easy to use in lieff's library already, I don't think we would actually save that much code by doing this.

@yongtang
Copy link
Member

yongtang commented Mar 5, 2020

@jjedele The mp3 (and to an extent mp4a) is not very challenging to decode, thanks to @lieff great libraries. Though we have different use cases for audio:

  1. Sequential Access (AudioIODataset, to be passed to tf.keras, and lazily loaded)
  2. Random Access (AudioIOTensor, allows __getitem__, and lazily loaded)
  3. Basic ops that decode memory (non-lazily loaded as content is already in memory before hand.

We would like to come up to some way to not duplicate code in many places. Opened issue #839 for further discussion.

@jjedele
Copy link
Author

jjedele commented Mar 8, 2020

@yongtang Should I continue adding the discussed changes here currently or do you think #839 will lead to a bigger redesign of the API which we should do upfront?

@yongtang
Copy link
Member

yongtang commented Mar 8, 2020

@jjedele We prefer multiple smaller PRs than one big PR. For this PR I think we can stay with the focus of decode and decode_mp3, just need to sync up with the overall picture that is being discussed #839

Also, other than mp3 there are also several other types (wav, Flac, ogg, mp4) that has been more or less ready to be exposed as decode_format. We can add them in follow up PRs and if you don't mind, I can help work on some of them.

@yongtang
Copy link
Member

yongtang commented Mar 8, 2020

@jjedele Also, given the ongoing discussion in #839, maybe we can temporarily place the API in tfio.experimental.audio? Once we get the majority complete we could batch move to tfio.audio.

We plan on having the next tensorflow-io release for TF 2.2. Since TF 2.2 is likely to be released after 4+ weeks (RC is not out yet), we pretty much have 1+ months to get everything in place and move from tfio.experimental.audio to tfio.audio by then.

@jjedele
Copy link
Author

jjedele commented Mar 8, 2020

@yongtang Sounds good. I will move it to experimental.

@jjedele
Copy link
Author

jjedele commented Mar 18, 2020

@yontang I was looking at your code a bit, and right now it seems to me like we would introduce much duplication for this whole shaping code if we create separate operators for the different decode operations without winning anything at this point. Maybe we should just go with the generic one to start with and then differentiate further if we actually find/need codec specific parameters.

@yongtang
Copy link
Member

@jjedele I think what we could do is:

  1. Leave python interface alone.
  2. In C++ kernel, only have one AudioDecodeOp kernel, which, takes one additional Attr of encoding to optionally pass an encoding (mp3, mp4a, etc) so that it could be rerouted to the processing code.
  3. In Python implementation, each decode_format will calls C++ binding io_audio_decode(..., encoding="mp3")

@jjedele
Copy link
Author

jjedele commented Mar 19, 2020

@yongtang That sounds like a good idea. But maybe that should be another task since I've just seen that in the meantime we already have decode_flac, decode_ogg, etc.

On that note: At this point, it seems to me like decode_mp3 would mostly be copy/paste of one of the other operators since we can now reuse the ReadableResource implementations. Given that I currently don't have much time and am blocked until I can resolve #855 , maybe the most efficient thing to do would be closing this PR and you add the DecodeMP3Operator from a clean branch like the others. Then we can open a new issue for the general decode operator and me or somebody else can work at it at some point later.

What do you think?

@yongtang
Copy link
Member

@jjedele Sure, let me take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants