-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode MP3 from Memory #815
Conversation
Port operator definition and kernel.
Add operator to API.
Add tests.
Make the linters happy.
Linter check fails, but it does not seem to be required to my changes:
|
Open questions:
|
@yongtang Got to open 1st version of the PR. Please have a look and tell me your thoughts when you have some time :) |
@jjedele Thanks for the PR, overall looks good! Some thought about API:
|
@yongtang : Thanks for your input! My thoughts:
|
The failing CI checks are nothing I'm causing, right? Looks like generic git problems. |
@jjedele Yes the failing CI is not a concern (likely GitHub actions checkout is not able to find the base commit), once you update it will work I think. |
@jjedele Thanks!
I think we can certainly offer both. A generic A I think both will help user in different use cases, and the underlying implementation could be consolidated so that code are re-used.
One thing we tend to favor, is to push the optional args into python level whenever possible, and makes C++ level ops stable. The reason is that making changes to C++ could be hard for some contributors in this day and age. With more code in python level it will be much easier to get more contributors involved in the project. For example, even if the basic C++ level ops may only expose Also, for duplicate the mono channel to stereo channels, at the python level |
@yongtang : Yeah, what you're saying makes total sense. So let me summarize the plan:
Sounds good? |
@jjedele The summary looks good! Let me know if you need any help in bazel build/etc 👍 |
Lift functionality to fix shapes from C++ to Python level.
Implemented and pushed the part with lifting the shaping functionality to Python level. Going to look at the generic |
After initial investigation it unfortunately seems like media files are not always easily identifiable by the magic number in the header, e.g. https://stackoverflow.com/questions/11360286/detect-if-a-file-is-an-mp3-file Hope the situation is better for Ogg, Flac and MP4a |
@jjedele We could start with having In case of detecting Below are the place of AudioIOTensor' of checking ogg/flac/wav and fall back to mp3: io/tensorflow_io/core/kernels/audio_kernels.cc Lines 32 to 46 in 7a19d34
|
@yongtang Thx for hinting me at that code. For MP4 it also seems to work with the header (https://www.file-recovery.com/mp4-signature-format.htm). I'm not a big fan of things happening implicit, but in this case it's probably the best solution. Also, as long the MP3 files have ID3 metadata (which I guess usually is the case), that can be identified as well. |
@jjedele Another option is to add an Attr of
if they know exactly the format then they could also use:
For many users, honestly they don't care about the format itself, they only want to have an API to decode an audio file (in different format) and not worry about all kinds of parameters. (We could see a similar case of For some other users, they do want fine control of the format. So ideally I think an option of auto probing the format makes sense. |
@yongtang Unfortunately didn't get to continue on this yet since I'm a bit busy with other things right now. I think I would stay with the approach with specific |
Factor out audio format detection.
@yongtang I'm also thinking right now about what's the best approach to share the code between |
@jjedele Yes we want to reduce code duplication as much as possible, so reusing code in different parts would be great 👍 |
@yongtang This is not so much about reusing code yes or no, but rather about the best way to do it. I currently see 2 options:
Option 1 seems preferable to me since it's simpler. For 2 I would still need to duplicate the shape logic. I don't know if option 1 makes any problems if we pass along the TF Context objects, etc though. |
@jjedele I would suggest go with 2, as honestly I don't know if there will any implications if we go with 1 (with graph node, context, etc). |
@yongtang Ok, I will do that! Thanks again for your help. |
I can write helper function for mp3 detect. Basically we need to check several consecutive frames to prove this is really mp3 if id3v2 is absent + support damaged files like https://github.com/lieff/minimp3/blob/master/vectors/l3-sin1k0db.bit . |
@lieff Thx for offering! For my use case where we already have the whole data in memory this would be awesome. I'm a bit unsure yet how we would integrate it for the file-based reader. We would probably need to read a considerable bigger piece than just the file header for testing. But might be worth it as long it's in kb range. |
Thanks @lieff for offering help! @jjedele File based access is possible, as TensorFlow's Minimp3 already have a callback API (again thanks @lieff for the great work 👍 ) so it is quite easy to wire up the two callbacks. The following is how callback is wrapped to use mp3 for processing IOTensor and Dataset: io/tensorflow_io/core/kernels/audio_mp3_kernels.cc Lines 25 to 51 in 7a19d34
In fact even in case the whole file is in memory, the callback could be applied as well by just translating callback to memory direct access. |
Yes, ~16kb worst case is needed for 10 consecutive frames. I'll create detect functions and note here when they're ready. |
Here new detect functions lieff/minimp3@0a2ff3b . |
Refactor to work towards general DecodeAudio operator.
Thx for doing this so quickly @lieff ! |
New ToDo list:
|
Thanks @lieff for the help! @jjedele you can update the workspace file in Lines 724 to 732 in 7a19d34
The |
Implement general DecodeAudio operator.
Add lieff's detect_mp3 function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongtang Implemented the general audio.decode
operator now and wired it together with the MP3 decoding. The code is in a state which seems OK to me, but I'm not an experienced C++ programmer, so happy about any feedback. I would suggest that we finish up this PR and then implement decoding for the other formats in follow up PRs.
// DecodeAudioBaseOp | ||
DecodeAudioBaseOp::DecodeAudioBaseOp(OpKernelConstruction *context) : OpKernel(context) {} | ||
|
||
void DecodeAudioBaseOp::Compute(OpKernelContext *context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit unhappy with the fact that these methods are in tensorflow::data
while everything else is in this nested nameless namespace. I'm not experienced enough with C++ to know what this is about, so happy about feedback/ideas.
|
||
// DecodedAudio | ||
size_t DecodedAudio::data_size() { | ||
return channels * samples_perchannel * sizeof(int16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think google's code style prefer method name with CamelCase, so DataSize instead?
Also, not all audio files stay with int16
so this one has to at least take into consideration the data types. But that is a larger discussion.
@@ -30,23 +108,27 @@ class AudioReadableResource : public AudioReadableResourceBase { | |||
mutex_lock l(mu_); | |||
std::unique_ptr<tensorflow::RandomAccessFile> file; | |||
TF_RETURN_IF_ERROR(env_->NewRandomAccessFile(input, &file)); | |||
char header[8]; | |||
char header_buf[8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think header
is fine here?
public: | ||
DecodeAudioOp(OpKernelConstruction *context) : DecodeAudioBaseOp(context) {} | ||
|
||
std::unique_ptr<DecodedAudio> decode(StringPiece &data, void *config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google's style is to return Status
in return field of the function, and for any values/pointers that needs to be returned as well, they will be placed at the end of the function with *
, so something like:
Status Decode(StringPiece &data, DecodeAudio** audio);
Then you could use:
DecodeAudio* audio;
Status status = Decode(data, &audio);
std::unique_ptr<DecodedAudio> d;
d.reset(audio);
You might also pass unique_ptr as well I think:
Status Decode(StringPiece &data, std::unique_ptr<DecodeAudio>* audio);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing a pointer to a unique pointer sounds pretty hacky, do you think that's a good idea? Probably I'd rather go with the first option. Thx for pointing me to the coding style!
DecodeAudioOp(OpKernelConstruction *context) : DecodeAudioBaseOp(context) {} | ||
|
||
std::unique_ptr<DecodedAudio> decode(StringPiece &data, void *config) { | ||
auto error = std::unique_ptr<DecodedAudio>(new DecodedAudio(false, 0, 0, 0, nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this does no capture the error situation.
@@ -45,5 +54,40 @@ Status MP4ReadableResourceInit( | |||
Env* env, const string& input, | |||
std::unique_ptr<AudioReadableResourceBase>& resource); | |||
|
|||
// Container for decoded audio. | |||
class DecodedAudio { | |||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need a class
here, as it is just a struct
with one function that does a
channels * samples_perchannel * sizeof(int16);
maybe we don't need this class after all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right. I think a struct is enough.
const int sampling_rate; | ||
// should first contain all samples of the left channel | ||
// followed by the right channel | ||
const int16 *data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could avoid allocate the memory here, as we can create the output Tensor
which will hold the memory. Then the output Tensor can be used directly to get the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this for a while, but I'm not sure how it would work. The problem is I need a shape to allocate the output tensor, which I get by decoding the MP3, for which in turn I need to have memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of a shape, in:
Status Read(const int64 start, const int64 stop, |
a callback-type of lambda is passed which allows the allocation to be done when shape is ready. This woucl be helpful when we only want to call once to read
.
public: | ||
DecodeAudioOp(OpKernelConstruction *context) : DecodeAudioBaseOp(context) {} | ||
|
||
std::unique_ptr<DecodedAudio> decode(StringPiece &data, void *config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't see the decode
here? as it only does a classify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjedele If you take a look at the exiting
With the buffer's
You pretty much have the process of memory backed mp3 decoder in place. After that, you can check the intrinsic shape (
Once you have the |
@yongtang Thx for your feedback! The approach you mention in your last comment I actually haven't considered. I did just assume that I'm not sure how much I like the approach though. If we start doing this it would mean we must from now on always ensure that it works without a real file in the background. And since the actual decoding logic is super easy to use in lieff's library already, I don't think we would actually save that much code by doing this. |
@jjedele The mp3 (and to an extent mp4a) is not very challenging to decode, thanks to @lieff great libraries. Though we have different use cases for audio:
We would like to come up to some way to not duplicate code in many places. Opened issue #839 for further discussion. |
@jjedele We prefer multiple smaller PRs than one big PR. For this PR I think we can stay with the focus of Also, other than mp3 there are also several other types (wav, Flac, ogg, mp4) that has been more or less ready to be exposed as |
@jjedele Also, given the ongoing discussion in #839, maybe we can temporarily place the API in We plan on having the next tensorflow-io release for TF 2.2. Since TF 2.2 is likely to be released after 4+ weeks (RC is not out yet), we pretty much have 1+ months to get everything in place and move from |
@yongtang Sounds good. I will move it to experimental. |
@yontang I was looking at your code a bit, and right now it seems to me like we would introduce much duplication for this whole shaping code if we create separate operators for the different decode operations without winning anything at this point. Maybe we should just go with the generic one to start with and then differentiate further if we actually find/need codec specific parameters. |
@jjedele I think what we could do is:
|
@yongtang That sounds like a good idea. But maybe that should be another task since I've just seen that in the meantime we already have On that note: At this point, it seems to me like What do you think? |
@jjedele Sure, let me take a look. |
No description provided.