Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose FLAC__stream_encoder_set_do_md5 in the Public API #547

Closed
cnweaver opened this issue Feb 16, 2023 · 9 comments
Closed

Expose FLAC__stream_encoder_set_do_md5 in the Public API #547

cnweaver opened this issue Feb 16, 2023 · 9 comments

Comments

@cnweaver
Copy link

FLAC__stream_encoder_set_do_md5 (and the corresponding FLAC__stream_encoder_get_do_md5) are currently not made available in the public API, apparently on the basis that users generally need the data integrity protection provided by the checksum. However, this is not always the case, such as when FLAC encoded data is stored within some other container format which includes its own integrity checking, and the overhead of computing the MD5 checksums can be considerable. Specifically, this comes up when using libFLAC to compress waveforms recorded by the SPT-3G and planned CMB-S4 scientific experiments (using https://github.com/CMB-S4/spt3g_software), where very large numbers (e.g. millions) of short waveforms (e.g. hundreds to tens of thousands of samples in length) are compressed and stored in a format wherein a CRC is already always applied at a higher level. For this use case, disabling the computation of MD5 checksums during encoding has been observed to produce a 20-30% speed-up in serializing data.

Clearly, most API users should continue to include checksums when encoding; it is entirely reasonable to keep checksums as the default, and documentation associated with FLAC__stream_encoder_set_do_md5 could warn that users turning off checksums must assume the responsibility for data integrity themselves. Exposing this feature can be quite valuable in cases where it is applicable, though, and it seems unlikely to be overly dangerous to other users with appropriate documentation.

@ktmf01
Copy link
Collaborator

ktmf01 commented Feb 16, 2023

Many thanks for reaching out.

The functions you mention are in fact exported by the libraries, however they are not included in the headers. These functions were only ever intended to facilitate testing.

I could honor your request, but I'm not planning a release soon, and even after a release uptake can be pretty slow on many systems. There is however a way to get access to these functions directly with existing libFLAC (even rather old ones) by simply supplying your own prototypes. Just add these to your code.

FLAC_API FLAC__bool FLAC__stream_encoder_set_do_md5(FLAC__StreamEncoder *encoder, FLAC__bool value);
FLAC_API FLAC__bool FLAC__stream_encoder_get_do_md5(const FLAC__StreamEncoder *encoder);

Would this be a acceptable short-term solution for your case?

@cnweaver
Copy link
Author

cnweaver commented Feb 16, 2023

Thanks, I do think that adding the declaration(s) in our own code will work for us in the near-term. Before doing so, though, I had wanted to check that it would be stable and could become officially supported in the future, since it is of course reasonable that you might otherwise remove or change what is supposed to be a private interface. Given that this solution works today, I don't think we would be in any hurry to see it changed in a release, but knowing that it would appear in some future release would be valuable. If it would be helpful, I could prepare a PR to add the public declaration with accompanying documentation.

@ktmf01
Copy link
Collaborator

ktmf01 commented Feb 16, 2023

The interface has existed since 2007, and hasn't changed since, see 00da5ae I see no reason this could change, especially since it mirrors other (public) functions, but I understand your concern. However, I am not sure making these functions public is a good idea.

While I find it fascinating that FLAC is being used for storing all kinds of signals and want to help out here, FLAC is first and foremost an audio codec, tuned to store audio. This project strongly recommends storing an MD5 sum with audio data, and I feel adding this function to a public header (and thus the API documentation) might weaken that message.

Some users like yourself that have already gone the route of providing their own prototypes. I'm not sure whether their implementations would break if I would move the prototypes to the public header, but if it does, that would be another reason to not do this.

What about adding a note to the header and implementation that these functions should be considered part of the API, even while they're not?

@cnweaver
Copy link
Author

I think that if that's the solution with which you're most comfortable, it would cover my concerns, and I would consider my due diligence done. Thanks again.

@ktmf01
Copy link
Collaborator

ktmf01 commented Feb 16, 2023

Good to hear. If there is anything I can help you with otherwise, for example other specific needs for non-audio use of FLAC, I'd be glad to help out.

@ktmf01
Copy link
Collaborator

ktmf01 commented Feb 17, 2023

I did some reading on the project, very impressive. Great to hear that the FLAC project is able to play a small role in the probably very complex project that this is.

I found some mentions of problems with FLAC being limited to 24-bit data. You might have noticed, but this is no longer the case. Since release 1.4.0 last year, FLAC is capable of storing up to 32-bit data.

Also, looking at the code it seems a default compression level is used (I'm not sure which, but it seems like level 1). As FLAC is tuned for audio, it could be worthwhile trying different compression levels or different blocksizes. Usually larger blocksizes result in faster encodes. I recon using compression level 0 and blocksize of 4096 or higher (up from the 1152 that is presumably used now) might speed up encoding by another 20 to 25%.

@cnweaver
Copy link
Author

Thanks for the suggestions; we'll definitely try these things out. (I think SPT-3G is mostly using compression level 5, but I've been largely testing with level 1 for the higher data rates planned in CMB-S4.) 32-bit support is interesting, but I think at present we are working with 24-bit digitizers, so we don't have an immediate need for more when handling raw data.

@ktmf01
Copy link
Collaborator

ktmf01 commented Feb 17, 2023

If you need an even faster encoder at the expense of slightly larges files, I could look into that. This hasn't had much development attention, because FLAC is for plain audio use already extremely fast, and at the fastest levels it is, as you noticed, bottlenecked by the speed of MD5 checksumming. Seeing the datarates you're working with, I understand the need for speed.

Do you perhaps have a sample of the kind of data you're working with, so I can take a look at it? I have a hunch it might benefit from different tuning.

@cnweaver
Copy link
Author

Since the data file is a bit big (~300 MB), I've sent an email with a link to it. We can definitely get by with things as they are (turning off the checksum just seemed like a big, easy win for our use), but we're certainly also interested in better tuning. One point I should mention is that while I'm personally mostly looking at compression, decompression matters a lot too, since data are only compressed once, but get read and decompressed repeatedly during different stages of processing and analysis. Higher compression levels also have their place, since when archiving years of data, disk space costs add up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants