Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote request file extension enhancements #2947

Merged
merged 6 commits into from
Mar 8, 2023
Merged

Remote request file extension enhancements #2947

merged 6 commits into from
Mar 8, 2023

Conversation

toots
Copy link
Member

@toots toots commented Mar 7, 2023

This PR enhances the remote request file extension resolution:

  • Turns the mime to file extension mapping into a setting
  • Add support for Content-Disposition headers to extract a file extension from it.

Overall, it is expected that the Content-Disposition will cover most cases and that the mime to file extension mappings should become less relevant, similar to the way local files are resolved.

@toots toots marked this pull request as ready for review March 7, 2023 22:52
@toots toots requested a review from smimram March 7, 2023 22:53
@vitoyucepi
Copy link
Collaborator

Are content-disposition and content-type headers the right way to determine the correct extension name?
The server can report that the file is application/octet-stream.

I think that the correct way is to get some bytes from the file and resolve mime like it's done in go.
Also, I think it would be great not to use the file extension to define which decoder should be used.

@toots
Copy link
Member Author

toots commented Mar 8, 2023

Historically, we used to completely ignore the file extension and use libmagic to detect the format. However, unstructured formats such as mp3 have proven hard to auto-detect this way. Since we switched to file extension locally, most of the issues we had with detecting formats are gone and, so I would tend to think that this should be similar if we apply the same logic to remote files.

Remote files are usually stored with a file extension that is the same as the local file it's been created from and that file extension will be returned by the Content-Disposition header. Typically, if you grab a download link from dropbox, you would get:

  • Content-Type: application/json
  • Content-Disposition: attachment; filename="Example.mp4"; filename*=UTF-8''Example.mp4

The current decoding detection logic for remote requests is:

  • For file extension:
    • If Content-Disposition is present and we can get a file extension from it, use it.
    • Otherwise, if Content-Type is present and we can get a file extension from it, use it.

Then we are down to the file format detection logic:

  • If there is a file extension that matches a given decoder, use it.
  • If libmagic is compiled and returns a mime that matches a given decoder, use it.

I think that this is the reasonable way. Again, the false negative with auto-detecting mp3 based on first bytes should not be underestimated!

@toots toots requested a review from vitoyucepi March 8, 2023 14:00
@toots toots merged commit d1206cf into main Mar 8, 2023
@toots toots deleted the content-disposition branch March 8, 2023 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants