Provide mimetypes.sniff API as stdlib #85018

corona10 · 2020-06-02T06:42:45Z

BPO	40841
Nosy	@gvanrossum, @taleinat, @tiran, @berkerpeksag, @JimJJewett, @serhiy-storchaka, @YoSTEALTH, @corona10, @tirkarthi
PRs	bpo-40841: Add mimetypes.mimesniff #20720

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/corona10'
closed_at = <Date 2020-10-23.18:10:33.246>
created_at = <Date 2020-06-02.06:42:45.204>
labels = ['type-feature', 'library', '3.10']
title = 'Provide mimetypes.sniff API as stdlib'
updated_at = <Date 2020-10-24.05:29:15.738>
user = 'https://github.com/corona10'

bugs.python.org fields:

activity = <Date 2020-10-24.05:29:15.738>
actor = 'corona10'
assignee = 'corona10'
closed = True
closed_date = <Date 2020-10-23.18:10:33.246>
closer = 'corona10'
components = ['Library (Lib)']
creation = <Date 2020-06-02.06:42:45.204>
creator = 'corona10'
dependencies = []
files = []
hgrepos = []
issue_num = 40841
keywords = ['patch']
message_count = 16.0
messages = ['370591', '370602', '374150', '374387', '374438', '374439', '374467', '374471', '374509', '374511', '374593', '374615', '379460', '379461', '379470', '379520']
nosy_count = 9.0
nosy_names = ['gvanrossum', 'taleinat', 'christian.heimes', 'berker.peksag', 'Jim.Jewett', 'serhiy.storchaka', 'YoSTEALTH', 'corona10', 'xtreak']
pr_nums = ['20720']
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue40841'
versions = ['Python 3.10']

corona10 · 2020-06-02T06:42:45Z

The current mimetypes.guess_type API guesses file types based on file extensions.

However, there is a more accurate method which is calling sniffing.

Some languages like Go(https://golang.org/pkg/net/http/#DetectContentType) provides mimesniff API and the method is implemented based on a standard way which is published on https://mimesniff.spec.whatwg.org/

I have a sample code implementation this
https://github.com/corona10/mimesniff/blob/master/mimesniff/mimesniff.py
But the API interface will be changed to mimetypes API.

So I would like to provide mimetypes.sniff API rather than a new stdlib package like mimesniff.

corona10 · 2020-06-02T10:41:15Z

I ping some of the core developers who recently work on this module.
Sorry if this topic is not interesting to you :(

I want to listen to how about provide this API as the stdlib API.
Three things I'd like to appeal through this proposal.

It will provide based on a more precise way.
There is a good standard(whatwg) in which format will be supported.
I am eager to maintain this module as the active core developer.

gvanrossum · 2020-07-23T22:09:33Z

This looks like a useful addition. I hope someone will take up the review!

corona10 · 2020-07-27T15:31:42Z

This looks like a useful addition. I hope someone will take up the review!

Thank you guido!
I also think that this API is good to be added to the standard library and it would be very useful!

I hope that someone would like to interest in this issue ;)

jimjjewett · 2020-07-27T23:15:18Z

The standard itself says that it only applies to content served over http; if the content is retrieved by ftp or from a file system, then you should trust that. I don't notice that in the code you pointed to.

So maybe filetype is the right answer if the data isn't coming over the network? For whatwg demonstration code, it is reasonable to assume that, but in python -- at a minimum, you should document the assumption prominently in the docs and docstring.

gvanrossum · 2020-07-27T23:21:49Z

Whether the data was retrieved over a network has nothing to do with it.

There are complementary ways of guessing what data you are working with -- guess based on the filename extension or sniff based on the contents of the file (or downloaded data).

There are a zillion reasons why the filename could be a lie -- e.g. a user could pick the wrong extension, or rename a file, or a tool could save a file using the wrong extension or no extension at all. Then again sometimes the contents of the file might not be enough, e.g.

foo() // bar

is both valid Python and valid JavaScript. :-)

jimjjewett · 2020-07-28T05:56:06Z

There are a zillion reasons a filename could be wrong -- but the standard
says to trust the filesystem. So if it sniffs based on contents, it isn't
quite following the standard. It is probably still a useful tool, but it
won't be the One Right Way, and it isn't even clear that it should replace
current heuristics.

On Mon, Jul 27, 2020 at 7:22 PM Guido van Rossum <report@bugs.python.org>
wrote:

Guido van Rossum <guido@python.org> added the comment:

Whether the data was retrieved over a network has nothing to do with it.

There are complementary ways of guessing what data you are working with --
guess based on the filename extension or sniff based on the contents of the
file (or downloaded data).

There are a zillion reasons why the filename could be a lie -- e.g. a user
could pick the wrong extension, or rename a file, or a tool could save a
file using the wrong extension or no extension at all. Then again sometimes
the contents of the file might not be enough, e.g.
foo() // bar
is both valid Python and valid JavaScript. :-)

----------

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue40841\>

serhiy-storchaka · 2020-07-28T07:07:30Z

I think that both functions for detecting file type, by name and by content, are useful in different circumstances. We have similar more specific detection functions sndhdr and imghdr.

But I am not sure whether it should be a part of the mimetypes module or separate module. Should it use sndhdr and imghdr modules for audio and image types? Should it be a wrapper to the libmagic library (https://linux.die.net/man/3/libmagic) or reimplement it in Python?

If we add the code for detecting the file type based on algorithms used in browsers, should not we add also the code for detecting the text encoding based on other algorithms used in browsers, or it is too much?

corona10 · 2020-07-28T16:35:09Z

I think that both functions for detecting file type, by name and by content

I think so too, mime sniffing would not be a way to alternate the method based on the file extension. Both APIs should be provided.

should not we add also the code for detecting the text encoding based on other algorithms used in browsers

I already add the code for text encoding detection based on the whatwg standard so if this API is landed, yes text encoding detection will be supported.(e.g utf-16be)
IMHO, there would be use-cases since today python is used a lot for text data handling (for example crawling, data pre-processing)

There would be the question that the standard for the browser is appropriate for the python stdlib module.
My answer is that the whatwg standard could be the one of best standards to follow if make the decision to provide mime sniffing.

The standard handle mime types that are widely used in the real world not only for browser but also HTTP server or else.

One of the big stress to maintain mime-types detection is that considering how many mime-types should be supported.
Luckily, whatwg can be the strong standard to make the decision.

gvanrossum · 2020-07-28T17:04:19Z

When the standard says "trust the filename" it is talking to the
application, not to the sniffing library. The library should provide the
tool for applications to follow the standard, but I don't see a reason why
we would have to enforce how applications call the library. Since we agree
there are use cases beyond what the standard has thought of for combining
sniffing the data and guessing based on the filename, we should make that
possible, the standard's exhortations notwithstanding.

Python is not a browser; a browser could be an application written in
Python. Python therefulre should provide tools that are useful to implement
a browser.

YoSTEALTH · 2020-07-29T23:13:21Z

Start and end position of the signature must be accounted for, not all file signature start at 0 or < 512 bytes

Rather then writing all the signatures manually might be a good idea to use already collected resource like https://www.garykessler.net/library/file_sigs.html

corona10 · 2020-07-30T14:25:30Z

https://www.garykessler.net/library/file_sigs.html looks like a good resource for this kind of API.

However, I would like to choose well-known standard from whatwg or w3c etc..

corona10 · 2020-10-23T18:10:33Z

I close this issue as rejected!

During the sprint, I could hear a lot of opinions from core devs including Guido, Tal, and Christian.

The overall conclusion for me is not to add this time.
if the mimetypes module is extracted from stdlib to pypi package, we can discuss to add this feature at that time!

Thank you everyone for the discussion!

gvanrossum · 2020-10-23T18:15:00Z

Dong-hee, I recommend that you turn this into a 3rd party package on PyPI
yourself. That way your effort and code will live on!

taleinat · 2020-10-23T19:48:43Z

Dong-hee, I recommend that you turn this into a 3rd party package on PyPI yourself.

+1

corona10 · 2020-10-24T05:29:16Z

@gvanrossum, @taleinat

I've already provided the mimesniffing through PyPI ;)
https://pypi.org/project/mimesniff/

The interface is similar to imghdr.what :)

corona10 added 3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jun 2, 2020

corona10 assigned corona10 Jun 2, 2020

corona10 changed the title ~~Implement mimetypes.sniff~~ Provide mimetypes.sniff API Jun 2, 2020

corona10 changed the title ~~Provide mimetypes.sniff API~~ Provide mimetypes.sniff API as stdlib Jun 2, 2020

corona10 closed this as completed Oct 23, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide mimetypes.sniff API as stdlib #85018

Provide mimetypes.sniff API as stdlib #85018

corona10 commented Jun 2, 2020

corona10 commented Jun 2, 2020

corona10 commented Jun 2, 2020

gvanrossum commented Jul 23, 2020

corona10 commented Jul 27, 2020

jimjjewett mannequin commented Jul 27, 2020

gvanrossum commented Jul 27, 2020

jimjjewett mannequin commented Jul 28, 2020

serhiy-storchaka commented Jul 28, 2020

corona10 commented Jul 28, 2020

gvanrossum commented Jul 28, 2020

YoSTEALTH mannequin commented Jul 29, 2020

corona10 commented Jul 30, 2020

corona10 commented Oct 23, 2020

gvanrossum commented Oct 23, 2020

taleinat commented Oct 23, 2020

corona10 commented Oct 24, 2020

Provide mimetypes.sniff API as stdlib #85018

Provide mimetypes.sniff API as stdlib #85018

Comments

corona10 commented Jun 2, 2020

corona10 commented Jun 2, 2020

corona10 commented Jun 2, 2020

gvanrossum commented Jul 23, 2020

corona10 commented Jul 27, 2020

jimjjewett mannequin commented Jul 27, 2020

gvanrossum commented Jul 27, 2020

jimjjewett mannequin commented Jul 28, 2020

serhiy-storchaka commented Jul 28, 2020

corona10 commented Jul 28, 2020

gvanrossum commented Jul 28, 2020

YoSTEALTH mannequin commented Jul 29, 2020

corona10 commented Jul 30, 2020

corona10 commented Oct 23, 2020

gvanrossum commented Oct 23, 2020

taleinat commented Oct 23, 2020

corona10 commented Oct 24, 2020