Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide mimetypes.sniff API as stdlib #85018

Closed
corona10 opened this issue Jun 2, 2020 · 16 comments
Closed

Provide mimetypes.sniff API as stdlib #85018

corona10 opened this issue Jun 2, 2020 · 16 comments
Assignees
Labels
3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@corona10
Copy link
Member

corona10 commented Jun 2, 2020

BPO 40841
Nosy @gvanrossum, @taleinat, @tiran, @berkerpeksag, @JimJJewett, @serhiy-storchaka, @YoSTEALTH, @corona10, @tirkarthi
PRs
  • bpo-40841: Add mimetypes.mimesniff #20720
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/corona10'
    closed_at = <Date 2020-10-23.18:10:33.246>
    created_at = <Date 2020-06-02.06:42:45.204>
    labels = ['type-feature', 'library', '3.10']
    title = 'Provide mimetypes.sniff API as stdlib'
    updated_at = <Date 2020-10-24.05:29:15.738>
    user = 'https://github.com/corona10'

    bugs.python.org fields:

    activity = <Date 2020-10-24.05:29:15.738>
    actor = 'corona10'
    assignee = 'corona10'
    closed = True
    closed_date = <Date 2020-10-23.18:10:33.246>
    closer = 'corona10'
    components = ['Library (Lib)']
    creation = <Date 2020-06-02.06:42:45.204>
    creator = 'corona10'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40841
    keywords = ['patch']
    message_count = 16.0
    messages = ['370591', '370602', '374150', '374387', '374438', '374439', '374467', '374471', '374509', '374511', '374593', '374615', '379460', '379461', '379470', '379520']
    nosy_count = 9.0
    nosy_names = ['gvanrossum', 'taleinat', 'christian.heimes', 'berker.peksag', 'Jim.Jewett', 'serhiy.storchaka', 'YoSTEALTH', 'corona10', 'xtreak']
    pr_nums = ['20720']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue40841'
    versions = ['Python 3.10']

    @corona10
    Copy link
    Member Author

    corona10 commented Jun 2, 2020

    The current mimetypes.guess_type API guesses file types based on file extensions.

    However, there is a more accurate method which is calling sniffing.

    Some languages like Go(https://golang.org/pkg/net/http/#DetectContentType) provides mimesniff API and the method is implemented based on a standard way which is published on https://mimesniff.spec.whatwg.org/

    I have a sample code implementation this
    https://github.com/corona10/mimesniff/blob/master/mimesniff/mimesniff.py
    But the API interface will be changed to mimetypes API.

    So I would like to provide mimetypes.sniff API rather than a new stdlib package like mimesniff.

    @corona10 corona10 added 3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jun 2, 2020
    @corona10 corona10 changed the title Implement mimetypes.sniff Provide mimetypes.sniff API Jun 2, 2020
    @corona10 corona10 changed the title Implement mimetypes.sniff Provide mimetypes.sniff API Jun 2, 2020
    @corona10 corona10 changed the title Provide mimetypes.sniff API Provide mimetypes.sniff API as stdlib Jun 2, 2020
    @corona10 corona10 changed the title Provide mimetypes.sniff API Provide mimetypes.sniff API as stdlib Jun 2, 2020
    @corona10
    Copy link
    Member Author

    corona10 commented Jun 2, 2020

    I ping some of the core developers who recently work on this module.
    Sorry if this topic is not interesting to you :(

    I want to listen to how about provide this API as the stdlib API.
    Three things I'd like to appeal through this proposal.

    1. It will provide based on a more precise way.
    2. There is a good standard(whatwg) in which format will be supported.
    3. I am eager to maintain this module as the active core developer.

    @gvanrossum
    Copy link
    Member

    This looks like a useful addition. I hope someone will take up the review!

    @corona10
    Copy link
    Member Author

    This looks like a useful addition. I hope someone will take up the review!

    Thank you guido!
    I also think that this API is good to be added to the standard library and it would be very useful!

    I hope that someone would like to interest in this issue ;)

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Jul 27, 2020

    The standard itself says that it only applies to content served over http; if the content is retrieved by ftp or from a file system, then you should trust that. I don't notice that in the code you pointed to.

    So maybe filetype is the right answer if the data isn't coming over the network? For whatwg demonstration code, it is reasonable to assume that, but in python -- at a minimum, you should document the assumption prominently in the docs and docstring.

    @gvanrossum
    Copy link
    Member

    Whether the data was retrieved over a network has nothing to do with it.

    There are complementary ways of guessing what data you are working with -- guess based on the filename extension or sniff based on the contents of the file (or downloaded data).

    There are a zillion reasons why the filename could be a lie -- e.g. a user could pick the wrong extension, or rename a file, or a tool could save a file using the wrong extension or no extension at all. Then again sometimes the contents of the file might not be enough, e.g.

    foo() // bar
    

    is both valid Python and valid JavaScript. :-)

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Jul 28, 2020

    There are a zillion reasons a filename could be wrong -- but the standard
    says to trust the filesystem. So if it sniffs based on contents, it isn't
    quite following the standard. It is probably still a useful tool, but it
    won't be the One Right Way, and it isn't even clear that it should replace
    current heuristics.

    On Mon, Jul 27, 2020 at 7:22 PM Guido van Rossum <report@bugs.python.org>
    wrote:

    Guido van Rossum <guido@python.org> added the comment:

    Whether the data was retrieved over a network has nothing to do with it.

    There are complementary ways of guessing what data you are working with --
    guess based on the filename extension or sniff based on the contents of the
    file (or downloaded data).

    There are a zillion reasons why the filename could be a lie -- e.g. a user
    could pick the wrong extension, or rename a file, or a tool could save a
    file using the wrong extension or no extension at all. Then again sometimes
    the contents of the file might not be enough, e.g.

    foo() // bar
    

    is both valid Python and valid JavaScript. :-)

    ----------


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue40841\>


    @serhiy-storchaka
    Copy link
    Member

    I think that both functions for detecting file type, by name and by content, are useful in different circumstances. We have similar more specific detection functions sndhdr and imghdr.

    But I am not sure whether it should be a part of the mimetypes module or separate module. Should it use sndhdr and imghdr modules for audio and image types? Should it be a wrapper to the libmagic library (https://linux.die.net/man/3/libmagic) or reimplement it in Python?

    If we add the code for detecting the file type based on algorithms used in browsers, should not we add also the code for detecting the text encoding based on other algorithms used in browsers, or it is too much?

    @corona10
    Copy link
    Member Author

    I think that both functions for detecting file type, by name and by content

    I think so too, mime sniffing would not be a way to alternate the method based on the file extension. Both APIs should be provided.

    should not we add also the code for detecting the text encoding based on other algorithms used in browsers

    I already add the code for text encoding detection based on the whatwg standard so if this API is landed, yes text encoding detection will be supported.(e.g utf-16be)
    IMHO, there would be use-cases since today python is used a lot for text data handling (for example crawling, data pre-processing)

    There would be the question that the standard for the browser is appropriate for the python stdlib module.
    My answer is that the whatwg standard could be the one of best standards to follow if make the decision to provide mime sniffing.

    The standard handle mime types that are widely used in the real world not only for browser but also HTTP server or else.

    One of the big stress to maintain mime-types detection is that considering how many mime-types should be supported.
    Luckily, whatwg can be the strong standard to make the decision.

    @gvanrossum
    Copy link
    Member

    When the standard says "trust the filename" it is talking to the
    application, not to the sniffing library. The library should provide the
    tool for applications to follow the standard, but I don't see a reason why
    we would have to enforce how applications call the library. Since we agree
    there are use cases beyond what the standard has thought of for combining
    sniffing the data and guessing based on the filename, we should make that
    possible, the standard's exhortations notwithstanding.

    Python is not a browser; a browser could be an application written in
    Python. Python therefulre should provide tools that are useful to implement
    a browser.

    @YoSTEALTH
    Copy link
    Mannequin

    YoSTEALTH mannequin commented Jul 29, 2020

    Start and end position of the signature must be accounted for, not all file signature start at 0 or < 512 bytes

    Rather then writing all the signatures manually might be a good idea to use already collected resource like https://www.garykessler.net/library/file_sigs.html

    @corona10
    Copy link
    Member Author

    https://www.garykessler.net/library/file_sigs.html looks like a good resource for this kind of API.

    However, I would like to choose well-known standard from whatwg or w3c etc..

    @corona10
    Copy link
    Member Author

    I close this issue as rejected!

    During the sprint, I could hear a lot of opinions from core devs including Guido, Tal, and Christian.

    The overall conclusion for me is not to add this time.
    if the mimetypes module is extracted from stdlib to pypi package, we can discuss to add this feature at that time!

    Thank you everyone for the discussion!

    @gvanrossum
    Copy link
    Member

    Dong-hee, I recommend that you turn this into a 3rd party package on PyPI
    yourself. That way your effort and code will live on!

    @taleinat
    Copy link
    Contributor

    Dong-hee, I recommend that you turn this into a 3rd party package on PyPI yourself.

    +1

    @corona10
    Copy link
    Member Author

    @gvanrossum, @taleinat

    I've already provided the mimesniffing through PyPI ;)
    https://pypi.org/project/mimesniff/

    The interface is similar to imghdr.what :)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants