Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MIME types instead of magic strings for RPMTAG_FILECLASS #1099

Closed

Conversation

pmatilai
Copy link
Member

@pmatilai pmatilai commented Mar 5, 2020

File magic strings are unreliable and largely unusable for anything but human consumption, MIME types are far more meaningful for classifying file types. Populate RPMTAG_FILECLASS (or rather, CLASSDICT) with MIME type instead, and add types for all files and not just our strange hardcoded list. Remove now redundant cruft.

Fixes: #1096

Note that this is only an RFC as a basis of discussion. I know we want to include MIME data, but I'm not sure we can just replace the cruft that is currently going into RPMTAG_CLASSDICT. OTOH I'd really rather not duplicate all that to another tag set just to preserve cruft that almost nobody uses (AFAIK), such a patch would have very different line-count.

Oh and this on top of #1098 and #1095 which are pre-requisite for doing this.

Thoughts?

libmagic strings are notoriously unreliable as the details from version
to version. We link to libelf anyway so we might as well as get the
info straight from the horse's mouth.

Besides being more reliable, this detaches the coloring business from
the hardcoded rpmfcTokens struct and informative-only FILECLASS contents,
opening the door for other changes in that area.
In addition to "magic" strings, support classifying files by matches
on their MIME types which are far more predictable than the notoriously
volatile magic strings which are intended for human consumption.

This adds an optional %__foo_mime and %__foo_exclude_mime patterns
to file attributes. If the mime-variant is present, magic is ignored
if present (with a warning).

The testcase is a fine example of how the grass not necessarily being
any greener on the other side: the actual output is far more predictable,
but the actual classification is not. Our "script" with an arbitrary
unknown interpreter is considered text/plain by libmagic.

Fixes: rpm-software-management#1097
This is a stupid and simple test, but then with file strings that's
just the kind of thing you want as we don't want testsuite failures
from file string changes.
File magic strings are unreliable and largely unusable for anything
but human consumption, MIME types are far more meaningful for
classifying file types. Populate RPMTAG_FILECLASS (or rather, CLASSDICT)
with MIME type instead, and add types for all files and not just our
strange hardcoded list. Remove now redundant cruft.

Fixes: rpm-software-management#1096
@pmatilai pmatilai added the RFC label Mar 5, 2020
@Conan-Kudo
Copy link
Member

I'm not sure it's a good idea to replace file magic entirely with MIME, as MIME tends to get identifiers a lot less frequently than file magic identifiers. And there's a large world of dependency generators that rely on file magic filters, and that's something I'm not sure I want to break unless we're doing a new major version.

@pmatilai
Copy link
Member Author

pmatilai commented Mar 5, 2020

This doesn't affect dependency generators at all. This only replaces the gunk that goes into RPMTAG_CLASSDICT in packages.

@Conan-Kudo
Copy link
Member

@pmatilai Hmm, I guess I misread this. It seemed like you wanted to transition everything away from file magic to MIME. If this is just scoped to the internal classification of content in RPM, I don't have as much of an objection. It would wind up being a file format change, though, since we're changing what kind of values to expect there.

@pmatilai
Copy link
Member Author

pmatilai commented Mar 5, 2020

This isn't "internal" classification either, because its very public in the headers. Hence the request for discussion.

The thing is, the existing RPMTAG_CLASSDICT content is intolerable goo. So in my view, we have two options:

  1. stop creating that fileclass data, andd new tags and stuff for MIME instead
  2. hijack it for MIME use

The problem with 1) is that we need to carry a lot of baggage for a little-used feature, the problem with 2) is that there are users (rpmlint at least) and if we just hijack it, you don't really know what data to expect there.

@pmatilai
Copy link
Member Author

While looking at this, I rediscovered (oh the things you write and forget) makeClass() in lib/tagexts.c.
Which makes me think the answer is probably 1), ie leave the CLASS tags alone but stop populating the header with them, and add new MIME tags instead.

@pmatilai
Copy link
Member Author

"Plan 1" now implemented for comparison in #1114.

@pmatilai
Copy link
Member Author

Okay I suppose its become clear this is not the way to go, closing.

@pmatilai pmatilai closed this Apr 21, 2020
@pmatilai pmatilai deleted the mimeclass-pr branch June 21, 2021 11:59
@pmatilai pmatilai restored the mimeclass-pr branch April 7, 2022 11:11
@pmatilai pmatilai deleted the mimeclass-pr branch April 7, 2022 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

RFE: Add MIME classification of all files to packages
2 participants