Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move magic patterns to JSON file #39

Closed
matthewbauer opened this issue Nov 5, 2015 · 13 comments
Closed

Move magic patterns to JSON file #39

matthewbauer opened this issue Nov 5, 2015 · 13 comments

Comments

@matthewbauer
Copy link
Contributor

I'm thinking something like this:

[
    {
        "ext": "jpg",
        "mime": "image/jpeg",
        "patterns": [
            {
                "start": 0,
                "match": "FFD8FF"
            }
        ]
    },
    {
        "ext": "png",
        "mime": "image/png",
        "patterns": [
            {
                "start": 0,
                "match": "89504E47"
            }
        ]
    },
    {
        "ext": "gif",
        "mime": "image/gif",
        "patterns": [
            {
                "start": 0,
                "match": "474946"
            }
        ]
    }
]

Would that be worthwhile? Just trying to get some input.

@sindresorhus
Copy link
Owner

Why?

@matthewbauer
Copy link
Contributor Author

I guess this would just be reinventing libmagic but it would be helpful to have one large JSON file that can be used/interpreted.

@sindresorhus
Copy link
Owner

Ok, sure if you do a PR. Make sure to preserve the order as a few of the checks are order-dependent.

@thisconnect
Copy link

I like! Note that the mime-type module have the definitions in a single JSON file, they maintain its own repo for the db (https://www.npmjs.com/package/mime-db)

How would you use multiple "patterns" would you OR them?
That would work for .mp4

file-type/index.js

Lines 128 to 138 in 14a48a3

if (
(buf[0] === 0x0 && buf[1] === 0x0 && buf[2] === 0x0 && (buf[3] === 0x18 || buf[3] === 0x20) && buf[4] === 0x66 && buf[5] === 0x74 && buf[6] === 0x79 && buf[7] === 0x70) ||
(buf[0] === 0x33 && buf[1] === 0x67 && buf[2] === 0x70 && buf[3] === 0x35) ||
(buf[0] === 0x0 && buf[1] === 0x0 && buf[2] === 0x0 && buf[3] === 0x1C && buf[4] === 0x66 && buf[5] === 0x74 && buf[6] === 0x79 && buf[7] === 0x70 && buf[8] === 0x6D && buf[9] === 0x70 && buf[10] === 0x34 && buf[11] === 0x32 && buf[16] === 0x6D && buf[17] === 0x70 && buf[18] === 0x34 && buf[19] === 0x31 && buf[20] === 0x6D && buf[21] === 0x70 && buf[22] === 0x34 && buf[23] === 0x32 && buf[24] === 0x69 && buf[25] === 0x73 && buf[26] === 0x6F && buf[27] === 0x6D) ||
(buf[0] === 0x0 && buf[1] === 0x0 && buf[2] === 0x0 && buf[3] === 0x1C && buf[4] === 0x66 && buf[5] === 0x74 && buf[6] === 0x79 && buf[7] === 0x70 && buf[8] === 0x69 && buf[9] === 0x73 && buf[10] === 0x6F && buf[11] === 0x6D)
) {
return {
ext: 'mp4',
mime: 'video/mp4'
};
}

How would you define something like avi, where you have buf 0-3 AND buff 8-10 ?

if (buf[0] === 0x52 && buf[1] === 0x49 && buf[2] === 0x46 && buf[3] === 0x46 && buf[8] === 0x41 && buf[9] === 0x56 && buf[10] === 0x49) {

@alexanderlperez
Copy link
Contributor

The current way signatures are tested have an inherent logic to them (no pun intended), so anything that moves the complexity away from the the 'schema' of how the tests are structured seems like goldplating the architecture.

However, @thisconnect @matthewbauer, something that keeps to the simplicity of actual JS would be possible, maybe a reference JSON and a method to compare the stored signatures would be ideal, some thing like (ES5ish psuedocode 😅):

var ref = require('./reference.json');

function testSignature(file, ref_entry) {
... get the ref and any associated offset data and test against the file
}

// test the various mp4 container signatures saved in the JSON reference
if (testSignature(file, ref.mp4_3gp) || testSignature(file, ref.mp4_isom) || testSignature(file, ref.mp4_etc) || ...) {
 return ...
}

// test the avi signature stored as multiple parts
if (testSignature(file, ref.avi[0]) && testSignature(file, ref.avi[1])) {
 return ...
}

At a glance, most of the complexity would may stored in the JSON and how it's formatted. Moving past that and it's almost like defining a new configuration language, ie. building an interpreter.

Thoughts?

@thisconnect
Copy link

@alexanderlperez

defining a new configuration language, ie. building an interpreter.

yes that is exactly the point. reference.json should contain the patterns and the logic otherwise there is no point in doing this at all IMO.

A signature should contain multiple patterns that are OR'ed and each of these have mutliple rules that are AND'ed. The script should iterate through the signature (and make sure to preserve the order) and test if ever rule is true for one of the patterns.

@billinghamj
Copy link

I also wonder whether it could be done in more of a hash-map kind of arrangement, rather than an array of possible matches which needs to be looped over.

Also it's worth noting that the current method of having static code is likely to be quite a bit faster than running against an array, due to code optimizations. You might want to generate code from the array ahead of time to mitigate this.

@alexanderlperez
Copy link
Contributor

Totally agreed.

It's easier to argue "breaking out the file data into a JSON file" is better as a separate project.

@Prinzhorn
Copy link

Prinzhorn commented May 10, 2016

The current way signatures are tested have an inherent logic to them (no pun intended), so anything that moves the complexity away from the the 'schema' of how the tests are structured seems like goldplating the architecture.

I agree 100% with what @alexanderlperez said. I don't see any reason to separate the checks. The only reason would be to brag about how everything is nicely separated and so clean (aka "goldplating"). But separation creates complexity.

Or in other words: I came to this repo, opened the index.js and immediately understood what's going on. It's simplicity is beautiful. It's self documenting. It's trivial to port to other languages. Please don't change it just for the sake of change. drops mic

@tmorehouse
Copy link

I could see this being more of a tree structure, starting at byte 0 and matching on that (and maybe using a hex string and/or ** as a match for any byte. Longer key strings could be searched first in the JSON hash, and if failing to match the next longest string is matched. If the string key string matches, and has children, then recursively go down to the next node until a leaf is found.

Wouldn't be very human readable, but would be relatively fast.

@dimapaloskin
Copy link

dimapaloskin commented Nov 16, 2016

Hello, guys.
Some time ago I implemented module which has signatures in the json files and also has supporting or/and conditions.

If someone still to be interested: https://github.com/dimapaloskin/detect-file-type

@mifi
Copy link
Sponsor Contributor

mifi commented Nov 25, 2016

The problem with this simple declarative mechanism is that not all common files can be detected this way. Some have more dynamic detection mechanisms.
For example mkv/webm.
See #67 #69

The best thing would be to implement support for libmagic's magic files, so we can reuse all their declaration files. See #68

@sindresorhus
Copy link
Owner

Closing as this would not be feasible for many file formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants