Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change proposal: break the fixed sample limit #248

Closed
6 tasks done
Borewit opened this issue Oct 15, 2019 · 5 comments
Closed
6 tasks done

Change proposal: break the fixed sample limit #248

Borewit opened this issue Oct 15, 2019 · 5 comments

Comments

@Borewit
Copy link
Collaborator

Borewit commented Oct 15, 2019

Limitation

Some file types cannot be determined reliable within the current 4k (or any reasonable single buffer limit). For example, an audio file with 600 kB ID3v2 header, may not be necessary an MP3 file. There are other audio formats which may be prefixed with an ID3v2 header.

Proposed solution

Using a stream based solution, we could read information, starting from the beginning required to determine file type. This is an approach I use music-metadata using strtok3 tokenizer, which supports peeking. It picks the most efficient way to jump to next required offset, depending if the underlying data is random accessible, like a file, blob or buffer. Or just read ahead in case of stream. This access abstraction can re-used from readable-web-to-node-stream.

what happens if a file type can be determined by the first 4 bytes?

The first 4 bytes will be read, no more.

what happens if the file is prefixed with ID3v2 header of 4 MB?

It will keep requesting data at the calculated offsets, in sequential manner, until it can determine the file type. But no more then required.

As this is major change, and will introduce a number of specialized function, e.g.:

// Determine file type from a node.js file
const type1 = fileType.fromFile(file);

// Determine file type from a node.js stream
const type2 = fileType.fromStream(stream);

// Determine file type using a buffer as an input (where the buffer represents the entire file)
const type3 = fileType.fromBuffer(buffer)` 

A dedicated module for browser usage (using is effectively an adapter) offers browser specific functions:

// Determine file type from a browser File or Blob
const type1 = browserFileType.fromBlob(blob);

// Determine file type from a browser ReadableStream
const type2 = browserFileType.fromStream(stream);

The browser part adapter is comparable with how I wrapped music-metadata with music-metadata-browser.

I love to hear your feedback to make an assessment if this proposal is worth implementing.

Looking forward to hear your feedback.

Related issues: #210

Implementation status

Development branch: stream-tokenizer

Progress / ToDo

@sindresorhus
Copy link
Owner

This sounds reasonable. I think it should still be possible to pass a partial file as a buffer for the detections that doesn't need the whole file.

@Borewit
Copy link
Collaborator Author

Borewit commented Nov 19, 2019

I think it should still be possible to pass a partial file as a buffer for the detections that doesn't need the whole file.

Fair enough.

@Borewit
Copy link
Collaborator Author

Borewit commented Nov 22, 2019

@sindresorhus can you please open a 'development' branch, e.g. stream-tokenizer, allowing to divide this effort over multiple PR's against that branch?

@sindresorhus
Copy link
Owner

https://github.com/sindresorhus/file-type/tree/stream-tokenizer

@Borewit
Copy link
Collaborator Author

Borewit commented Dec 3, 2019

@sindresorhus, implementation is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants