Change proposal: break the fixed sample limit #248

Borewit · 2019-10-15T22:16:40Z

Limitation

Some file types cannot be determined reliable within the current 4k (or any reasonable single buffer limit). For example, an audio file with 600 kB ID3v2 header, may not be necessary an MP3 file. There are other audio formats which may be prefixed with an ID3v2 header.

Proposed solution

Using a stream based solution, we could read information, starting from the beginning required to determine file type. This is an approach I use music-metadata using strtok3 tokenizer, which supports peeking. It picks the most efficient way to jump to next required offset, depending if the underlying data is random accessible, like a file, blob or buffer. Or just read ahead in case of stream. This access abstraction can re-used from readable-web-to-node-stream.

what happens if a file type can be determined by the first 4 bytes?

The first 4 bytes will be read, no more.

what happens if the file is prefixed with ID3v2 header of 4 MB?

It will keep requesting data at the calculated offsets, in sequential manner, until it can determine the file type. But no more then required.

As this is major change, and will introduce a number of specialized function, e.g.:

// Determine file type from a node.js file
const type1 = fileType.fromFile(file);

// Determine file type from a node.js stream
const type2 = fileType.fromStream(stream);

// Determine file type using a buffer as an input (where the buffer represents the entire file)
const type3 = fileType.fromBuffer(buffer)`

A dedicated module for browser usage (using is effectively an adapter) offers browser specific functions:

// Determine file type from a browser File or Blob
const type1 = browserFileType.fromBlob(blob);

// Determine file type from a browser ReadableStream
const type2 = browserFileType.fromStream(stream);

The browser part adapter is comparable with how I wrapped music-metadata with music-metadata-browser.

I love to hear your feedback to make an assessment if this proposal is worth implementing.

Looking forward to hear your feedback.

Related issues: #210

Implementation status

Development branch: stream-tokenizer

Progress / ToDo

PR Initial integration with strtok3 #261:
- Update AP
- Proof backward compatibility, being able to determine file type from a partial file as a buffer
PR Use tokenizer #262:
- Utilize tokenizer (migrate checks to use tokenizer)
- Update TypeScript definitions
PR Separate Node.js dependency #265
- Separate core and Node.js functions by sub-inclusion
PR Change proposal: break the fixed sample limit #248
- Add /browser sub-module

The text was updated successfully, but these errors were encountered:

sindresorhus · 2019-11-19T07:27:25Z

This sounds reasonable. I think it should still be possible to pass a partial file as a buffer for the detections that doesn't need the whole file.

Borewit · 2019-11-19T19:50:00Z

I think it should still be possible to pass a partial file as a buffer for the detections that doesn't need the whole file.

Fair enough.

Borewit · 2019-11-22T18:48:51Z

@sindresorhus can you please open a 'development' branch, e.g. stream-tokenizer, allowing to divide this effort over multiple PR's against that branch?

sindresorhus · 2019-11-22T19:07:21Z

https://github.com/sindresorhus/file-type/tree/stream-tokenizer

Borewit · 2019-12-03T07:29:58Z

@sindresorhus, implementation is ready for review.

sindresorhus added enhancement help wanted labels Nov 19, 2019

Borewit mentioned this issue Nov 20, 2019

Unit tests rewritten with mocha and chai #257

Closed

This was referenced Nov 22, 2019

Initial integration with strtok3 #261

Merged

Use tokenizer #262

Merged

This was referenced Dec 2, 2019

Separate Node.js dependency #265

Merged

.aac is detected as .mp4 (audio/mpeg) #259

Closed

This was referenced Dec 3, 2019

Improve the MPEG/ADTS frame / ID3 detection algorithm #210

Closed

Use tokenizer browser #268

Merged

This was referenced Jan 1, 2020

Merge tokenizer API to master branch #284

Merged

Stream tokenizer #285

Closed

Borewit closed this as completed Jan 12, 2020

This was referenced Jan 27, 2020

Removed minimumBytes #319

Merged

PNG with long iTXt chunk not detected as PNG image #320

Closed

Borewit mentioned this issue Aug 20, 2021

Concurrent read operation?' Borewit/peek-readable#356

Closed

Borewit mentioned this issue Dec 15, 2021

file-type does not work in browsers #505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change proposal: break the fixed sample limit #248

Change proposal: break the fixed sample limit #248

Borewit commented Oct 15, 2019 •

edited

sindresorhus commented Nov 19, 2019

Borewit commented Nov 19, 2019

Borewit commented Nov 22, 2019

sindresorhus commented Nov 22, 2019

Borewit commented Dec 3, 2019

Change proposal: break the fixed sample limit #248

Change proposal: break the fixed sample limit #248

Comments

Borewit commented Oct 15, 2019 • edited

Limitation

Proposed solution

Implementation status

Progress / ToDo

sindresorhus commented Nov 19, 2019

Borewit commented Nov 19, 2019

Borewit commented Nov 22, 2019

sindresorhus commented Nov 22, 2019

Borewit commented Dec 3, 2019

Borewit commented Oct 15, 2019 •

edited