Strip BOM when reading UTF-8 files #21

mathiasbynens · 2014-05-11T19:51:09Z

When using gulp-concat, I noticed that the BOM wasn’t stripped from files. When concatenating files together there would be multiple BOMs in the resulting file, which only leads to increased file size for no good reason. (Of course, people shouldn’t use UTF-8 with BOM in the first place, but that’s a separate issue…)

Let’s do what Grunt does and strip BOM by default. Grunt has a setting to override it but IMHO this is not needed. https://github.com/gruntjs/grunt/blob/057eb1c3a7ffdb26b6e2e1a5eb67b262cdd3a71d/lib/grunt/file.js#L238-L241

Thanks to @phated for the pointer to vinyl-fs on IRC.

mathiasbynens · 2014-05-11T19:52:07Z

The point of this PR is to get the discussion going. I’ll add tests etc. if you agree this is a useful addition.

sindresorhus · 2014-05-11T20:07:25Z

👍

Could also use https://github.com/sindresorhus/strip-bom (it also supports buffers), but no worries if not.

mathiasbynens · 2014-05-11T20:18:45Z

@sindresorhus Should’ve known you had a package for that! Updated the PR.

I’ve now added a test, too.

Note to self: easy way to create a “UTF-8 with BOM” file for testing:

$ echo 'This file is saved as UTF-8 with BOM.' > test/fixtures/bom.txt
$ vim -e -s -c ':set bomb' -c ':wq' test/fixtures/bom.txt

sindresorhus · 2014-05-11T21:01:52Z

@mathiasbynens or just $ uconv --add-signature test.md >! test.md

mathiasbynens · 2014-05-12T05:02:04Z

We should probably only strip any BOM if the input if it’s in UTF-8 – else it becomes impossible for people to use any encodings that rely on the BOM (not that they should…). Or, we could add that preserveBOM setting after all.

sindresorhus · 2014-05-12T08:01:59Z

It definitely should only strip BOM from utf8. As far as I know Node only supports reading in files as buffer, utf8 or ascii, not utf16, so not sure how it would ever be a problem?

No, to preserveBOM though. It should just strip BOM from utf8 as it's moot and preserve others.

mathiasbynens · 2014-05-12T08:13:49Z

As far as I know Node only supports reading in files as buffer, utf8 or ascii, not utf16, so not sure how it would ever be a problem?

If a Big-Endian UTF-16-encoded file starts with U+EFBB followed by U+BF00 (or any other code point in the range from U+BF00 to U+BFFF) then the first three bytes in the buffer would be 0xEF 0xBB 0xBF and strip-bom would remove them, not only causing causing data loss but effectively mangling the rest of the buffer.

Since the code point U+EFBB has not been assigned a symbol this specific example is definitely an edge case. But my point is that I shouldn’t just assume that 0xEF 0xBB 0xBF can be stripped in every possible encoding, as it’s not necessarily a BOM.

sindresorhus · 2014-05-12T08:21:48Z

Is there anything strip-bom can do to protect against that?

mathiasbynens · 2014-05-12T08:23:35Z

strip-bom is intended to be used for UTF-8 encoded files (right?) so it’s fine. If we want to make strip-bom work for any file, we’d need to detect if if it’s a UTF-8-encoded file somehow, and only strip the BOM in that case.

sindresorhus · 2014-05-12T08:23:57Z

Hmm, all the peoples on the internet says to do it like that though...

sindresorhus · 2014-05-12T08:24:40Z

Well, yes, but would be nice if consumers didn't have to make sure it's utf8 themselves.

mathiasbynens · 2014-05-12T08:26:31Z

I just meant that you’ll probably want to create a separate is-utf8 package and then make strip-bom depend on it. It seems like we agree here. 👍

Discussion: gulpjs/vinyl-fs#21

mathiasbynens · 2014-05-12T09:40:32Z

Updated this PR to use the new version of strip-bom, which only affects UTF-8 strings and buffers. 👍

For the test files, I took a valid UTF-16-BE/LE files and prepended 0xEF 0xBB 0xBF (i.e. a UTF-8-encoded BOM) to them using a hex editor. This is what they look like:

$ hexdump -C test/fixtures/bom-utf16be.txt
00000000  ef bb bf 00 00 54 00 68  00 69 00 73 00 20 00 66  |.....T.h.i.s. .f|
00000010  00 69 00 6c 00 65 00 20  00 69 00 73 00 20 00 73  |.i.l.e. .i.s. .s|
00000020  00 61 00 76 00 65 00 64  00 20 00 61 00 73 00 20  |.a.v.e.d. .a.s. |
00000030  00 55 00 54 00 46 00 2d  00 31 00 36 00 2d 00 42  |.U.T.F.-.1.6.-.B|
00000040  00 45 00 2e 00 20 00 49  00 74 00 20 00 63 00 6f  |.E... .I.t. .c.o|
00000050  00 6e 00 74 00 61 00 69  00 6e 00 73 00 20 00 73  |.n.t.a.i.n.s. .s|
00000060  00 6f 00 6d 00 65 00 20  00 67 00 61 00 72 00 62  |.o.m.e. .g.a.r.b|
00000070  00 61 00 67 00 65 00 20  00 61 00 74 00 20 00 74  |.a.g.e. .a.t. .t|
00000080  00 68 00 65 00 20 00 73  00 74 00 61 00 72 00 74  |.h.e. .s.t.a.r.t|
00000090  00 20 00 74 00 68 00 61  00 74 00 20 00 6c 00 6f  |. .t.h.a.t. .l.o|
000000a0  00 6f 00 6b 00 73 00 20  00 6c 00 69 00 6b 00 65  |.o.k.s. .l.i.k.e|
000000b0  00 20 00 61 00 20 00 55  00 54 00 46 00 2d 00 38  |. .a. .U.T.F.-.8|
000000c0  00 2d 00 65 00 6e 00 63  00 6f 00 64 00 65 00 64  |.-.e.n.c.o.d.e.d|
000000d0  00 20 00 42 00 4f 00 4d  00 20 00 28 00 62 00 75  |. .B.O.M. .(.b.u|
000000e0  00 74 00 20 00 69 00 73  00 6e 20 19 00 74 00 29  |.t. .i.s.n ..t.)|
000000f0  00 2e 00 0a                                       |....|
000000f4

$ hexdump -C test/fixtures/bom-utf16le.txt
00000000  ef bb bf 00 54 00 68 00  69 00 73 00 20 00 66 00  |....T.h.i.s. .f.|
00000010  69 00 6c 00 65 00 20 00  69 00 73 00 20 00 73 00  |i.l.e. .i.s. .s.|
00000020  61 00 76 00 65 00 64 00  20 00 61 00 73 00 20 00  |a.v.e.d. .a.s. .|
00000030  55 00 54 00 46 00 2d 00  31 00 36 00 2d 00 4c 00  |U.T.F.-.1.6.-.L.|
00000040  45 00 2e 00 20 00 49 00  74 00 20 00 63 00 6f 00  |E... .I.t. .c.o.|
00000050  6e 00 74 00 61 00 69 00  6e 00 73 00 20 00 73 00  |n.t.a.i.n.s. .s.|
00000060  6f 00 6d 00 65 00 20 00  67 00 61 00 72 00 62 00  |o.m.e. .g.a.r.b.|
00000070  61 00 67 00 65 00 20 00  61 00 74 00 20 00 74 00  |a.g.e. .a.t. .t.|
00000080  68 00 65 00 20 00 73 00  74 00 61 00 72 00 74 00  |h.e. .s.t.a.r.t.|
00000090  20 00 74 00 68 00 61 00  74 00 20 00 6c 00 6f 00  | .t.h.a.t. .l.o.|
000000a0  6f 00 6b 00 73 00 20 00  6c 00 69 00 6b 00 65 00  |o.k.s. .l.i.k.e.|
000000b0  20 00 61 00 20 00 55 00  54 00 46 00 2d 00 38 00  | .a. .U.T.F.-.8.|
000000c0  2d 00 65 00 6e 00 63 00  6f 00 64 00 65 00 64 00  |-.e.n.c.o.d.e.d.|
000000d0  20 00 42 00 4f 00 4d 00  20 00 28 00 62 00 75 00  | .B.O.M. .(.b.u.|
000000e0  74 00 20 00 69 00 73 00  6e 00 19 20 74 00 29 00  |t. .i.s.n.. t.).|
000000f0  2e 00 0a 00                                       |....|
000000f4

yocontra · 2014-05-13T07:05:21Z

This looks great. The only problem is consistency with streaming mode, so this also needs to be applied here https://github.com/wearefractal/vinyl-fs/blob/master/lib/src/streamFile.js#L4 via a streaming BOM remover which may or may not exist already.

sindresorhus · 2014-05-24T17:09:25Z

@mathiasbynens mind updating so we can land this?

strip-bom has stream support: https://github.com/sindresorhus/strip-bom#usage

mathiasbynens · 2014-05-24T19:11:33Z

Done (without a test for streaming mode though).

coveralls · 2014-05-24T19:12:33Z

Coverage increased (+0.04%) when pulling 4c1fc02 on mathiasbynens:patch-1 into abd2fe4 on wearefractal:master.

yocontra · 2014-05-24T22:51:40Z

One small problem with the BOM removal streaming - it's possible that a chunk won't contain the entire BOM, you need to buffer the first 3 bytes then check for the BOM. The first chunk could just be one or two bytes and then it will be marked as "no bom found" and slip through. I can't think of a case where bytes would be emitted one by one but it's possible and I don't think it's safe to make assumptions

yocontra · 2014-05-24T22:55:12Z

Also, we can go with UTF8 support for now but I would like to see support for others afterwards. I can help out with that if I have time.

https://github.com/jedmao/bombom#pre-registered-bom-types

sindresorhus · 2014-05-24T22:58:49Z

@contra i assumed it would never be as low, but as i can't find anywhere mentioning a minimum size i guess i have to do that...

Also, we can go with UTF8 support for now but I would like to see support for others afterwards. I can help out with that if I have time.

There's no reason to strip bom from anything other than UTF8. In UTF16 and others it actually has a use.

yocontra · 2014-05-24T23:00:53Z

@sindresorhus Ah okay, I have no idea what I'm saying then - disregard 🌵

yocontra · 2014-05-24T23:07:49Z

Once the strip-bom module has support for lower watermarks I'll merge this and add the stream test before publishing, then include in gulp 3.7 (releasing later tonight)

gulpjs/vinyl-fs#21

sindresorhus · 2014-05-25T16:09:01Z

@contra done

yocontra · 2014-05-25T18:58:48Z

@sindresorhus Nice job on first-chunk-stream

Strip BOM when reading UTF-8 files

sindresorhus added a commit to sindresorhus/strip-bom that referenced this pull request May 12, 2014

Only strip BOM from Buffer containing UTF8

13d9c91

Discussion: gulpjs/vinyl-fs#21

mathiasbynens changed the title ~~Strip BOM when reading files~~ Strip BOM when reading UTF-8 files May 12, 2014

Strip BOM when reading UTF-8-encoded files

4c1fc02

sindresorhus added a commit to sindresorhus/strip-bom that referenced this pull request May 25, 2014

Ensure the first chunk is at least 3 bytes

8a55585

gulpjs/vinyl-fs#21

yocontra added a commit that referenced this pull request May 25, 2014

Merge pull request #21 from mathiasbynens/patch-1

15c0f9a

Strip BOM when reading UTF-8 files

yocontra merged commit 15c0f9a into gulpjs:master May 25, 2014

amitport mentioned this pull request May 30, 2014

added option to disable strip BOM #22

Closed

yocontra mentioned this pull request Jul 29, 2014

Remove potential Unicode BOM spalger/gulp-jshint#65

Merged

phated mentioned this pull request Aug 12, 2015

UTF-8 BOM stripping Redux #83

Closed

phated pushed a commit that referenced this pull request Dec 5, 2017

Merge pull request #21 from mathiasbynens/patch-1

3b199d4

Strip BOM when reading UTF-8 files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip BOM when reading UTF-8 files #21

Strip BOM when reading UTF-8 files #21

mathiasbynens commented May 11, 2014

mathiasbynens commented May 11, 2014

sindresorhus commented May 11, 2014

mathiasbynens commented May 11, 2014

sindresorhus commented May 11, 2014

mathiasbynens commented May 12, 2014

sindresorhus commented May 12, 2014

mathiasbynens commented May 12, 2014

sindresorhus commented May 12, 2014

mathiasbynens commented May 12, 2014

sindresorhus commented May 12, 2014

sindresorhus commented May 12, 2014

mathiasbynens commented May 12, 2014

mathiasbynens commented May 12, 2014

yocontra commented May 13, 2014

sindresorhus commented May 24, 2014

mathiasbynens commented May 24, 2014

coveralls commented May 24, 2014

yocontra commented May 24, 2014

yocontra commented May 24, 2014

sindresorhus commented May 24, 2014

yocontra commented May 24, 2014

yocontra commented May 24, 2014

sindresorhus commented May 25, 2014

yocontra commented May 25, 2014

Strip BOM when reading UTF-8 files #21

Strip BOM when reading UTF-8 files #21

Conversation

mathiasbynens commented May 11, 2014

mathiasbynens commented May 11, 2014

sindresorhus commented May 11, 2014

mathiasbynens commented May 11, 2014

sindresorhus commented May 11, 2014

mathiasbynens commented May 12, 2014

sindresorhus commented May 12, 2014

mathiasbynens commented May 12, 2014

sindresorhus commented May 12, 2014

mathiasbynens commented May 12, 2014

sindresorhus commented May 12, 2014

sindresorhus commented May 12, 2014

mathiasbynens commented May 12, 2014

mathiasbynens commented May 12, 2014

yocontra commented May 13, 2014

sindresorhus commented May 24, 2014

mathiasbynens commented May 24, 2014

coveralls commented May 24, 2014

yocontra commented May 24, 2014

yocontra commented May 24, 2014

sindresorhus commented May 24, 2014

yocontra commented May 24, 2014

yocontra commented May 24, 2014

sindresorhus commented May 25, 2014

yocontra commented May 25, 2014