Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A space appears at the beginning of the file (Byte order mark) #1922

Closed
v-timofeev opened this issue Oct 26, 2021 · 8 comments · Fixed by #1938
Closed

A space appears at the beginning of the file (Byte order mark) #1922

v-timofeev opened this issue Oct 26, 2021 · 8 comments · Fixed by #1938
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@v-timofeev
Copy link

v-timofeev commented Oct 26, 2021

Describe the bug you encountered:

If you use bat on C# source files (.cs, .xaml and others), a space appears in the first line. This is due to byte order mark (BOM)
Maybe reproduced for others files on Windows systems
https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
Sample file with BOM:
Program.cs.txt

IWpnW3OyGgg

Screenshot_57

What did you expect to happen instead?
If I delete these bytes:
Screenshot_58

bat works correctly :

Screenshot_59

How did you install bat?

GitHub release:
bat-v0.18.3-x86_64-pc-windows-gnu.zip
bat-v0.18.3-x86_64-pc-windows-msvc.zip


bat version and environment

Software version

bat 0.18.3 (b146958)

Operating system

Windows 6.2.9200

Command-line

C:\WINDOWS\system32\bat.exe --diagnostic

Environment variables

SHELL=<not set>
PAGER=<not set>
LESS=<not set>
BAT_PAGER=<not set>
BAT_CACHE_PATH=<not set>
BAT_CONFIG_PATH=<not set>
BAT_OPTS=<not set>
BAT_STYLE=<not set>
BAT_TABS=<not set>
BAT_THEME=<not set>
XDG_CONFIG_HOME=<not set>
XDG_CACHE_HOME=<not set>
COLORTERM=<not set>
NO_COLOR=<not set>
MANPAGER=<not set>

Config file

Could not read contents of 'C:\Users\timoxa\AppData\Roaming\bat\config': Системе не удается найти указанный путь. (os error 3).

Compile time information

  • Profile: release
  • Target triple: x86_64-pc-windows-gnu
  • Family: windows
  • OS: windows
  • Architecture: x86_64
  • Pointer width: 64
  • Endian: little
  • CPU features: fxsr,sse,sse2
  • Host: x86_64-pc-windows-msvc
@v-timofeev v-timofeev added the bug Something isn't working label Oct 26, 2021
@Enselic Enselic added the windows Issue is related to the Windows build of bat label Oct 26, 2021
@sharkdp
Copy link
Owner

sharkdp commented Oct 26, 2021

Thank you very much for the detailed bug report.

There a few things to consider here.

  • bat uses a library called content_inspector, which I created some time ago to distinguish "text" files from "binary" files (see https://dev.to/sharkdp/what-is-a-binary-file-2cf5 if you want to know more). content_inspector can also properly detect Byte order marks and correctly classifies your file as "UTF-8-BOM", i.e. UTF-8 encoded text with a Byte order mark.
  • bat uses this classification (UTF-8, UTF-8-BOM, UTF-16LE, UTF-16BE) to choose a proper decoder for the text file. This is why bat (in contrast to cat) can properly show the contents of a UTF-16-encoded file (which is often used on Windows). For UTF-8, we don't do anything, as we assume the terminal to be configured as UTF-8. UTF-32 (BE/LE) is currently not supported, but could easily be.
  • What bat does not do is to strip the BOM from the output. This is (arguably? maybe?) the right thing to do when we are in "plain" mode. And definitely the right thing to do when we are in non-interactive/loopthrough mode, i.e. if we are piping the output to a file or to another program. However, when we are showing the contents of a file on an interactive terminal, we should consider stripping the BOM.
  • Since the BOM is not stripped, it will be part of bats output. What this means is that the terminal (which we assume to be on UTF-8 encoding) has to interpret and display the BOM. The people who designed Unicode are obviously quite smart because they mapped the byte sequence EF BB BF to the Unicode code point U+FEFF, which is a "Zero Width No-Break Space". On my terminal emulator (terminator), this Unicode character does not show up:
    image
    Similarly, calling bat on your file looks good:
    image
    This does not mean that the character is not there. If I select the first line with the mouse and copy the text, the clipboard does actually contain the U+FEFF character, which we can confirm with bat:
    image
    By the way: do not confuse U+FEFF with the byte sequence FE FF, which is actually a UTF-16 (Big Endian) Byte order mark! This is not a coincidence. The Big Endian encoding of U+FEFF is FE FF.

So I guess for now this is a question of whether or not a particular terminal emulator prints the U+FEFF zero width character with an actual width of zero. We could probably still do better and simply strip BOMs from bats interactive output.

To be honest, I have never seen a UTF-8 BOM "in the wild". At least on Linux, every program seems to use the BOM-less version when writing UTF-8. That doesn't make this bug less relevant though, because UTF-16 files should suffer from the same problem.

@Enselic: IMO, this is not a Windows-specific bug. Files with UTF-8 BOMs might appear on non-Windows systems as well.

Further reading: Unicode standard, https://www.unicode.org/versions/Unicode6.1.0/ch16.pdf page 562

@sharkdp sharkdp removed the windows Issue is related to the Windows build of bat label Oct 26, 2021
@Enselic
Copy link
Collaborator

Enselic commented Oct 27, 2021

Great analysis! I confess to not having done that deep of an analysis before putting the windows label on 😊 . Turns out it was overhasty, because there is a similar problem on macOS 11.6 with Terminal.app Version 2.11 (440):

Screen Shot 2021-10-27 at 07 54 03

The problem persist even in --plain mode when the BOM is in the first bytes of the output:

Screen Shot 2021-10-27 at 07 55 43

Interestingly, if we bypass the pager, the output is correct:

Screen Shot 2021-10-27 at 07 56 29

Turns out the output is correct even if the BOM is not first in the output, as long as the pager is bypassed:

Screen Shot 2021-10-27 at 07 59 00

This is with the current vanilla less on macOS which is somewhat old:

% less --version
less 487 (POSIX regular expressions)

So what if we try a later version of less? That seems to solve the problem on macOS:

% less --version
less 563 (PCRE regular expressions)

Screen Shot 2021-10-27 at 08 06 38

@v-timofeev What pager and version are you using?

@v-timofeev
Copy link
Author

@Enselic Sorry for not answering for a long time!

What pager and version are you using?

PAGER=<not set>

I noticed that the behavior depends on the terminal emulator:

For PowerShell 7.1.5:
Screenshot_4

For Cygwin64 Terminal:
Screenshot_5

For VNC (Centos 8)
Screenshot_6
You can see, that first symbol (#) with broken color scheme:
Screenshot_7

@Enselic
Copy link
Collaborator

Enselic commented Oct 28, 2021

If no pager is specified, less is used.

  • What is your output for less --version?

  • What is your output in PowerShell with bat --pager=never Program.cs?

@v-timofeev
Copy link
Author

Powershell on windows:
Screenshot_8

Powershell on Centos 8 (via ssh)
Screenshot_9

@Enselic
Copy link
Collaborator

Enselic commented Nov 1, 2021

Does it work if the BOM are the first bytes of the output? Try both bat --plain Program.cs and bat --plain --pager=never Program.cs just to be sure, even though a pager might not even be used in your case.

@v-timofeev
Copy link
Author

In PowerShell same behavior:
image
image

In cmder:
image

You can see, that BOM breaks the syntax highlighting on first line of file:
with BOM:
image
without BOM:
image

@Enselic
Copy link
Collaborator

Enselic commented Nov 1, 2021

I suspect the highlighting error is because most syntax regex patterns do not work with a BOM. So even if the terminal displays it properly (i.e. not at all), we still need to strip it whenever we want syntax highlighting to work. Even with --plain, as long as we highlight. Not on loop-through mode though; I agree.

A nice side effect of that is that it we also "fix" when the pager and/or terminal in question do not display the BOM properly.

I'm setting good-first-issue on this because it shouldn't be very hard to do.

@Enselic Enselic added the good first issue Good for newcomers label Nov 1, 2021
Repository owner deleted a comment from denidenial22 Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants