Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue handling text with escape sequences already embedded #2185

Closed
eliminmax opened this issue May 6, 2022 · 17 comments · Fixed by #2856
Closed

Issue handling text with escape sequences already embedded #2185

eliminmax opened this issue May 6, 2022 · 17 comments · Fixed by #2856
Assignees
Labels
bug Something isn't working

Comments

@eliminmax
Copy link

What steps will reproduce the bug?

  1. Create a file with multiple embedded escape sequences in one line:
# shell one-liner to create a simple demo file for the bug
printf '\033[1;32mgreen,bold \033[39mcolorless,bold \033[35mpurple,bold \033[39mcolorless,bold \033[0mcolorless\n' > bat-bug-test
  1. bat -pp the new file

What happens?

The first escape sequence overrides all following escape sequences before the final '[0m' (clear formatting) escape sequence

What did you expect to happen instead?

The escape sequences would render properly, or not at all

For context, both cat from coreutils version 8.32-4ubuntu2 and busybox cat from busybox 1:1.30.1-6ubuntu3.1 on Pop!_OS 21.04 result in the following:

cat-proper

Meanwhile, bat -pp from bat version 0.20.0 results in the following:

bat-improper

For context, I have a "shell collection" consisting of 17 command-line shells, with different colors for the prompts, depending on the shell, and I have an executable that prints out the different prompts, which can be seen here. I first ran into this issue when running bat -pp on that file.

How did you install bat?

From the GitHub releases .deb file


bat version and environment

Software version

bat 0.20.0 (0655ecf)

Operating system

Linux 5.16.19-76051619-generic

Command-line

bat --diagnostic -pp bat-bug-test 

Environment variables

SHELL=/usr/bin/bash
PAGER=<not set>
LESS=-Q
LANG=en_US.UTF-8
LC_ALL=<not set>
BAT_PAGER=<not set>
BAT_CACHE_PATH=<not set>
BAT_CONFIG_PATH=<not set>
BAT_OPTS=<not set>
BAT_STYLE=<not set>
BAT_TABS=<not set>
BAT_THEME=<not set>
XDG_CONFIG_HOME=/home/eliminmax/.config
XDG_CACHE_HOME=<not set>
COLORTERM=truecolor
NO_COLOR=<not set>
MANPAGER=<not set>

Config file

# This is `bat`s configuration file. Each line either contains a comment or
# a command-line option that you want to pass to `bat` by default. You can
# run `bat --help` to get a list of all possible configuration options.

# Specify desired highlighting theme (e.g. "TwoDark"). Run `bat --list-themes`
# for a list of all available themes
--theme="Visual Studio Dark+"

# Enable this to use italic text on the terminal. This is not supported on all
# terminal emulators (like tmux, by default):
--italic-text=always

# Uncomment the following line to disable automatic paging:
#--paging=never

# Uncomment the following line if you are using less version >= 551 and want to
# enable mouse scrolling support in `bat` when running inside tmux. This might
# disable text selection, unless you press shift.
 # --pager="less --RAW-CONTROL-CHARS --quit-if-one-screen --mouse"

# Syntax mappings: map a certain filename pattern to a language.
#   Example 1: use the C++ syntax for .ino files
#   Example 2: Use ".gitignore"-style highlighting for ".ignore" files
#--map-syntax "*.ino:C++"
#--map-syntax ".ignore:Git Ignore"

# treat all files in a directory called ".bashrc.d" as bash scripts
--map-syntax '**/.bashrc.d/*:Bourne Again Shell (bash)'
# default options
--style plain,header,numbers,changes
--tabs 4
--wrap never

Compile time information

  • Profile: release
  • Target triple: x86_64-unknown-linux-gnu
  • Family: unix
  • OS: linux
  • Architecture: x86_64
  • Pointer width: 64
  • Endian: little
  • CPU features: fxsr,sse,sse2
  • Host: x86_64-unknown-linux-gnu

Less version

> less --version 
less 551 (GNU regular expressions)
Copyright (C) 1984-2019  Mark Nudelman

less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution,
see the file named README in the less distribution.
Home page: http://www.greenwoodsoftware.com/less
@eliminmax eliminmax added the bug Something isn't working label May 6, 2022
@eliminmax
Copy link
Author

Additional context: when replacing the ESC character with ^[, the output of bat -pp bat-bug-test is as follows:

^[[1;32m^[[38;2;220;220;220m^[[32m^[[1;32mgreen,bold ^[[0m^[[39m^[[38;2;220;220;220m^[[1;32mcolorless,bold ^[[0m^[[35m^[[38;2;220;220;220m^[[35m^[[1;32mpurple,bold ^[[0m^[[39m^[[38;2;220;220;220m^[[1;32mcolorless,bold ^[[0m^[[0m^[[38;2;220;220;220mcolorless^[[0m

The coreutils cat results in the following:

^[[1;32mgreen,bold ^[[39mcolorless,bold ^[[35mpurple,bold ^[[39mcolorless,bold ^[[0mcolorless

I understand how odd of an edge case this is, and I'm sure it's difficult to handle existing escape sequences while adding new sequences for the syntax highlighting, so it's no big deal if this is a low-priority or WONTFIX situation.

@keith-hall
Copy link
Collaborator

Thanks for reporting, it's useful to know even about edge cases :) I'm curious how bat prior to #1596 behaves. Maybe you can try v0.18.3 and let us know the outcome? - it could give us an idea where to look.

@eliminmax
Copy link
Author

eliminmax commented May 6, 2022

I just installed v0.18.3 with cargo and tried it out, and got the following:

^[[38;2;220;220;220m^[[1;32mgreen,bold ^[[39mcolorless,bold ^[[35mpurple,bold ^[[39mcolorless,bold ^[[0mcolorless^[[0m

The visual appearance is pixel-for-pixel identical to busybox cat and coreutils cat, but it does have the extra escape sequences at the beginning and end.

@Enselic
Copy link
Collaborator

Enselic commented May 6, 2022

This is a known limitation and is mentioned in the readme: https://github.com/sharkdp/bat#garbled-output

You are advised to turn off syntax highlighting for such input files

@eliminmax
Copy link
Author

Don't know how I missed that. My bad.

@Enselic
Copy link
Collaborator

Enselic commented May 6, 2022

No worries

@eliminmax
Copy link
Author

I just realized that the issue persists even with --wrap never --color never.

The result is as follows:

^[[1;32m^[[32m^[[1;32mgreen,bold ^[[39m^[[1;32mcolorless,bold ^[[35m^[[35m^[[1;32mpurple,bold ^[[39m^[[1;32mcolorless,bold ^[[0mcolorless

It seems like bat is inserting the first escape sequence it encounters after all following escape sequences other that the final (clear formatting) one.

@eliminmax eliminmax reopened this May 9, 2022
@Enselic
Copy link
Collaborator

Enselic commented May 10, 2022

I can confirm this is a regression introduced by 63ad538 (first released in bat v0.19.0). The bug reproduces with that commit. With the commit before, bat has the same output as cat. If you make bat use loop-through mode by piping the bat output through cat, then it also works as it should: bat bat-bug-test | cat.

Thanks a lot for reporting this. My apologies for closing this issue.

Our code for ANSI escape passthrough is very complicated. It was never fully clear to me why we should have special code to handle input with ANSI escape characters, since syntax highlighting does not work with such files anyway. I should do more research before proposing this, but maybe we should simply remove our ANSI escape passthrough code? The benefits would be:

  • We behave more similarly to cat (at least for this case, see details in linked PR below)
  • Performance would improve
  • The code would be significantly simplified

@Enselic
Copy link
Collaborator

Enselic commented May 10, 2022

Prototyped code for the above proposal that pass all regression tests (after changing one of them) and that also fixes the bug this issue is about can be found here: #2189

@eliminmax
Copy link
Author

To be clear, I was the one who closed the issue myself - no need to apologize.

@ImportTaste
Copy link

So with v0.22.1, this is an issue again, which is a shame because this fix/adjustment was useful to me. Any chance of bringing this back as a parameter and doing a v0.22.2 release?

@Enselic
Copy link
Collaborator

Enselic commented Sep 10, 2022

Sorry about that. But my cheat (removing ANSI code completely) didn't fly. So we have to pretend I never did it.

In other words, someone needs to figure out what the problem is with the existing code, and then write as many regression tests as possible, because the ANSI processing code is going cause us quite a bit of trouble I am afraid.

@ImportTaste
Copy link

Sorry about that. But my cheat (removing ANSI code completely) didn't fly. So we have to pretend I never did it.

In other words, someone needs to figure out what the problem is with the existing code, and then write as many regression tests as possible, because the ANSI processing code is going cause us quite a bit of trouble I am afraid.

So a --ansi-preprocessing=never parameter is out of the question?

@Enselic
Copy link
Collaborator

Enselic commented Sep 11, 2022

Aha you mean like that. That sounds reasonable to me, but the code would have to be refactored quite a bit to make that possible, I think. If you want to take a shot at it, I recommend to do it in several small incremental PRs, so that code review and performance regression testing becomes easy to do.

@eth-p
Copy link
Collaborator

eth-p commented Feb 12, 2024

I'm going to try and tackle this issue, since ANSI escape sequences (and the parsing of) is my domain of expertise :)

Our code for ANSI escape passthrough is very complicated. It was never fully clear to me why we should have special code to handle input with ANSI escape characters, since syntax highlighting does not work with such files anyway.

@Enselic, it's meant to support the drop-in-replacement aspect of bat and serves three purposes:

  1. It allows bat to ignore ANSI sequences when calculating the width of a line.
  2. It allows ANSI coloring to persist across wrapped lines, which would otherwise be reset at the end of the line.
  3. The plain text syntax highlighting emits colors at the start of the line, and certain sequences (e.g. \x1B[39m, \x1B[0m) would wipe out the default coloring used by bat, even though the intent of those sequences is to reset the coloring to default.

Removing ANSI processing entirely simplifies the code a bit, but I don't think the performance increase is worth breaking support for that. I did a toy test by stripping out ANSI handling in the printer and ran it against a 100,000 line file, and the result was a difference of 19.3 ms between the two—or about 193 nanoseconds per line.

image

When running it against a non-benchmark example, the difference was negligible enough that variances in clock frequency boosting made bat's master branch benchmark look better.

image

Meanwhile, these are the consequences of stripping support for ANSI parsing:

image
image

@eth-p
Copy link
Collaborator

eth-p commented Feb 12, 2024

Fixed in #2856:

image

It actually should have been fixed in #2544, but I didn't account for SGR sequences that combine text attributes with colors, so it ended up grouping the green color into the bold attribute. I've now fixed it and added a regression test.

@ImportTaste
Copy link

ImportTaste commented Feb 12, 2024

Yep, it's fixed alright.

Really hope this necessitates a new stable version sometime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants