Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report unicode #7609

Closed
wants to merge 16 commits into from
Closed

Conversation

doerwalter
Copy link

This pull request implements feature request #7608: "Use unicode box drawing characters for horizontal rules in pytest output".

When the option --unicode is specified, Unicode box drawing characters are used from printing horizontal rules (instead of the ASCII characters -, =, _ and ?).

This closes #7608

Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I'm in favor of using the box characters. But I am against adding a flag for this. We should just switch to the Unicode without a flag. And if for some reason someone is using a non-Unicode terminal, we can either let TerminalWriter handle it (will print ugly Unicode escapes), or have an automatic ASCII fallback if really necessary.

"--unicode",
action="store_true",
default=False,
help="Use unicode characters for horizontal rules.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In prose Unicode should start with a capital letter:

Suggested change
help="Use unicode characters for horizontal rules.",
help="Use Unicode characters for horizontal rules.",

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I update the help message, or remove the option entirely?

**markup: bool
) -> None:
self.write_sep(
"\u2500" if self.config.option.unicode else "-", title, fullwidth, **markup
Copy link
Member

@bluetech bluetech Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escaping is not needed here, I suggest using the character directly (same for the others):

Suggested change
"\u2500" if self.config.option.unicode else "-", title, fullwidth, **markup
"" if self.config.option.unicode else "-", title, fullwidth, **markup

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when we're using the character directly in the file, we have to make sure, that the file uses the right encoding (i.e. UTF-8). But it seems that we've already got unicode characters in the pytest sourcecode:

>>> import pathlib
>>> max(max(pys.read_text(encoding="utf-8")) for pys in pathlib.Path(".").glob("*/*.py"))
'😊'

However, I find to be visible nearly indistinguishable from - (at least in non-proportional fonts), so the source code might be misleading. Using the escaped character clearly shows whats going on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course be could always use an \N escape, i.e.:

"\N{BOX DRAWINGS LIGHT HORIZONTAL}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file encoding is UTF-8 so it's not a problem.

However, I find ─ to be visible nearly indistinguishable from -

I'd like to have the symbol because then I can see what it is. Maybe leave the symbol, but add a clarifying comment with its Unicode name?

**markup: bool
) -> None:
self.write_sep(
"\u2049" if self.config.option.unicode else "!", title, fullwidth, **markup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this character "⁉", I think this one can be kept as is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've updated the pull request to use "!" in both the ASCII and Unicode case.

@The-Compiler
Copy link
Member

I'm in favor of using the box characters. But I am against adding a flag for this. We should just switch to the Unicode without a flag. And if for some reason someone is using a non-Unicode terminal, we can either let TerminalWriter handle it (will print ugly Unicode escapes), or have an automatic ASCII fallback if really necessary.

IMHO we should have a clean/proper ASCII fallback. As shown in #7475, there is still a variety of conditions where it's not possible to print Unicode to the terminal.

@doerwalter
Copy link
Author

I'm in favor of using the box characters. But I am against adding a flag for this. We should just switch to the Unicode without a flag. And if for some reason someone is using a non-Unicode terminal, we can either let TerminalWriter handle it (will print ugly Unicode escapes), or have an automatic ASCII fallback if really necessary.

IMHO we should have a clean/proper ASCII fallback. As shown in #7475, there is still a variety of conditions where it's not possible to print Unicode to the terminal.

The simplest solution for that would be to install a codec error handling callback to translate fancy Unicode characters to simple ASCII characters, i.e. something like:

import codecs

replacements = [
	("\u2500", "-"),
	("\u2550", "="),
	("\u2581", "_"),
]

def simplify(exc):
	if not isinstance(exc, UnicodeEncodeError):
		raise TypeError(f"can't handle {exc}")
	object = exc.object[exc.start:exc.end]
	for (u, a) in replacements:
		object = object.replace(u, a)
	object = object.encode("ascii", "backslashreplace").decode("ascii")
	return (object, exc.end)

codecs.register_error("simplify", simplify)

Then we could use "simpify" as the error handling name when encoding:

>>> print("\u2500\u2550\u2581 foo".encode("ascii", "simplify"))
b'-=_ foo'

@The-Compiler
Copy link
Member

The simplest solution for that would be to install a codec error handling callback to translate fancy Unicode characters to simple ASCII characters, i.e. something like: [...]

It might be simpler for TerminalWriter to check whether the terminal has unicode support first (i.e. try to encode the unicode char with sys.stdout.encoding) and if not, explicitly select the ascii characters instead. The error handler might be a bit harder to understand for people who aren't familiar with them. No hard feelings either way, though.

@doerwalter
Copy link
Author

The simplest solution for that would be to install a codec error handling callback to translate fancy Unicode characters to simple ASCII characters, i.e. something like: [...]

It might be simpler for TerminalWriter to check whether the terminal has unicode support first (i.e. try to encode the unicode char with sys.stdout.encoding) and if not, explicitly select the ascii characters instead.

But which unicode character should that be? It might be one that the target encoding can encode, and then we'd use the target encoding, but later might encounter unencodable characters and fail anyway. Or it might be one that we can't encode and then we'de fall back to ASCII although the encoding might be able to encode some non-ASCII characters.

Or would you decide whether to fallback to ASCII on every single call?

The error handler might be a bit harder to understand for people who aren't familiar with them. No hard feelings either way, though.

But it's totally transparent (except that you have to use the error handler name). And when you have a properly configured unicode terminal (which should be almost always these days), it's 0% performance overhead (the error handler doesn't even have to be looked up).

@bluetech
Copy link
Member

bluetech commented Aug 3, 2020

My view is that pytest should not care about non-Unicode terminals except for not crashing. Though we might want to wait until we drop Python 3.5 since IIUC in this version on Windows the stdout encoding is still broken.

Currently the "not crashing" part is done by TerminalWriter catching any UnicodeEncodeErrors and then encoding with unicode_escape instead. This gives an ugly output, but very visibly so. My preference is to stick with that, and see if anyone legitimately complains.

@The-Compiler
Copy link
Member

But it's totally transparent (except that you have to use the error handler name). And when you have a properly configured unicode terminal (which should be almost always these days), it's 0% performance overhead (the error handler doesn't even have to be looked up).

Okay, that got me convinced!

This gives an ugly output, but very visibly so. My preference is to stick with that, and see if anyone legitimately complains.

I will complain then. I do use pytest with Python 3.5 on Windows (and also use it on GitHub Actions with Python 3.7 where there also is not Unicode output, see #7475). I do consider a \u2550\u2550\u2550\u2550\u2550... etc. outputs (in the wrong length) instead of =====... a quite bad UX regression, and judging from #7475 I won't be the only one.

@bluetech
Copy link
Member

bluetech commented Aug 3, 2020

I will complain then. I do use pytest with Python 3.5 on Windows

Maybe we can start discussing removing support for Python 3.5. According to here, it reaches end-of-life at 2020-09-13 which is a few weeks from now.

and also use it on GitHub Actions with Python 3.7 where there also is not Unicode output, see #7475

Did we confirm that GitHub Actions stdout really doesn't support Unicode? It sounds very unlikely to me.

I do consider a \u2550\u2550\u2550\u2550\u2550... etc. outputs (in the wrong length) instead of =====... a quite bad UX regression, and judging from #7475 I won't be the only one.

The issue there was a crash, which I agree we shouldn't do. The question is if anyone would complain if it's not a crash, just ugly output. If it's just broken output on CI systems then maybe people won't even notice.

I don't think there are many developers left with terminals which only support ASCII, but I may be wrong on that.

@Zac-HD
Copy link
Member

Zac-HD commented Aug 3, 2020

I don't think there are many developers left with terminals which only support ASCII, but I may be wrong on that.

This is very common in schools - I know that it would affect me and many of my students - and is part of the general pattern where UX problems are disproportionately bad for new people who don't know how to handle or avoid them 😕

@doerwalter
Copy link
Author

But it's totally transparent (except that you have to use the error handler name). And when you have a properly configured unicode terminal (which should be almost always these days), it's 0% performance overhead (the error handler doesn't even have to be looked up).

Okay, that got me convinced!

OK, then I'll update the pull request to use the error callback and remove the option.

This gives an ugly output, but very visibly so. My preference is to stick with that, and see if anyone legitimately complains.

I will complain then. I do use pytest with Python 3.5 on Windows (and also use it on GitHub Actions with Python 3.7 where there also is not Unicode output, see #7475). I do consider a \u2550\u2550\u2550\u2550\u2550... etc. outputs (in the wrong length) instead of =====... a quite bad UX regression, and judging from #7475 I won't be the only one.

This will be replaced by the proper ASCII characters by the callback. And when you're using e.g. an ISO-8859-1 terminal you will still get the characters from the Latin-1 range (U+0080 - U+00FF). If you only have an ASCII terminal those characters would be replace by escape sequences instead. This is different from the current behaviour which in case of an encoding error switches to escape sequences for all non-ASCII characters, even if the terminal would be able to encode some of them).

…rs instead.

Reencode failing output via a codec error callback to support terminals that are
not unicode capable.

Update tests so that the handle both unicode and ASCII output.
@doerwalter
Copy link
Author

OK, I've updated the patch to remove the --unicode option and to always output the box drawing characters. (And I had to update a lot of tests).

Note that changing the errors attribute of the output stream doesn't work (but it should, I've filed a Python bug report for that: https://bugs.python.org/issue41465), so as a workaround when encoding the output fails TerminalWriter.write() retries with the error callback. I also had to implement that in Testdir.runpytest_inprocess(). This still gives us support for more that just ASCII characters, as can be seen in the updated test test_terminalwriter_not_unicode (as cp1252 can encode ô).

There's a test failing in windows-py38 that I don't understand:

     File "D:\a\pytest\pytest\.tox\py38-unittestextras\lib\site-packages\typing.py", line 1007, in __new__
      self._abc_registry = extra._abc_registry
  AttributeError: type object 'Callable' has no attribute '_abc_registry'

@The-Compiler
Copy link
Member

The _abc_registry issue is unrelated to your changes, see #7616.

@bluetech
Copy link
Member

bluetech commented Aug 5, 2020

@Zac-HD

This is very common in schools - I know that it would affect me and many of my students - and is part of the general pattern where UX problems are disproportionately bad for new people who don't know how to handle or avoid them

What sort of machines or configuration issues did you run into? Maybe we can mitigate them, or give clear guidance about them.

The click python package for example does this: https://github.com/pallets/click/blob/master/src/click/_unicodefun.py


@doerwalter

The simplest solution for that would be to install a codec error handling callback to translate fancy Unicode characters to simple ASCII characters

I think applying the replacements in TerminalWriter is the wrong layer. If say a plugin writes one of these Unicode characters to the terminal, it is wrong of us to silently replace them with ASCII equivalents. What happens currently is that it is replaced with a Unicode escape which is a clear indicator of what was done.

@nicoddemus
Copy link
Member

Hi everyone, sorry for joining late in the discussion.

I definitely believe we need to improve the pytest reporting, it has grown organically over the years and feels a bit dated sometimes.

However, I'm 👎 on this change for the following reasons:

  1. This has the potential of breaking a lot of configurations out there, or produce worse output than what we have now.
  2. Because we unfortunately can't assume full unicode support in all terminals, the code needed to avoid a crash feels brittle and prone to break in the future.
  3. More importantly, the gains are fairly minimal..

In summary, the change in visuals is pretty minimal compared to the effort, possible breakages for users, and increased complexity. I wouldn't really like to make a release and having a ton of breakages based on the minimal improvement that this provides, I don't think it is worth the trade-off.

I would much rather we keep things for simple now and start to think about delegating rich output/unicode handling to a third party library, like rich, rather than start baking (more) things in pytest itself.

@graingert
Copy link
Member

have you seen pytest-sugar? https://pypi.org/project/pytest-sugar/

@nicoddemus
Copy link
Member

nicoddemus commented Aug 5, 2020

have you seen pytest-sugar?

Sorry @graingert who's that addressed to?

@graingert
Copy link
Member

have you seen pytest-sugar?

Sorry @graingert who's that addressed to?

In general for the thread, a plugin that uses non-ascii drawing characters

@doerwalter
Copy link
Author

Hi everyone, sorry for joining late in the discussion.

I definitely believe we need to improve the pytest reporting, it has grown organically over the years and feels a bit dated sometimes.

However, I'm 👎 on this change for the following reasons:

1. This has the potential of breaking a lot of configurations out there, or produce worse output than what we have now.

This doesn't change any pytest configuration, so I'm not sure why the patch should break them.

Or do you mean "system configuration"? For non-unicode capable systems, the code now escapes unencodable characters and it did before (now it uses a codec error callback, before it was using the unicode-ecape codec).

But after the patch, pytest would print non ASCII characters much more often, so indeed the chance of breaking on non-unicode capable systems is higher than with ASCII only output.

2. Because we unfortunately can't assume full unicode support in all terminals, the code needed to avoid a crash feels brittle and prone to break in the future.

IMHO the changes don't make the code more brittle.

But of course the safest approach would be to set the new codec error handler on stdout and stderr. This can be done with the reconfigure() method on io.TextIOWrapper. Then any output that pytest does on stdout or stderr would go through the error hander. Unfortunately this requires Python 3.7. But the code is simple: First define the error handler like the code in the patch does, i.e.:

import codecs

def simplify(exc):
    if not isinstance(exc, UnicodeEncodeError):
        raise TypeError("can't handle {}".format(exc))
    replacements = [
        ("\u2500", "-"),
        ("\u2550", "="),
        ("\u2581", "_"),
    ]
    object = exc.object[exc.start : exc.end]
    for (u, a) in replacements:
        object = object.replace(u, a)
    object = object.encode("ascii", "backslashreplace").decode("ascii")
    return (object, exc.end)


codecs.register_error("simplify", simplify)

And then we can do

import sys
sys.stdout.reconfigure(errors="simplify")
sys.stderr.reconfigure(errors="simplify")

Since this requires Python 3.7 we can't do that yet, but without that we're back to the status quo (or at least the version with the codec error callback).

3. More importantly, the gains are fairly minimal..

True.

In summary, the change in visuals is pretty minimal compared to the effort, possible breakages for users, and increased complexity. I wouldn't really like to make a release and having a ton of breakages based on the minimal improvement that this provides, I don't think it is worth the trade-off.

Another approach would be to keep using the ASCII characters but move them into class attributes of the TerminalReporter. This has two benefits:

  1. It makes it easier to replace the rule characters if the user or a pytest plugin wants to do that.

  2. It makes the propose of those characters clearer and so serves as documentation.

If that approach seems reasonable to you, I can update the pull request to implement that.

I would much rather we keep things for simple now and start to think about delegating rich output/unicode handling to a third party library, like rich, rather than start baking (more) things in pytest itself.

OK, that is a completely different level of effort/benefit and probably requires major refactoring of pytests reporting infrastructure.

@doerwalter
Copy link
Author

doerwalter commented Aug 6, 2020

Also note that rich will not solve the problem with non-unicode capable installations for us, as the following shows:

🐚 » export PYTHONIOENCODING=ascii
🐚 » python
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from rich.console import Console
>>> console = Console()
>>> console.print(":smiley: :vampire: :pile_of_poo: :thumbs_up: :raccoon:")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/walter/pyvenvs/pytest/lib/python3.8/site-packages/rich/console.py", line 877, in print
    extend(
  File "/Users/walter/pyvenvs/pytest/lib/python3.8/site-packages/rich/console.py", line 448, in __exit__
    self._exit_buffer()
  File "/Users/walter/pyvenvs/pytest/lib/python3.8/site-packages/rich/console.py", line 426, in _exit_buffer
    self._check_buffer()
  File "/Users/walter/pyvenvs/pytest/lib/python3.8/site-packages/rich/console.py", line 993, in _check_buffer
    self.file.write(text)
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f603' in position 0: ordinal not in range(128)

Which means that rich might benefit too from our new codec error callback. ;)

@nicoddemus
Copy link
Member

Hey @doerwalter,

Or do you mean "system configuration"?

Sorry, yest that's what I've meant.

For non-unicode capable systems, the code now escapes unencodable characters and it did before (now it uses a codec error callback, before it was using the unicode-ecape codec).

Technically yes, as far as our testing can cover, but in our experience we can't cover all possible cases in our suite.

But after the patch, pytest would print non ASCII characters much more often, so indeed the chance of breaking on non-unicode capable systems is higher than with ASCII only output.

I agree, and for me the chance of this breaking users vs the small improvement in the UX is not worth the trade-off.

Users run pytest in many many environments: local terminals, CI, docker containers, embeded systems, etc, I would be surprised that this won't break something.

OK, that is a completely different level of effort/benefit and probably requires major refactoring of pytests reporting infrastructure.

Oh I agree, but I suspect we would have major gains too, including better graphics plus getting rid of a lot of custom code.

Also note that rich will not solve the problem with non-unicode capable installations for us

I'm sure rich contains bugs, but in that case we would defer/help rich fix them, possibly by improving rich to support limited configurations (for example), instead of having our own custom implementation in pytest.

In summary my main point is that we are baking a custom solution, which always has a maintenance cost, for something that has very minimal gains; I don't think it is worth the potential breakage that this brings.


Having said all that, I definitely appreciate the effort you've shown here in implementing this and accommodating reviewers' concerns. 👍

@doerwalter
Copy link
Author

OK, I've updated to pull request again. I've reverted the changes to the default rule characters, and the use of the codec error callback (and the changes to the tests).

I've moved the characters used for printing rules into class attributes of TerminalReporter (and kept the methods used from printing rules). As stated previously, this has two benefits:

  1. It makes it easier to replace the rule characters if the user or a pytest plugin wants to do that.

  2. It makes the propose of those characters clearer and so serves as documentation.

Has the pull request any chance of getting accepted in this form?

@nicoddemus
Copy link
Member

nicoddemus commented Aug 7, 2020

It makes it easier to replace the rule characters if the user or a pytest plugin wants to do that.
It makes the propose of those characters clearer and so serves as documentation.

All great points, thanks!

Has the pull request any chance of getting accepted in this form?

I'm 👍.

While one could argue that TerminalReporter is not really meant to be changed/extended by others, this PR doesn't change the status quo, because plugins/users that deal with TerminalReporter directly (for example pytest-xdist does this) are already working with an internal plugin anyway, so they are aware of the risks.

Again thanks for the patience on dealing with this PR; I know it can be frustrating when there are many contrary opinions when reviewing a PR.

@The-Compiler
Copy link
Member

@nicoddemus IMHO, there are three different approaches to unicode output:

  1. Let's avoid it where possible, as we don't know what it'd break (your stance, if I'm reading this correctly)
  2. Let's do it, but we'll need to have a proper fallback for cases where it's not available (my opinion)
  3. Everything should be UTF-8, if it breaks it's not our fault (my take on @bluetech's view)

I'm definitely tending more towards 1. than towards 3., especially for a test runner. However, we already do have non-ASCII output (the ± comes to mind, from #1441). This turned out to be a problem on Python 2 (#2111) but from what I remember, hasn't really bitten anyone since it was introduced in 2016.

Then again, ± is also part of latin1, so it might be less problematic than the box drawing characters originally part of this PR.

Additionally, this would probably also break pytester usage, where I think people filter for =* some title and such a lot. Thus, I agree it's probably not worth the pain. But at the same time, I wonder how much our output really can be improved if we want to avoid non-ASCII as much as possible.

@doerwalter
Copy link
Author

@nicoddemus IMHO, there are three different approaches to unicode output:

  1. Let's avoid it where possible, as we don't know what it'd break (your stance, if I'm reading this correctly)

This might not be possible 100%, as there might by Unicode characters in the users test file.

  1. Let's do it, but we'll need to have a proper fallback for cases where it's not available (my opinion)

  2. Everything should be UTF-8, if it breaks it's not our fault (my take on @bluetech's view)

But degrading gracefully would IMHO be better than crashing and burning.

I'm definitely tending more towards 1. than towards 3., especially for a test runner. However, we already do have non-ASCII output (the ± comes to mind, from #1441). This turned out to be a problem on Python 2 (#2111) but from what I remember, hasn't really bitten anyone since it was introduced in 2016.

Then again, ± is also part of latin1, so it might be less problematic than the box drawing characters originally part of this PR.

Additionally, this would probably also break pytester usage, where I think people filter for =* some title and such a lot. Thus, I agree it's probably not worth the pain. But at the same time, I wonder how much our output really can be improved if we want to avoid non-ASCII as much as possible.

We can of course always wait until pytest requires at least Python 3.7 and then use io.TextIOWrapper.reconfigure() to reconfigure sys.stdout and sys.stderr to use a codec error callback which should catch all cases.

But the current version of the pull request doesn't introduceany Unicode characters anyway anymore.

@nicoddemus
Copy link
Member

Hey @The-Compiler and @doerwalter,

Let's avoid it where possible, as we don't know what it'd break (your stance, if I'm reading this correctly)

Mostly yes, but an important part for me is that it was not worth the small improvement vs the chance of breakage. But you bring another point about pytester, which would require further changes.

But the current version of the pull request doesn't introduceany Unicode characters anyway anymore.

Indeed, I think the current version allows plugins to extend it and improves the code further, because now we centralize the drawing of box characters in well named functions/variables. 👍

@bluetech
Copy link
Member

Hi @doerwalter,

The intention of this PR has shifted from using Unicode characters for the hrules in pytest to adding new APIs to TerminalReported which make it easier for plugins to customize them.

When we add new public APIs (which can't be changed without a lot of pain), we need to carefully consider the design to make sure it's the best we can come up with.

Since this PR has a lot of comments which discuss something irrelevant to the current state of the PR, would you mind re-submitting this in a new PR where we can have a more targeted discussion? Thanks!

@doerwalter
Copy link
Author

Done: #7647

@nicoddemus nicoddemus closed this Aug 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use unicode box drawing characters for horizontal rules in pytest output
6 participants