Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difflib: mention other "problematic" characters in documentation #87855

Closed
jugmac00 mannequin opened this issue Apr 1, 2021 · 7 comments
Closed

difflib: mention other "problematic" characters in documentation #87855

jugmac00 mannequin opened this issue Apr 1, 2021 · 7 comments
Labels
3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes docs Documentation in the Doc dir

Comments

@jugmac00
Copy link
Mannequin

jugmac00 mannequin commented Apr 1, 2021

BPO 43689
Nosy @tim-one, @terryjreedy, @jugmac00
PRs
  • bpo-43689: improve documentation for Differ #25132
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-04-01.08:09:11.376>
    labels = ['3.8', '3.9', '3.10', 'docs']
    title = 'difflib: mention other "problematic" characters in documentation'
    updated_at = <Date 2021-04-06.04:15:36.741>
    user = 'https://github.com/jugmac00'

    bugs.python.org fields:

    activity = <Date 2021-04-06.04:15:36.741>
    actor = 'tim.peters'
    assignee = 'docs@python'
    closed = False
    closed_date = None
    closer = None
    components = ['Documentation']
    creation = <Date 2021-04-01.08:09:11.376>
    creator = 'jugmac00'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43689
    keywords = []
    message_count = 7.0
    messages = ['389961', '390113', '390115', '390117', '390119', '390272', '390274']
    nosy_count = 4.0
    nosy_names = ['tim.peters', 'terry.reedy', 'docs@python', 'jugmac00']
    pr_nums = ['25132']
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue43689'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @jugmac00
    Copy link
    Mannequin Author

    jugmac00 mannequin commented Apr 1, 2021

    In the documentation you can currently read for the "?"-output:

    "These lines can be confusing if the sequences contain tab characters."

    From first hand experience :-), I can assure it is also very confusing for other types of whitespace characters, such as spaces and line breaks.

    I'd like to add the other characters to the documentation.

    @jugmac00 jugmac00 mannequin added 3.7 (EOL) end of life 3.10 only security fixes 3.8 (EOL) end of life 3.9 only security fixes labels Apr 1, 2021
    @jugmac00 jugmac00 mannequin assigned docspython Apr 1, 2021
    @jugmac00 jugmac00 mannequin added docs Documentation in the Doc dir 3.7 (EOL) end of life 3.10 only security fixes 3.8 (EOL) end of life 3.9 only security fixes labels Apr 1, 2021
    @jugmac00 jugmac00 mannequin assigned docspython Apr 1, 2021
    @jugmac00 jugmac00 mannequin added the docs Documentation in the Doc dir label Apr 1, 2021
    @terryjreedy
    Copy link
    Member

    The quote is in the following section.
    https://docs.python.org/3/library/difflib.html#difflib.Differ
    I do not really understand the previous line "Lines beginning with ‘?’ attempt to guide the eye to intraline differences, and were not present in either input sequence. " Can you give examples where '?' occurs, with tabs and spaces (newlines would not be within a line).?

    @terryjreedy terryjreedy removed 3.7 (EOL) end of life labels Apr 3, 2021
    @tim-one
    Copy link
    Member

    tim-one commented Apr 3, 2021

    Lines beginning with "?" are entirely synthetic: they were not present in either input. So that's what that part means.

    I'm not clear on what else could be materially clearer without greatly bloating the text. For example,

    >>> d = difflib.Differ()
    >>> for L in d.compare(["abcefghijkl\n"], ["a cxefghijkl\n"]):
    	print(L, end="")
    - abcefghijkl
    ?  ^
    + a cxefghijkl
    ?  ^ +

    The "?" lines guide the eye to the places that differ: "b" was replaced by a blank, and "x" was inserted. The marks on the "?" lines are intended to point out exactly where changes (substitutions, insertions, deletions) occurred.

    If the second input had a tab instead of a blank, the "+" wouldn't _appear_ to be under the "x" at all. It would instead "look like" a long string of blanks was between "a" and "c" in the first input, and the "+" would appear to be under one of them somewhere near the middle of the empty space.

    Tough luck. Use tab characters (or any other kind of "goofy" whitespace) in input to visual tools, and you deserve whatever you get :-)

    @terryjreedy
    Copy link
    Member

    After 3+ years of Github I did not remember that B&W diffs use lines with change position markers and in particular that at they (often? always?) start with ?s. IDLE also uses color to mark positions (for syntax errors). The following would have been clearer to me and likely to people who have never seen such lines.

    "Location marker lines beginning with ‘?’ use symbols to guide the eye to intraline differences."

    Tim, you seem to still think that tabs are especially problematical.

    Jürgen, without evidence otherwise, I agree with this. Adding other chars to the sentence would dilute the current focus on tabs. Hence my request for examples to justify doing so. Sorry I was not as clear as I could and should have been.

    @jugmac00
    Copy link
    Mannequin Author

    jugmac00 mannequin commented Apr 3, 2021

    First I need to apologize for not providing more info already when I created the issue.

    Initially, I did not even plan to create an issue, and thought the PR with the context of the current documentation would be sufficient information.

    Thanks for taking your time anyway!

    Also, thanks to Tim for explaining the meaning of the question mark in detail. When I read the documentation, I also had to pause a moment to understand the sentence. But I agree with Tim, it is hard to explain it better without getting much more verbose.

    My initial reason to read (and then to update) the documentation was an output of pytest, which left me puzzled.

    E AssertionError: assert 'ROOT: No tox...ith_no_t0/p\n' == 'ROOT: No tox..._with_no_t0/p'
    E Skipping 136 identical leading characters in diff, use -v to show
    E - ith_no_t0/p
    E + ith_no_t0/p
    E ? +

    Here is the screenshot and some discussion:
    https://twitter.com/jugmac00/status/1377317886419738624

    Using a similar snippet as Tim, here is a minimal example:

    for L in d.compare(["abcdefghijkl"], ["abcdefghijkl\n"]):
        print(L)
    • abcdefghijkl
      + abcdefghijkl

    ? +

    Usually, the output is pretty obvious most of the time, so I never actually noticed the question mark - except when whitespace characters are involved.

    I was then told that pytest uses difflib, and I was kindly pointed to the Python documentation.

    As only the tab character was listed, I thought it would be a good idea to add the other whitespace characters as well.

    After Tim's explanation, I see, that tabs could be especially confusing, while all whitespace characters are on a normal level of confusing :-), especially at the end of the diff.

    I certainly won't forget what I learned, but maybe my proposal helps one fellow Python user or another.

    @terryjreedy
    Copy link
    Member

    I have an alternate replacement: "These lines can be confusing if the sequences contain tab characters or other characters that result in the indicator symbols in these lines being mislocated."

    Or leave the current sentence as is.

    Explanation with the details omitted from the above:
    In 3.x, strings are unicode. Even if one uses a fixed pitch font for the ascii subset, a majority of characters will be rendered either in a different fixed pitch or with variable pitch. And on a graphics screen that is not simulating a fixed-pitch text terminal (such as Windows console), the so-called double-wide East Asian characters are not really double wide but more like 1.6 times as wide. The details depend on the OS, the font, and perhaps the font size. One can explore this in the font sample box for the Font tab of the IDLE settings dialog. The problems include chars less than 'one space', down to 0 wide. For general unicode, ^ marking does not work. Syntax error marking has the same problem and there is no general solution.

    Tab is an example of a character that is either displayed as a variable space or a fixed double space ('\t') or larger. If we were to make a change, we should mention, as above, that many non-ascii chars are as especially confusing as tabs.

    In your example above, the caret at least points to the right space. It correctly indicates some difference beyond the visible end - a non-visible whitespace difference.

    @tim-one
    Copy link
    Member

    tim-one commented Apr 6, 2021

    Terry, your suggested replacement statement looks like an improvement to me. Perhaps the longer explanation could be placed in a footnote.

    Note that I'm old ;-) I grew up on plain old ASCII, decades & decades ago, and tabs are in fact the only "characters" I've had a problem with in doctests. But then, e.g., I never in my life used goofy things like ASCII "form feed" characters, or NUL bytes, or ... in text either.

    I don't use Unicode either, except to the extent that Python forces me to when I'm sticking printable ASCII characters inside string quotes ;-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @slateny slateny closed this as completed May 15, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes docs Documentation in the Doc dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants