Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better explain "junk" concept in difflib doc #58540

Closed
patena mannequin opened this issue Mar 16, 2012 · 14 comments
Closed

Better explain "junk" concept in difflib doc #58540

patena mannequin opened this issue Mar 16, 2012 · 14 comments
Labels
docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error

Comments

@patena
Copy link
Mannequin

patena mannequin commented Mar 16, 2012

BPO 14332
Nosy @tim-one, @akuchling, @terryjreedy, @merwok
Files
  • issue14332.patch: patch for the bug
  • issue14332_2.patch: Patch for 14332 bug with no references
  • 14332.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-03-19.20:45:07.458>
    created_at = <Date 2012-03-16.05:52:05.015>
    labels = ['type-bug', 'docs']
    title = 'Better explain "junk" concept in difflib doc'
    updated_at = <Date 2014-03-19.20:45:07.458>
    user = 'https://bugs.python.org/patena'

    bugs.python.org fields:

    activity = <Date 2014-03-19.20:45:07.458>
    actor = 'akuchling'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2014-03-19.20:45:07.458>
    closer = 'akuchling'
    components = ['Documentation']
    creation = <Date 2012-03-16.05:52:05.015>
    creator = 'patena'
    dependencies = []
    files = ['34451', '34479', '34502']
    hgrepos = []
    issue_num = 14332
    keywords = ['patch']
    message_count = 14.0
    messages = ['155992', '156046', '156062', '156066', '165502', '165506', '165677', '199201', '213789', '213945', '214037', '214060', '214089', '214133']
    nosy_count = 9.0
    nosy_names = ['tim.peters', 'akuchling', 'terry.reedy', 'eric.araujo', 'eli.bendersky', 'docs@python', 'python-dev', 'patena', 'albamagallanes']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue14332'
    versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

    @patena
    Copy link
    Mannequin Author

    patena mannequin commented Mar 16, 2012

    According to difflib.ndiff help, the optional linejunk argument is "A function that should accept a single string argument, and return true iff the string is junk." Presumably the point is to ignore the junk lines in the comparison. But the function doesn't appear to actually do this - in fact I haven't been able to make the linejunk argument change the output in any way.

    Expected difflib.ndiff behavior with no linejunk argument given:
     >>> test_lines_1 = ['# something\n', 'real data\n']
     >>> test_lines_2 = ['# something else\n', 'real data\n']
     >>> print ''.join(difflib.ndiff(test_lines_1,test_lines_2))
     - # something
     + # something else
     ?            +++++
       real data
    
    Now I'm providing a linejunk function to ignore all lines starting with '#', but the output is still the same:
     >>> print ''.join(difflib.ndiff(test_lines_1, test_lines_2, 
                               linejunk=lambda line: line.startswith('#')))
     - # something
     + # something else
     ?            +++++
       real data
    
    In fact if I make linejunk always return True (or False), nothing changes either:
     >>> print ''.join(difflib.ndiff(test_lines_1, test_lines_2, 
                                     linejunk=lambda line: True))
     - # something
     + # something else
     ?            +++++
       real data

    It certainly looks like an error, although it's possible that I'm just misunderstanding how this should work.

    I'm using Python 2.6.5, on Ubuntu Linux 10.04.

    @patena patena mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 16, 2012
    @merwok
    Copy link
    Member

    merwok commented Mar 16, 2012

    Unfortunately Python 2.6 only gets fixes for security bugs now, not regular bugs. Can you reproduce the problem with 2.7 or 3.2?

    @terryjreedy
    Copy link
    Member

    I reproduced the observed behavior in 3.3.0a.
    However, I am rather sure it is not a bug.
    In any case, linejunk is not ignored. Passing 'lambda x: 1/0' causes ZeroDivisionError, proving that it gets called.

    The body of ndiff(linejunk,charjunk,a,b) is
    return Differ(linejunk, charjunk).compare(a, b)
    Differ only uses the linejunk parameter here
    cruncher = SequenceMatcher(self.linejunk, a, b)

    SequenceMatcher uses the first parameter, isjunk, in the internal .__chain_b method to segregate (not remove) items expected to be common in order to speed up the .find_longest_match method. Read the docstring for that method (and possibly the code) to see how it affects matching. The main intent of the *junk parameters is to speed up matching to find differences, not to mask differences. It does, however, affect output of the .*ratio methods.

    The doc string for ndiff says "The default is None, and is recommended; as of Python 2.3, an adaptive notion of "noise" lines is used that does a good job on its own." That is a good idea.

    That said, I think the doc (and docstrings) should explain the notion of "junk" elements and what 'ignoring' them means. In particular, I think a couple of sentences should be added after "The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk)." The quotes around "junk" indicate that it is being used with a non-standard, module specific meaning. What is it? And what does 'ignore' (used several times later in the doc) mean?

    Tim, I think we may need your help here since 'junk' is your label for your concept and I am not sure I understand well enough to articulate it. (For one thing, given that the "common" heuristic was apparently meant to replace at least the linejunk version version of junk, I do not understand why .get_longest_match treats 'junk' and 'common' items differently, other than that the two concepts are apparently not the same.)

    @terryjreedy terryjreedy added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Mar 16, 2012
    @terryjreedy terryjreedy changed the title difflib.ndiff appears to ignore linejunk argument Better explain "junk" concept in difflib doc Mar 16, 2012
    @patena
    Copy link
    Mannequin Author

    patena mannequin commented Mar 16, 2012

    Ah, I see. True, the ndiff docstring doesn't actually explain what junk IS - I was just engaging in wishful thinking and assuming it did the thing I wanted. A better explanation would help.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Jul 15, 2012

    ping

    @terryjreedy
    Copy link
    Member

    I guess I should try to come up with something that is an improvement, even if not perfect.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Jul 17, 2012

    I agree. Any improvement is preferred over just letting this decay in the issue tracker ;-)

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Oct 8, 2013

    Tim, any suggestions?

    @albamagallanes
    Copy link
    Mannequin

    albamagallanes mannequin commented Mar 17, 2014

    I would like to help with this issue. I'm attaching a patch for it.

    @albamagallanes
    Copy link
    Mannequin

    albamagallanes mannequin commented Mar 18, 2014

    I removed the References to 2.x version.

    @akuchling
    Copy link
    Member

    Thanks for your patch! I took it and added some more text describing what junk is, and clarifying that junk affects what's matched but doesn't cause any differences to be ignored.

    @merwok
    Copy link
    Member

    merwok commented Mar 19, 2014

    amk, if you’re satisfied with your patch, I think you can go ahead and commit it.

    @elibendersky
    Copy link
    Mannequin

    elibendersky mannequin commented Mar 19, 2014

    Revised patch LGTM.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 19, 2014

    New changeset 0a69b1e8b7fe by Andrew Kuchling in branch 'default':
    bpo-14332: provide a better explanation of junk in difflib docs
    http://hg.python.org/cpython/rev/0a69b1e8b7fe

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants