Better explain "junk" concept in difflib doc #58540

patena · 2012-03-16T05:52:05Z

BPO	14332
Nosy	@tim-one, @akuchling, @terryjreedy, @merwok
Files	issue14332.patch: patch for the bug issue14332_2.patch: Patch for 14332 bug with no references 14332.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-03-19.20:45:07.458>
created_at = <Date 2012-03-16.05:52:05.015>
labels = ['type-bug', 'docs']
title = 'Better explain "junk" concept in difflib doc'
updated_at = <Date 2014-03-19.20:45:07.458>
user = 'https://bugs.python.org/patena'

bugs.python.org fields:

activity = <Date 2014-03-19.20:45:07.458>
actor = 'akuchling'
assignee = 'docs@python'
closed = True
closed_date = <Date 2014-03-19.20:45:07.458>
closer = 'akuchling'
components = ['Documentation']
creation = <Date 2012-03-16.05:52:05.015>
creator = 'patena'
dependencies = []
files = ['34451', '34479', '34502']
hgrepos = []
issue_num = 14332
keywords = ['patch']
message_count = 14.0
messages = ['155992', '156046', '156062', '156066', '165502', '165506', '165677', '199201', '213789', '213945', '214037', '214060', '214089', '214133']
nosy_count = 9.0
nosy_names = ['tim.peters', 'akuchling', 'terry.reedy', 'eric.araujo', 'eli.bendersky', 'docs@python', 'python-dev', 'patena', 'albamagallanes']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue14332'
versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

patena · 2012-03-16T05:52:04Z

According to difflib.ndiff help, the optional linejunk argument is "A function that should accept a single string argument, and return true iff the string is junk." Presumably the point is to ignore the junk lines in the comparison. But the function doesn't appear to actually do this - in fact I haven't been able to make the linejunk argument change the output in any way.

Expected difflib.ndiff behavior with no linejunk argument given:
 >>> test_lines_1 = ['# something\n', 'real data\n']
 >>> test_lines_2 = ['# something else\n', 'real data\n']
 >>> print ''.join(difflib.ndiff(test_lines_1,test_lines_2))
 - # something
 + # something else
 ?            +++++
   real data

Now I'm providing a linejunk function to ignore all lines starting with '#', but the output is still the same:
 >>> print ''.join(difflib.ndiff(test_lines_1, test_lines_2, 
                           linejunk=lambda line: line.startswith('#')))
 - # something
 + # something else
 ?            +++++
   real data

In fact if I make linejunk always return True (or False), nothing changes either:
 >>> print ''.join(difflib.ndiff(test_lines_1, test_lines_2, 
                                 linejunk=lambda line: True))
 - # something
 + # something else
 ?            +++++
   real data

It certainly looks like an error, although it's possible that I'm just misunderstanding how this should work.

I'm using Python 2.6.5, on Ubuntu Linux 10.04.

merwok · 2012-03-16T14:50:13Z

Unfortunately Python 2.6 only gets fixes for security bugs now, not regular bugs. Can you reproduce the problem with 2.7 or 3.2?

terryjreedy · 2012-03-16T18:01:25Z

I reproduced the observed behavior in 3.3.0a.
However, I am rather sure it is not a bug.
In any case, linejunk is not ignored. Passing 'lambda x: 1/0' causes ZeroDivisionError, proving that it gets called.

The body of ndiff(linejunk,charjunk,a,b) is
return Differ(linejunk, charjunk).compare(a, b)
Differ only uses the linejunk parameter here
cruncher = SequenceMatcher(self.linejunk, a, b)

SequenceMatcher uses the first parameter, isjunk, in the internal .__chain_b method to segregate (not remove) items expected to be common in order to speed up the .find_longest_match method. Read the docstring for that method (and possibly the code) to see how it affects matching. The main intent of the *junk parameters is to speed up matching to find differences, not to mask differences. It does, however, affect output of the .*ratio methods.

The doc string for ndiff says "The default is None, and is recommended; as of Python 2.3, an adaptive notion of "noise" lines is used that does a good job on its own." That is a good idea.

That said, I think the doc (and docstrings) should explain the notion of "junk" elements and what 'ignoring' them means. In particular, I think a couple of sentences should be added after "The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk)." The quotes around "junk" indicate that it is being used with a non-standard, module specific meaning. What is it? And what does 'ignore' (used several times later in the doc) mean?

Tim, I think we may need your help here since 'junk' is your label for your concept and I am not sure I understand well enough to articulate it. (For one thing, given that the "common" heuristic was apparently meant to replace at least the linejunk version version of junk, I do not understand why .get_longest_match treats 'junk' and 'common' items differently, other than that the two concepts are apparently not the same.)

patena · 2012-03-16T18:15:00Z

Ah, I see. True, the ndiff docstring doesn't actually explain what junk IS - I was just engaging in wishful thinking and assuming it did the thing I wanted. A better explanation would help.

elibendersky · 2012-07-15T03:56:05Z

ping

terryjreedy · 2012-07-15T05:21:23Z

I guess I should try to come up with something that is an improvement, even if not perfect.

elibendersky · 2012-07-17T03:37:43Z

I agree. Any improvement is preferred over just letting this decay in the issue tracker ;-)

elibendersky · 2013-10-08T13:12:20Z

Tim, any suggestions?

albamagallanes · 2014-03-17T03:34:11Z

I would like to help with this issue. I'm attaching a patch for it.

albamagallanes · 2014-03-18T05:55:31Z

I removed the References to 2.x version.

akuchling · 2014-03-18T23:20:13Z

Thanks for your patch! I took it and added some more text describing what junk is, and clarifying that junk affects what's matched but doesn't cause any differences to be ignored.

merwok · 2014-03-19T06:49:19Z

amk, if you’re satisfied with your patch, I think you can go ahead and commit it.

elibendersky · 2014-03-19T12:46:44Z

Revised patch LGTM.

python-dev · 2014-03-19T20:44:25Z

New changeset 0a69b1e8b7fe by Andrew Kuchling in branch 'default':
bpo-14332: provide a better explanation of junk in difflib docs
http://hg.python.org/cpython/rev/0a69b1e8b7fe

patena mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 16, 2012

terryjreedy added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Mar 16, 2012

terryjreedy changed the title ~~difflib.ndiff appears to ignore linejunk argument~~ Better explain "junk" concept in difflib doc Mar 16, 2012

terryjreedy assigned docspython Mar 16, 2012

akuchling closed this as completed Mar 19, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better explain "junk" concept in difflib doc #58540

Better explain "junk" concept in difflib doc #58540

patena mannequin commented Mar 16, 2012

patena mannequin commented Mar 16, 2012

merwok commented Mar 16, 2012

terryjreedy commented Mar 16, 2012

patena mannequin commented Mar 16, 2012

elibendersky mannequin commented Jul 15, 2012

terryjreedy commented Jul 15, 2012

elibendersky mannequin commented Jul 17, 2012

elibendersky mannequin commented Oct 8, 2013

albamagallanes mannequin commented Mar 17, 2014

albamagallanes mannequin commented Mar 18, 2014

akuchling commented Mar 18, 2014

merwok commented Mar 19, 2014

elibendersky mannequin commented Mar 19, 2014

python-dev mannequin commented Mar 19, 2014

Better explain "junk" concept in difflib doc #58540

Better explain "junk" concept in difflib doc #58540

Comments

patena mannequin commented Mar 16, 2012

patena mannequin commented Mar 16, 2012

merwok commented Mar 16, 2012

terryjreedy commented Mar 16, 2012

patena mannequin commented Mar 16, 2012

elibendersky mannequin commented Jul 15, 2012

terryjreedy commented Jul 15, 2012

elibendersky mannequin commented Jul 17, 2012

elibendersky mannequin commented Oct 8, 2013

albamagallanes mannequin commented Mar 17, 2014

albamagallanes mannequin commented Mar 18, 2014

akuchling commented Mar 18, 2014

merwok commented Mar 19, 2014

elibendersky mannequin commented Mar 19, 2014

python-dev mannequin commented Mar 19, 2014