improve csv.Sniffer().sniff() behavior #73591

mepstein · 2017-02-01T00:26:25Z

BPO	29405
Nosy	@berkerpeksag, @zhangyangyu
Files	csv_sniffer.patch csv.py.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-02-06.02:53:45.648>
created_at = <Date 2017-02-01.00:26:25.052>
labels = ['3.7', 'type-feature', 'library']
title = 'improve csv.Sniffer().sniff() behavior'
updated_at = <Date 2017-02-06.03:38:39.320>
user = 'https://bugs.python.org/mepstein'

bugs.python.org fields:

activity = <Date 2017-02-06.03:38:39.320>
actor = 'xiang.zhang'
assignee = 'none'
closed = True
closed_date = <Date 2017-02-06.02:53:45.648>
closer = 'xiang.zhang'
components = ['Library (Lib)']
creation = <Date 2017-02-01.00:26:25.052>
creator = 'mepstein'
dependencies = []
files = ['46509', '46512']
hgrepos = []
issue_num = 29405
keywords = ['patch']
message_count = 9.0
messages = ['286570', '286571', '286913', '286923', '287070', '287071', '287072', '287073', '287074']
nosy_count = 4.0
nosy_names = ['python-dev', 'berker.peksag', 'xiang.zhang', 'mepstein']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue29405'
versions = ['Python 3.7']

mepstein · 2017-02-01T00:26:25Z

I'm trying to use csv.Sniffer().sniff(sample_data) to determine the delimiter on a number of input files. Through some trial and error, many "Could not determine delimiter" errors, and analyzing how this routine works/behaves, I settled on sample_data being some number of lines of the input file, particularly 30. This value seems to allow the routine to work more frequently, although not always, particularly on short input files.

I realize the way this routine works is somewhat idiosyncratic, and it won't be so easy to improve it generally, but there's one simple change that occurred to me that would help in some cases. Currently the function _guess_delimiter() in csv.py contains the following lines:

            # build a list of possible delimiters
            modeList = modes.items()
            total = float(chunkLength * iteration)

So total is increased by chunkLength on each iteration. The problem occurs when total becomes greater than the length of sample_data, that is, the iteration would go beyond the end of sample_data. That reading is handled fine, it's truncated at the end of sample_data, but total is needlessly set too high. My suggested change is to add the following two lines after the above:

            if total > len(data):
                total = float(len(data))

mepstein · 2017-02-01T00:35:04Z

FWIW, it might be more concise and more consistent with the existing code to change the one line to:

            total = min(float(chunkLength * iteration), float(len(data)))

zhangyangyu · 2017-02-04T06:11:41Z

Sounds reasonable. IIUC if the sample data gets 11 lines the total could be 20. I also think the second min is redundant. Would you mind review my patch Milt?

mepstein · 2017-02-04T07:33:35Z

That's right, with 11 lines in the sample data, total will become 20 on the second iteration. And that throws off some of the computations done in that function.

Your patch looks good, in that it will achieve what I'm requesting. But :-), your pointing out that other redundant min() made me take a closer look at the code, and led me to produce the attached patch as an alternate suggestion. I think it makes the code a bit more sensible and cleaner. Please review, and go with what you think best.

Thanks.

python-dev · 2017-02-06T02:51:04Z

New changeset 724d1aa7589b by Xiang Zhang in branch 'default':
Issue bpo-29405: Make total calculation in _guess_delimiter more accurate.
https://hg.python.org/cpython/rev/724d1aa7589b

zhangyangyu · 2017-02-06T02:53:45Z

Thanks Milt. I committed with my change not because it's better, but I want to make the change small so others won't get unfamiliar with the new code. :-)

python-dev · 2017-02-06T03:00:23Z

New changeset 6d0a09623155710680ff19f05f279d45c007a304 by Xiang Zhang in branch 'master':
Issue bpo-29405: Make total calculation in _guess_delimiter more accurate.
6d0a096

berkerpeksag · 2017-02-06T03:15:44Z

[...] but I want to make the change small so others won't get unfamiliar with the new code. :-)

Assuming it doesn't cause any behavior changes, I find Milt's patch simple enough and easier to understand than the version uses 'iteration' variable.

zhangyangyu · 2017-02-06T03:38:39Z

I am fine with any version (both are simple and not the hardest part to understand in the logic). :-) I have no opinion on which is better.

mepstein mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Feb 1, 2017

zhangyangyu added 3.7 (EOL) end of life type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Feb 4, 2017

zhangyangyu closed this as completed Feb 6, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve csv.Sniffer().sniff() behavior #73591

improve csv.Sniffer().sniff() behavior #73591

mepstein mannequin commented Feb 1, 2017

mepstein mannequin commented Feb 1, 2017

mepstein mannequin commented Feb 1, 2017

zhangyangyu commented Feb 4, 2017

mepstein mannequin commented Feb 4, 2017

python-dev mannequin commented Feb 6, 2017

zhangyangyu commented Feb 6, 2017

python-dev mannequin commented Feb 6, 2017

berkerpeksag commented Feb 6, 2017

zhangyangyu commented Feb 6, 2017

improve csv.Sniffer().sniff() behavior #73591

improve csv.Sniffer().sniff() behavior #73591

Comments

mepstein mannequin commented Feb 1, 2017

mepstein mannequin commented Feb 1, 2017

mepstein mannequin commented Feb 1, 2017

zhangyangyu commented Feb 4, 2017

mepstein mannequin commented Feb 4, 2017

python-dev mannequin commented Feb 6, 2017

zhangyangyu commented Feb 6, 2017

python-dev mannequin commented Feb 6, 2017

berkerpeksag commented Feb 6, 2017

zhangyangyu commented Feb 6, 2017