Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve csv.Sniffer().sniff() behavior #73591

Closed
mepstein mannequin opened this issue Feb 1, 2017 · 9 comments
Closed

improve csv.Sniffer().sniff() behavior #73591

mepstein mannequin opened this issue Feb 1, 2017 · 9 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@mepstein
Copy link
Mannequin

mepstein mannequin commented Feb 1, 2017

BPO 29405
Nosy @berkerpeksag, @zhangyangyu
Files
  • csv_sniffer.patch
  • csv.py.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-02-06.02:53:45.648>
    created_at = <Date 2017-02-01.00:26:25.052>
    labels = ['3.7', 'type-feature', 'library']
    title = 'improve csv.Sniffer().sniff() behavior'
    updated_at = <Date 2017-02-06.03:38:39.320>
    user = 'https://bugs.python.org/mepstein'

    bugs.python.org fields:

    activity = <Date 2017-02-06.03:38:39.320>
    actor = 'xiang.zhang'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-02-06.02:53:45.648>
    closer = 'xiang.zhang'
    components = ['Library (Lib)']
    creation = <Date 2017-02-01.00:26:25.052>
    creator = 'mepstein'
    dependencies = []
    files = ['46509', '46512']
    hgrepos = []
    issue_num = 29405
    keywords = ['patch']
    message_count = 9.0
    messages = ['286570', '286571', '286913', '286923', '287070', '287071', '287072', '287073', '287074']
    nosy_count = 4.0
    nosy_names = ['python-dev', 'berker.peksag', 'xiang.zhang', 'mepstein']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue29405'
    versions = ['Python 3.7']

    @mepstein
    Copy link
    Mannequin Author

    mepstein mannequin commented Feb 1, 2017

    I'm trying to use csv.Sniffer().sniff(sample_data) to determine the delimiter on a number of input files. Through some trial and error, many "Could not determine delimiter" errors, and analyzing how this routine works/behaves, I settled on sample_data being some number of lines of the input file, particularly 30. This value seems to allow the routine to work more frequently, although not always, particularly on short input files.

    I realize the way this routine works is somewhat idiosyncratic, and it won't be so easy to improve it generally, but there's one simple change that occurred to me that would help in some cases. Currently the function _guess_delimiter() in csv.py contains the following lines:

                # build a list of possible delimiters
                modeList = modes.items()
                total = float(chunkLength * iteration)

    So total is increased by chunkLength on each iteration. The problem occurs when total becomes greater than the length of sample_data, that is, the iteration would go beyond the end of sample_data. That reading is handled fine, it's truncated at the end of sample_data, but total is needlessly set too high. My suggested change is to add the following two lines after the above:

                if total > len(data):
                    total = float(len(data))

    @mepstein mepstein mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Feb 1, 2017
    @mepstein
    Copy link
    Mannequin Author

    mepstein mannequin commented Feb 1, 2017

    FWIW, it might be more concise and more consistent with the existing code to change the one line to:

                total = min(float(chunkLength * iteration), float(len(data)))

    @zhangyangyu
    Copy link
    Member

    Sounds reasonable. IIUC if the sample data gets 11 lines the total could be 20. I also think the second min is redundant. Would you mind review my patch Milt?

    @zhangyangyu zhangyangyu added 3.7 (EOL) end of life type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Feb 4, 2017
    @mepstein
    Copy link
    Mannequin Author

    mepstein mannequin commented Feb 4, 2017

    That's right, with 11 lines in the sample data, total will become 20 on the second iteration. And that throws off some of the computations done in that function.

    Your patch looks good, in that it will achieve what I'm requesting. But :-), your pointing out that other redundant min() made me take a closer look at the code, and led me to produce the attached patch as an alternate suggestion. I think it makes the code a bit more sensible and cleaner. Please review, and go with what you think best.

    Thanks.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 6, 2017

    New changeset 724d1aa7589b by Xiang Zhang in branch 'default':
    Issue bpo-29405: Make total calculation in _guess_delimiter more accurate.
    https://hg.python.org/cpython/rev/724d1aa7589b

    @zhangyangyu
    Copy link
    Member

    Thanks Milt. I committed with my change not because it's better, but I want to make the change small so others won't get unfamiliar with the new code. :-)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 6, 2017

    New changeset 6d0a09623155710680ff19f05f279d45c007a304 by Xiang Zhang in branch 'master':
    Issue bpo-29405: Make total calculation in _guess_delimiter more accurate.
    6d0a096

    @berkerpeksag
    Copy link
    Member

    [...] but I want to make the change small so others won't get unfamiliar with the new code. :-)

    Assuming it doesn't cause any behavior changes, I find Milt's patch simple enough and easier to understand than the version uses 'iteration' variable.

    @zhangyangyu
    Copy link
    Member

    I am fine with any version (both are simple and not the hardest part to understand in the logic). :-) I have no opinion on which is better.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants