Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to implement comment rows in csv module #42112

Closed
iainhaslam mannequin opened this issue Jun 22, 2005 · 20 comments
Closed

Proposal to implement comment rows in csv module #42112

iainhaslam mannequin opened this issue Jun 22, 2005 · 20 comments
Labels
stdlib Python modules in the Lib dir

Comments

@iainhaslam
Copy link
Mannequin

iainhaslam mannequin commented Jun 22, 2005

BPO 1225769
Nosy @smontanaro, @rhettinger, @amauryfa, @orsenthil, @pitrou, @merwok, @kynan
Files
  • csv.diff: Proposed diffs to implement commented records.
  • csv_comment_doc_test.diff: Test and documentation diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-02-04.02:10:23.012>
    created_at = <Date 2005-06-22.19:48:01.000>
    labels = ['library']
    title = 'Proposal to implement comment rows in csv module'
    updated_at = <Date 2010-08-29.13:40:43.212>
    user = 'https://bugs.python.org/iainhaslam'

    bugs.python.org fields:

    activity = <Date 2010-08-29.13:40:43.212>
    actor = 'kynan'
    assignee = 'andrewmcnamara'
    closed = True
    closed_date = <Date 2010-02-04.02:10:23.012>
    closer = 'andrewmcnamara'
    components = ['Library (Lib)']
    creation = <Date 2005-06-22.19:48:01.000>
    creator = 'iain_haslam'
    dependencies = []
    files = ['6701', '6702']
    hgrepos = []
    issue_num = 1225769
    keywords = ['patch']
    message_count = 20.0
    messages = ['48499', '48500', '48501', '48502', '48503', '48504', '48505', '65434', '65505', '65506', '98415', '98421', '98422', '98426', '98436', '98438', '98447', '98807', '98808', '98812']
    nosy_count = 10.0
    nosy_names = ['skip.montanaro', 'rhettinger', 'amaury.forgeotdarc', 'andrewmcnamara', 'orsenthil', 'pitrou', 'iain_haslam', 'eric.araujo', 'Mario.Fasold', 'kynan']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1225769'
    versions = ['Python 2.5']

    @iainhaslam
    Copy link
    Mannequin Author

    iainhaslam mannequin commented Jun 22, 2005

    Sometimes csv files contain comment rows, for
    temporarily commenting out data or occasionally for
    documentation. The current csv module has no built-in
    ability to skip rows; in order to skip all lines
    beginning with '#', the programmer writes something like:

    csv_reader = csv.reader(fp)
    for row in csv_reader:
        if row[0][0] != '#':    #assuming no blank lines
            print row

    I propose adding a "commentchar" parameter to the csv
    parser, so that the above code could be written (more
    elegantly, in my opinion):

    csv_reader = csv.reader(fp, commentchar='#')
    for row in csv_reader:
        print row

    This requires only relatively minor changes to the
    module, and by defaulting to using no comment
    character, existing code will behave as before. If you
    are interested, the patch (diffs against current cvs)
    required for the second example to run are attached.

    Note that that implementation adds SKIPPED_RECORD as a
    pseudonym for START_RECORD, because setting status to
    START_RECORD after skipping a record would cause a
    blank record to be returned. Altering that behaviour
    would cause more changes and the patch would be harder
    to review. I've also held back on updating tests and
    documentation to reflect this change, pending any
    support for it.

    It shoud be irrelevant, but this has been developed on
    Debian testing against the cvs head of Python.

    @iainhaslam iainhaslam mannequin assigned smontanaro Jun 22, 2005
    @iainhaslam iainhaslam mannequin added the stdlib Python modules in the Lib dir label Jun 22, 2005
    @iainhaslam iainhaslam mannequin assigned smontanaro Jun 22, 2005
    @iainhaslam iainhaslam mannequin added the stdlib Python modules in the Lib dir label Jun 22, 2005
    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    I'm not inclined to clutter the C code with further complications. Why
    can't you implement this on an as-needed basis with a file object wrapper,
    a subclass of the csv.reader class, or just continue to use the example in
    your submission?

    @iainhaslam
    Copy link
    Mannequin Author

    iainhaslam mannequin commented Jun 26, 2005

    Logged In: YES
    user_id=1301296

    I'm not inclined to clutter the C code with further
    complications.

    Sorry - I haven't been keeping up with the existing
    complications! Don't forget that one man's clutter is
    another man's functionality. It doesn't actually require
    much of a change to the code, although I was slightly
    suprised to discover that this module was in C in the first
    place...

    Basically, I noticed that the csv module has a bias towards
    Excel-generated csv files, but most of the time I've come
    across csv files, they were hand-edited, and I've often seen
    comment fields as described in the submission.

    My submission was intended in the "batteries included"
    spirit (I do understand that you stop short of the kitchen
    sink), and also seemed in-keeping with the
    'skipinitialspace' option within the existing csv module.

    Why can't you implement this on an as-needed basis
    with a file object wrapper [other options]

    True, I could do any of those things, but it would be
    simpler / clearer not to have to. Of course, if you took
    your argument further, you could cut chunks out of several
    modules; the argument comes down to whether the benefits
    outweigh the additional complexity. I was suprised to
    discover the option wasn't already there, but maybe that's
    just me.

    In any case, if your vote goes from your apparent -0 to -1,
    that's your choice, and you're better placed to make it than
    I am.

    Cheers,
    Iain.

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    Something else just occurred to me. What about writing csv files with
    comments? Also, a tweak to the docs would be in order if this is
    accepted.

    @smontanaro
    Copy link
    Contributor

    Logged In: YES
    user_id=44345

    Iain - There was some positive response to your patch from
    the csv mailing list (most notably from one of the authors of
    the C extension module). Can you provide diffs for the
    module documentation and necessary test cases to go along
    with your patch? Also, addressing the issue that CSV files
    with comments (probably?) won't round-trip would be a good
    thing to note in the docs.

    Skip

    @iainhaslam
    Copy link
    Mannequin Author

    iainhaslam mannequin commented Jun 30, 2005

    Logged In: YES
    user_id=1301296

    Here are the documentation and test diffs.

    I'm glad to hear of the positive feedback. I couldn't find
    the csv mailing list (I assume it's not public), so didn't
    see the discussion of round-tripping, but I agree with the
    implied conclusion that flagging rows as comments would
    complicate the interface too much.

    On a related point, I noticed that the csv documentation is
    filed under "Internet data handling". This seems a little
    odd - I'd suggest moving it to "Miscellaneous Services"
    alongside fileinput and ConfigParser.

    Thanks,
    Iain.

    @smontanaro
    Copy link
    Contributor

    Sorry, I'm coming back to this after a long hiatus... I'm still not
    inclined to make this change to the C source. I think a) comments in CSV
    files are pretty rare and that b) implementing this using a file object
    wrapper would be more flexible.

    #!/usr/bin/env python

    import csv
    import StringIO
    
    class CommentedFile:
        def __init__(self, f, commentstring="#"):
            self.f = f
            self.commentstring = commentstring
    
        def next(self):
            line = self.f.next()
            while line.startswith(self.commentstring):
                line = self.f.next()
            return line
    
        def __iter__(self):
            return self
    
    f = StringIO.StringIO('''\
    "event","performers","start","end","time"
    # Rachel Sage
    "","Rachael Sage","2008-01-03","2008-01-03","8:00pm"
    # Others
    "","Tung-N-GRoeVE","2008-01-16","2008-01-16","9:30pm-2:00am"
    "","Bossa Nova Beatniks","2007-11-11","2007-11-11","11:11pm"
    "","Special Consensus","2006-10-06","2006-10-06",""
    ''')
    
    for row in csv.reader(CommentedFile(f)):
        print row

    The output of the above is as expected:

    ['event', 'performers', 'start', 'end', 'time']
    ['', 'Rachael Sage', '2008-01-03', '2008-01-03', '8:00pm']
    ['', 'Tung-N-GRoeVE', '2008-01-16', '2008-01-16', '9:30pm-2:00am']
    ['', 'Bossa Nova Beatniks', '2007-11-11', '2007-11-11', '11:11pm']
    ['', 'Special Consensus', '2006-10-06', '2006-10-06', '']

    This has the added benefit that comment lines aren't restricted to single
    character comment prefixes. On the downside, comment characters appearing
    at the beginning of a continuation line would trip this up. In practice, I
    don't think it would be a significant limitation. In almost all cases I
    suspect CSV files with embedded comments would be manually created and
    maintained and aren't likely to contain fields with embedded comments.

    Skip

    @smontanaro
    Copy link
    Contributor

    Assigning to Andrew (as the primary C lib author). Andrew, please
    comment ;-).

    @andrewmcnamara
    Copy link
    Mannequin

    andrewmcnamara mannequin commented Apr 15, 2008

    I think it's a reasonable enough request - I've certainly had to
    process CSV files with comments. Iain - appologies for not looking at
    your request before now - 3 years is a pretty poor response time.

    Some thoughts:

    • this should target 2.6, not 2.5 (as 2.5 is in maintenance only)?
    • the check that disallows "space" - I wonder what other things it
      should disallow - the delimiter?
    • should it support multicharacter comments, like the C++ //-style
      comment? I think no.
    • any implications for the py3k port?

    Skip - are you happy making the changes, or should I dust off my
    working copy?

    @rhettinger
    Copy link
    Contributor

    -1 on the change. Comments in CSV files are not a common use case.
    I'm less worried about cluttering the C code and more concerned about
    cluttering the API, making the module harder to learn and remember.

    Also, it is already trivial to implement filtering using str.startswith
    (). This proposed patch is not worth the rare case where it saves a
    line or two code that was already crystal clear.

    @MarioFasold
    Copy link
    Mannequin

    MarioFasold mannequin commented Jan 27, 2010

    Comment lines are a *very* common case in scientific and statistical data. +1 for the change.

    @amauryfa
    Copy link
    Member

    Comment lines in csv data may be common in some areas, but they are not part of any standard, and they are not the only possible extension to csv files (for example: ignore empty lines, or a terminal \ for line continuation...)

    Currently all members of Dialect deal with the format of the records, not with the extraction of records from the file. (lineterminator is used only when writing).

    The "CommentedFile" filter above is a good recipe for all cases, and is easy to use. I recommend closing this issue.

    @merwok
    Copy link
    Member

    merwok commented Jan 27, 2010

    Hello

    Shouldn't the comment char definition belong in a dialect class? The reader would still have to be modified to skip these lines, but having this char in the dialect would not require a change to csv.reader signature.

    Kind regards

    @amauryfa
    Copy link
    Member

    Shouldn't the comment char definition belong in a dialect class?
    The proposed patch does this correctly; then csv.reader automatically accepts the dialect parameters to override the selected dialect.

    I'm still -1 on the feature - not standard enough, and easy to implement outside the csv module.

    @smontanaro
    Copy link
    Contributor

    Amaury> Comment lines in csv data may be common in some areas, but they
    Amaury> are not part of any standard, and they are not the only possible
    Amaury> extension to csv files (for example: ignore empty lines, or a
    Amaury> terminal \ for line continuation...)

    Or different peoples' notion of how to comment strings. Precidely because
    there is no standard way to comment out content in CSV files people are free
    to dream up anything they want, including,
    but not limited to:

    * C- or C++-style comments
    * MoinMoin-style comments, where "##" introduces a comment but a lone
      "#" introduces a meta command to the parser
    

    Trying to accommodate the myriad varieties of way s people might decide to
    comment out content would put an undue burden on the csv module's parser.

    Skip

    @pitrou
    Copy link
    Member

    pitrou commented Jan 27, 2010

    Agreed with Skip, Raymond and Amaury.
    Since the csv module returns you an iterator, it's easy enough to wrap it in another iterator.

    @smontanaro
    Copy link
    Contributor

    Antoine> Since the csv module returns you an iterator, it's easy enough
    Antoine> to wrap it in another iterator.

    I prefer to do this sort of stuff as a pre-processing step, so I generally
    wrap the file object input and use that iterator as the input "file" for the
    csv reader. I'm sure it's largely a matter of taste and what you need to
    do. Certainly wrapping the csv reader's output would be useful to provide
    sensible missing values.

    Skip

    @orsenthil
    Copy link
    Member

    Here is another -1 for this proposed feature. Having a comments in the csv fields and providing a way to deal will complicate matters more than required. Different suggestions of how to accomplish it has been suggested here. As others, I too recommend closing it. (It is assigned to andewmcnamara, so I guess, he would close it).

    @andrewmcnamara
    Copy link
    Mannequin

    andrewmcnamara mannequin commented Feb 4, 2010

    Okay, while I am sympathetic to the points raised by the people asking for this enhancement, I'm persuaded to reject it by the arguments that the potential benefit is outweighed by the increase in complexity (code and documentation).

    While the attached patch from Iain requires relatively minor and innocuous changes to the state machine, it only satisfies a limited subset of users, and the effect of complexity is geometric. In other words, to understand the state machine, you need to understand the relation of each state to every other state, etc.

    As to why the core of the module is implemented in C, the sort of stateful parsing done by the module is one area where python does not excel (Ha!). Stripping comments, on the other hand, CAN be done with reasonable efficiency from python (and considerably more flexibility). Also, it has been my experience that CSV files that use comments typically have other non-standard features as well, requiring additional pre- and post-processing in any case.

    @andrewmcnamara andrewmcnamara mannequin closed this as completed Feb 4, 2010
    @andrewmcnamara andrewmcnamara mannequin closed this as completed Feb 4, 2010
    @andrewmcnamara
    Copy link
    Mannequin

    andrewmcnamara mannequin commented Feb 4, 2010

    Note that there is one case that cannot easily be addressed via pre-processing: where the comment character coincidently appears at the start of a line within a multi-line quoted field. For example:

    # This is a comment
    1, 2, "This is field^M
    #3"

    What this should produce is debatable, but it would be hard to make it produce:

    ["1", "2", "This is field^M#3"]

    without implementing it within the parser.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants