Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add named tuple reader to CSV module #46143

Closed
rhettinger opened this issue Jan 13, 2008 · 42 comments
Closed

Add named tuple reader to CSV module #46143

rhettinger opened this issue Jan 13, 2008 · 42 comments
Assignees
Labels
3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

BPO 1818
Nosy @smontanaro, @warsaw, @rhettinger, @pitrou, @merwok, @cedk, @asvetlov, @dlenski
Files
  • ntreader.diff: Proof-of-concept patch
  • ntreader3.diff: namedtuple reader and writer.
  • ntreader4.diff: Includes revision for rename keyword argument
  • named_tuple_write_header2.patch
  • ntreader4_py3_1.diff: Patch against python 3.1a1
  • ntreader6_py3.diff: updated documentation
  • ntreader6_py27.diff: updated documentation
  • 1818_py35.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/smontanaro'
    closed_at = <Date 2020-12-22.02:45:54.546>
    created_at = <Date 2008-01-13.22:27:14.884>
    labels = ['3.8', 'type-feature', 'library']
    title = 'Add named tuple reader to CSV module'
    updated_at = <Date 2020-12-22.02:45:54.545>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2020-12-22.02:45:54.545>
    actor = 'rhettinger'
    assignee = 'skip.montanaro'
    closed = True
    closed_date = <Date 2020-12-22.02:45:54.546>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2008-01-13.22:27:14.884>
    creator = 'rhettinger'
    dependencies = []
    files = ['9151', '12990', '13009', '13188', '13263', '13274', '13275', '39139']
    hgrepos = []
    issue_num = 1818
    keywords = ['patch']
    message_count = 42.0
    messages = ['59866', '61523', '61532', '81453', '81464', '81518', '81537', '82744', '82745', '82746', '82764', '82765', '82770', '82771', '82778', '82780', '82798', '82799', '82812', '82814', '82819', '83298', '83299', '83310', '83318', '83321', '83332', '83333', '83334', '83340', '102936', '102959', '110598', '111523', '111552', '115348', '235710', '241599', '241601', '242879', '242893', '311182']
    nosy_count = 11.0
    nosy_names = ['skip.montanaro', 'barry', 'rhettinger', 'pitrou', 'eric.araujo', 'ced', 'jdwhitley', 'rrenaud', 'asvetlov', 'dlenski', 'copper-head']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1818'
    versions = ['Python 3.8']

    @rhettinger
    Copy link
    Contributor Author

    Here's a proof-of-concept patch. If approved, will change from
    generator form to match the other readers and will add a test suite.

    The idea corresponds to what is currently done by the dict reader but
    returns a space and time efficient named tuple instead of a dict. Field
    order is preserved and named attribute access is supported.

    A writer is not needed because named tuples can be feed into the
    existing writer just like regular tuples.

    @rhettinger rhettinger added the stdlib Python modules in the Lib dir label Jan 13, 2008
    @rhettinger
    Copy link
    Contributor Author

    Barry, any thoughts on this?

    @smontanaro
    Copy link
    Contributor

    I'd personally be kind of surprised if Barry had any thoughts on this.
    Is there any reason this couldn't be pushed down into the C code and
    replace the normal tuple output completely? In the absence of any
    fieldnames you could just dream some up, like "field001", "field002",
    etc.

    Skip

    @jdwhitley
    Copy link
    Mannequin

    jdwhitley mannequin commented Feb 9, 2009

    An implementation of a namedtuple reader and writer.

    Created a writer for the case where user would like to specify
    desired field names and default values on missing field names.

    e.g.
    mywriter = NamedTupleWriter(f, fieldnames=['f1', 'f2', 'f3'],
    restval='missing')

    Nt = namedtuple('LessFields', 'f1 f3')
    nt = Nt(f1='one', f2=2)
    
    mywriter.writerow(nt) # writes one,missing,2

    any thoughts on case where defined fieldname has a leading
    underscore? Should there be a flag to silently ignore?

    e.g.
    if self._ignore_underscores:
    fieldname = fieldname.lstrip('_')

    Leading underscores may be present in an unsighted csv file,
    additionally, spaces and other non alpha numeric characters pose
    a problem that does not affect the DictReader class.

    Cheers,

    @rhettinger
    Copy link
    Contributor Author

    Consider providing a hook to a function that converts non-conforming
    field names (ones with a leading underscore, leading digit, non-letter,
    keyword, or duplicate name).

    class NamedTupleReader:
        def __init__(self, f, fieldnames=None, restkey=None, restval=None,
                     dialect="excel", fieldnamer=None, *args, **kwds):
                     . . .

    I'm going to either post a recipe to do the renaming or provide a static
    method for the same purpose. It might work like this:

      >>> renamer(['abc', 'def', '1', '_hidden', 'abc', 'p', 'abc'])
      ['abc', 'x_def', 'x_1', 'x_hidden', 'x_abc', 'p', 'x1_abc']

    @rhettinger
    Copy link
    Contributor Author

    In r69480, named tuples gained the ability to automatically rename
    invalid fieldnames.

    @jdwhitley
    Copy link
    Mannequin

    jdwhitley mannequin commented Feb 10, 2009

    Updated NamedTupleReader to give a rename=False keyword argument.
    rename is passed directly to the namedtuple factory function to enable
    automatic handling of invalid fieldnames.

    Two new tests for the rename keyword.

    Cheers,

    @rrenaud
    Copy link
    Mannequin

    rrenaud mannequin commented Feb 26, 2009

    I am totally new to Python dev. I reinvented a NamedTupleReader
    tonight, only to find out that it was created a year ago. My primary
    motivation is that DictReader reads headers nicely, but DictWriter
    totally sucks at handling them.

    Consider doing some filtering on a csv file, like so.

    sample_data = [
        'title,latitude,longitude',
        'OHO Ofner & Hammecke Reinigungsgesellschaft mbH,48.128265,11.610848',
        'Kitchen Kaboodle,45.544241,-122.715728',
        'Walgreens,28.339727,-81.596367',
        'Gurnigel Pass,46.731944,7.447778'
        ]
    
    def filter_with_dict_reader_writer():
      accepted_rows = []
      for row in csv.DictReader(sample_data):
        if float(row['latitude']) > 0.0 and float(row['longitude']) > 0.0:
          accepted_rows.append(row)
    
      field_names = csv.reader(sample_data).next()
      output_writer = csv.DictWriter(open('accepted_by_dict.csv', 'w'),
                                     field_names)
      output_writer.writerow(dict(zip(field_names, field_names)))
      output_writer.writerows(accepted_rows)

    You have to work so hard to maintain the headers when you write the file
    with DictWriter. I understand this is a limitation of dicts throwing
    away the order information. But namedtuples don't have that problem.

    NamedTupleReader and NamedTupleWriter should be inverses. This means
    that NamedTupleWriter needs to write headers. This should produce
    identical output as the dict writer example, but it's much cleaner.

    def filter_with_named_tuple_reader_writer():
       accepted_rows = []
       for row in csv.NamedTupleReader(sample_data):
         if float(row.latitude) > 0.0 and float(row.longitude) > 0.0:
           accepted_rows.append(row)
    
       output_writer = csv.NamedTupleWriter(
           open('accepted_by_named_tuple.csv', 'w'))
       output_writer.writerows(accepted_rows)

    I patched on top of the existing NamedTupleWriter patch adding support
    for writing headers. I don't know if that's bad style/etiquette, etc.

    @rrenaud
    Copy link
    Mannequin

    rrenaud mannequin commented Feb 26, 2009

    My previous patch could write the header twice. But I am not sure about
    about how the writer should handle the fieldnames parameter on one hand,
    and the namedtuple._fields on the other.

    @rhettinger
    Copy link
    Contributor Author

    The two latest patches (ntreader4.diff and
    named_tuple_write_header.patch) seem like they are going in the right
    direction and are getting close.

    Barry or Skip, is this something you want in your module?

    @rhettinger rhettinger added the type-feature A feature request or enhancement label Feb 26, 2009
    @smontanaro
    Copy link
    Contributor

    Raymond> Barry or Skip, is this something you want in your module?

    Sorry, I haven't really looked at this ticket other than to notice its
    presence. I wrote the DictReader/DictWriter functions way back when, so I'm
    pretty comfortable using them. I haven't felt the need for any other reader
    or writer which manipulates file headers.

    Skip

    @warsaw
    Copy link
    Member

    warsaw commented Feb 26, 2009

    I think it would be useful to have.

    @smontanaro
    Copy link
    Contributor

    Hrm... I replied twice by email. Only one comment appears to have
    survived the long trip. Here's my second reply:

    Rob> NamedTupleReader and NamedTupleWriter should be inverses.  This
    Rob> means that NamedTupleWriter needs to write headers.  This should
    Rob> produce identical output as the dict writer example, but it's much
    Rob> cleaner.
    

    You're assuming that one instance of these classes will read or write an
    entire file. What if you want to append lines to an existing CSV file or
    pick up reading a file with a new reader which has already be partially
    processed?

    @smontanaro
    Copy link
    Contributor

    Let me be more explicit. I don't know how it implements it, but I think
    you really need to give the user the option of specifying the field
    names and not reading/writing headers. It can't be implicit as I
    interpreted Rob's earlier comment:

    > NamedTupleReader and NamedTupleWriter should be inverses.
    > This means that NamedTupleWriter needs to write headers.
    

    Skip

    @jdwhitley
    Copy link
    Mannequin

    jdwhitley mannequin commented Feb 26, 2009

    Skip> Let me be more explicit. I don't know how it implements it, but I
    think
    Skip> you really need to give the user the option of specifying the
    field
    Skip> names and not reading/writing headers. It can't be implicit as I
    Skip> interpreted Rob's earlier comment:

    rrenaud> NamedTupleReader and NamedTupleWriter should be inverses.
    rrenaud> This means that NamedTupleWriter needs to write headers.
    

    I agree with Skip, we mustn't have a 'wroteheader' flag internal to the
    NamedTupleWriter.

    Currently to write a 'header' row with a csv.writer you could (for
    example) pass a tuple of header names to writerow. NamedTupleWriter
    is no different, you would have a namedtuple of header names instead of
    a tuple of header names.

    I would not like to see another flag added to the initialisation process
    to enable the writing of a header row as the 'first' (or any) row
    written to a file. We could add a function 'writeheader' that would
    write the contents of 'fieldnames' as a row, but I don't like the idea.

    Cheers,

    @rrenaud
    Copy link
    Mannequin

    rrenaud mannequin commented Feb 26, 2009

    I want to make sure I understand. Am I correct in believing that Skip
    thinks writing headers should be optional, while Jervis believes we
    should leave the burden to the NamedTupleWriter client?

    I agree that we should not unconditionally write headers, but I think
    that we should write headers by default, much like we read them by default.

    I believe the implicit header writing is very elegant, and the only
    reason that the DictWriter object doesn't write headers is the impedance
    mismatch between dicts and CSV. namedtuples has the field order
    information, the impedance mismatch is gone, we should no longer be
    hindered. Implicitly reading but not explicitly writing headers just
    seems wrong.

    It also seems wrong to require the construction of "header" namedtuple
    objects. It's much less natural than dicts holding identity mappings.

    >>> Point._make(Point._fields)
    Point(x='x', y='y')

    To me, that just looks weird and non-obvious to me. That Point instance
    doesn't really fit in my mind as something that should be a Point.

    @smontanaro
    Copy link
    Contributor

    Rob> I agree that we should not unconditionally write headers, but I
    Rob> think that we should write headers by default, much like we read
    Rob> them by default.

    I don't think you should write them by default. I've worked with lots of
    CSV files which have no headers. I can imagine people wanting to write CSV
    files with multiple headers. It should be optional and explicit.

    Skip

    @smontanaro
    Copy link
    Contributor

    More concretely, I don't think this is so onerous:

        names = ["col1", "col2", "color"]
        writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...)
        writer.writerow(dict(zip(names, names)))
        ...

    or

        f = open("f.csv", "rb")
        names = csv.reader(f).next()
        reader = csv.DictReader(f, fieldnames=names, ...)
        ...

    Skip

    @rrenaud
    Copy link
    Mannequin

    rrenaud mannequin commented Feb 27, 2009

    I did a search on Google code for the DictReader constructor. I
    analyzed the first 3 pages, the fieldnames parameter was used in 14 of
    27 cases (discounting unittest code built into Python) and was not
    used in 13 of 27 cases. I suppose that means headered csv files are
    sufficiently rare that they shouldn't be created implicitly by
    default. I still don't like the lack of symmetry of supporting
    implicit header reads, but not implicit header writes.

    On Thu, Feb 26, 2009 at 8:00 PM, Skip Montanaro <report@bugs.python.org> wrote:

    Skip Montanaro <skip@pobox.com> added the comment:

    More concretely, I don't think this is so onerous:

    names = ["col1", "col2", "color"]
    writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...)
    writer.writerow(dict(zip(names, names)))
    ...

    or

    f = open("f.csv", "rb")
    names = csv.reader(f).next()
    reader = csv.DictReader(f, fieldnames=names, ...)
    ...

    Skip


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue1818\>


    @rhettinger
    Copy link
    Contributor Author

    I don't think you should write them by default.
    I've worked with lots of CSV files which have no headers.

    My experience has been the same as Skips.

    @smontanaro
    Copy link
    Contributor

    Rob> I still don't like the lack of symmetry of supporting implicit
    Rob> header reads, but not implicit header writes.

    A header is nothing more than a row in the CSV file with special
    interpretation applied by the user. There is nothing implicit about it.
    If you know the first line is a header, use the recipe I posted. If not,
    supply your own fieldnames and treat the first row as data.

    Skip

    @jdwhitley
    Copy link
    Mannequin

    jdwhitley mannequin commented Mar 8, 2009

    Added a patch against py3k branch.

    in csv.rst removed reference to reader.next() as a public method.

    @smontanaro
    Copy link
    Contributor

    Jervis> in csv.rst removed reference to reader.next() as a public method.

    Because? I've not seen any discussion in this issue or in any other forums
    (most certainly not on the csv@python.org mailing list) which would suggest
    that csv.reader's next() method should no longer be a public method.

    Skip

    @pitrou
    Copy link
    Member

    pitrou commented Mar 8, 2009

    I don't understand why NamedTupleReader requires the fieldnames array
    rather than the namedtuple class itself. If you could pass it the
    namedtuple class, users could choose whatever namedtuple subclass with
    whatever additional methods or behaviour suits them. It would make
    NamedTupleReader more flexible and more useful.

    @smontanaro
    Copy link
    Contributor

    I don't know how NamedTuple objects work, but in many situations you
    want the content of the CSV file to drive the output. I would think
    you would use a technique similar to my DictReader example to tell
    the NamedTupleReader the fieldnames. For that you need a fieldnames
    argument.

    @smontanaro
    Copy link
    Contributor

    I retract my previous comment. I don't use the DictReader the way it
    operates (fieldnames==None => first row is a header) and forgot about
    that behavior.

    @jdwhitley
    Copy link
    Mannequin

    jdwhitley mannequin commented Mar 8, 2009

    Jervis> in csv.rst removed reference to reader.next() as a public method.

    Skip> Because? I've not seen any discussion in this issue or in any
    Skip> other forums
    Skip> (most certainly not on the csv@python.org mailing list) which
    would Skip> suggest
    Skip> that csv.reader's next() method should no longer be a public method.

    I agree, this should be applied separately.

    @jdwhitley
    Copy link
    Mannequin

    jdwhitley mannequin commented Mar 8, 2009

    Antoine> I don't understand why NamedTupleReader requires the
    Antoine> fieldnames array
    Antoine> rather than the namedtuple class itself. If you could pass it
    Antoine> the namedtuple class, users could choose whatever namedtuple
    Antoine> subclass with whatever additional methods or behaviour suits
    Antoine> them. It would make NamedTupleReader more flexible and more
    Antoine> useful.

    The NamedTupleReader does take the namedtuple class as the fieldnames
    argument. It can be a namedtuple, a 'fieldnames' array or None.
    If a namedtuple is used as the fieldnames argument, returned rows are
    created using ._make from the this namedtuple. Unless I have read your
    requirements incorrectly, this is the behaviour you describe.

    Given the confusion, I accept that the documentation needs to be improved.

    The NamedTupleReader and Writer were created to follow as closely as
    possible the behaviour (and signature) of the DictReader and DictWriter,
    with the exception of using namedtuples instead of dicts.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 8, 2009

    Ok, I got misled by the documentation ("The contents of *fieldnames* are
    passed directly to be used as the namedtuple fieldnames"), and your
    implementation is a bit difficult to follow.

    @jdwhitley
    Copy link
    Mannequin

    jdwhitley mannequin commented Mar 9, 2009

    Updated version of docs for 2.7 and 3k.

    @merwok
    Copy link
    Member

    merwok commented Apr 12, 2010

    See also this python-ideas thread: http://mail.python.org/pipermail/python-ideas/2010-April/006991.html

    @smontanaro
    Copy link
    Contributor

    Type conversion is a whole 'nuther kettle of fish. This particular thread is long and complex enough that it shouldn't be made more complex.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 17, 2010

    I suggest that this is closed unless anyone shows an active interest in it.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 25, 2010

    Closing as no response to msg110598.

    @BreamoreBoy BreamoreBoy mannequin closed this as completed Jul 25, 2010
    @rhettinger
    Copy link
    Contributor Author

    Re-opening because we ought to do something along these lines at some point. The DictReader and DictWriter are inadequate for preserving order and they are unnecessarily memory intensive (one dict per record).

    FWIW, the non-conforming field name problem has already been solved by recent improvements to collections.namedtuple using rename=True.

    @rhettinger rhettinger assigned rhettinger and unassigned warsaw Jul 25, 2010
    @rhettinger rhettinger reopened this Jul 25, 2010
    @rhettinger
    Copy link
    Contributor Author

    Unassigning, this needs fresh thought and a fresh patch from someone who can devote a little deep thinking on how to solve this problem cleanly. In the meantime, it is no problem to simply cast the CSV tuples into named tuples.

    @rhettinger rhettinger removed their assignment Sep 2, 2010
    @dlenski
    Copy link
    Mannequin

    dlenski mannequin commented Feb 10, 2015

    Here's the class I have been using for reading namedtuples from CSV files:

        from collections import namedtuple
        from itertools import imap
        import csv
    
        class CsvNamedTupleReader(object):
            __slots__ = ('_r', 'row', 'fieldnames')
            def __init__(self, *args, **kwargs):
                self._r = csv.reader(*args, **kwargs)
                self.row = namedtuple("row", self._r.next())
                self.fieldnames = self.row._fields
    
            def __iter__(self):
                #FIXME: how about this? return imap(self.row._make, self._r[:len(self.fieldnames)]
                return imap(self.row._make, self._r)
    
            dialect = property(lambda self: self._r.dialect)
            line_num = property(lambda self: self._r.line_num)

    This class wraps csv.reader since it doesn't seem to be possible to inherit from it. It uses itertools.imap to iterate over the rows output by csv.reader and convert them to the namedtuple class.

    One thing that needs fixing (marked with FIXME above) is what to do in the case of a row which has more fields than the header row. The simplest solution is simply to truncate such a row, but perhaps more options are needed, similar to those offered by DictReader.

    @copper-head
    Copy link
    Mannequin

    copper-head mannequin commented Apr 20, 2015

    As my contribution during the sprints at PyCon 2015, I've tweaked Jervis's patch a little and updated the tests/docs to work with Python 3.5.

    My only real change was placing the basic reader object inside a generator expression that filters out empty lines. Being partial to functional programming I find this removes some of the code clutter in __next__(), letting that method focus on turning rows into tuples.

    Hopefully this will rekindle the discussion!

    @rhettinger
    Copy link
    Contributor Author

    Skip or Barry, do you want to look at this?

    @copper-head
    Copy link
    Mannequin

    copper-head mannequin commented May 11, 2015

    Friendly reminder that this exists.

    I know everyone's busy and this is marked as low-priority, but I'm gonna keep bumping this till we add a solution :)

    @smontanaro
    Copy link
    Contributor

    I looked at this six years ago. I still haven't found a situation where I pined for a NamedTupleReader. That said, I have no objection to committing it if others, more well-versed in current Python code and NamedTuples than I gives it a pass. Note that I added a couple comments to the csv.py diff, but nobody either updated the code or explained why I was out in the weeds in my comments.

    @rhettinger rhettinger added the 3.8 (EOL) end of life label Jan 29, 2018
    @smontanaro
    Copy link
    Contributor

    FWIW, I relinquished my check-in privileges quite awhile ago. This should
    almost certainly no longer be assigned to me.

    S

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants