Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv needs more quoting rules #67230

Open
samwyse mannequin opened this issue Dec 12, 2014 · 52 comments
Open

csv needs more quoting rules #67230

samwyse mannequin opened this issue Dec 12, 2014 · 52 comments
Assignees
Labels
3.12 bugs and security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@samwyse
Copy link
Mannequin

samwyse mannequin commented Dec 12, 2014

BPO 23041
Nosy @smontanaro, @rhettinger, @bitdancer, @berkerpeksag, @serhiy-storchaka, @yoonghm, @erdnaxeli, @msetina
PRs
  • gh-67230: add quoting rules to csv module #29469
  • Files
  • issue23041.patch: Patch for resolving issue 23041. According to message 232560 and 232563
  • issue23041_test.patch: Test cases added for testing as behaviour proposed in message 232560
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/smontanaro'
    closed_at = None
    created_at = <Date 2014-12-12.16:36:45.576>
    labels = ['easy', 'type-feature', 'library', '3.11']
    title = 'csv needs more quoting rules'
    updated_at = <Date 2022-02-24.00:07:17.586>
    user = 'https://bugs.python.org/samwyse'

    bugs.python.org fields:

    activity = <Date 2022-02-24.00:07:17.586>
    actor = 'samwyse'
    assignee = 'skip.montanaro'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2014-12-12.16:36:45.576>
    creator = 'samwyse'
    dependencies = []
    files = ['37444', '37445']
    hgrepos = []
    issue_num = 23041
    keywords = ['patch', 'easy']
    message_count = 26.0
    messages = ['232560', '232561', '232563', '232596', '232607', '232630', '232676', '232677', '232681', '261141', '261146', '261147', '341460', '358461', '396621', '396641', '396642', '396643', '401607', '401608', '405951', '406013', '406015', '406017', '412463', '413867']
    nosy_count = 11.0
    nosy_names = ['skip.montanaro', 'rhettinger', 'samwyse', 'r.david.murray', 'berker.peksag', 'serhiy.storchaka', 'krypten', 'yoonghm', 'tegdev', 'erdnaxeli', 'msetina']
    pr_nums = ['29469']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue23041'
    versions = ['Python 3.11']

    Linked PRs

    @samwyse
    Copy link
    Mannequin Author

    samwyse mannequin commented Dec 12, 2014

    The csv module currently implements four quoting rules for dialects: QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC and QUOTE_NONE. These rules treat values of None the same as an empty string, i.e. by outputting two consecutive quotes. I propose the addition of two new rules, QUOTE_NOTNULL and QUOTE_STRINGS. The former behaves like QUOTE_ALL while the later behaves like QUOTE_NONNUMERIC, except that in both cases values of None are output as an empty field. Examples follow.

    Current behavior (which will remain unchanged)

    >>> csv.register_dialect('quote_all', quoting=csv.QUOTE_ALL)
    >>> csv.writer(sys.stdout, dialect='quote_all').writerow(['foo', None, 42])
    "foo","","42"
    
    >>> csv.register_dialect('quote_nonnumeric', quoting=csv.QUOTE_NONNUMERIC)
    >>> csv.writer(sys.stdout, dialect='quote_nonnumeric').writerow(['foo', None, 42])
    "foo","",42

    Proposed behavior

    >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
    >>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['foo', None, 42])
    "foo",,"42"
    
    >>> csv.register_dialect('quote_strings', quoting=csv.QUOTE_STRINGS)
    >>> csv.writer(sys.stdout, dialect='quote_strings').writerow(['foo', None, 42])
    "foo",,42

    @samwyse samwyse mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Dec 12, 2014
    @bitdancer
    Copy link
    Member

    As an enhancement, this could be added only to 3.5. The proposal sounds reasonable to me.

    @bitdancer bitdancer added the easy label Dec 12, 2014
    @samwyse
    Copy link
    Mannequin Author

    samwyse mannequin commented Dec 12, 2014

    David: That's not a problem for me.

    Sorry I can't provide real patches, but I'm not in a position to compile (much less test) the C implementation of _csv. I've looked at the code online and below are the changes that I think need to be made. My use cases don't require special handing when reading empty fields, so the only changes I've made are to the code for writers. I did verify that the reader code mostly only checks for QUOTE_NOTNULL when parsing. This means that completely empty fields will continue to load as zero-length strings, not None. I won't stand in the way of anyone wanting to "fix" that for these new rules.

    typedef enum {
    QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC, QUOTE_NONE,
    QUOTE_STRINGS, QUOTE_NOTNULL
    } QuoteStyle;

    static StyleDesc quote_styles[] = {
    { QUOTE_MINIMAL, "QUOTE_MINIMAL" },
    { QUOTE_ALL, "QUOTE_ALL" },
    { QUOTE_NONNUMERIC, "QUOTE_NONNUMERIC" },
    { QUOTE_NONE, "QUOTE_NONE" },
    { QUOTE_STRINGS, "QUOTE_STRINGS" },
    { QUOTE_NOTNULL, "QUOTE_NOTNULL" },
    { 0 }
    };

            switch (dialect->quoting) {
            case QUOTE_NONNUMERIC:
                quoted = !PyNumber_Check(field);
                break;
            case QUOTE_ALL:
                quoted = 1;
                break;
            case QUOTE_STRINGS:
                quoted = PyString_Check(field);
                break;
            case QUOTE_NOTNULL:
                quoted = field != Py_None;
                break;
            default:
                quoted = 0;
                break;
            }

    " csv.QUOTE_MINIMAL means only when required, for example, when a\n"
    " field contains either the quotechar or the delimiter\n"
    " csv.QUOTE_ALL means that quotes are always placed around fields.\n"
    " csv.QUOTE_NONNUMERIC means that quotes are always placed around\n"
    " fields which do not parse as integers or floating point\n"
    " numbers.\n"
    " csv.QUOTE_STRINGS means that quotes are always placed around\n"
    " fields which are strings. Note that the Python value None\n"
    " is not a string.\n"
    " csv.QUOTE_NOTNULL means that quotes are only placed around fields\n"
    " that are not the Python value None.\n"

    @rhettinger
    Copy link
    Contributor

    Samwyse, are these suggestions just based on ideas of what could be done or have you encountered real-world CSV data exchanges that couldn't be handled by the CSV module?

    @smontanaro
    Copy link
    Contributor

    It doesn't look like a difficult change, but is it really needed? I guess my reaction is the same as Raymond's. Are there real-world uses where the current set of quoting styles isn't sufficient?

    @krypten
    Copy link
    Mannequin

    krypten mannequin commented Dec 14, 2014

    Used function PyUnicode_Check instead of PyString_Check

    @samwyse
    Copy link
    Mannequin Author

    samwyse mannequin commented Dec 15, 2014

    Yes, it's based on a real-world need. I work for a Fortune 500 company and we have an internal tool that exports CSV files using what I've described as the QUOTE_NOTNULL rules. I need to create similar files for re-importation. Right now, I have to post-process the output of my Python program to get it right. I added in the QUOTE_STRINGS rule for completeness. I think these two new rules would be useful for anyone wanting to create sparse CSV files.

    @smontanaro
    Copy link
    Contributor

    If I understand correctly, your software needs to distinguish between

    # wrote ["foo", "", 42, None] with quote_all in effect
    "foo","","42",""

    and

    # wrote ["foo", None, 42, ""] with quote_nonnull in effect
    "foo",,"42",""

    so you in effect want to transmit some type information through a CSV file?

    @samwyse
    Copy link
    Mannequin Author

    samwyse mannequin commented Dec 15, 2014

    Skip, I don't have any visibility into how the Java program I'm feeding data into works, I'm just trying to replicate the csv files that it exports as accurately as possible. It has several other quirks, but I can replicate all of them using Dialects; this is the only "feature" I can't. The files I'm looking at have quoted strings and numbers, but there aren't any quoted empty strings. I'm using a DictWriter to create similar csv files, where missing keys are treated as values of None, so I'd like those printed without quotes. If we also want to print empty strings without quotes, that wouldn't impact me at all.

    Besides my selfish needs, this could be useful to anyone wanting to reduce the save of csv files that have lots of empty fields, but wants to quote all non-empty values. This may be an empty set, I don't know.

    @smontanaro
    Copy link
    Contributor

    Thanks for the update berker.peksag. I'm still not convinced that the csv module should be modified just so one user (sorry samwyse) can match the input format of someone's Java program. It seems a bit like trying to make the csv module type-sensitive. What happens when someone finds a csv file containing timestamps in a format other than the datetime.datetime object will produce by default? Why is None special as an object where bool(obj) is False?

    I think the better course here is to either:

    • subclass csv.DictWriter, use dictionaries as your element type, and have its writerow method do the application-specific work.

    • define a writerow() function which does something similar (essentially wrapping csv.writerow()).

    If someone else thinks this is something which belongs in Python's csv module, feel free to reopen and assign it to yourself.

    @serhiy-storchaka
    Copy link
    Member

    The csv module is already type-sensitive (with QUOTE_NONNUMERIC). I agree, that we shouldn't modify the csv module just for one user and one program.

    If a standard CVS library in Java (or other popular laguages) differentiates between empty string and null value when read from CSV, it would be a serious argument to support this in Python. Quick search don't find this.

    @berkerpeksag
    Copy link
    Member

    I was thinking adding a more flexible API like:

    ...
    spamwriter = csv.writer(csvfile, quoting_callable=lambda field: field is not None)
    ...
    

    But that would require too much change in the csv module (or at least its implementation wouldn't be trivial).

    I agree that subclassing DictWriter is a much better way to achieve this.

    @tegdev
    Copy link
    Mannequin

    tegdev mannequin commented May 5, 2019

    The correct handling of None values belongs to the csv module.

    There is a use case to migrate a DB2 database to PostgreSQL.
    DB2 has a command line tool "db2 export ..." which produces csv-files.
    A row
    ['Hello', null, 'world']
    is exported to
    "Hello,,"world".

    I would like to read in these exports with python and put it to PostgreSQL.

    But with the csv library I can't read it in correctly. The input is converted to:
    ['Hello', '', 'world']
    It should read as:
    ['Hello', None, 'world']

    It is pretty easy to write a correct CSV reader with ANTLR but it's terribly slow.

    And last but not least: if someone writes a list the reading should the identity.
    Thats not True for the csv libraray.

    Example:

    import csv
    hello_out_lst = ['Hello', None, 'world']
    with open('hello.csv', 'w') as ofh:
        writer = csv.writer(ofh, delimiter=',')
        writer.writerow(hello_out_lst)
    
    with open('hello.csv', 'r') as ifh:
        reader = csv.reader(ifh, delimiter=',')
        for row in reader:
            hello_in_lst = row
    
    is_equal = hello_out_lst == hello_in_lst
    print(f'{hello_out_lst} is equal {hello_in_lst} ?  {is_equal}')

    The result is:
    ['Hello', None, 'world'] is equal ['Hello', '', 'world'] ? False

    @yoonghm
    Copy link
    Mannequin

    yoonghm mannequin commented Dec 16, 2019

    There is a real requirement for csv to handle an empty field vs a empty string """". csv.QUOTE_NOTNULL could be useful.

    @yoonghm yoonghm mannequin added the 3.8 only security fixes label Dec 16, 2019
    @erdnaxeli
    Copy link
    Mannequin

    erdnaxeli mannequin commented Jun 28, 2021

    I have another use case. I want to import data into PostgreSQL using the COPY FROM command. This command can read a CSV input and it needs to distinguish empty string from null values.

    Could we reconsider this issue and the proposed solution?

    @smontanaro
    Copy link
    Contributor

    Okay, I'll reopen this, at least for the discussion of QUOTE_NONNULL. @erdnaxeli please given an example of how PostgreSQL distinguishes between the empty string and None cases. Is it a quoted empty string vs an empty field? If so, modifying @samwyse's original example, is this what you are after?

    >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
    >>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['', None, 42])
    "",,"42"

    ? Can you modify the original two patches to restrict to QUOTE_NONNULL?

    @smontanaro smontanaro reopened this Jun 28, 2021
    @smontanaro
    Copy link
    Contributor

    Missed tweaking a couple settings.

    @smontanaro smontanaro added 3.11 bug and security fixes and removed 3.8 only security fixes labels Jun 28, 2021
    @smontanaro
    Copy link
    Contributor

    Ugh... s/QUOTE_NONNULL/QUOTE_NOTNULL/

    Not, Non, None... Perl would treat them all the same, right? <wink>

    @msetina
    Copy link
    Mannequin

    msetina mannequin commented Sep 10, 2021

    The MS Synapse has a CSV support in its COPY command that would benefit from the proposed csv.QUOTE_NOTNULL as it can be used when preparing data for import. As they reference the same RFC the CSV modul should support, both the proposed modifications would extend the RFC functionality with support for NULL/None values in the data.
    The csv.QUOTE_STRINGS would enable a smaller file size.
    Our case is that we are moving data into Synapse and have to resort to risk prone string replace functionality to provide the MS Synapse COPY a proper support for empty string and NULL

    @msetina msetina mannequin added 3.7 (EOL) end of life 3.8 only security fixes labels Sep 10, 2021
    miss-islington pushed a commit that referenced this issue Apr 12, 2023
    Add two quoting styles for csv dialects.
    They will help to work with certain databases in particular.
    
    Automerge-Triggered-By: GH:merwok
    @merwok
    Copy link
    Member

    merwok commented Apr 12, 2023

    PR merged, thanks folks!

    I realize too late that we should have added a note in whatsnew. Can someone make a quick PR?

    @smontanaro
    Copy link
    Contributor

    PR merged, thanks folks!

    Thanks for your quick work Éric.

    I realize too late that we should have added a note in whatsnew. Can someone make a quick PR?

    Not sure I did it correctly, but here's a simple PR.

    #103491

    @samwyse
    Copy link
    Contributor

    samwyse commented Apr 13, 2023

    Whoop! Whoop! Whoop!
    It only took eight years, but it's finally across the finish line!
    Many thanks to Skip, Éric, Serhiy, and everyone else who helped make this possible.

    @msetina
    Copy link

    msetina commented Apr 14, 2023

    Thanks everyone who made this possible. CSV is a funny place.Now we meditate until official release and distros to pick it up.....

    @serhiy-storchaka
    Copy link
    Member

    Wait, #29469 only changes writer, not reader. There are no tests for reader with new quoting styles, so we cannot be sure that nothing is broken.

    Also, why do we need both QUOTE_STRINGS and QUOTE_NOTNULL? In what cases the difference is meaningful?

    @samwyse, could you please provide external links to specifications of the difference between "" and an empty field?

    @smontanaro
    Copy link
    Contributor

    There are no tests for reader with new quoting styles, so we cannot be sure that nothing is broken.

    It's quite possible I failed to properly describe the changes in whatsnew. It does, indeed, only affect the writer, but as far as I can tell, the writer tests do exercise that code. Here's the switch statement from csv_writerow as noted by llvm-cov:

          899: 1263:        switch (dialect->quoting) {
            -: 1264:        case QUOTE_NONNUMERIC:
            4: 1265:            quoted = !PyNumber_Check(field);
            4: 1266:            break;
            -: 1267:        case QUOTE_ALL:
           10: 1268:            quoted = 1;
           10: 1269:            break;
            -: 1270:        case QUOTE_STRINGS:
            4: 1271:            quoted = PyUnicode_Check(field);
            4: 1272:            break;
            -: 1273:        case QUOTE_NOTNULL:
            4: 1274:            quoted = field != Py_None;
            4: 1275:            break;
            -: 1276:        default:
          877: 1277:            quoted = 0;
          877: 1278:            break;
            -: 1279:        }
    
    

    Perhaps all that needs doing is to fix my hurried whatsnew entry to remove the reader reference.

    @merwok
    Copy link
    Member

    merwok commented Apr 14, 2023

    It does, indeed, only affect the writer

    Oh oops! Does this mean that the csv module can write files with the new quoting styles that it can’t read back?

    @smontanaro
    Copy link
    Contributor

    @merwok It does kinda seem that way. Is half a loaf better than none? I'm not the person who wrote (or asked for) the changes, as I don't actually do anything unusual with CSV files (just the normal stuff...).

    @smontanaro
    Copy link
    Contributor

    Looking at @samwyse's original post above, it seems all he wanted was to modify the writer. Sam, can you comment whether you think these two new data items should (try to) round trip?

    @samwyse
    Copy link
    Contributor

    samwyse commented Apr 14, 2023

    Back when I first submitted this, I was only interested in the writer. Later, people discussed being able to round trip the data, so I added references to the reader. We can drop the those references and open new request to add those.

    As I was coding up the new tests for the writer, it occurred to me that we should use the same input for all quoting roles, so that the differences in the quoting rules are more obvious.

    What is the dropdead date for 3.12?

    @terryjreedy
    Copy link
    Member

    Beta1 is due in early May. However, later betas are for fixing bugs in new features included in beta1.

    I understand that there is a use for writing files to be written elsewhere, but round-tripping would seem better to me.

    @samwyse
    Copy link
    Contributor

    samwyse commented Apr 14, 2023

    Adding two more tests for readers is trivial. As I look at _csv.c, however, I can't really see an easy way to distinguish between «,,» and «,"",». So unless someone more knowledgeable than I thinks that it's simple, I propose that we punt this down the road, making the two new quoting rules illegal for reader objects for the time being. I am updating my branch to do this now.

    @samwyse
    Copy link
    Contributor

    samwyse commented Apr 15, 2023

    I think I have made it, so that using the new quoting rules in reader objects will raise an error. Note that I am not entirely confident in my ability to copy and paste C code.

    @smontanaro
    Copy link
    Contributor

    Thinking more about the problem, I think it's wrong to do anything to the reader. In its entire history, the reader has never done any sort of type inference. That would be exactly what distinguishing an empty field from a quoted empty field would be, a (small) bit of type inference. Once you let the camel's nose under the tent, he's going to want to join the party and will ask for type inference of float, int, complex, and who knows what-all.

    The better course at this point would be to adjust the docs to state that the two new quoting rules have no effect in the reader. My apologies for not thinking of this sooner (like a year or two ago).

    @samwyse
    Copy link
    Contributor

    samwyse commented Apr 16, 2023

    I'm not going to fire up the interpreter to check, but I thought that if a field looked like a number, then it was converted to float, and I further thought that only applied to unquoted fields. But my memory has been awful since the pandemic. Anyway, I like that plan.

    @smontanaro
    Copy link
    Contributor

    Python's csv module always returns strings. The Pandas CSV file read does infer types though.

    % cat names.csv
    first_name,last_name,age
    Eric,Idle,47
    John,Cleese,99
    % python
    Python 3.12.0a7+ (heads/main-dirty:330a942b63, Apr 14 2023, 06:59:05) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import csv
    >>> for row in csv.reader(open("names.csv")):
    ...   print(row)
    ... 
    ['first_name', 'last_name', 'age']
    ['Eric', 'Idle', '47']
    ['John', 'Cleese', '99']
    >>> ^D
    

    @msetina
    Copy link

    msetina commented Apr 16, 2023

    If I can add my pinch of salt. Here we colide data and programming world. In data world None is a value of a type. In Python None is a separate type.
    Using a paradigm from data world Oracle will equate empty string in strings with None. MS would not.
    On the reader side if you have a CSV to read and you want to preserve NULL values normally in non quoted there would be a special character combination (^N or something) to represent a NULL, so reader would be able to parse it.
    Then came Microsoft where someone was not checking with others and used quoted values which are more expensive and thought that it was a good idea to use empty field for None. And for this a reader needs to infere None.
    In Python using non quoted values in CSV is good, but there is an if, when users encounter MS produced CSV and want None values preserved. Maybe MS is not the only one, but they stepped on my toe.

    @smontanaro
    Copy link
    Contributor

    I certainly understand the desire to interoperate with the MS CSV ecosystem, as that has traditionally been the de facto "spec" for what it means to read and write CSV files. Still, I think the topic would be too involved to deal with properly between now and May 1. I propose limiting the change for now to just the writer, fixing any doc bits which suggest there are any changes to the reader, then going through the usual PEP process to investigate making changes to the semantics of the reader.

    aisk pushed a commit to aisk/cpython that referenced this issue Apr 18, 2023
    Add two quoting styles for csv dialects.
    They will help to work with certain databases in particular.
    
    Automerge-Triggered-By: GH:merwok
    @smontanaro
    Copy link
    Contributor

    Would someone like to work on a PEP regarding type conversion in the reader?

    carljm added a commit to carljm/cpython that referenced this issue Apr 20, 2023
    * main: (24 commits)
      pythongh-98040: Move the Single-Phase Init Tests Out of test_imp (pythongh-102561)
      pythongh-83861: Fix datetime.astimezone() method (pythonGH-101545)
      pythongh-102856: Clean some of the PEP 701 tokenizer implementation (python#103634)
      pythongh-102856: Skip test_mismatched_parens in WASI builds (python#103633)
      pythongh-102856: Initial implementation of PEP 701 (python#102855)
      pythongh-103583: Add ref. dependency between multibytecodec modules (python#103589)
      pythongh-83004: Harden msvcrt further (python#103420)
      pythonGH-88342: clarify that `asyncio.as_completed` accepts generators yielding tasks (python#103626)
      pythongh-102778: IDLE - make sys.last_exc available in Shell after traceback (python#103314)
      pythongh-103582: Remove last references to `argparse.REMAINDER` from docs (python#103586)
      pythongh-103583: Always pass multibyte codec structs as const (python#103588)
      pythongh-103617: Fix compiler warning in _iomodule.c (python#103618)
      pythongh-103596: [Enum] do not shadow mixed-in methods/attributes (pythonGH-103600)
      pythonGH-100530: Change the error message for non-class class patterns (pythonGH-103576)
      pythongh-95299: Remove lingering setuptools reference in installer scripts (pythonGH-103613)
      [Doc] Fix a typo in optparse.rst (python#103504)
      pythongh-101100: Fix broken reference `__format__` in `string.rst` (python#103531)
      pythongh-95299: Stop installing setuptools as a part of ensurepip and venv (python#101039)
      pythonGH-103484: Docs: add linkcheck allowed redirects entries for most cases (python#103569)
      pythongh-67230: update whatsnew note for csv changes (python#103598)
      ...
    @hugovk hugovk removed the easy label Sep 8, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 bugs and security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    Status: Done
    Development

    No branches or pull requests