New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csv needs more quoting rules #67230
Comments
The csv module currently implements four quoting rules for dialects: QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC and QUOTE_NONE. These rules treat values of None the same as an empty string, i.e. by outputting two consecutive quotes. I propose the addition of two new rules, QUOTE_NOTNULL and QUOTE_STRINGS. The former behaves like QUOTE_ALL while the later behaves like QUOTE_NONNUMERIC, except that in both cases values of None are output as an empty field. Examples follow. Current behavior (which will remain unchanged) >>> csv.register_dialect('quote_all', quoting=csv.QUOTE_ALL)
>>> csv.writer(sys.stdout, dialect='quote_all').writerow(['foo', None, 42])
"foo","","42"
>>> csv.register_dialect('quote_nonnumeric', quoting=csv.QUOTE_NONNUMERIC)
>>> csv.writer(sys.stdout, dialect='quote_nonnumeric').writerow(['foo', None, 42])
"foo","",42 Proposed behavior >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
>>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['foo', None, 42])
"foo",,"42"
>>> csv.register_dialect('quote_strings', quoting=csv.QUOTE_STRINGS)
>>> csv.writer(sys.stdout, dialect='quote_strings').writerow(['foo', None, 42])
"foo",,42 |
As an enhancement, this could be added only to 3.5. The proposal sounds reasonable to me. |
David: That's not a problem for me. Sorry I can't provide real patches, but I'm not in a position to compile (much less test) the C implementation of _csv. I've looked at the code online and below are the changes that I think need to be made. My use cases don't require special handing when reading empty fields, so the only changes I've made are to the code for writers. I did verify that the reader code mostly only checks for QUOTE_NOTNULL when parsing. This means that completely empty fields will continue to load as zero-length strings, not None. I won't stand in the way of anyone wanting to "fix" that for these new rules. typedef enum { static StyleDesc quote_styles[] = { switch (dialect->quoting) {
case QUOTE_NONNUMERIC:
quoted = !PyNumber_Check(field);
break;
case QUOTE_ALL:
quoted = 1;
break;
case QUOTE_STRINGS:
quoted = PyString_Check(field);
break;
case QUOTE_NOTNULL:
quoted = field != Py_None;
break;
default:
quoted = 0;
break;
} " csv.QUOTE_MINIMAL means only when required, for example, when a\n" |
Samwyse, are these suggestions just based on ideas of what could be done or have you encountered real-world CSV data exchanges that couldn't be handled by the CSV module? |
It doesn't look like a difficult change, but is it really needed? I guess my reaction is the same as Raymond's. Are there real-world uses where the current set of quoting styles isn't sufficient? |
Used function PyUnicode_Check instead of PyString_Check |
Yes, it's based on a real-world need. I work for a Fortune 500 company and we have an internal tool that exports CSV files using what I've described as the QUOTE_NOTNULL rules. I need to create similar files for re-importation. Right now, I have to post-process the output of my Python program to get it right. I added in the QUOTE_STRINGS rule for completeness. I think these two new rules would be useful for anyone wanting to create sparse CSV files. |
If I understand correctly, your software needs to distinguish between # wrote ["foo", "", 42, None] with quote_all in effect and # wrote ["foo", None, 42, ""] with quote_nonnull in effect so you in effect want to transmit some type information through a CSV file? |
Skip, I don't have any visibility into how the Java program I'm feeding data into works, I'm just trying to replicate the csv files that it exports as accurately as possible. It has several other quirks, but I can replicate all of them using Dialects; this is the only "feature" I can't. The files I'm looking at have quoted strings and numbers, but there aren't any quoted empty strings. I'm using a DictWriter to create similar csv files, where missing keys are treated as values of None, so I'd like those printed without quotes. If we also want to print empty strings without quotes, that wouldn't impact me at all. Besides my selfish needs, this could be useful to anyone wanting to reduce the save of csv files that have lots of empty fields, but wants to quote all non-empty values. This may be an empty set, I don't know. |
Thanks for the update berker.peksag. I'm still not convinced that the csv module should be modified just so one user (sorry samwyse) can match the input format of someone's Java program. It seems a bit like trying to make the csv module type-sensitive. What happens when someone finds a csv file containing timestamps in a format other than the datetime.datetime object will produce by default? Why is None special as an object where bool(obj) is False? I think the better course here is to either:
If someone else thinks this is something which belongs in Python's csv module, feel free to reopen and assign it to yourself. |
The csv module is already type-sensitive (with QUOTE_NONNUMERIC). I agree, that we shouldn't modify the csv module just for one user and one program. If a standard CVS library in Java (or other popular laguages) differentiates between empty string and null value when read from CSV, it would be a serious argument to support this in Python. Quick search don't find this. |
I was thinking adding a more flexible API like:
But that would require too much change in the csv module (or at least its implementation wouldn't be trivial). I agree that subclassing DictWriter is a much better way to achieve this. |
The correct handling of None values belongs to the csv module. There is a use case to migrate a DB2 database to PostgreSQL. I would like to read in these exports with python and put it to PostgreSQL. But with the csv library I can't read it in correctly. The input is converted to: It is pretty easy to write a correct CSV reader with ANTLR but it's terribly slow. And last but not least: if someone writes a list the reading should the identity. Example: import csv
hello_out_lst = ['Hello', None, 'world']
with open('hello.csv', 'w') as ofh:
writer = csv.writer(ofh, delimiter=',')
writer.writerow(hello_out_lst)
with open('hello.csv', 'r') as ifh:
reader = csv.reader(ifh, delimiter=',')
for row in reader:
hello_in_lst = row
is_equal = hello_out_lst == hello_in_lst
print(f'{hello_out_lst} is equal {hello_in_lst} ? {is_equal}') The result is: |
There is a real requirement for csv to handle an empty field vs a empty string """". csv.QUOTE_NOTNULL could be useful. |
I have another use case. I want to import data into PostgreSQL using the COPY FROM command. This command can read a CSV input and it needs to distinguish empty string from null values. Could we reconsider this issue and the proposed solution? |
Okay, I'll reopen this, at least for the discussion of QUOTE_NONNULL. @erdnaxeli please given an example of how PostgreSQL distinguishes between the empty string and None cases. Is it a quoted empty string vs an empty field? If so, modifying @samwyse's original example, is this what you are after? >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
>>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['', None, 42])
"",,"42" ? Can you modify the original two patches to restrict to QUOTE_NONNULL? |
Missed tweaking a couple settings. |
Ugh... s/QUOTE_NONNULL/QUOTE_NOTNULL/ Not, Non, None... Perl would treat them all the same, right? <wink> |
The MS Synapse has a CSV support in its COPY command that would benefit from the proposed csv.QUOTE_NOTNULL as it can be used when preparing data for import. As they reference the same RFC the CSV modul should support, both the proposed modifications would extend the RFC functionality with support for NULL/None values in the data. |
Add two quoting styles for csv dialects. They will help to work with certain databases in particular. Automerge-Triggered-By: GH:merwok
PR merged, thanks folks! I realize too late that we should have added a note in whatsnew. Can someone make a quick PR? |
Thanks for your quick work Éric.
Not sure I did it correctly, but here's a simple PR. |
Whoop! Whoop! Whoop! |
Thanks everyone who made this possible. CSV is a funny place.Now we meditate until official release and distros to pick it up..... |
Wait, #29469 only changes writer, not reader. There are no tests for reader with new quoting styles, so we cannot be sure that nothing is broken. Also, why do we need both QUOTE_STRINGS and QUOTE_NOTNULL? In what cases the difference is meaningful? @samwyse, could you please provide external links to specifications of the difference between |
It's quite possible I failed to properly describe the changes in whatsnew. It does, indeed, only affect the writer, but as far as I can tell, the writer tests do exercise that code. Here's the switch statement from csv_writerow as noted by
Perhaps all that needs doing is to fix my hurried whatsnew entry to remove the |
Oh oops! Does this mean that the csv module can write files with the new quoting styles that it can’t read back? |
@merwok It does kinda seem that way. Is half a loaf better than none? I'm not the person who wrote (or asked for) the changes, as I don't actually do anything unusual with CSV files (just the normal stuff...). |
Looking at @samwyse's original post above, it seems all he wanted was to modify the writer. Sam, can you comment whether you think these two new data items should (try to) round trip? |
Back when I first submitted this, I was only interested in the writer. Later, people discussed being able to round trip the data, so I added references to the reader. We can drop the those references and open new request to add those. As I was coding up the new tests for the writer, it occurred to me that we should use the same input for all quoting roles, so that the differences in the quoting rules are more obvious. What is the dropdead date for 3.12? |
Beta1 is due in early May. However, later betas are for fixing bugs in new features included in beta1. I understand that there is a use for writing files to be written elsewhere, but round-tripping would seem better to me. |
Adding two more tests for readers is trivial. As I look at _csv.c, however, I can't really see an easy way to distinguish between «,,» and «,"",». So unless someone more knowledgeable than I thinks that it's simple, I propose that we punt this down the road, making the two new quoting rules illegal for reader objects for the time being. I am updating my branch to do this now. |
I think I have made it, so that using the new quoting rules in reader objects will raise an error. Note that I am not entirely confident in my ability to copy and paste C code. |
Thinking more about the problem, I think it's wrong to do anything to the reader. In its entire history, the reader has never done any sort of type inference. That would be exactly what distinguishing an empty field from a quoted empty field would be, a (small) bit of type inference. Once you let the camel's nose under the tent, he's going to want to join the party and will ask for type inference of float, int, complex, and who knows what-all. The better course at this point would be to adjust the docs to state that the two new quoting rules have no effect in the reader. My apologies for not thinking of this sooner (like a year or two ago). |
I'm not going to fire up the interpreter to check, but I thought that if a field looked like a number, then it was converted to float, and I further thought that only applied to unquoted fields. But my memory has been awful since the pandemic. Anyway, I like that plan. |
Python's
|
If I can add my pinch of salt. Here we colide data and programming world. In data world None is a value of a type. In Python None is a separate type. |
I certainly understand the desire to interoperate with the MS CSV ecosystem, as that has traditionally been the de facto "spec" for what it means to read and write CSV files. Still, I think the topic would be too involved to deal with properly between now and May 1. I propose limiting the change for now to just the writer, fixing any doc bits which suggest there are any changes to the reader, then going through the usual PEP process to investigate making changes to the semantics of the reader. |
Add two quoting styles for csv dialects. They will help to work with certain databases in particular. Automerge-Triggered-By: GH:merwok
Would someone like to work on a PEP regarding type conversion in the reader? |
* main: (24 commits) pythongh-98040: Move the Single-Phase Init Tests Out of test_imp (pythongh-102561) pythongh-83861: Fix datetime.astimezone() method (pythonGH-101545) pythongh-102856: Clean some of the PEP 701 tokenizer implementation (python#103634) pythongh-102856: Skip test_mismatched_parens in WASI builds (python#103633) pythongh-102856: Initial implementation of PEP 701 (python#102855) pythongh-103583: Add ref. dependency between multibytecodec modules (python#103589) pythongh-83004: Harden msvcrt further (python#103420) pythonGH-88342: clarify that `asyncio.as_completed` accepts generators yielding tasks (python#103626) pythongh-102778: IDLE - make sys.last_exc available in Shell after traceback (python#103314) pythongh-103582: Remove last references to `argparse.REMAINDER` from docs (python#103586) pythongh-103583: Always pass multibyte codec structs as const (python#103588) pythongh-103617: Fix compiler warning in _iomodule.c (python#103618) pythongh-103596: [Enum] do not shadow mixed-in methods/attributes (pythonGH-103600) pythonGH-100530: Change the error message for non-class class patterns (pythonGH-103576) pythongh-95299: Remove lingering setuptools reference in installer scripts (pythonGH-103613) [Doc] Fix a typo in optparse.rst (python#103504) pythongh-101100: Fix broken reference `__format__` in `string.rst` (python#103531) pythongh-95299: Stop installing setuptools as a part of ensurepip and venv (python#101039) pythonGH-103484: Docs: add linkcheck allowed redirects entries for most cases (python#103569) pythongh-67230: update whatsnew note for csv changes (python#103598) ...
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: