-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV Null Byte Error #71767
Comments
I think this has been asked before, but it has been awhile and I think needs to be re-addressed. Why doesn't the CSV module handle NULL bytes? Let me do some examples, ascii = [chr(n) for n in range(256)] #No errors
print(ascii) #No errors
print(dict(zip(ascii,ascii))) #No errors open('temp','r').writelines(ascii) #No errors Python hsndles every ascii chr, DEL, ESC, RETURN, etc. It displays those characters, uses them as keys or values, etc. But now try, import csv
data = csv.DictReader(ascii)
for line in data: print(line) #NULL Byte Error But we can quick fix this easily, ascii.pop(0)
data = csv.DictReader(ascii)
for line in data: print(line) #No Errors Doesn't it seem odd that we allow the use of every chr as keys, vslues, prints, etc. but then we hold a special exception for using the csv module and yhat special exception is not the ESC or DEL or any other non-printable chr, the exception is for the NULL? |
By "discussed before" I presume you are referring to bpo-1599055. It is true that have been changes since then in Python's handling of null bytes. Perhaps it is indeed time to revisit this. I'll leave that to the experts...this can be closed as a duplicate of bpo-1599055 if I'm wrong about things having possibly changed in the interim. |
Beyond whether or not the csv module can handle NUL bytes, you might figure out if Excel will. Since the CSV format isn't some sort of "standard", its operational definition has always been what Excel will produce or consume. I don't have Excel (not a Windows guy), but I do have Gnumeric and LibreOffice. I constructed a simple CSV file by hand which contains several NUL bytes to see what they would do. Gnumeric pops up a dialog and converts them to spaces (and then, oddly enough, doesn't think the file has been modified). LibreOffice didn't complain while loading the file, but when I saved it, it silently deleted the NULs. I've attached the file should anyone like to experiment with other spreadsheets. |
@ Skip Montanaro, Actually, CSV has nothing to do with excel. Excel is a gui that processes CSV like many other programs. CSV stands for comma seperated values and is basic standard for data.https://en.m.wikipedia.org/wiki/Comma-separated_values As far as I know, many programs handle NULL bytes fine (either as a empty atring or a # or something else). In any case, it really should be irrelevent how some particular programs handle a CSV file. |
Bobby: Skip is one of the authors of the csv module, and has been maintaining it since it was added to the standard library. He knows whereof he speaks: there is no standard for csv (as noted in the article you link), and all csv parsers that want to be interoperable refer back to Microsoft's implementation when dealing with any quirks. That implementation currently is Excel. That said, your are right that others have adopted the format, and there is an argument to be made that we don't have to *limit* ourselves to what Microsoft supports. Although we probably don't want to be emitting stuff that they don't without being explicit about it. |
I am sorry I must have mis-read the history part of that article. |
I wasn't familiar with RFC 4180. (Or, quite possibly, I forgot I used to be familiar with it.) Note that at the bottom of the BNF definition of the file structure is the definition of TEXTDATA: TEXTDATA = %x20-21 / %x23-2B / %x2D-7E That pretty explicitly excludes NUL bytes (%x00 I think). I'd still like someone load my nul.csv file into Excel and report back what it does (or doesn't) do with it. |
I attempted to open the file in excel; raised no errors. In any case (regardless of Microsoft-concerns), I am glad to see a discussion started and possibly some concern that an update might be useful to the community (it would certainly cut down on the number of stack-exchange posts about this very topic). I certainly would put my vote to have csv handle the NULL byte, in the same way as python does natively. Thanks for you time. Oh, and since you are one of the author's, thanks for writing/working on the csv module, VERY USEFUL. :-) |
Thanks. The display you showed looks about like I saw in LibreOffice. If you export it back to another CSV file, does the new file match the original exactly, or does (like LibreOffice) it save a file without NUL bytes? I don't mind having the discussion, but even though we have traditionally treated CSV files as binary files in Python (at least when I was closely involved in the 2.x days), that was mostly so end-of-line sequences weren't corrupted. As others have pointed out, in 2.x Python String objects stored the data as a normal NUL-terminated pointer-to-char for efficiency when interacting with C libraries. C uses NUL as a string terminator, so we couldn't work with embedded NULs. I haven't looked at the 3.x string stuff (I know Unicode is much more intimately involved). If it still maintains that close working relationship with the typical C strings, supporting NUL bytes will be problematic. In cases where the underlying representation isn't quite what I want, I've been able to get away with a file wrapper which suitably mangles the input before passing it up the chain to the csv module. For example, the __next__ method of your file wrapper could delete NULs or replace them with something suitably innocuous, like "\001", or some other non-printable character you are certain won't be in the input. If you want to preserve NULs, reverse the translation during the write(). |
Forgive my frustration, but @Skip I really don't see how the definition of CSV relating to Excel (or Gnumeric or LibreOffice) has any relevance as to whether or not the module (and perhaps Python more generally) supports chr(0x00) as a delimiter. (Neither you nor I get to decide how someone else might write output data...) While the module is called CSV, it's really not just *Comma* Separated Values - rather it's a rough approximation of a database table with an optional header row where rows/records are separated by <record separator> and fields are separated by <field separator>. Sometimes the record separator is chr(0x2c) (e.g. a comma) sometimes it's chr(0x09) (e.g. a tab - or in ASCII parlance "Horizontal Tab/HT") ... or maybe even the actual ASCII "Record Separator" character (e.g. chr(0x1e)) ... or maybe NUL chr(0x00). (1) The module should be 100% agnostic about the separator - the current (3.8.3) error text when trying to use csv.reader(..., delimiter=chr(0x00)) is "TypeError: "delimiter" must be a 1-character string" ... well, chr(0x00) *is* a 1-character string. It's not a 1-character *printable* string... But then again neither is chr(0x1e) (ASCII "RS" Record Separator) .. and csv.reader(..., delimiter=chr(0x1e)) appears to work (I haven't tried actual data yet). (1a) The use of chr(0x00) or '\0' is used quite often in the *NIX world as a convenient record separator that doesn't have escaping problems because by it's very nature it's non-printable. e.g. "find . -iname "*something*" -print0 | xargs -0 <program>" ... As to the difficulty in handling 0x00 characters, I dunno ... it appears that GNU find, xargs, gawk... same with FreeBSD. FreeBSD writes the output for "-print0" like this: https://github.com/freebsd/freebsd/blob/508f3673dec94b03f89b9ce9569390d6d9b86a89/usr.bin/find/function.c#L1383 ... and bsd xargs handles it too. I haven't looked at the CPython source to see what's going on - it might be tricky to modify the code to support this... (but then again, IMHO, this sort of thing should have been a consideration in the first place....) I suppose in many ways, the very existence of this specific issue at all is just one example of what seems to be a larger issue with Python's overall development: It's a great language for *many* things and in many ways. But I've run into so many little fringe "gotchas" where something doesn't work or is limited in some way because, seemingly, functionality is designed around/defined by a practical-example-use-case and not what is or might be *possible* (e.g. the CSV-as-only-a-spreadsheet-interface example -- and I really *don't* mean that as a personal attack @Skip - I am very appreciative of the time and effort you and everyone has poured into the project...) Is it possible to write a NUL (0x00) character to a file? Through a *NIX pipe? You bet. (I got a little rant-y .. sorry... I'm sure there's a _lot_ more going on underneath the covers and there are a lot of factors - not limited to just the csv module - as you mentioned. I just really feel like something is "off". Maybe it's my brain - ha. :)) |
I feel like the CSV module could support using NULL as separator and could send a PR with some test to add this but Daniel could you give more information regarding why you need explicit CSV support for this? The CSV parser is needed to read and write CSV files to can be produced by other programs and has been extended to support the various dialects, most of them produced by spreadsheet programs.
Yes and all of those are supported natively by Python already:
Multilines input can be supported as easily doing: data = [
line.split('\x00')
for line in input()
] Is this not enough? |
Writing to those files is obviously as easy, since like you said "because by it's very nature it's non-printable" and you will probably not find it in your data: with open('file', 'w') as f:
f.write('\x00'.join(data)) It will break if there is a NULL byte in your data, and CSV would quote the element properly instead, but so would "find . -iname "*something*" -print0 | xargs -0 <program>" if one of the file had a NULL byte in their name. I don't think Python is being unreasonable here, especially considering it has the same drawbacks as the other Unix utilities. |
I'm sorry, but why is this issue coming up again after nearly four years? Especially without a patch? (I apologize. I've gotten a bit more grumpy as I've aged.) Let me summarize a bit of history. Back in the early 2000s, Dave Cole at Object Craft in Australia implemented a C-based CSV module for Python. I don't know exactly what version was his initial target, but I have a vague memory that it was Python 1.5-ish. A few years later that was cleaned up and adapted for inclusion into the Python standard library, being added to Python 2.3. I helped with some of the Python stuff and steered it into the core. Much later (Python 3.x - with the great Unicode unification), it acquired Unicode support. At the time of creation, the only "standard" for CSV files was what Excel could read and write. One of the initial requirements as I recall was that CSV files generated by the module had to be able to survive a round trip to Excel and back. As I indicated in at least one of my previous messages to this thread, knowing how Excel handles CSV files with NUL bytes would be (at least) interesting. I still think so. Can anyone test that? I'm not trying to suggest that gracefully handling NUL bytes wouldn't be useful in certain contexts (I also use find with -print0 routinely to preserve filenames with spaces), and if the CSV module was being written from scratch today, perhaps NUL support would be included. I am happy with the current CSV module as it exists today. I still use it routinely. Adding NUL support wouldn't scratch any itch I have. If you want NUL byte support, either as a delimiter or as a character in a cell, I think you're going to have to submit a patch. Otherwise I suggest this be closed rather than languishing for another four years. |
Looks like my comment removed Remi from the nosy list. Restoring that... |
Hi Skip, I'm on a Mac so I tried Numbers, the spreadsheet tool that comes with MacOS: Opening and exporting to CSV again works: ➜ ~ cat -v ~/Documents/nul.csv When exporting to TSV, it makes the correct changes: ➜ ~ cat -v ~/Documents/nul.tsv |
It looks to me that rejecting the null character is just an implementation artifact. 0 is used both as signalling value for "not set" character parameter and as the end-of-line character in the state automata. We can use different values outside of the range of Unicode characters (0-0x10ffff). It will automatically enable support of null characters. |
The NUL check was around for a long time, available to be used (XKCD-1172 style) as a simple check against reading binary files. The change did break tests of distutils. |
*docutils, not distutils |
In Python 3.11 CSV files are allowed to have null characters so the test data is no longer a valid. We update it to not have a valid unicode character to still test this code path correctly. I'm not concerned about the fact that the files with null will get past this test beacause there are other checks on the content of the file that catch if it doesn't have enough or the right fields so this should be a safe change to make to the tests. Relevant Change in Python: python/cpython#71767
In Python 3.11 CSV files are allowed to have null characters so the test data is no longer a valid. We update it to not have a valid unicode character to still test this code path correctly. I'm not concerned about the fact that the files with null will get past this test beacause there are other checks on the content of the file that catch if it doesn't have enough or the right fields so this should be a safe change to make to the tests. Relevant Change in Python: python/cpython#71767
In Python 3.11 CSV files are allowed to have null characters so the test data is no longer a valid. We update it to not have a valid unicode character to still test this code path correctly. I'm not concerned about the fact that the files with null will get past this test beacause there are other checks on the content of the file that catch if it doesn't have enough or the right fields so this should be a safe change to make to the tests. Relevant Change in Python: python/cpython#71767
In Python 3.11 CSV files are allowed to have null characters so the test data is no longer a valid. We update it to not have a valid unicode character to still test this code path correctly. I'm not concerned about the fact that the files with null will get past this test beacause there are other checks on the content of the file that catch if it doesn't have enough or the right fields so this should be a safe change to make to the tests. Relevant Change in Python: python/cpython#71767
In Python 3.11 CSV files are allowed to have null characters so the test data is no longer a valid. We update it to not have a valid unicode character to still test this code path correctly. I'm not concerned about the fact that the files with null will get past this test beacause there are other checks on the content of the file that catch if it doesn't have enough or the right fields so this should be a safe change to make to the tests. Relevant Change in Python: python/cpython#71767
In Python 3.11 CSV files are allowed to have null characters so the test data is no longer a valid. We update it to not have a valid unicode character to still test this code path correctly. I'm not concerned about the fact that the files with null will get past this test beacause there are other checks on the content of the file that catch if it doesn't have enough or the right fields so this should be a safe change to make to the tests. Relevant Change in Python: python/cpython#71767
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: