Writing bytes using CSV module results in b prefixed strings #84939

sidhant007 · 2020-05-25T05:49:18Z

BPO	40762
Nosy	@smontanaro, @terryjreedy, @ericvsmith, @stevendaprano, @serhiy-storchaka, @remilapeyre, @sidhant007
PRs	bpo-40762: Fix writing bytes in csv #20371

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-05-30.01:56:53.535>
created_at = <Date 2020-05-25.05:49:18.340>
labels = ['type-feature', 'library', '3.10']
title = 'Writing bytes using CSV module results in b prefixed strings'
updated_at = <Date 2020-05-30.12:32:16.932>
user = 'https://github.com/sidhant007'

bugs.python.org fields:

activity = <Date 2020-05-30.12:32:16.932>
actor = 'eric.smith'
assignee = 'none'
closed = True
closed_date = <Date 2020-05-30.01:56:53.535>
closer = 'terry.reedy'
components = ['Library (Lib)']
creation = <Date 2020-05-25.05:49:18.340>
creator = 'sidhant'
dependencies = []
files = []
hgrepos = []
issue_num = 40762
keywords = ['patch']
message_count = 18.0
messages = ['369848', '369849', '369857', '369866', '369869', '369873', '369877', '369937', '369953', '369958', '370040', '370043', '370058', '370060', '370067', '370068', '370348', '370381']
nosy_count = 7.0
nosy_names = ['skip.montanaro', 'terry.reedy', 'eric.smith', 'steven.daprano', 'serhiy.storchaka', 'remi.lapeyre', 'sidhant']
pr_nums = ['20371']
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue40762'
versions = ['Python 3.10']

sidhant007 · 2020-05-25T05:49:17Z

The following code

import csv
with open("abc.csv", "w") as f:
   data = [b'\xc2a9', b'\xc2a9']
   w = csv.writer(f)
   w.writerow(data)

writes "b'\xc2a9',b'\xc2a9'" in "abc.csv", i.e the b-prefixed byte string instead of the actual bytes.

Although one can argue that the write is done in text mode and not binary mode, however I believe the more natural expectation by the user will be to have the bytes written actually in the "abc.csv".

Can refer to this pandas-dev/pandas#9712 to see one of the issues this brings in Pandas. From the discussion on this issue, the main reasons of changing this behaviour in Python would be to avoid leaking python's encoding system syntax into a generic data exchange format, i.e CSV.

sidhant007 · 2020-05-25T05:52:45Z

The following code

import csv
with open("abc.csv", "w") as f:
   data = [b'\xc2a9', b'\xc2a9']
   w = csv.writer(f)
   w.writerow(data)

writes "b'\xc2a9',b'\xc2a9'" in "abc.csv", i.e the b-prefixed byte string instead of the actual bytes.

Although one can argue that the write is done in text mode and not binary mode, however I believe the more natural expectation by the user will be to have the bytes written actually in the "abc.csv". Also take note that writing in binary mode is not supported by csv, so the only method to write bytes is this.

Can refer to this pandas-dev/pandas#9712 to see one of the issues this brings in Pandas. From the discussion on this issue, the main reasons of changing this behaviour in Python would be to avoid leaking python's encoding system syntax into a generic data exchange format, i.e CSV.

serhiy-storchaka · 2020-05-25T07:43:58Z

According to the documentation (https://docs.python.org/3/library/csv.html#writer-objects):

"""
A row must be an iterable of strings or numbers for Writer objects and a dictionary mapping fieldnames to strings or numbers (by passing them through str() first) for DictWriter objects.
"""

Byte object is neither string nor number. Passing it through str() gives your what you get.

Run Python with option -b or -bb to catch such kind of errors.

sidhant007 · 2020-05-25T09:15:38Z

Yes, I do recognise that the current doc states that csv only supports strings and numbers.

However the use case here is motivated when the user wants to write bytes and numbers/strings mixed to a CSV file. Currently providing bytes to write to a CSV passes it through str() as you stated, however this results it into being being prefixed with a b. Ex. "b'A'" is written instead of "A".

This kind of output is in most real-life cases not useful for any other program (be it python or non-python) that ends up reading this csv file, since this CSV file consists of a Python-specific syntax.

As an example, if I write character "A" as a byte, i.e b'A' in a csv file, it will end up writing "b'A'", which is not what the user would have wanted in majority of the use-cases.

Another way to change this behaviour could be to redefine how str() method works on bytes in python so that python doesn't leak out this b-prefix notation in generic file systems (ex. CSV), however that is too fundamental of a change and will affect the entire codebase in large.

remilapeyre · 2020-05-25T09:46:24Z

As an example, if I write character "A" as a byte, i.e b'A' in a csv file

But you can't write b'A' in a csv file, what you can't do is write b'a'.decode() or b'a'.decode('latin1') or b'a'.decode('whatever') but the string representation of a byte string is dependant on the character encoding and it's not possible to guess it, which is why bytes and string were separated in Python3 in the first place.

Since csv can't write bytes, it gets a string representation of the by calling str(), as it would with any other objects.

which is not what the user would have wanted in majority of the use-cases

If you try to guess the encoding, you will obligatory fail in some case and the resulting file will be corrupted. The only way here is for your users to fix their program and decode thee byte string using the correct encoding before giving them to csv.

sidhant007 · 2020-05-25T11:57:30Z

Hi Remi,

Currently a code like this:

with open("abc.csv", "w", encoding='utf-8') as f:
    data = [b'\x41']
    w = csv.writer(f)
    w.writerow(data)
with open("abc.csv", "r") as f:
    rows = csv.reader(f)
    for row in rows:
        print(row[0]) # prints b'A'

Is able to write the string "b'A'" in a CSV file. You are correct that the ideal way should indeed be to decode the byte first.

However if a user does not decode the byte then the CSV module calls the str() method on the byte object as you said, but in real-life that b-prefixed string is just not readable by another program in an easy way (they will need to first chop off the b-prefix and single quotes around the string) and has turned out to be a pain point in one of the pandas issue I referred to in my first message.

Also I am not sure if you have taken a look at my PR, but my approach to fix this problem does NOT involve guessing the encoding scheme used, instead we simply use the encoding scheme that the user provided when they open the file object. So if you open it with open("abc.csv", "w", encoding="latin1") then it will try to decode the byte using "latin1". Incase it fails to decode using that, then it will throw a UnicodeDecodeError (So there is no unknowing file corruption, a UnicodeDecode error is thrown when this happens). You can refer to the tests + NEWS.d in the PR to confirm the same.

remilapeyre · 2020-05-25T14:05:32Z

in real-life that b-prefixed string is just not readable by another program in an easy way

If another program opens this CSV file, it will read the string "b'A'" which is what this field actually contains. Everything that is not a number or a string gets converted to a string:

In [1]: import collections, dataclasses, random, secrets, io, csv
...:
...: Point = collections.namedtuple('Point', 'x y')
...:
...: @dataclasses.dataclass
...: class Valar:
...: name: str
...: age: int
...:
...: a = Point(1, 2)
...: b = Valar('Melkor', 2900)
...: c = secrets.token_bytes(4)
...:
...: out = io.StringIO()
...: f = csv.writer(out)
...: f.writerow((a, b, c))
...:
...: out.seek(0)
...: print(out.read())
...:
"Point(x=1, y=2)","Valar(name='Melkor', age=2900)",b'\x95g6\xa2'

Here another would find three fields, all strings: "Point(x=1, y=2)", "Valar(name='Melkor', age=2900)" and "b'\x95g6\xa2'". Would you expect to get actual objects instead of strings when reading the two first fields?

Incase it fails to decode using that, then it will throw a UnicodeDecodeError

I read your PR, but succeeding to decode it does not mean it's correct:

In [4]: b'r\xc3\xa9sum\xc3\xa9'.decode('latin')
Out[4]: 'rÃ©sumÃ©'

It worked, but is it the appropriate encoding? Probably not

In [5]: b'r\xc3\xa9sum\xc3\xa9'.decode('utf8')
Out[5]: 'résumé'

If you want to be able to save bytes, the best way is to use a format that can roundtrip bytes like parquet:

In [18]: df = pd.DataFrame.from_dict({'a': [b'a']})                                                                                                                    

In [19]: df.to_parquet('foo.parquet')                                                                                                                                  

In [20]: type(pd.read_parquet('foo.parquet')['a'][0])                                                                                                                  
Out[20]: bytes

sidhant007 · 2020-05-26T02:44:53Z

Hi Remi,

I understand your concerns with the current approach to resolve this issue.
I would like to propose a new/different change to the way csv.writer works.

I am putting here the diff of how the updated docs (https://docs.python.org/3/library/csv.html#csv.writer) should look for my proposed change:

-.. function:: writer(csvfile, dialect='excel', **fmtparams)
+.. function:: writer(csvfile, encoding=None, dialect='excel', **fmtparams)

    Return a writer object responsible for converting the user's data into delimited
    strings on the given file-like object.  *csvfile* can be any object with a
    :func:`write` method.  If *csvfile* is a file object, it should be opened with
-   ``newline=''`` [1]_.  An optional *dialect*
+   ``newline=''`` [1]_.  An optional *encoding* parameter can be given which is
+   used to define how to decode the bytes encountered before writing them to
+   the csv. After being decoded using this encoding scheme, this resulting string
+   will then be transcoded from this encoding scheme to the encoding scheme specified
+   by the file object and be written into the CSV file. If the decoding or the
+   transcoding fails an error will be thrown. Incase this optional parameter is not
+   provided or is set to None then all the bytes will be stringified with :func:`str`
+   before being written just like all the other non-string data. Another optional *dialect*
    parameter can be given which is used to define a set of parameters specific to a
    particular CSV dialect.  It may be an instance of a subclass of the
    :class:`Dialect` class or one of the strings returned by the

       import csv
       with open('eggs.csv', 'w', newline='') as csvfile:
-          spamwriter = csv.writer(csvfile, delimiter=' ',
+          spamwriter = csv.writer(csvfile, encoding='latin1', delimiter=' ',
                                   quotechar='|', quoting=csv.QUOTE_MINIMAL)
+          spamwriter.writerow([b'\xc2', 'A', 'B'])
           spamwriter.writerow(['Spam'] * 5 + ['Baked Beans'])
           spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])

(This diff can be found here: sidhant007@50d809c)

If another program opens this CSV file, it will read the string "b'A'" which is what this field actually contains. Everything that is not a number or a string gets converted to a string:

In this proposal, I am proposing "bytes" to be treated specially just like strings and numbers are treated by the CSV module, since it also one of the primitive datatypes and more relevant than other use defined custom datatypes (in your example the point and person object) in a lot of use cases.

I read your PR, but succeeding to decode it does not mean it's correct

Now we will be providing the user the option to decode according to what encoding scheme they want to and that will overcome this. If they provide no encoding scheme or set it to None we will simply revert to the current behaviour, i.e the b-prefixed string will be written to the CSV. This will ensure no accidental conversions using incorrect encoding schemes

remilapeyre · 2020-05-26T08:26:41Z

I don't think this would be accepted but I think you could try to propose that on the python-ideas mailing list to get some feedback on your proposal.

serhiy-storchaka · 2020-05-26T08:55:24Z

It would be confusing. There would be the encoding argument for open() to encode strings to bytes and the encoding argument for csv.writer() to decode bytes to strings. The latter will be ignored in all normal cases (for strings and numbers).

stevendaprano · 2020-05-27T02:09:47Z

The csv file object knows the encoding it was opened with, I think?

If so, we could add an enhancement that bytes objects are first decoded to str using the same encoding the file was opened with. That seems like a reasonable new feature to me.

Everything else would continue to be passed through str().

stevendaprano · 2020-05-27T03:01:35Z

On further thought, no, I don't think it would be a reasonable feature.

User opens the CSV file, probably using the default encoding (UTF-8?)
but potentially in anything.

They collect some data as bytes. Those bytes could be from any unknown
encoding. When they try writing those bytes to the CSV file, at best
they get an explicit but confusing exception that the decoding failed,
at worst they get data loss (mojibake).

# Latin-1 to UTF-8 fails
py> b = 'ßæ'.encode('latin-1')
py> b.decode('utf-8')
# raises UnicodeDecodeError: 'utf-8' codec can't decode 
# byte 0xdf in position 0: invalid continuation byte

# UTF-8 to Latin-1 loses data
py> b = 'ßæ'.encode('UTF-8')
py> b.decode('latin-1')
# returns mojibake 'Ã\x9fÃ¦'

Short of outright banning the use of bytes (raise a TypeError), I think
the current behaviour is least-worst.

ericvsmith · 2020-05-27T09:15:13Z

Short of outright banning the use of bytes (raise a TypeError), I think
the current behaviour is least-worst.

I agree.

I'd like to see the TypeError raised for everything that's not a string or number (since the docs say that's what's accepted), but at this point that would be big change. Maybe we could add a parameter to __init__ to turn on that behavior.

smontanaro · 2020-05-27T10:14:51Z

This likely worked in the past because bytes == str in Python 2.x. This is just a corner case people porting from 2 to 3 need to address in their code. Papering over it so people using Pandas don't have to do the right thing is no reason to make changes. Bytes aren't strings any longer. A huge effort was undertaken to clean up this aspect of Python's data model in the move to Python 3. Conflating bytes and strings at this point (no matter which way you try to accomplish that conflation) makes no sense to me.

The current behavior should not be changed.

serhiy-storchaka · 2020-05-27T12:08:39Z

Since Pandas opens an output file it has control on what encoding is used. It is a Pandas' responsibility to decode bytes, or raise an exception, or just ignore the problem if it is pretty uncommon case. Pandas already have a complex code for formatting output data into CSV files, one additional check does not matter.

smontanaro · 2020-05-27T12:11:39Z

I would also that tweaking Python to make this work with no change in Pandas would be a case of the tail wagging the dog. A big tail, but a tail nonetheless.

terryjreedy · 2020-05-30T01:56:53Z

I make 5 core developers who agree that csv should definitely *not* assume that bytes given to it represent encoded text, reverting to the confusion of Python 1 and 2. (And even it if did, it should not assume that the encoding of the given to it and the encoding of the file are the same, or even that all bytes given to it have the same encoding!)

If a user does not like the current default wonky mixed ascii-hex string representation of bytes, the user should explicitly convert bytes to the representation they want. Here are just 3 examples, 2 with possible variations.

>>> b'\xc2a9'.hex()  # One might want to add prefix '0x' or r'\x'.
'c26139'             # Or add a separator.
>>> str(list(b'\xc2a9'))  # One might want to change or strip brackets, 
'[194, 97, 57]'           # change separator, or strip spaces.
>>> b'\xc2a9'.decode('latin-1')
'Âa9'

What is best depends on the expected reader of the output. Pandas users who don't like Pandas' csv output should talk to its authors. They are welcome to ask advice on python-list.

Eric: I agree that adding 'strict=False' might be a good idea (in a new issue).

ericvsmith · 2020-05-30T12:32:17Z

I've created bpo-40825 for adding a "strict" parameter.

sidhant007 mannequin added type-bug An unexpected behavior, bug, or error 3.10 only security fixes stdlib Python modules in the Lib dir labels May 25, 2020

stevendaprano added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels May 27, 2020

ericvsmith added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels May 27, 2020

terryjreedy closed this as completed May 30, 2020

terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels May 30, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing bytes using CSV module results in b prefixed strings #84939

Writing bytes using CSV module results in b prefixed strings #84939

sidhant007 mannequin commented May 25, 2020

sidhant007 mannequin commented May 25, 2020

sidhant007 mannequin commented May 25, 2020

serhiy-storchaka commented May 25, 2020

sidhant007 mannequin commented May 25, 2020

remilapeyre mannequin commented May 25, 2020

sidhant007 mannequin commented May 25, 2020

remilapeyre mannequin commented May 25, 2020

sidhant007 mannequin commented May 26, 2020

remilapeyre mannequin commented May 26, 2020

serhiy-storchaka commented May 26, 2020

stevendaprano commented May 27, 2020

stevendaprano commented May 27, 2020

ericvsmith commented May 27, 2020

smontanaro commented May 27, 2020

serhiy-storchaka commented May 27, 2020

smontanaro commented May 27, 2020

terryjreedy commented May 30, 2020

ericvsmith commented May 30, 2020

Writing bytes using CSV module results in b prefixed strings #84939

Writing bytes using CSV module results in b prefixed strings #84939

Comments

sidhant007 mannequin commented May 25, 2020

sidhant007 mannequin commented May 25, 2020

sidhant007 mannequin commented May 25, 2020

serhiy-storchaka commented May 25, 2020

sidhant007 mannequin commented May 25, 2020

remilapeyre mannequin commented May 25, 2020

sidhant007 mannequin commented May 25, 2020

remilapeyre mannequin commented May 25, 2020

sidhant007 mannequin commented May 26, 2020

remilapeyre mannequin commented May 26, 2020

serhiy-storchaka commented May 26, 2020

stevendaprano commented May 27, 2020

stevendaprano commented May 27, 2020

ericvsmith commented May 27, 2020

smontanaro commented May 27, 2020

serhiy-storchaka commented May 27, 2020

smontanaro commented May 27, 2020

terryjreedy commented May 30, 2020

ericvsmith commented May 30, 2020