csvstat: UnicodeError in Python 2 when --csv option enabled #944

binarytooth · 2018-03-18T03:00:04Z

csvstat 1.0.2
Python 2.7.12
Ubuntu 16.04.3 LTS xenial

When I run csvstat on a file containing a header and a single euro sign € (ISO-8859-15 character A4), it exits with an error, but only if the --csv option is included. If the option is not included csvstat generates output correctly. A copy of the test file

euro-sign-iso-8859-15.txt

is attached to this ticket.

This happens with all characters from 0x80 (128) through 0xFF (255) with the exception of 0x85 (Next line character) All produce the same error when csvstat is run with the --csv option

When csvstat is run on the test file without the csv option, it produces the following correct output:

$ csvstat -e ISO-8859-15 euro-sign-iso-8859-15.txt

"Contents"

Type of data:          Text
Contains null values:  False
Unique values:         1
Longest value:         1 characters
Most common values:    € (1x)

Row count: 1

When I run the same command but with the --csv option enabled it blows up.

$ csvstat -v -e ISO-8859-15 euro-sign-iso-8859-15.txt --csv
column_id,column_name,type,nulls,unique,min,max,sum,mean,median,stdev,len,freq
Traceback (most recent call last):
File "/usr/local/bin/csvstat", line 9, in
load_entry_point('csvkit==1.0.2', 'console_scripts', 'csvstat')()
File "/usr/local/lib/python2.7/dist-packages/csvkit/utilities/csvstat.py", line 335, in launch_new_instance
utility.run()
File "/usr/local/lib/python2.7/dist-packages/csvkit/cli.py", line 114, in run
self.main()
File "/usr/local/lib/python2.7/dist-packages/csvkit/utilities/csvstat.py", line 166, in main
self.print_csv(table, column_ids, stats)
File "/usr/local/lib/python2.7/dist-packages/csvkit/utilities/csvstat.py", line 318, in print_csv
writer.writerow(output_row)
File "/usr/local/lib/python2.7/dist-packages/agate/csv_py2.py", line 190, in writerow
UnicodeWriter.writerow(self, row)
File "/usr/local/lib/python2.7/dist-packages/agate/csv_py2.py", line 103, in writerow
self.writer.writerow([six.text_type(s if s is not None else '').encode(self.encoding) for s in row])
File "/usr/lib/python2.7/codecs.py", line 369, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)

This problem also occurs for files encoded using WINDOWS-1252. This bug prevents the use of many useful characters in csv files such as smart quotes, dagger, double dagger, many accented letters, the euro sign, etc.

The text was updated successfully, but these errors were encountered:

jpmckinney · 2018-03-26T20:18:50Z

It's an issue in Python 2 only. I tried to identify the issue, but running the same code in a Python shell doesn't reproduce the error – the error only occurs when run from the command-line.

aborruso · 2018-04-01T09:17:00Z

The same for me.

You have this error with python 2.7 when you have accented characters.

Try with console and this csv file

fieldA,fieldB
aa,bb
aa,cc
hello,world
à,b

TomekZet · 2018-05-11T10:03:14Z

I get similar problem with csvlook, only when redirecting output to a file:

$ csvlook -
fieldA,fieldB
aa,bb
aa,cc
hello,world
à,b
| fieldA | fieldB |
| ------ | ------ |
| aa     | bb     |
| aa     | cc     |
| hello  | world  |
| à      | b      |

When printing directly to the console it is ok, because I have UTF-8 locale set:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
[...]
$ csvlook - > test.tmp
fieldA,fieldB
aa,bb
aa,cc
hello,world
à,b
'ascii' codec can't encode character u'\xe0' in position 2: ordinal not in range(128)

Looks as if you don't encode unicode output properly.
I am not sure but probably the missing encode should be addede to https://github.com/wireservice/agate/blob/master/agate/table/print_table.py

The problem is not present in versions <= 1.0.0
The earliest version in which I can reproduce it is 1.0.1

jpmckinney · 2023-10-17T19:09:45Z

Closing as Python 2 no longer supported.

jpmckinney added the bug label Mar 26, 2018

jpmckinney changed the title ~~csvstat blows up on all high ascii characters when --csv option is enabled~~ csvstat: UnicodeError in Python 2 when --csv option enabled Mar 26, 2018

jpmckinney added bug: platform and removed bug labels May 20, 2018

jpmckinney added the Low Priority label Nov 21, 2018

jpmckinney closed this as completed Oct 17, 2023

jpmckinney added bug and removed bug: platform labels Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csvstat: UnicodeError in Python 2 when --csv option enabled #944

csvstat: UnicodeError in Python 2 when --csv option enabled #944

binarytooth commented Mar 18, 2018

jpmckinney commented Mar 26, 2018

aborruso commented Apr 1, 2018

TomekZet commented May 11, 2018 •

edited

jpmckinney commented Oct 17, 2023

csvstat: UnicodeError in Python 2 when --csv option enabled #944

csvstat: UnicodeError in Python 2 when --csv option enabled #944

Comments

binarytooth commented Mar 18, 2018

jpmckinney commented Mar 26, 2018

aborruso commented Apr 1, 2018

TomekZet commented May 11, 2018 • edited

jpmckinney commented Oct 17, 2023

TomekZet commented May 11, 2018 •

edited