Handle CSV input other than UTF-8 e.g. ANSI #79

alistairhann · 2017-03-01T16:01:36Z

I am using Flask-Excel with the first example in the docs, and the first line fails, unless the CSV is in UTF-8 e.g. I have a customer with an ANSI file and I want to be able to import it. The line is as follows, and is therefore just hitting pyexcel (hence raising this here):

jsonify({"result": request.get_array(field_name='file')})

The trace is as follows:

File "/code/app/myfile.py", line 99, in upload_file if request.method == 'POST': File "/usr/local/lib/python2.7/site-packages/pyexcel_webio/__init__.py", line 81, in get_array return pe.get_array(**params) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 290, in get_array sheet = get_sheet(**keywords) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 68, in get_sheet sheet = Sheet(named_content.payload, named_content.name, **sheet_params) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 87, in __init__ transpose_after=transpose_after File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 130, in init Matrix.__init__(self, sheet) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/matrix.py", line 36, in __init__ self.__width, self.__array = uniform(list(array)) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/sheet.py", line 55, in to_array for row_index, row in enumerate(self.row_iterator()): File "/usr/local/lib/python2.7/site-packages/pyexcel_io/_compact.py", line 43, in next return type(self).__next__(self) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/fileformat/_csv.py", line 42, in __next__ return next(self.reader).encode('utf-8') File "/usr/local/lib/python2.7/codecs.py", line 630, in next line = self.readline() File "/usr/local/lib/python2.7/codecs.py", line 545, in readline data = self.read(readsize, firstline=True) File "/usr/local/lib/python2.7/codecs.py", line 492, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 0: invalid continuation byte

Going through the code, while _csv.py supports the notion of different encodings (e.g. UTF8Recorder has a reader and it supplies the encoding) unless I am mistaken, the encoding is determined by the Python version in _compact.py (otherwise through system default encoding), rather than it being possible to supply it as a parameter to get_array() that cascades down.

The text was updated successfully, but these errors were encountered:

chfw · 2017-03-01T17:16:03Z

Could you please try feeding it with an extra encoding parameter?

jsonify({"result": request.get_array(field_name='file', encoding='ascii')})

This parameter will get a corresponding codec from codecs.

alistairhann · 2017-03-02T10:46:38Z

Thanks for replying!

I tried as you suggested:

raw = request.get_array(field_name='file',encoding='ascii')

but I still get error as above (pasted at the bottom).

I traced through the code again, and the parameter would next get to get_sheet(**keywords) in core.py which would in turn pass it to get_sheet_stream(**keywords). That then passes it to the get_source factory and I think that's where the encoding parameter gets dropped (although I'll admit I don't fully understand what's going on there).

File "/usr/local/lib/python2.7/site-packages/pyexcel_webio/__init__.py", line 81, in get_array return pe.get_array(**params) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 290, in get_array sheet = get_sheet(**keywords) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 68, in get_sheet sheet = Sheet(named_content.payload, named_content.name, **sheet_params) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 87, in __init__ transpose_after=transpose_after File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 130, in init Matrix.__init__(self, sheet) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/matrix.py", line 36, in __init__ self.__width, self.__array = uniform(list(array)) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/sheet.py", line 55, in to_array for row_index, row in enumerate(self.row_iterator()): File "/usr/local/lib/python2.7/site-packages/pyexcel_io/_compact.py", line 43, in next return type(self).__next__(self) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/fileformat/_csv.py", line 42, in __next__ return next(self.reader).encode('utf-8') File "/usr/local/lib/python2.7/codecs.py", line 630, in next line = self.readline() File "/usr/local/lib/python2.7/codecs.py", line 545, in readline data = self.read(readsize, firstline=True) File "/usr/local/lib/python2.7/codecs.py", line 492, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 0: invalid continuation byte INFO:werkzeug:172.17.0.1 - - [02/Mar/2017 10:24:36] "POST /upload HTTP/1.1" 500 -

chfw · 2017-03-02T11:09:54Z

My bad. I have not spotted '0xca'. And it looks famous for UnicodeDecodeError, where encoding was suggested to be 'latin1' or 'ISO-8859-1'.

0xca is 202 and ascii encoding cannot handle it. Here's my experiment:

>>> import pyexcel as p
>>> a='\xca'
>>> with open('test.csv', 'wb') as f:
...     f.write(a)
...
>>> p.get_sheet(file_name='test.csv', encoding='latin1')
test.csv:
+---+
| Ê |
+---+
>>> p.get_sheet(file_name='test.csv')
test.csv:

>>>

alistairhann · 2017-03-02T16:44:38Z

Thank you!

If I understand your response correctly, you are suggesting that it would encode correctly if I hinted Latin1?

What I still don't understand is that if the encoding parameter defines the Codec why would the error be:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 0: invalid continuation byte

Regardless of what codec I specify for example:

raw = request.get_array(field_name='file',encoding='latin1')
and
raw = request.get_array(field_name='file',encoding='ascii')
and
raw = request.get_array(field_name='file')

Each result in the error referring to the UTF-8 codec - as if the codec isn't set when using request.get_array()

chfw · 2017-03-02T17:30:14Z

That is strange. You can put a break point or put a print statement here: https://github.com/pyexcel/pyexcel-io/blob/master/pyexcel_io/fileformat/_csv.py#L36, in order to see if encoding is passed on or not.

In general, pyexcel's getter functions always pass on keyword options down to pyexcel-io and its third party libraries.

Do you mind sharing a portion of your csv file for me to reproduce the problem in my end?

alistairhann · 2017-03-02T17:32:47Z

I'm happy to share the file, what's the best way to get it to you?

I can just put it on github if you like.

chfw · 2017-03-02T17:34:45Z

via github is easier if you drop the private data.

alistairhann · 2017-03-02T18:14:50Z

In the process of creating a clean version of the file that still breaks I found that it's a particular entry that breaks it. There's a special character in one of the fields and that makes the whole thing fail. I don't know exactly how the file was created - most probably saving a spreadsheet as a CSV on a Mac (Notepad++ identifies it is "ANSI" with Mac line endings).

Part of what has confused me here is why the codec in the error is UTF8 despite the encoding and why it is the byte in position zero (as I assumed that would be the start of the file or string - which I had checked for anomalies).

Thanks again for your help, and sorry if this is a red herring.

chfw · 2017-03-02T21:12:58Z

I got it reproduced but I got it fixed by specifying encoding='latin1' over here. But the problem is in pyexcel-webio, which isn't passing on keywords.

This is my session to show pyexcel could handle your csv but 'encoding' was blocked by webio:

>>> import pyexcel as p
>>> s=p.get_sheet(url="https://raw.githubusercontent.com/alistairhann/broken_csv/master/broken.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel-0.5.0-py2.7.egg/pyexcel/core.py", line 68, in get_sheet
    named_content = sources.get_sheet_stream(**keywords)
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel-0.5.0-py2.7.egg/pyexcel/sources/__init__.py", line 15, in get_sheet_stream
    sheets = source.get_data()
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel-0.5.0-py2.7.egg/pyexcel/sources/http.py", line 56, in get_data
    **self.__keywords)
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel_io-0.3.2-py2.7.egg/pyexcel_io/io.py", line 39, in get_data
    data[key] = list(data[key])
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel_io-0.3.2-py2.7.egg/pyexcel_io/sheet.py", line 55, in to_array
    for row_index, row in enumerate(self.row_iterator()):
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel_io-0.3.2-py2.7.egg/pyexcel_io/_compact.py", line 43, in next
    return type(self).__next__(self)
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel_io-0.3.2-py2.7.egg/pyexcel_io/fileformat/_csv.py", line 42, in __next__
    return next(self.reader).encode('utf-8')
  File "/Users/jaska/github/py2env/lib/python2.7/codecs.py", line 618, in next
    line = self.readline()
  File "/Users/jaska/github/py2env/lib/python2.7/codecs.py", line 533, in readline
    data = self.read(readsize, firstline=True)
  File "/Users/jaska/github/py2env/lib/python2.7/codecs.py", line 480, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xef in position 0: invalid continuation byte
>>> s=p.get_sheet(url="https://raw.githubusercontent.com/alistairhann/broken_csv/master/broken.csv", encoding='latin1')
>>> s
csv:
+-----------+------------+---------+-----------------------+-----------+
| Last Name | First Name | Company | Email                 | Job Title |
+-----------+------------+---------+-----------------------+-----------+
| Test      | Thïs       | Cool Co | test.this@example.com | Founder   |
+-----------+------------+---------+-----------------------+-----------+
>>> p.get_array(url="https://raw.githubusercontent.com/alistairhann/broken_csv/master/broken.csv", encoding='latin1')
[[u'Last Name', u'First Name', u'Company', u'Email', u'Job Title'], [u'Test', u'Th\xefs', u'Cool Co', u'test.this@example.com', u'Founder']]

chfw · 2017-03-02T23:37:54Z

Please try the fix:

pip install https://github.com/pyexcel/pyexcel-webio/archive/master.zip

alistairhann · 2017-03-03T11:21:55Z

I concur about the keywords not being passed on - that was the point I was poorly articulating above. I dare say it would have been clearer if I'd raised the issue in flask-excel or pyexcel-webio.

I tried the fix (needed to add --upgrade) and that's fixed it nicely both with the broken.csv and the original file that were providing problems. It throws an exception when the encoding isn't provided and uploads fine when:

encoding='latin1'

Thank you very much for your help with this and for producing the fix.

chfw · 2017-03-04T22:11:26Z

the fix was released in pyexcel-webio version 0.0.11. thanks for reporting it.

chfw mentioned this issue Mar 2, 2017

web-io is not passing on keywords to pyexcel pyexcel-webwares/pyexcel-webio#4

Closed

chfw closed this as completed Mar 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle CSV input other than UTF-8 e.g. ANSI #79

Handle CSV input other than UTF-8 e.g. ANSI #79

alistairhann commented Mar 1, 2017

chfw commented Mar 1, 2017

alistairhann commented Mar 2, 2017

chfw commented Mar 2, 2017

alistairhann commented Mar 2, 2017

chfw commented Mar 2, 2017

alistairhann commented Mar 2, 2017 •

edited

chfw commented Mar 2, 2017

alistairhann commented Mar 2, 2017

chfw commented Mar 2, 2017

chfw commented Mar 2, 2017

alistairhann commented Mar 3, 2017

chfw commented Mar 4, 2017

Handle CSV input other than UTF-8 e.g. ANSI #79

Handle CSV input other than UTF-8 e.g. ANSI #79

Comments

alistairhann commented Mar 1, 2017

chfw commented Mar 1, 2017

alistairhann commented Mar 2, 2017

chfw commented Mar 2, 2017

alistairhann commented Mar 2, 2017

chfw commented Mar 2, 2017

alistairhann commented Mar 2, 2017 • edited

chfw commented Mar 2, 2017

alistairhann commented Mar 2, 2017

chfw commented Mar 2, 2017

chfw commented Mar 2, 2017

alistairhann commented Mar 3, 2017

chfw commented Mar 4, 2017

alistairhann commented Mar 2, 2017 •

edited