Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle CSV input other than UTF-8 e.g. ANSI #79

Closed
alistairhann opened this issue Mar 1, 2017 · 12 comments
Closed

Handle CSV input other than UTF-8 e.g. ANSI #79

alistairhann opened this issue Mar 1, 2017 · 12 comments

Comments

@alistairhann
Copy link

I am using Flask-Excel with the first example in the docs, and the first line fails, unless the CSV is in UTF-8 e.g. I have a customer with an ANSI file and I want to be able to import it. The line is as follows, and is therefore just hitting pyexcel (hence raising this here):

jsonify({"result": request.get_array(field_name='file')})

The trace is as follows:

File "/code/app/myfile.py", line 99, in upload_file if request.method == 'POST': File "/usr/local/lib/python2.7/site-packages/pyexcel_webio/__init__.py", line 81, in get_array return pe.get_array(**params) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 290, in get_array sheet = get_sheet(**keywords) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 68, in get_sheet sheet = Sheet(named_content.payload, named_content.name, **sheet_params) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 87, in __init__ transpose_after=transpose_after File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 130, in init Matrix.__init__(self, sheet) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/matrix.py", line 36, in __init__ self.__width, self.__array = uniform(list(array)) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/sheet.py", line 55, in to_array for row_index, row in enumerate(self.row_iterator()): File "/usr/local/lib/python2.7/site-packages/pyexcel_io/_compact.py", line 43, in next return type(self).__next__(self) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/fileformat/_csv.py", line 42, in __next__ return next(self.reader).encode('utf-8') File "/usr/local/lib/python2.7/codecs.py", line 630, in next line = self.readline() File "/usr/local/lib/python2.7/codecs.py", line 545, in readline data = self.read(readsize, firstline=True) File "/usr/local/lib/python2.7/codecs.py", line 492, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 0: invalid continuation byte

Going through the code, while _csv.py supports the notion of different encodings (e.g. UTF8Recorder has a reader and it supplies the encoding) unless I am mistaken, the encoding is determined by the Python version in _compact.py (otherwise through system default encoding), rather than it being possible to supply it as a parameter to get_array() that cascades down.

@chfw
Copy link
Member

chfw commented Mar 1, 2017

Could you please try feeding it with an extra encoding parameter?

jsonify({"result": request.get_array(field_name='file', encoding='ascii')})

This parameter will get a corresponding codec from codecs.

@alistairhann
Copy link
Author

Thanks for replying!

I tried as you suggested:

raw = request.get_array(field_name='file',encoding='ascii')

but I still get error as above (pasted at the bottom).

I traced through the code again, and the parameter would next get to get_sheet(**keywords) in core.py which would in turn pass it to get_sheet_stream(**keywords). That then passes it to the get_source factory and I think that's where the encoding parameter gets dropped (although I'll admit I don't fully understand what's going on there).

File "/usr/local/lib/python2.7/site-packages/pyexcel_webio/__init__.py", line 81, in get_array return pe.get_array(**params) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 290, in get_array sheet = get_sheet(**keywords) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 68, in get_sheet sheet = Sheet(named_content.payload, named_content.name, **sheet_params) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 87, in __init__ transpose_after=transpose_after File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 130, in init Matrix.__init__(self, sheet) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/matrix.py", line 36, in __init__ self.__width, self.__array = uniform(list(array)) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/sheet.py", line 55, in to_array for row_index, row in enumerate(self.row_iterator()): File "/usr/local/lib/python2.7/site-packages/pyexcel_io/_compact.py", line 43, in next return type(self).__next__(self) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/fileformat/_csv.py", line 42, in __next__ return next(self.reader).encode('utf-8') File "/usr/local/lib/python2.7/codecs.py", line 630, in next line = self.readline() File "/usr/local/lib/python2.7/codecs.py", line 545, in readline data = self.read(readsize, firstline=True) File "/usr/local/lib/python2.7/codecs.py", line 492, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 0: invalid continuation byte INFO:werkzeug:172.17.0.1 - - [02/Mar/2017 10:24:36] "POST /upload HTTP/1.1" 500 -

@chfw
Copy link
Member

chfw commented Mar 2, 2017

My bad. I have not spotted '0xca'. And it looks famous for UnicodeDecodeError, where encoding was suggested to be 'latin1' or 'ISO-8859-1'.

0xca is 202 and ascii encoding cannot handle it. Here's my experiment:

>>> import pyexcel as p
>>> a='\xca'
>>> with open('test.csv', 'wb') as f:
...     f.write(a)
...
>>> p.get_sheet(file_name='test.csv', encoding='latin1')
test.csv:
+---+
| Ê |
+---+
>>> p.get_sheet(file_name='test.csv')
test.csv:

>>>

@alistairhann
Copy link
Author

Thank you!

If I understand your response correctly, you are suggesting that it would encode correctly if I hinted Latin1?

What I still don't understand is that if the encoding parameter defines the Codec why would the error be:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 0: invalid continuation byte

Regardless of what codec I specify for example:

raw = request.get_array(field_name='file',encoding='latin1')
and
raw = request.get_array(field_name='file',encoding='ascii')
and
raw = request.get_array(field_name='file')

Each result in the error referring to the UTF-8 codec - as if the codec isn't set when using request.get_array()

@chfw
Copy link
Member

chfw commented Mar 2, 2017

That is strange. You can put a break point or put a print statement here: https://github.com/pyexcel/pyexcel-io/blob/master/pyexcel_io/fileformat/_csv.py#L36, in order to see if encoding is passed on or not.

In general, pyexcel's getter functions always pass on keyword options down to pyexcel-io and its third party libraries.

Do you mind sharing a portion of your csv file for me to reproduce the problem in my end?

@alistairhann
Copy link
Author

alistairhann commented Mar 2, 2017

I'm happy to share the file, what's the best way to get it to you?

I can just put it on github if you like.

@chfw
Copy link
Member

chfw commented Mar 2, 2017

via github is easier if you drop the private data.

@alistairhann
Copy link
Author

In the process of creating a clean version of the file that still breaks I found that it's a particular entry that breaks it. There's a special character in one of the fields and that makes the whole thing fail. I don't know exactly how the file was created - most probably saving a spreadsheet as a CSV on a Mac (Notepad++ identifies it is "ANSI" with Mac line endings).

Part of what has confused me here is why the codec in the error is UTF8 despite the encoding and why it is the byte in position zero (as I assumed that would be the start of the file or string - which I had checked for anomalies).

Thanks again for your help, and sorry if this is a red herring.

@chfw
Copy link
Member

chfw commented Mar 2, 2017

I got it reproduced but I got it fixed by specifying encoding='latin1' over here. But the problem is in pyexcel-webio, which isn't passing on keywords.

This is my session to show pyexcel could handle your csv but 'encoding' was blocked by webio:

>>> import pyexcel as p
>>> s=p.get_sheet(url="https://raw.githubusercontent.com/alistairhann/broken_csv/master/broken.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel-0.5.0-py2.7.egg/pyexcel/core.py", line 68, in get_sheet
    named_content = sources.get_sheet_stream(**keywords)
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel-0.5.0-py2.7.egg/pyexcel/sources/__init__.py", line 15, in get_sheet_stream
    sheets = source.get_data()
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel-0.5.0-py2.7.egg/pyexcel/sources/http.py", line 56, in get_data
    **self.__keywords)
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel_io-0.3.2-py2.7.egg/pyexcel_io/io.py", line 39, in get_data
    data[key] = list(data[key])
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel_io-0.3.2-py2.7.egg/pyexcel_io/sheet.py", line 55, in to_array
    for row_index, row in enumerate(self.row_iterator()):
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel_io-0.3.2-py2.7.egg/pyexcel_io/_compact.py", line 43, in next
    return type(self).__next__(self)
  File "/Users/jaska/github/py2env/lib/python2.7/site-packages/pyexcel_io-0.3.2-py2.7.egg/pyexcel_io/fileformat/_csv.py", line 42, in __next__
    return next(self.reader).encode('utf-8')
  File "/Users/jaska/github/py2env/lib/python2.7/codecs.py", line 618, in next
    line = self.readline()
  File "/Users/jaska/github/py2env/lib/python2.7/codecs.py", line 533, in readline
    data = self.read(readsize, firstline=True)
  File "/Users/jaska/github/py2env/lib/python2.7/codecs.py", line 480, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xef in position 0: invalid continuation byte
>>> s=p.get_sheet(url="https://raw.githubusercontent.com/alistairhann/broken_csv/master/broken.csv", encoding='latin1')
>>> s
csv:
+-----------+------------+---------+-----------------------+-----------+
| Last Name | First Name | Company | Email                 | Job Title |
+-----------+------------+---------+-----------------------+-----------+
| Test      | Thïs       | Cool Co | test.this@example.com | Founder   |
+-----------+------------+---------+-----------------------+-----------+
>>> p.get_array(url="https://raw.githubusercontent.com/alistairhann/broken_csv/master/broken.csv", encoding='latin1')
[[u'Last Name', u'First Name', u'Company', u'Email', u'Job Title'], [u'Test', u'Th\xefs', u'Cool Co', u'test.this@example.com', u'Founder']]

@chfw
Copy link
Member

chfw commented Mar 2, 2017

Please try the fix:

pip install https://github.com/pyexcel/pyexcel-webio/archive/master.zip

@alistairhann
Copy link
Author

I concur about the keywords not being passed on - that was the point I was poorly articulating above. I dare say it would have been clearer if I'd raised the issue in flask-excel or pyexcel-webio.

I tried the fix (needed to add --upgrade) and that's fixed it nicely both with the broken.csv and the original file that were providing problems. It throws an exception when the encoding isn't provided and uploads fine when:

encoding='latin1'

Thank you very much for your help with this and for producing the fix.

@chfw
Copy link
Member

chfw commented Mar 4, 2017

the fix was released in pyexcel-webio version 0.0.11. thanks for reporting it.

@chfw chfw closed this as completed Mar 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants