-
-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle CSV input other than UTF-8 e.g. ANSI #79
Comments
Could you please try feeding it with an extra encoding parameter?
This parameter will get a corresponding codec from codecs. |
Thanks for replying! I tried as you suggested:
but I still get error as above (pasted at the bottom). I traced through the code again, and the parameter would next get to get_sheet(**keywords) in core.py which would in turn pass it to get_sheet_stream(**keywords). That then passes it to the get_source factory and I think that's where the encoding parameter gets dropped (although I'll admit I don't fully understand what's going on there).
|
My bad. I have not spotted '0xca'. And it looks famous for UnicodeDecodeError, where encoding was suggested to be 'latin1' or 'ISO-8859-1'. 0xca is 202 and ascii encoding cannot handle it. Here's my experiment:
|
Thank you! If I understand your response correctly, you are suggesting that it would encode correctly if I hinted Latin1? What I still don't understand is that if the encoding parameter defines the Codec why would the error be:
Regardless of what codec I specify for example:
Each result in the error referring to the UTF-8 codec - as if the codec isn't set when using request.get_array() |
That is strange. You can put a break point or put a print statement here: https://github.com/pyexcel/pyexcel-io/blob/master/pyexcel_io/fileformat/_csv.py#L36, in order to see if encoding is passed on or not. In general, pyexcel's getter functions always pass on keyword options down to pyexcel-io and its third party libraries. Do you mind sharing a portion of your csv file for me to reproduce the problem in my end? |
I'm happy to share the file, what's the best way to get it to you? I can just put it on github if you like. |
via github is easier if you drop the private data. |
In the process of creating a clean version of the file that still breaks I found that it's a particular entry that breaks it. There's a special character in one of the fields and that makes the whole thing fail. I don't know exactly how the file was created - most probably saving a spreadsheet as a CSV on a Mac (Notepad++ identifies it is "ANSI" with Mac line endings). Part of what has confused me here is why the codec in the error is UTF8 despite the encoding and why it is the byte in position zero (as I assumed that would be the start of the file or string - which I had checked for anomalies). Thanks again for your help, and sorry if this is a red herring. |
I got it reproduced but I got it fixed by specifying This is my session to show pyexcel could handle your csv but 'encoding' was blocked by webio:
|
Please try the fix:
|
I concur about the keywords not being passed on - that was the point I was poorly articulating above. I dare say it would have been clearer if I'd raised the issue in flask-excel or pyexcel-webio. I tried the fix (needed to add --upgrade) and that's fixed it nicely both with the broken.csv and the original file that were providing problems. It throws an exception when the encoding isn't provided and uploads fine when:
Thank you very much for your help with this and for producing the fix. |
the fix was released in pyexcel-webio version 0.0.11. thanks for reporting it. |
I am using Flask-Excel with the first example in the docs, and the first line fails, unless the CSV is in UTF-8 e.g. I have a customer with an ANSI file and I want to be able to import it. The line is as follows, and is therefore just hitting pyexcel (hence raising this here):
jsonify({"result": request.get_array(field_name='file')})
The trace is as follows:
File "/code/app/myfile.py", line 99, in upload_file if request.method == 'POST': File "/usr/local/lib/python2.7/site-packages/pyexcel_webio/__init__.py", line 81, in get_array return pe.get_array(**params) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 290, in get_array sheet = get_sheet(**keywords) File "/usr/local/lib/python2.7/site-packages/pyexcel/core.py", line 68, in get_sheet sheet = Sheet(named_content.payload, named_content.name, **sheet_params) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 87, in __init__ transpose_after=transpose_after File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/sheet.py", line 130, in init Matrix.__init__(self, sheet) File "/usr/local/lib/python2.7/site-packages/pyexcel/sheets/matrix.py", line 36, in __init__ self.__width, self.__array = uniform(list(array)) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/sheet.py", line 55, in to_array for row_index, row in enumerate(self.row_iterator()): File "/usr/local/lib/python2.7/site-packages/pyexcel_io/_compact.py", line 43, in next return type(self).__next__(self) File "/usr/local/lib/python2.7/site-packages/pyexcel_io/fileformat/_csv.py", line 42, in __next__ return next(self.reader).encode('utf-8') File "/usr/local/lib/python2.7/codecs.py", line 630, in next line = self.readline() File "/usr/local/lib/python2.7/codecs.py", line 545, in readline data = self.read(readsize, firstline=True) File "/usr/local/lib/python2.7/codecs.py", line 492, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 0: invalid continuation byte
Going through the code, while _csv.py supports the notion of different encodings (e.g. UTF8Recorder has a reader and it supplies the encoding) unless I am mistaken, the encoding is determined by the Python version in _compact.py (otherwise through system default encoding), rather than it being possible to supply it as a parameter to get_array() that cascades down.
The text was updated successfully, but these errors were encountered: