Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode char as delimiter won't use C engine #14065

Closed
schodge opened this issue Aug 22, 2016 · 7 comments
Closed

Unicode char as delimiter won't use C engine #14065

schodge opened this issue Aug 22, 2016 · 7 comments
Labels
IO CSV read_csv, to_csv Unicode Unicode strings
Milestone

Comments

@schodge
Copy link

schodge commented Aug 22, 2016

I have the following code:

dfEL = pd.read_csv(IN_PATH, delimiter='\\u00A7', encoding='utf-8')

which I've also tried with other ways of writing the delimiter, e.g.:

dfEL = pd.read_csv(IN_PATH, delimiter='§', encoding='utf-8')

These other methods don't work, and generate a UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 10: unexpected end of data.

The first method, though, won't use the C regex engine:

dfEL = pd.read_csv(IN_PATH, delimiter='\\u00A7', encoding='utf-8')
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':

Shouldn't this only be considered one character and still use (the I presume faster) C engine?

Sample data - there's a lot of messiness in the rightmost column, which is why an unusual separator was used:

foo§1457431587429§$request_details.bar
foo§1457431587429§$request_details.foo.bar
foo§1457431587429§$request_details.foo.foo.bar
foo§1457431587429§null
foo§1457431587429§null
foo§1457431587429§null
foo§1457431587429§$request_type
foo§1457431587429§$additional_details
foo§1457431587429§$Generic_Params.success_msg_folder+$request_type.action+'_'+$request_type.object
§1459972605829§$Path
§1459972605829§$Name
§1441995198746§$original.original
§1441995198746§null
pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None
@TomAugspurger
Copy link
Contributor

Copy-pastable example (python3)

import pandas as pd
from io import StringIO
s = "a§b\n1§2\n3§4"

pd.read_csv(StringIO(s), sep='§')

@TomAugspurger TomAugspurger added IO CSV read_csv, to_csv Unicode Unicode strings labels Aug 22, 2016
@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Aug 22, 2016
@TomAugspurger
Copy link
Contributor

By the way, for you first example, I think you want delimiter='\u00A7' (you had an extra backslash).

@schodge
Copy link
Author

schodge commented Aug 23, 2016

Apologies for not including a cut and paste example.

Actually, I do need the delimiter with two backslashes. In fact, your copy-and-paste example doesn't work for me as written:

import pandas as pd
from io import StringIO
s = "a§b\n1§2\n3§4"

pd.read_csv(StringIO(s), sep='§')

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-43a4e6b0419c> in <module>()
      3 s = "a§b\n1§2\n3§4"
      4 
----> 5 pd.read_csv(StringIO(s), sep='§')
      6 

<cutting>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1: unexpected end of data

With the doubled form:

import pandas as pd
from io import StringIO
s = "a§b\n1§2\n3§4"

pd.read_csv(StringIO(s), sep='\\u00A7')

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
Out[2]: 
   a  b
0  1  2
1  3  4

import pandas as pd
from io import StringIO
s = "a§b\n1§2\n3§4"

And with single form:

pd.read_csv(StringIO(s), sep='\u00A7')

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-e789e47e41b2> in <module>()
      3 s = "a§b\n1§2\n3§4"
      4 
----> 5 pd.read_csv(StringIO(s), sep='\u00A7')
      6 

<cutting>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1: unexpected end of data

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 23, 2016

That's odd, I wonder if we escape something improperly... Compare these two:

In [22]: '\u00A7'
Out[22]: '§'

In [24]: '\\u00A7'
Out[24]: '\\u00A7'

When you have the double \, the first backslash escapes the second, so Out[24] is the literal string \u00A7.

@schodge
Copy link
Author

schodge commented Aug 23, 2016

So those work:

'\u00A7'
Out[2]: '§'

'\\u00A7'
Out[3]: '\\u00A7'

Right - I don't think I've seen this behavior with unicode outside of pandas, but I rarely venture into unicode.

@gfyoung
Copy link
Member

gfyoung commented Aug 29, 2016

The reason for the error is that the data is getting encoded as utf-8 (see here), which "destroys" the delimiter in the data:

>>> data = "a§b\n1§2\n3§4"
>>> data.encode('utf-8')
b'a\xc2\xa7b\n1\xc2\xa72\n3\xc2\xa74'

However, ord(§) == 167, which is \xa7. This causes the data to split improperly with the first element of the header being a\xc2, which leads to the error that you're seeing.

Now technically, we should be splittng by \xc2\xa7, but the C engine doesn't support splitting of that kind (we only support single character splitting for now).

In the long run, the solution would be to somehow support multi-char delimiters (tricky since we parse byte by byte with the C engine). In the short-term, I think we should check the separator to see if it would be a multi-char when encoded, and if so, raise an error.

Thoughts?

@jorisvandenbossche
Copy link
Member

Trying to detect such separators and raising an informative message sounds fine IMO

gfyoung added a commit to forking-repos/pandas that referenced this issue Aug 30, 2016
The system file encoding can cause a separator to be
encoded as more than one character even though it maybe
provided as one character.

Multi-char separators are not supported by the C engine,
so we need to catch this case.

Closes pandas-devgh-14065.
@jreback jreback modified the milestones: 0.19.0, 0.20.0 Aug 31, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

5 participants