Unicode char as delimiter won't use C engine #14065

schodge · 2016-08-22T19:27:03Z

I have the following code:

dfEL = pd.read_csv(IN_PATH, delimiter='\\u00A7', encoding='utf-8')

which I've also tried with other ways of writing the delimiter, e.g.:

dfEL = pd.read_csv(IN_PATH, delimiter='§', encoding='utf-8')

These other methods don't work, and generate a UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 10: unexpected end of data.

The first method, though, won't use the C regex engine:

dfEL = pd.read_csv(IN_PATH, delimiter='\\u00A7', encoding='utf-8')
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':

Shouldn't this only be considered one character and still use (the I presume faster) C engine?

Sample data - there's a lot of messiness in the rightmost column, which is why an unusual separator was used:

foo§1457431587429§$request_details.bar
foo§1457431587429§$request_details.foo.bar
foo§1457431587429§$request_details.foo.foo.bar
foo§1457431587429§null
foo§1457431587429§null
foo§1457431587429§null
foo§1457431587429§$request_type
foo§1457431587429§$additional_details
foo§1457431587429§$Generic_Params.success_msg_folder+$request_type.action+'_'+$request_type.object
§1459972605829§$Path
§1459972605829§$Name
§1441995198746§$original.original
§1441995198746§null

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-08-22T21:18:24Z

Copy-pastable example (python3)

import pandas as pd
from io import StringIO
s = "a§b\n1§2\n3§4"

pd.read_csv(StringIO(s), sep='§')

TomAugspurger · 2016-08-22T21:20:22Z

By the way, for you first example, I think you want delimiter='\u00A7' (you had an extra backslash).

schodge · 2016-08-23T01:21:17Z

Apologies for not including a cut and paste example.

Actually, I do need the delimiter with two backslashes. In fact, your copy-and-paste example doesn't work for me as written:

import pandas as pd
from io import StringIO
s = "a§b\n1§2\n3§4"

pd.read_csv(StringIO(s), sep='§')

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-43a4e6b0419c> in <module>()
      3 s = "a§b\n1§2\n3§4"
      4 
----> 5 pd.read_csv(StringIO(s), sep='§')
      6 

<cutting>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1: unexpected end of data

With the doubled form:

import pandas as pd
from io import StringIO
s = "a§b\n1§2\n3§4"

pd.read_csv(StringIO(s), sep='\\u00A7')

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
Out[2]: 
   a  b
0  1  2
1  3  4

import pandas as pd
from io import StringIO
s = "a§b\n1§2\n3§4"

And with single form:

pd.read_csv(StringIO(s), sep='\u00A7')

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-e789e47e41b2> in <module>()
      3 s = "a§b\n1§2\n3§4"
      4 
----> 5 pd.read_csv(StringIO(s), sep='\u00A7')
      6 

<cutting>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1: unexpected end of data

TomAugspurger · 2016-08-23T01:26:03Z

That's odd, I wonder if we escape something improperly... Compare these two:

In [22]: '\u00A7'
Out[22]: '§'

In [24]: '\\u00A7'
Out[24]: '\\u00A7'

When you have the double \, the first backslash escapes the second, so Out[24] is the literal string \u00A7.

schodge · 2016-08-23T02:12:22Z

So those work:

'\u00A7'
Out[2]: '§'

'\\u00A7'
Out[3]: '\\u00A7'

Right - I don't think I've seen this behavior with unicode outside of pandas, but I rarely venture into unicode.

gfyoung · 2016-08-29T05:03:07Z

The reason for the error is that the data is getting encoded as utf-8 (see here), which "destroys" the delimiter in the data:

>>> data = "a§b\n1§2\n3§4"
>>> data.encode('utf-8')
b'a\xc2\xa7b\n1\xc2\xa72\n3\xc2\xa74'

However, ord(§) == 167, which is \xa7. This causes the data to split improperly with the first element of the header being a\xc2, which leads to the error that you're seeing.

Now technically, we should be splittng by \xc2\xa7, but the C engine doesn't support splitting of that kind (we only support single character splitting for now).

In the long run, the solution would be to somehow support multi-char delimiters (tricky since we parse byte by byte with the C engine). In the short-term, I think we should check the separator to see if it would be a multi-char when encoded, and if so, raise an error.

Thoughts?

jorisvandenbossche · 2016-08-29T12:14:14Z

Trying to detect such separators and raising an informative message sounds fine IMO

The system file encoding can cause a separator to be encoded as more than one character even though it maybe provided as one character. Multi-char separators are not supported by the C engine, so we need to catch this case. Closes pandas-devgh-14065.

TomAugspurger added IO CSV read_csv, to_csv Unicode Unicode strings labels Aug 22, 2016

TomAugspurger added this to the 0.20.0 milestone Aug 22, 2016

TomAugspurger added Effort Medium labels Aug 22, 2016

gfyoung mentioned this issue Aug 30, 2016

API: Warn or raise for > 1 char encoded sep #14120

Closed

jreback modified the milestones: 0.19.0, 0.20.0 Aug 31, 2016

jreback closed this as completed in 5db52f0 Aug 31, 2016

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode char as delimiter won't use C engine #14065

Unicode char as delimiter won't use C engine #14065

schodge commented Aug 22, 2016 •

edited

Loading

TomAugspurger commented Aug 22, 2016

TomAugspurger commented Aug 22, 2016

schodge commented Aug 23, 2016

TomAugspurger commented Aug 23, 2016 •

edited

Loading

schodge commented Aug 23, 2016

gfyoung commented Aug 29, 2016 •

edited

Loading

jorisvandenbossche commented Aug 29, 2016

Unicode char as delimiter won't use C engine #14065

Unicode char as delimiter won't use C engine #14065

Comments

schodge commented Aug 22, 2016 • edited Loading

TomAugspurger commented Aug 22, 2016

TomAugspurger commented Aug 22, 2016

schodge commented Aug 23, 2016

TomAugspurger commented Aug 23, 2016 • edited Loading

schodge commented Aug 23, 2016

gfyoung commented Aug 29, 2016 • edited Loading

jorisvandenbossche commented Aug 29, 2016

schodge commented Aug 22, 2016 •

edited

Loading

TomAugspurger commented Aug 23, 2016 •

edited

Loading

gfyoung commented Aug 29, 2016 •

edited

Loading