Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv: European numbers do not work with dates #14066

Closed
dacoex opened this issue Aug 22, 2016 · 10 comments
Closed

read_csv: European numbers do not work with dates #14066

dacoex opened this issue Aug 22, 2016 · 10 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@dacoex
Copy link
Contributor

dacoex commented Aug 22, 2016

Code Sample, a copy-pastable example if possible

Using the following reader leads to omission of the dates resulting in no index:

file_, index_col=0,
                   header=4, sep=';', quoting=1, parse_dates=True,
                   encoding='latin-1', dayfirst=True, 
#                   decimal=',', thousands='.'

Expected Output

dataframe with pyhon numeric data and datecol as index

output of pd.show_versions()


pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-31-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 22, 2016

Can you post some example data and the actual read_csv, and the output you're getting? You can use the StringIO module https://docs.python.org/3/library/io.html#io.StringIO to paste the data in. Thanks.

@TomAugspurger TomAugspurger added the IO CSV read_csv, to_csv label Aug 22, 2016
@dacoex
Copy link
Contributor Author

dacoex commented Aug 23, 2016

@TomAugspurger thanks for the response.

See also:
https://github.com/balzer82/DKB-Kontoauszug-Visualizer/blob/master/DKB-Kontoauszug-Visualizer.py#L35

Here goes the example:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd


TESTDATA=StringIO("""
 ; 
 ; 
 ; 
 ; 
 ; 
 ; 
day;transaction value (EUR) anon
04.08.2016;33043,5217026458
03.08.2016;52542,924322369
01.08.2016;2470,86389648009
01.08.2016;124462,590366738
01.08.2016;82283,2571462275
01.08.2016;5508278,98359258
01.08.2016;154264,547309073
01.08.2016;38068,8061252655
01.08.2016;10444,4806476402
01.08.2016;13958,4753522405
01.08.2016;4834682,43989406
29.07.2016;-1247752,81935378
29.07.2016;157975,468518703
    """)

df = pd.read_csv(TESTDATA, index_col=0,
                   header=4, sep=';', quoting=1, parse_dates=True,
                   encoding='latin-1', dayfirst=True, 
                   decimal=',',
                   thousands='.',
                   )

@gfyoung
Copy link
Member

gfyoung commented Aug 24, 2016

A couple of things:

  1. from pandas.compat import StringIO takes care of the Python 2/3 thing FYI.

  2. I believe you have the header parameter wrong. I think it should be header=6.

However, even with that parameter fixed, the reason you're seeing the index_col not being parsed as a date is because your thousands='.' is confusing the parser into parsing the '.' as a numeric value.

That is a bug, so thank you for pointing it out!

The issue actually has nothing to do with European date formats. You can see the bug surfaced here with a much more simplified example:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a\n04.15.2016'
>>> read_csv(StringIO(data), index_col=0, parse_dates=True, thousands='.')
Empty DataFrame
Columns: []
Index: [4152016]  # WRONG
>>>
>>> read_csv(StringIO(data), index_col=0, parse_dates=True)
Empty DataFrame
Columns: []
Index: [2016-04-15 00:00:00]  # RIGHT

Note that this bug does not affect non-index columns:

>>> read_csv(StringIO(data), parse_dates=['a'], thousands='.')
           a
0 2016-04-15

Similar observations can be made with the Python parser.

@dacoex
Copy link
Contributor Author

dacoex commented Aug 24, 2016

@gfyoung thanks for confirmation.

minor clarification:

The issue actually has nothing to do with European date formats.

I was referring to European data as under: Quoting, Compression, and File Format

No idea how to go on from here but looks like the processing priorities need to be changed in the parser.

@gfyoung
Copy link
Member

gfyoung commented Aug 24, 2016

@dacoex : Ah, okay. Good to know that my minimal example is capturing the issue you were seeing!

@jreback jreback added the Bug label Aug 25, 2016
@jreback jreback added this to the 0.19.0 milestone Aug 25, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Aug 26, 2016
When a thousands parameter is specified, if the index column data
contains that thousands value for date purposes (e.g. '.'), do not
interpret those characters as the thousands parameter.

Closes pandas-devgh-14066.
jorisvandenbossche pushed a commit that referenced this issue Aug 27, 2016
When a thousands parameter is specified, if the index column data
contains that thousands value for date purposes (e.g. '.'), do not
interpret those characters as the thousands parameter.

Closes gh-14066.
@dacoex
Copy link
Contributor Author

dacoex commented Aug 27, 2016

Big thanks to @jorisvandenbossche & @gfyoung FOSS is great!

@dacoex
Copy link
Contributor Author

dacoex commented Aug 27, 2016

@jreback
Copy link
Contributor

jreback commented Aug 27, 2016

9d10b76

@jreback
Copy link
Contributor

jreback commented Aug 27, 2016

it's takes a while to actually generate those docs

@dacoex
Copy link
Contributor Author

dacoex commented Sep 13, 2016

I tested with the v0.19 RC and it works with the original data. Thanks again!

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants