Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: improved pd.Period date parsing and formatting #13931

Open
smontanaro opened this issue Aug 8, 2016 · 10 comments
Open

ERR: improved pd.Period date parsing and formatting #13931

smontanaro opened this issue Aug 8, 2016 · 10 comments
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Period Period data type

Comments

@smontanaro
Copy link

It seems that pandas.Period has a few bugs w.r.t. date parsing and formatting, or at least poorly documented behavior.

  • The documentation makes no mention of the possible range of dates/years.
  • Formatting of years in strftime using the %Y format seems inconsistent.
  • Parsing of what is commonly assumed to be ISO-8601 dates seems wrong.

Code Sample, a copy-pastable example if possible

This seems reasonable, assuming American-centered times (m/d/y):
>>> pd.Period("1/2/3")
Period('2003-01-02', 'D')

This seems wrong, as the common interpretation of dates using hyphens as separators is (in my experience), ISO-8601 (though, I will grant that not specifying the necessary leading zeroes makes the input suspect):

>>> pd.Period("1-2-3")
Period('2003-01-02', 'D')

Hard to see how either of these is correct. I'm not sure quite what to expect for the first example, but the second example clearly reads like year=3, month=1, day=2 to me. Despite the presence or lack of leading zeroes, I would think when presented with a date containing hyphens as separators, %Y-%m-%d would be assumed. I also think that %Y, %m, and %d should also zero-pad their arguments to the correct widths (4, 2, 2, respectively). (The same could be said for other normally zero-padded timestamp fields.)

>>> pd.Period("01-02-0003")
Period('3-01-02', 'D')
>>> pd.Period("0003-01-02")
Period('2-03-01', 'D')
>>> pd.Period("0003-01-02").strftime("%Y")
u'2'

Given that date formats differ so widely, to eliminate ambiguity perhaps the Period constructor should take an (optional keyword) argument which specifies a format string as understood by strptime(). It would clearly have to be extended somewhat to accommodate the quarter notation.

output of pd.show_versions()

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.4.63-2.44-desktop
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

there is currently not a lot of validation on periods when they are not abbreviations (e.g. '2010Q3'). So sure could be some more hints here (maybe even dayfirst, yearfirst would be enough), as we don't support passing format to Timestamp.

This all comes at a cost though. perf has to be monitored (e.g. to take the fast path when this kind of parsing is NOT needed). Further improved docs / examples are always welcome.

PR to start us off?

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Period Period data type Difficulty Intermediate labels Aug 8, 2016
@jreback jreback added this to the Next Major Release milestone Aug 8, 2016
@jreback jreback changed the title Several pandas.Period date parsing and formatting nits ERR: improved pd.Period date parsing and formatting Aug 8, 2016
@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

note that here we recently added some much more flexible ISO8601-like parsing (e.g. accepting arbitrary seps). This is c-code, but I suppose could be done for Periods (prob cython is easier). The trick here is to do this efficiently.

Certainly narrowing down non-valid formats (or even ambiguous ones) is a good thing in any event.

cc @sinhrks
cc @chris-b1

pull-requests are welcome!

@smontanaro
Copy link
Author

I can probably come up with a trivial pull request for %Y formatting, but as I've never worked on the Pandas source before, I have a question. I see a tox.ini file in the top level directory. Am I supposed to pip install tox or conda install ctox into my Anaconda environment? (I tried conda install tox, but that failed.) I ask about the Anaconda environment because those packages are much more up-to-date than what I have in my normal work environment.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

contributing docs are here

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

i wouldn't use tox, no real need for this.

just create a separate dev environment as indicated.

@smontanaro
Copy link
Author

Ugh. This isn't looking as easy as I thought. The problem seems to be with the parsing, not the formatting on output. Consider this:

>>> x = pd.Period("03-01-02")
>>> x.year
2002
>>> x = pd.Period("003-01-02")
>>> x.year
2
>>> x = pd.Period("0003-01-02")
>>> x.year
2

It's not clear why the number of leading zeros in the month would have an effect on how the year is parsed. It seems to me like the year should always be 2 or 2002, and not vary as the number of leading zeros in the month changes.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

yes, you need to step thru this and see what is going on. Ideally if you create some tests of suspect things (with assertions that either give a valid result), or raise. Then fix.

@sinhrks
Copy link
Member

sinhrks commented Aug 8, 2016

2nd case is diefferent from dateutil, so maybe handled in any of short path.

For me, "0003-01-02" should have "3" as year (not "2"), as the same as "2003-01-02" should have "2003" as a year.

parser.parse("03-01-02")
# datetime.datetime(2002, 3, 1, 0, 0)

parser.parse("003-01-02")
# datetime.datetime(3, 1, 2, 0, 0)

parser.parse("0003-01-02")
# datetime.datetime(3, 1, 2, 0, 0)

@smontanaro
Copy link
Author

On Mon, Aug 8, 2016 at 4:02 PM, Sinhrks notifications@github.com wrote:

2nd case is diefferent from dateutil, so maybe handled in any of short path.

Yes, I noticed that when I was horsing around with
dateutil.parser.parse(). I hadn't gotten around to figuring out where
the pandas code diverged from just using dateutil.

@smarie
Copy link
Contributor

smarie commented Mar 11, 2022

I just bumped into this old issue today and had a look since I am concerned with fast string formatting in #46116.

Concerning the string formatting issue, it seems fixed on master:

>>> pd.Period("0003-01-02").strftime("%Y")
'0003'

For the parsing issues, they still seem relevant.
Maybe the title should be edited to remove "formatting"
That's all for today's two cents :)

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Period Period data type
Projects
None yet
Development

No branches or pull requests

6 participants