New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bizarre day parsing issue #307

Closed
robertzk opened this Issue Mar 17, 2015 · 7 comments

Comments

Projects
None yet
3 participants
@robertzk

robertzk commented Mar 17, 2015

I have a very spooky dataset. It fails to parse in full on this exact set in this exact order, but parses fine in every subset / subsample I have tried.

dates <- c("{ND}", "{ND}", "{ND}", "2006-11-26", "{ND}", "{ND}", "{ND}",
"2010-06-05", "2014-06-01", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "2006-10-31", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "2010-10-31", "2009-05-01",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", NA, "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "2008-09-09",
"{ND}", "2003-01-01", "{ND}", "{ND}", "2013-02-28", "2011-10-31",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "2010-08-31", "{ND}", "{ND}",
"2011-02-01", "2012-03-31", "2013-06-04", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "2005-12-12", "2006-09-30", NA, "{ND}")

First, notice

> all(is.na(lubridate::ymd(dates)))
[1] TRUE

However, every other variation of samples I have tried parses fine...

variations <- list(dates[-1], dates[-102], sample(dates), sample(dates, 50), dates[-50], dates[c(F,T)], dates[c(T,T,T,T,F)], dates[102:1])
all(sapply(variations, function(v) !all(is.na(lubridate::ymd(v)))))
# [1] TRUE

Note that shuffling the vector or reversing it is sufficient to get ymd to parse the vector correctly.

I understand if you don't want to look into / fix this, but it did cause a production issue. Baffling.

@peterhurford

This comment has been minimized.

Show comment
Hide comment
@peterhurford

peterhurford Mar 17, 2015

👍

I have replicated this on a separate machine. I have also replicated it again using R --no-init-file.

peterhurford commented Mar 17, 2015

👍

I have replicated this on a separate machine. I have also replicated it again using R --no-init-file.

@robertzk

This comment has been minimized.

Show comment
Hide comment
@robertzk

robertzk Mar 17, 2015

Looks like it's your "irregular guesser" in .get_train_set: the first prime indices under 100 in the vector I provided all have an "{ND}", and any shuffle or re-ordering will almost always disrupt this.

Oh well, space-time trade-off and all, I guess. No way to fix this without impacting performance; someone was bound to hit this in the wild eventually.

robertzk commented Mar 17, 2015

Looks like it's your "irregular guesser" in .get_train_set: the first prime indices under 100 in the vector I provided all have an "{ND}", and any shuffle or re-ordering will almost always disrupt this.

Oh well, space-time trade-off and all, I guess. No way to fix this without impacting performance; someone was bound to hit this in the wild eventually.

@vspinu

This comment has been minimized.

Show comment
Hide comment
@vspinu

vspinu Mar 17, 2015

Member

You were very unfortunate with your very sparce date vector. Lubridate has a training engine which is used to automatically recognize the formats. The training is done on a small subset of the original vector which is generated based on first 501 primes. See the internal .get_train_set functions for how it is done. If you have a better idea I am all ears.

Member

vspinu commented Mar 17, 2015

You were very unfortunate with your very sparce date vector. Lubridate has a training engine which is used to automatically recognize the formats. The training is done on a small subset of the original vector which is generated based on first 501 primes. See the internal .get_train_set functions for how it is done. If you have a better idea I am all ears.

@robertzk

This comment has been minimized.

Show comment
Hide comment
@robertzk

robertzk Mar 17, 2015

No, that's alright, I have the same issues in my packages. Obviously there is no mathematical solution without inspecting the full vector that won't run into edge cases.

robertzk commented Mar 17, 2015

No, that's alright, I have the same issues in my packages. Obviously there is no mathematical solution without inspecting the full vector that won't run into edge cases.

@robertzk robertzk closed this Mar 17, 2015

@vspinu

This comment has been minimized.

Show comment
Hide comment
@vspinu

vspinu Mar 17, 2015

Member

You can use parse_date_time2 which doesn't use any training and is very fast:

parse_date_time2(dates, "Ymd")

At least partially, training should also be removed from parse_date_time, at least if there was only one supplied order as in your case.

Member

vspinu commented Mar 17, 2015

You can use parse_date_time2 which doesn't use any training and is very fast:

parse_date_time2(dates, "Ymd")

At least partially, training should also be removed from parse_date_time, at least if there was only one supplied order as in your case.

@vspinu

This comment has been minimized.

Show comment
Hide comment
@vspinu

vspinu Mar 17, 2015

Member

Ok. Closing in favor of #308.

Member

vspinu commented Mar 17, 2015

Ok. Closing in favor of #308.

@vspinu vspinu closed this Mar 17, 2015

@robertzk

This comment has been minimized.

Show comment
Hide comment
@robertzk

robertzk commented Mar 17, 2015

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment