Skip to content

Bizarre day parsing issue #307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
robertzk opened this issue Mar 17, 2015 · 7 comments · Fixed by robertzk/syberiaMungebits#45
Closed

Bizarre day parsing issue #307

robertzk opened this issue Mar 17, 2015 · 7 comments · Fixed by robertzk/syberiaMungebits#45

Comments

@robertzk
Copy link

I have a very spooky dataset. It fails to parse in full on this exact set in this exact order, but parses fine in every subset / subsample I have tried.

dates <- c("{ND}", "{ND}", "{ND}", "2006-11-26", "{ND}", "{ND}", "{ND}",
"2010-06-05", "2014-06-01", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "2006-10-31", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "2010-10-31", "2009-05-01",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", NA, "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "2008-09-09",
"{ND}", "2003-01-01", "{ND}", "{ND}", "2013-02-28", "2011-10-31",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "{ND}", "{ND}", "2010-08-31", "{ND}", "{ND}",
"2011-02-01", "2012-03-31", "2013-06-04", "{ND}", "{ND}", "{ND}",
"{ND}", "{ND}", "2005-12-12", "2006-09-30", NA, "{ND}")

First, notice

> all(is.na(lubridate::ymd(dates)))
[1] TRUE

However, every other variation of samples I have tried parses fine...

variations <- list(dates[-1], dates[-102], sample(dates), sample(dates, 50), dates[-50], dates[c(F,T)], dates[c(T,T,T,T,F)], dates[102:1])
all(sapply(variations, function(v) !all(is.na(lubridate::ymd(v)))))
# [1] TRUE

Note that shuffling the vector or reversing it is sufficient to get ymd to parse the vector correctly.

I understand if you don't want to look into / fix this, but it did cause a production issue. Baffling.

@peterhurford
Copy link

👍

I have replicated this on a separate machine. I have also replicated it again using R --no-init-file.

@robertzk
Copy link
Author

Looks like it's your "irregular guesser" in .get_train_set: the first prime indices under 100 in the vector I provided all have an "{ND}", and any shuffle or re-ordering will almost always disrupt this.

Oh well, space-time trade-off and all, I guess. No way to fix this without impacting performance; someone was bound to hit this in the wild eventually.

@vspinu
Copy link
Member

vspinu commented Mar 17, 2015

You were very unfortunate with your very sparce date vector. Lubridate has a training engine which is used to automatically recognize the formats. The training is done on a small subset of the original vector which is generated based on first 501 primes. See the internal .get_train_set functions for how it is done. If you have a better idea I am all ears.

@robertzk
Copy link
Author

No, that's alright, I have the same issues in my packages. Obviously there is no mathematical solution without inspecting the full vector that won't run into edge cases.

@vspinu
Copy link
Member

vspinu commented Mar 17, 2015

You can use parse_date_time2 which doesn't use any training and is very fast:

parse_date_time2(dates, "Ymd")

At least partially, training should also be removed from parse_date_time, at least if there was only one supplied order as in your case.

@vspinu
Copy link
Member

vspinu commented Mar 17, 2015

Ok. Closing in favor of #308.

@vspinu vspinu closed this as completed Mar 17, 2015
@robertzk
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants