-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
“ymd” family of functions fail with “Error in gsub,” language locale bug? #181
Comments
I am afraid this is a system specific problem. I cannot reproduce it on linux. This is what I have: > Sys.setlocale("LC_TIME", "ja_JP.utf8")
[1] "ja_JP.utf8"
> format(Sys.time(), format = "%a %Y %b %d %I:%M:%S %p")
[1] "金 2013 4月 26 04:40:43 午後"
> ymd("2010-12-08")
[1] "2010-12-08 UTC"
> Let's try to isolate the problem. From what I can see it has to do with the following code in ╭──────── #408 ─ /home/vitoshka/TVC/lubridate/R/guess.r ──
│ num_exact[] <- gsub("(?<!\\()\\?(?!<)", "", perl = T, # remove ?
│ gsub("+", "*", fixed = T,
│ gsub(">", "_e>", num))) # append _e to avoid duplicates
╰──────── #410 ─ Would be nice if you post here what is the value of "num" variable. You can do that either by placing the |
Thanks so much for the response.
|
You are doing things right. The frame you should enter is 11 or even 10 |
Thanks very much for the guidance. Got it from frame 10. Resulted in quite the explosion of text.. I wonder if it's having issues with the AM/PM kanji in there?. Selection: 10 Browse[1]> |
Yes, it has to do with the encoding and most likely with the fact that your R is English and locale is Japanese. Now take those funny expressions each at a time and check which one is causing problem. For example: funny_exp <- "(((?<I_s>1[0-2]|0?[1-9])\\D*(?<p_s>午前|午後)(?![[:alpha:]]))|(?<H_s>2[0-4]|[01]?\\d))"
gsub("+", "*", funny_exp, fixed = T)
gsub("(?<!\\()\\?(?!<)", "", funny_exp, perl = T) Once we figure this out we are very close to a reproducible example that you can post further on stackoverflow or R-help. This is really not a lubridate issue per see, but would be good to know what is going on for the future. |
Great, once I get home tonight I will try playing around with it in more detail. I will post again as soon as I can. For reference, however, even returning my R to Japanese so that it matches the native locale, the error still occurs. |
Bizarre... I went through all the values of "num" in there, and each time the exact same pattern of output came out. No errors or anomalies pop up, however. The pattern is, for example,
or
where the question mark(s) are removed in the second gsub. Besides that, fiddling around with things I cannot seem to come across a reproduction of the error. R is quite widely used in Japan, and I would be surprised if no one in the country was able to use lubridate. Apologies for not being able to get more information together myself, the support is much appreciated. |
The value of num is correct. It's some multibyte glitch in gsub with fixed=T. Try the following fmt <- format(as.POSIXct("1970-01-01 02:00:00"), "%a+%A+%b@%B@%p@")
gsub("+", "*", fmt, fixed = T) It should give Also try: gsub("+", "*", sprintf("%s", fmt), fixed = T) This is how num is constructed, using format and sprintf. |
Seems to be no issue in trying these guys as well:
|
Ok I think I know where it comes from, but I have no clue how to solve it. Does this work for you? paste(c("午前", "午後"), collapse = "|")
## -> [1] "午前|午後" If so, does this work as expected: ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
ampm
## -> [1] "午前" "午後"
paste(ampm, collapse = "|")
## -> [1] "午前|午後" |
And then of course, gsub("+", "*", paste(ampm, collapse = "|"), fixed = T)
gsub("+", "*", sprintf("%s", paste(ampm, collapse = "|")), fixed = T) since that is where the error comes. |
Eh, it sinks gradually. Also double gsub: gsub("+", "*", gsub("|", "*", sprintf("%s", paste(ampm, collapse = "+|")), fixed = T), fixed = T) This is how remote debugging works :) |
This is an extremely useful learning experience... while I have some experience with programming it has all been fairly small-scale applications and thus was always quite easy to debug. Going through the above suggestions the functions responses are as follows:
|
Sorry. I am out of options, this is the most obscure bug I have seen in years. Here is the last attempt: gsub("+", "*", gsub(">", "*", sprintf("%s", paste(ampm, collapse = "+>"))), fixed = T) Here is an isolated code from .build_locale_regs. It is virtually identical to the internal code (btw, I hope you are using the most recent version of lubridate from github). ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
p <- unique(ampm)
p <- p[nzchar(p)]
alpha_p <- sprintf("(?<p>%s)(?![[:alpha:]])", paste(p, collapse = "|"))
## NUMERIC FORMATS
num <- c(
d = "(?<d>[012]?[1-9]|3[01]|[12]0)",
H = "(?<H>2[0-4]|[01]?\\d)",
h = "(?<H>2[0-4]|[01]?\\d)",
I = "(?<I>1[0-2]|0?[1-9])",
j = "(?<j>[0-3]?\\d?\\d)",
M = "(?<M>[0-5]?\\d)",
S = "((?<OS_S>[0-5]?\\d\\.\\d+)|(?<S>[0-6]?\\d))",
s = "((?<OS_S>[0-5]?\\d\\.\\d+)|(?<S>[0-6]?\\d))",
U = "(?<U>[0-5]?\\d)",
w = "(?<w>[0-6])", # merge with a, A??
u = "(?<u>[1-7])",
W = "(?<W>[0-5]?\\d)",
## x = "(?<x>\\d{2}/[01]?\\d/[0-3]?\\d)",
## X = "(?<X>[012]?\\d:[0-5]?\\d:[0-6]?\\d)",
Y = "(?<Y>\\d{4})",
y = "((?<Y_y>\\d{4})|(?<y>\\d{2}))",
Oz = "(?<Oz_Oz>[-+]\\d{4})", ## sptrtime implements only this format (4 digits)
## F = "(?<F>\\d{4)-\\d{2}-\\d{2})",
OO = "(?<OO>[-+]\\d{2}:\\d{2})",
Oo = "(?<Oo>[-+]\\d{2})")
check <- sprintf("((%s\\D+%s\\D+%s\\D*%s)|(%s\\D+%s\\D+%s))",
num[["I"]], num[["M"]], num[["S"]], alpha_p, num[["H"]], num[["M"]], num[["S"]])
check <- gsub("(<[IMSpHS]|<OS)", "\\1_s", check)
gsub("(?<!\\()\\?(?!<)", "", perl = T,
gsub("+", "*", fixed = T,
gsub(">", "_e>", check))) # append _e to avoid duplicates If this one doesn't give you an error I am afraid you have to step through the code and try to isolate the problem yourself. Here is how to do that: .date_template <- lubridate:::.date_template
lubridate:::.build_locale_regs() ## to get the code Now copy paste the body of the function into a new file and execute as usual. If you don't get the error that can only mean you are kidding me. Once you get an error, try to cut the irrelevant pieces till you get something manageable. BTW, just to make sure, lubridate:::.build_locale_regs() should give you the original error. |
Thanks so much for all the help. Extremely appreciated. |
@Vitoshka this is probably some detail with Windows locales + character encodings and will probably need a windows box to reproduce. |
In following the above suggestions, it seems as if the "location" of the error has been honed down. As above, assigning
trying this guy finally brought up an error:
I put in the spaces between the <, 8, c, and > since Github's auto-formatting made it disappear otherwise. |
Thanks. just to make sure. The following does work as expected: gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), fixed = T), fixed = T) And this one breaks: gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>")), fixed = T) If so I can fix it today and you probably should report an R bug. |
Really! That would be wonderful. Indeed, the first one works as expected, while the latter does break:
I'll be sure to report the bug. Please let me know if there is anything else I should do. |
I have committed a change. Try it out with: library(devtools)
install_github("lubridate", "vitoshka") |
Apologies for the delay... for some reason I'm having issues with install_github, getting an error in tools:::.install_packages() saying it "cannot create temporary directory." Read/write permissions are all fine for the temp directory, so I'm digging around the error a bit deeper. Once I get this worked out I'll get back to you ASAP. |
Sorry about the delay. I've got devtools working properly again (had to tweak some environment variables) and I believe the download from github went fine:
However, unfortunately when I load the package and try the function again,
which is a bit different than before, though I'm not sure of what has precisely changed. Going through the same routine as above once again, the results were the same, including with the same final error popping up here:
Assuming the download from github was of the correct files, please let me know if anything looks interesting to you in terms of the error content. I'll be sure to try anything out if need be. Regards! |
Hm, it is getting grimmer and grimmer. I have commited a change that completely avoids standard R regexp (that is, it uses perl or fixed regexp). I hope perl works for you, otherwise there is really no other option than deactivation of internationalization in lubridate. Try my master branch again and also try this: gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), perl = T), fixed = T) And you really should report this bug to R people (here https://bugs.r-project.org/bugzilla3/) |
There is no point submitting a bug to R unless you can create a simple reproducible example. For example, when I run the following code on my windows machine, I don't get an error (but neither do I get the correct output) Sys.setlocale("LC_ALL", "Japanese_Japan.932")
ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
ampm
# [1] "??" "??"
gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), fixed = T), fixed = T)
# [1] "??**??"
gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>")), fixed = T)
# [1] "??**??" |
For reference, I have been able to reproduce it on other machines, though the setup has been identical (Japanese 64-bit Windows 7). With the new version, I get
The other statement yields:
|
It's unlikely that the R maintainers will have a Japanese version of windows available, so it would be helpful to create an example that fails for everyone. It's quite possible that there are other ways to fix the bug by correcting the string encoding but without a reproducible example, there's no way I can explore. |
This is why this bug is a nightmare. The bug is easily reproducible on japanise machine and it is a problem with regexp parser because fixed=T works. Though in your case it might be something else going on. ?? in the output might simply mean that your terminal doesn't know how to display it or encoding is absent on your machine. I have no clue how windows deals with this but on linux I am getting a warning when I try to set a missing language locale. |
Fixing the encoding with Enconding(x) <- value. Right? May be at least ask on R-devel. Someone might recommend a fix without
hw> It's unlikely that the R maintainers will have a Japanese version of windows hw> --- |
Ok - I can reproduce it now - the key is to use the R gui, not RStudio (I'll report that bug). But the plot thickens: Sys.setlocale("LC_ALL", "Japanese_Japan.932")
times <- c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")
ampm <- format(as.POSIXct(times), format = "%p")
x <- gsub(">", "*", paste(ampm, collapse = "+>"))
y <- "午前+*午後"
identical(x, y)
# [1] TRUE
gsub("+", "*", x, fixed = T)
# Error in gsub("+", "*", x, fixed = T) :
# invalid multibyte string at '<8c>'
gsub("+", "*", y, fixed = T)
# [1] "午前**午後" |
It seems like a known problem, but it's not obvious what the fix is. |
R-dev was pretty silent:) Would it help to explicitly convert the string I see enc2utf8, Encoding and iconv that are apparently designed for this
|
No, that doesn't help as far as I can tell :( |
Thanks so much for all your efforts guys. |
I am using version 3.0.0 of R on 64-bit Windows 7. It may (or may not) be worth noting that I live in Japan and the system language of my OS is Japanese. I am running R in English, however.
Lubridate has been updated to the most recent version, and
loads the package without any issue. I can for example run something like
with no issue. That said, the critically important ymd family of functions (ymd, dmy, etc.) does not function properly. For ymd, mdy, any of them, with any argument, I get the following error:
I received a response on StackExchange suggesting it might be a bug related to the language locale. For reference, the sessionInfo() is as follows:
locale:
[1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
[5] LC_TIME=Japanese_Japan.932
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.3.0 TSA_1.01 tseries_0.10-31 mgcv_1.7-22
[5] locfit_1.5-9.1 leaps_2.9
loaded via a namespace (and not attached):
[1] digest_0.6.3 grid_3.0.0 lattice_0.20-15 Matrix_1.0-12
[5] memoise_0.1 nlme_3.1-109 plyr_1.8 quadprog_1.5-5
[9] stringr_0.6.2 zoo_1.7-9
Any attention would be most appreciated.
The text was updated successfully, but these errors were encountered: