“ymd” family of functions fail with “Error in gsub,” language locale bug? #181

ghost · 2013-04-26T14:09:02Z

I am using version 3.0.0 of R on 64-bit Windows 7. It may (or may not) be worth noting that I live in Japan and the system language of my OS is Japanese. I am running R in English, however.

Lubridate has been updated to the most recent version, and

library(lubridate)

loads the package without any issue. I can for example run something like

now()
[1] "2013-04-26 21:07:30 JST"
now() - days(2)
[1] "2013-04-24 21:07:41 JST"

with no issue. That said, the critically important ymd family of functions (ymd, dmy, etc.) does not function properly. For ymd, mdy, any of them, with any argument, I get the following error:

ymd("2010-12-08")
Error in gsub("+", "*", fixed = T, gsub(">", "_e>", num)) :
invalid multibyte string at '<8c>)<28>?![[:alpha:]]))|((?<H_s_e>
2[0-4]|[01]?\d)\D+(?<M_s_e>[0-5]?\d)\D+((?<OS_s_S_e>[0-5]?\d.\d+)|
(?<S_s_e>[0-6]?\d))))'

I received a response on StackExchange suggesting it might be a bug related to the language locale. For reference, the sessionInfo() is as follows:

sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
[5] LC_TIME=Japanese_Japan.932

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] lubridate_1.3.0 TSA_1.01 tseries_0.10-31 mgcv_1.7-22
[5] locfit_1.5-9.1 leaps_2.9

loaded via a namespace (and not attached):
[1] digest_0.6.3 grid_3.0.0 lattice_0.20-15 Matrix_1.0-12
[5] memoise_0.1 nlme_3.1-109 plyr_1.8 quadprog_1.5-5
[9] stringr_0.6.2 zoo_1.7-9

Any attention would be most appreciated.

vspinu · 2013-04-26T14:58:09Z

I am afraid this is a system specific problem. I cannot reproduce it on linux. This is what I have:

> Sys.setlocale("LC_TIME", "ja_JP.utf8")
[1] "ja_JP.utf8"
> format(Sys.time(), format = "%a %Y %b %d %I:%M:%S %p")
[1] "金 2013  4月 26 04:40:43 午後"
> ymd("2010-12-08")
[1] "2010-12-08 UTC"
>

Let's try to isolate the problem.

From what I can see it has to do with the following code in lubridate:::.build_locale_regs()

╭──────── #408 ─ /home/vitoshka/TVC/lubridate/R/guess.r ──
│     num_exact[] <- gsub("(?<!\\()\\?(?!<)", "", perl = T, # remove ?
│                         gsub("+", "*",  fixed = T,  
│                              gsub(">", "_e>", num))) # append _e to avoid duplicates
╰──────── #410 ─

Would be nice if you post here what is the value of "num" variable. You can do that either by placing the browser() at that location or with options(error=recover) and choose the appropriate frame. once the error occurred.

ghost · 2013-04-26T16:14:39Z

Thanks so much for the response.
I tried working with the code a bit, though couldn't get as far as the value of variable "num." As the code below shows, I must be missing something on how to reach the arguments of that nested gsub. Apologies for having to be walked through this, I'm still very new to R.

options(error=recover)
ymd("2012-02-02")
Error in gsub("+", "*", fixed = T, gsub(">", "_e>", num)) :
invalid multibyte string at '<8c>)<28>?![[:alpha:]]))|((?<H_s_e>2[0-4]|[01]?\d)\D+(?<M_s_e>[0-5]?\d)\D+((?<OS_s_S_e>[0-5]?\d.\d+)|(?<S_s_e>[0-6]?\d))))'

Enter a frame number, or 0 to exit

1: ymd("2012-02-02")
2: .parse_xxx(..., orders = "ymd", quiet = quiet, tz = tz, locale = locale,
3: as.POSIXct(parse_date_time(dates, orders, quiet = quiet, tz = tz, locale
4: parse_date_time(dates, orders, quiet = quiet, tz = tz, locale = locale,
5: .local_parse(x[to_parse], TRUE)
6: .best_formats(train, orders, locale = locale, .select_formats)
7: unique(guess_formats(x, orders, locale = locale, preproc_wday = TRUE))
8: guess_formats(x, orders, locale = locale, preproc_wday = TRUE)
9: .get_loc_regs(locale)
10: f(...)
11: gsub("(?<!()?(?!<)", "", perl = T, gsub("+", "", fixed = T, gsub(">
12: gsub("+", "", fixed = T, gsub(">", "_e>", num))

Selection: 12
Called from: top level
Browse[1]> num
Error during wrapup: object 'num' not found
Browse[1]> fixed
[1] TRUE
Browse[1]> gsub
function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
{
if (!is.character(x))
x <- as.character(x)
.Internal(gsub(as.character(pattern), as.character(replacement),
x, ignore.case, perl, fixed, useBytes))
}
<bytecode: 0x0c74c7dc>
<environment: namespace:base>
Browse[1]> gsub
function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
{
if (!is.character(x))
x <- as.character(x)
.Internal(gsub(as.character(pattern), as.character(replacement),
x, ignore.case, perl, fixed, useBytes))
}
<bytecode: 0x0c74c7dc>
<environment: namespace:base>
Browse[1]> num
Error during wrapup: object 'num' not found

vspinu · 2013-04-26T19:14:32Z

You are doing things right. The frame you should enter is 11 or even 10
not 12. Thanks.

ghost · 2013-04-26T23:46:07Z

Thanks very much for the guidance. Got it from frame 10. Resulted in quite the explosion of text.. I wonder if it's having issues with the AM/PM kanji in there?.

Selection: 10
Called from: top level
Browse[1]> num d
"(?[012]?[1-9]|3[01]|[12]0)"
H
"(?2[0-4]|[01]?\d)"
h
"(?2[0-4]|[01]?\d)"
I
"(?1[0-2]|0?[1-9])"
j
"(?[0-3]?\d?\d)"
M
"(?[0-5]?\d)"
S
"((?<OS_S>[0-5]?\d.\d+)|(?[0-6]?\d))"
s
"((?<OS_S>[0-5]?\d.\d+)|(?[0-6]?\d))"
U
"(?[0-5]?\d)"
w
"(?[0-6])"
u
"(?[1-7])"
W
"(?[0-5]?\d)"
Y
"(?\d{4})"
y
"((?<Y_y>\d{4})|(?\d{2}))"
Oz
"(?<Oz_Oz>[-+]\d{4})"
OO
"(?[-+]\d{2}:\d{2})"
Oo
"(?[-+]\d{2})"
T
"(((?<I_s>1[0-2]|0?[1-9])\D+(?<M_s_T>[0-5]?\d)\D+((?<OS_s_T_S>[0-5]?\d.\d+)|(?<S_s_T>[0-6]?\d))\D_(?<p_s>午前|午後)(?![[:alpha:]]))|((?<H_s>2[0-4]|[01]?\d)\D+(?<M_s>[0-5]?\d)\D+((?<OS_s_S>[0-5]?\d.\d+)|(?<S_s>[0-6]?\d))))"
R
"(((?<I_s>1[0-2]|0?[1-9])\D+(?<M_s_T>[0-5]?\d)\D_(?<p_s>午前|午後)(?![[:alpha:]]))|((?<H_s>2[0-4]|[01]?\d)\D+(?<M_s>[0-5]?\d)))"
r
"(((?<I_s>1[0-2]|0?[1-9])\D*(?<p_s>午前|午後)(?![[:alpha:]]))|(?<H_s>2[0-4]|[01]?\d))"

~~Browse[1]>~~

vspinu · 2013-04-27T00:21:15Z

Yes, it has to do with the encoding and most likely with the fact that your R is English and locale is Japanese.

Now take those funny expressions each at a time and check which one is causing problem. For example:

funny_exp <- "(((?<I_s>1[0-2]|0?[1-9])\\D*(?<p_s>午前|午後)(?![[:alpha:]]))|(?<H_s>2[0-4]|[01]?\\d))"
gsub("+", "*", funny_exp, fixed = T)
gsub("(?<!\\()\\?(?!<)", "", funny_exp, perl = T)

Once we figure this out we are very close to a reproducible example that you can post further on stackoverflow or R-help. This is really not a lubridate issue per see, but would be good to know what is going on for the future.

ghost · 2013-04-27T01:00:36Z

Great, once I get home tonight I will try playing around with it in more detail. I will post again as soon as I can.

For reference, however, even returning my R to Japanese so that it matches the native locale, the error still occurs.

ghost · 2013-04-27T15:27:20Z

Bizarre... I went through all the values of "num" in there, and each time the exact same pattern of output came out. No errors or anomalies pop up, however. The pattern is, for example,

strange <- "(((?< I_s >1[0-2]|0?[1-9])\D_(?< p_s >午前|午後)(?![[:alpha:]]))|(?< H_s >2[0-4]|[01]?\d))"
gsub("+", "", strange, fixed = T)
[1] "(((?< I_s >1[0-2]|0?[1-9])\D(?< p_s >午前|午後)(?![[:alpha:]]))|(?< H_s >2[0-4]|[01]?\d))"
gsub("(?<!()?(?!<)", "", strange, perl = T)
[1] "(((?< I_s >1[0-2]|0[1-9])\D_(?< p_s >午前|午後)(?![[:alpha:]]))|(?< H_s >2[0-4]|[01]\d))"

or

strange <- "(?< H >2[0-4]|[01]?\d)"
gsub("+", "*", strange, fixed = T)
[1] "(?< H >2[0-4]|[01]?\d)"
gsub("(?<!()?(?!<)", "", strange, perl = T)
[1] "(?< H >2[0-4]|[01]\d)"

where the question mark(s) are removed in the second gsub. Besides that, fiddling around with things I cannot seem to come across a reproduction of the error.

R is quite widely used in Japan, and I would be surprised if no one in the country was able to use lubridate.

Apologies for not being able to get more information together myself, the support is much appreciated.

vspinu · 2013-04-27T16:09:01Z

The value of num is correct. It's some multibyte glitch in gsub with fixed=T. Try the following

fmt <- format(as.POSIXct("1970-01-01 02:00:00"), "%a+%A+%b@%B@%p@")
gsub("+", "*", fmt, fixed = T)

It should give "木*木曜日* 1月@1月@午前@"

Also try:

gsub("+", "*", sprintf("%s", fmt), fixed = T)

This is how num is constructed, using format and sprintf.

ghost · 2013-04-27T23:39:01Z

Seems to be no issue in trying these guys as well:

fmt <- format(as.POSIXct("1970-01-01 02:00:00"), "%a+%A+%b@%B@%p@")
gsub("+", "", fmt, fixed = T)
[1] "木_木曜日_1@1月@午前@"
gsub("+", "", sprintf("%s", fmt), fixed = T)
[1] "木_木曜日_1@1月@午前@"

vspinu · 2013-04-28T12:34:47Z

Ok I think I know where it comes from, but I have no clue how to solve it.

Does this work for you?

paste(c("午前", "午後"), collapse = "|")
## -> [1] "午前|午後"

If so, does this work as expected:

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
ampm
## -> [1] "午前" "午後"
paste(ampm, collapse = "|")
## -> [1] "午前|午後"

vspinu · 2013-04-28T12:39:13Z

And then of course,

gsub("+", "*", paste(ampm, collapse = "|"), fixed = T)
gsub("+", "*", sprintf("%s", paste(ampm, collapse = "|")), fixed = T)

since that is where the error comes.

vspinu · 2013-04-28T12:49:14Z

Eh, it sinks gradually. Also double gsub:

gsub("+", "*", gsub("|", "*", sprintf("%s", paste(ampm, collapse = "+|")), fixed = T), fixed = T)

This is how remote debugging works :)

ghost · 2013-04-28T14:54:35Z

This is an extremely useful learning experience... while I have some experience with programming it has all been fairly small-scale applications and thus was always quite easy to debug. Going through the above suggestions the functions responses are as follows:

paste(c("午前", "午後"), collapse = "|")
[1] "午前|午後"
ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
ampm
[1] "午前" "午後"
paste(ampm, collapse = "|")
[1] "午前|午後"
gsub("+", "", paste(ampm, collapse = "|"), fixed = T)
[1] "午前|午後"
gsub("+", "", sprintf("%s", paste(ampm, collapse = "|")), fixed = T)
[1] "午前|午後"
gsub("+", "", gsub("|", "", sprintf("%s", paste(ampm, collapse = "+|")), fixed = T), fixed = T)
[1] "午前**午後"

vspinu · 2013-04-28T19:26:29Z

Sorry. I am out of options, this is the most obscure bug I have seen in years.

Here is the last attempt:

gsub("+", "*", gsub(">", "*", sprintf("%s", paste(ampm, collapse = "+>"))), fixed = T)

Here is an isolated code from .build_locale_regs. It is virtually identical to the internal code (btw, I hope you are using the most recent version of lubridate from github).

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
p <- unique(ampm)
p <- p[nzchar(p)]
alpha_p <- sprintf("(?<p>%s)(?![[:alpha:]])", paste(p, collapse = "|"))

##  NUMERIC FORMATS
num <- c(
  d = "(?<d>[012]?[1-9]|3[01]|[12]0)",
  H = "(?<H>2[0-4]|[01]?\\d)",
  h = "(?<H>2[0-4]|[01]?\\d)",
  I = "(?<I>1[0-2]|0?[1-9])", 
  j = "(?<j>[0-3]?\\d?\\d)", 
  M = "(?<M>[0-5]?\\d)",
  S = "((?<OS_S>[0-5]?\\d\\.\\d+)|(?<S>[0-6]?\\d))", 
  s = "((?<OS_S>[0-5]?\\d\\.\\d+)|(?<S>[0-6]?\\d))",
  U = "(?<U>[0-5]?\\d)", 
  w = "(?<w>[0-6])", # merge with a, A??
  u = "(?<u>[1-7])", 
  W = "(?<W>[0-5]?\\d)", 
  ## x = "(?<x>\\d{2}/[01]?\\d/[0-3]?\\d)", 
  ## X = "(?<X>[012]?\\d:[0-5]?\\d:[0-6]?\\d)", 
  Y = "(?<Y>\\d{4})",
  y = "((?<Y_y>\\d{4})|(?<y>\\d{2}))",
  Oz = "(?<Oz_Oz>[-+]\\d{4})", ## sptrtime implements only this format (4 digits)
  ## F = "(?<F>\\d{4)-\\d{2}-\\d{2})",
  OO = "(?<OO>[-+]\\d{2}:\\d{2})", 
  Oo = "(?<Oo>[-+]\\d{2})")


check <- sprintf("((%s\\D+%s\\D+%s\\D*%s)|(%s\\D+%s\\D+%s))",
                 num[["I"]], num[["M"]], num[["S"]], alpha_p, num[["H"]], num[["M"]], num[["S"]])

check <- gsub("(<[IMSpHS]|<OS)", "\\1_s", check)

gsub("(?<!\\()\\?(?!<)", "", perl = T, 
     gsub("+", "*",  fixed = T,  
          gsub(">", "_e>", check))) # append _e to avoid duplicates

If this one doesn't give you an error I am afraid you have to step through the code and try to isolate the problem yourself. Here is how to do that:

.date_template <- lubridate:::.date_template
lubridate:::.build_locale_regs() ## to get the code

Now copy paste the body of the function into a new file and execute as usual. If you don't get the error that can only mean you are kidding me. Once you get an error, try to cut the irrelevant pieces till you get something manageable.

BTW, just to make sure,

lubridate:::.build_locale_regs()

should give you the original error.

ghost · 2013-04-29T00:14:45Z

Thanks so much for all the help. Extremely appreciated.
I'll try my best with the debugging and if I make any progress will let you know.

hadley · 2013-04-29T11:39:56Z

@Vitoshka this is probably some detail with Windows locales + character encodings and will probably need a windows box to reproduce.

ghost · 2013-04-29T13:23:27Z

In following the above suggestions, it seems as if the "location" of the error has been honed down.

As above, assigning

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")

trying this guy finally brought up an error:

gsub("+", "", gsub(">", "", sprintf("%s", paste(ampm, collapse = "+>"))), fixed = T)
Error in gsub("+", "", gsub(">", "", sprintf("%s", paste(ampm, collapse = "+>"))), :
invalid multibyte string at ' < 8 c > '

I put in the spaces between the <, 8, c, and > since Github's auto-formatting made it disappear otherwise.

vspinu · 2013-04-29T13:33:53Z

Thanks. just to make sure.

The following does work as expected:

    gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), fixed = T), fixed = T)

And this one breaks:

    gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>")), fixed = T)

If so I can fix it today and you probably should report an R bug.

ghost · 2013-04-29T13:38:52Z

Really! That would be wonderful. Indeed, the first one works as expected, while the latter does break:

gsub("+", "", gsub(">", "", paste(ampm, collapse = "+>"), fixed = T), fixed = T)
[1] "午前*午後"
gsub("+", "", gsub(">", "", paste(ampm, collapse = "+>")), fixed = T)
Error in gsub("+", "", gsub(">", "*", paste(ampm, collapse = "+>")), :
invalid multibyte string at '< 8 c >'

I'll be sure to report the bug. Please let me know if there is anything else I should do.

vspinu · 2013-04-29T14:05:48Z

I have committed a change. Try it out with:

library(devtools)
install_github("lubridate", "vitoshka")

ghost · 2013-04-30T06:25:00Z

Apologies for the delay... for some reason I'm having issues with install_github, getting an error in tools:::.install_packages() saying it "cannot create temporary directory." Read/write permissions are all fine for the temp directory, so I'm digging around the error a bit deeper.

Once I get this worked out I'll get back to you ASAP.

ghost · 2013-05-01T07:09:36Z

Sorry about the delay. I've got devtools working properly again (had to tweak some environment variables) and I believe the download from github went fine:

install_github("lubridate","vitoshka")
Installing github repo(s) lubridate/master from vitoshka
Installing lubridate.zip from https://github.com/vitoshka/lubridate/archive/master.zip
Installing lubridate
"C:/.../R-3.0.0/bin/x64/R" --vanilla CMD INSTALL "C:...\lubridate-master" --library="C:/.../R-3.0.0/library" --with-> > keep.source
installing source package 'lubridate' ... # asterisks below removed by me due to github formatting
R
data
moving datasets to lazyload DB
inst
preparing package for lazy loading
help
installing help indices
building package indices
installing vignettes
testing if installed package can be loaded
arch - i386
arch - x64
DONE (lubridate)

However, unfortunately when I load the package and try the function again,

library(lubridate)
ymd("2012-02-02")
Error in gsub(">", "e>", num, fixed = TRUE) :
invalid multibyte string at '< 8c >)< 28 >?![[:alpha:]]))|((?< H_s >2[0-4]|[01]?\d)\D+(?< M_s >[0-5]?\d)\D+((?< OS_s_S >[0-5]?\d.\d+)|(?< S_s >[0-6]?\d))))'
Enter a frame number, or 0 to exit
1: ymd("2012-02-02")
2: parse.r#67: .parse_xxx(..., orders = "ymd", quiet = quiet, tz = tz, loca
3: parse.r#551: as.POSIXct(parse_date_time(dates, orders, quiet = quiet, tz
4: parse_date_time(dates, orders, quiet = quiet, tz = tz, locale = locale,
5: parse.r#450: .local_parse(x[to_parse], TRUE)
6: parse.r#427: .best_formats(train, orders, locale = locale, .select_forma
7: guess.r#248: unique(guess_formats(x, orders, locale = locale, preproc_wd
8: guess_formats(x, orders, locale = locale, preproc_wday = TRUE)
9: guess.r#134: .get_loc_regs(locale)
10: f(...)
11: guess.r#409: gsub("(?<!()?(?!<)", "", perl = TRUE, gsub("+", "", fi
12: gsub("+", "_", fixed = TRUE, gsub(">", "_e>", num, fixed = TRUE))
13: gsub(">", "_e>", num, fixed = TRUE)

which is a bit different than before, though I'm not sure of what has precisely changed. Going through the same routine as above once again, the results were the same, including with the same final error popping up here:

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
gsub("+", "", gsub("|", "", sprintf("%s", paste(ampm, collapse = "+|")), fixed = T), fixed = T)
[1] "午前*午後"
gsub("+", "", gsub(">", "", sprintf("%s", paste(ampm, collapse = "+>"))), fixed = T)
Error in gsub("+", "", gsub(">", "", sprintf("%s", paste(ampm, collapse = "+>"))), :
invalid multibyte string at '< 8c >'
Enter a frame number, or 0 to exit
1: gsub("+", "", gsub(">", "*", sprintf("%s", paste(ampm, collapse = "+>")

Assuming the download from github was of the correct files, please let me know if anything looks interesting to you in terms of the error content. I'll be sure to try anything out if need be. Regards!

vspinu · 2013-05-01T10:34:59Z

Hm, it is getting grimmer and grimmer. I have commited a change that completely avoids standard R regexp (that is, it uses perl or fixed regexp). I hope perl works for you, otherwise there is really no other option than deactivation of internationalization in lubridate.

Try my master branch again and also try this:

gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), perl = T), fixed = T)

And you really should report this bug to R people (here https://bugs.r-project.org/bugzilla3/)

hadley · 2013-05-01T11:09:27Z

There is no point submitting a bug to R unless you can create a simple reproducible example. For example, when I run the following code on my windows machine, I don't get an error (but neither do I get the correct output)

Sys.setlocale("LC_ALL", "Japanese_Japan.932")

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
ampm
# [1] "??" "??"

gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), fixed = T), fixed = T)
# [1] "??**??"
gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>")), fixed = T)
# [1] "??**??"

ghost · 2013-05-01T11:28:25Z

For reference, I have been able to reproduce it on other machines, though the setup has been identical (Japanese 64-bit Windows 7).

With the new version, I get

d <- ymd("2012-02-02")
Error in gsub(">", "e>", num, fixed = TRUE) :
invalid multibyte string at '< 8c >)< 28 >?![[:alpha:]]))|((?< H_s >2[0-4]|[01]?\d)\D+(?< M_s >[0-5]?\d)\D+((?< OS_s_S >[0-5]?\d.\d+)|(?< S_s >[0-6]?\d))))'
Enter a frame number, or 0 to exit
1: ymd("2012-02-02")
2: parse.r#67: .parse_xxx(..., orders = "ymd", quiet = quiet, tz = tz, loca
3: parse.r#551: as.POSIXct(parse_date_time(dates, orders, quiet = quiet, tz
4: parse_date_time(dates, orders, quiet = quiet, tz = tz, locale = locale,
5: parse.r#450: .local_parse(x[to_parse], TRUE)
6: parse.r#427: .best_formats(train, orders, locale = locale, .select_forma
7: guess.r#248: unique(guess_formats(x, orders, locale = locale, preproc_wd
8: guess_formats(x, orders, locale = locale, preproc_wday = TRUE)
9: guess.r#134: .get_loc_regs(locale)
10: f(...)
11: guess.r#409: gsub("(?<!()?(?!<)", "", perl = TRUE, gsub("+", "", fi
12: gsub("+", "_", fixed = TRUE, gsub(">", "_e>", num, fixed = TRUE))
13: gsub(">", "_e>", num, fixed = TRUE)

The other statement yields:

Error in gsub("+", "", gsub(">", "", paste(ampm, collapse = "+>"), perl = T), :
invalid multibyte string at '< 8c >'
Enter a frame number, or 0 to exit
1: gsub("+", "", gsub(">", "", paste(ampm, collapse = "+>"), perl = T), f
Selection: 1
Called from: top level
Browse[1]> perl
[1] FALSE

hadley · 2013-05-01T11:38:53Z

It's unlikely that the R maintainers will have a Japanese version of windows available, so it would be helpful to create an example that fails for everyone. It's quite possible that there are other ways to fix the bug by correcting the string encoding but without a reproducible example, there's no way I can explore.

vspinu · 2013-05-01T11:51:21Z

@hadley

This is why this bug is a nightmare. The bug is easily reproducible on japanise machine and it is a problem with regexp parser because fixed=T works.

Though in your case it might be something else going on. ?? in the output might simply mean that your terminal doesn't know how to display it or encoding is absent on your machine. I have no clue how windows deals with this but on linux I am getting a warning when I try to set a missing language locale.

vspinu · 2013-05-01T12:12:26Z

Fixing the encoding with Enconding(x) <- value. Right?

May be at least ask on R-devel. Someone might recommend a fix without
even reproducing the problem.

hadley wickham notifications@github.com
on Wed, 01 May 2013 04:38:54 -0700 wrote:

hw> It's unlikely that the R maintainers will have a Japanese version of windows
hw> available, so it would be helpful to create an example that fails for
hw> everyone. It's quite possible that there are other ways to fix the bug by
hw> correcting the string encoding but without a reproducible example, there's no
hw> way I can explore.

hw> ---
hw> Reply to this email directly or view it on GitHub:
hw> #181 (comment)

hadley · 2013-05-01T13:55:12Z

Ok - I can reproduce it now - the key is to use the R gui, not RStudio (I'll report that bug).

But the plot thickens:

Sys.setlocale("LC_ALL", "Japanese_Japan.932")

times <- c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")
ampm <- format(as.POSIXct(times), format = "%p")
x <- gsub(">", "*", paste(ampm, collapse = "+>"))

y <- "午前+*午後"
identical(x, y)
# [1] TRUE
gsub("+", "*", x, fixed = T)
# Error in gsub("+", "*", x, fixed = T) : 
#  invalid multibyte string at '<8c>'
gsub("+", "*", y, fixed = T)
# [1] "午前**午後"

hadley · 2013-05-01T14:09:58Z

Posted to r-devel at http://r.789695.n4.nabble.com/Windows-format-POSIXct-and-character-encodings-td4665911.html

hadley · 2013-05-07T13:02:46Z

It seems like a known problem, but it's not obvious what the fix is.

vspinu · 2013-05-07T14:08:30Z

R-dev was pretty silent:) Would it help to explicitly convert the string
in utf8 before processing with grep?

I see enc2utf8, Encoding and iconv that are apparently designed for this
task.

Vitalie

hadley wickham notifications@github.com
on Tue, 07 May 2013 06:02:47 -0700 wrote:

It seems like a known problem, but it's not obvious what the fix is.

Reply to this email directly or view it on GitHub:
#181 (comment)

hadley · 2013-05-07T14:27:37Z

No, that doesn't help as far as I can tell :(

ghost · 2013-05-07T23:04:36Z

Thanks so much for all your efforts guys.
I've seen some Japanese stats bloggers using lubridate before, so I'll try and get in touch to see if there has been any kind of workaround.

vspinu added a commit to vspinu/lubridate that referenced this issue Apr 29, 2013

gsub with fixed=TRUE (fix tidyverse#181)

76bfdb5

vspinu added a commit to vspinu/lubridate that referenced this issue Apr 29, 2013

gsub with fixed=TRUE (fix tidyverse#181)

c39d729

vspinu added a commit to vspinu/lubridate that referenced this issue May 1, 2013

use sub/gsub only with perl=TRUE or fixed=TRUE (bug tidyverse#181)

47aeb5b

vspinu mentioned this issue Jul 18, 2013

Mistake in French locale #194

Closed

vspinu closed this as completed Dec 13, 2014

“ymd” family of functions fail with “Error in gsub,” language locale bug? #181

“ymd” family of functions fail with “Error in gsub,” language locale bug? #181

Comments

ghost commented Apr 26, 2013

vspinu commented Apr 26, 2013

ghost commented Apr 26, 2013

vspinu commented Apr 26, 2013

ghost commented Apr 26, 2013

vspinu commented Apr 27, 2013

ghost commented Apr 27, 2013

ghost commented Apr 27, 2013

vspinu commented Apr 27, 2013

ghost commented Apr 27, 2013

vspinu commented Apr 28, 2013

vspinu commented Apr 28, 2013

vspinu commented Apr 28, 2013

ghost commented Apr 28, 2013

vspinu commented Apr 28, 2013

ghost commented Apr 29, 2013

hadley commented Apr 29, 2013

ghost commented Apr 29, 2013

vspinu commented Apr 29, 2013

ghost commented Apr 29, 2013

vspinu commented Apr 29, 2013

ghost commented Apr 30, 2013

ghost commented May 1, 2013

vspinu commented May 1, 2013

hadley commented May 1, 2013

ghost commented May 1, 2013

hadley commented May 1, 2013

vspinu commented May 1, 2013

vspinu commented May 1, 2013

hadley commented May 1, 2013

hadley commented May 1, 2013

hadley commented May 7, 2013

vspinu commented May 7, 2013

It seems like a known problem, but it's not obvious what the fix is.

hadley commented May 7, 2013

ghost commented May 7, 2013