Mistake in French locale #194

Closed
briatte opened this Issue Jul 18, 2013 · 7 comments

Comments

Projects
None yet
3 participants

briatte commented Jul 18, 2013

Hello,

There is a parsing issue in the French locale:

> dmy("3 janvier 2013", locale = "fr_FR")
[1] "2013-01-03 UTC"
> dmy("3 juillet 2013", locale = "fr_FR")
[1] NA
Warning message:
All formats failed to parse. No formats found.

The month juillet is the month july.

If this is not an issue with lubridate itself, please let me where shall I report this?

Collaborator

vspinu commented Jul 18, 2013

Works fine for me under linux:

dmy("3 janvier 2013", locale = "fr_FR.utf8")
[1] "2013-01-03 UTC"

What do you get on

lubridate:::.build_locale_regs("fr_FR.utf8")$alpha_exact[["b"]]

A very long thread #181 might be relevant here, but would good to know more about the system.

briatte commented Jul 18, 2013

You are right, the problem is on my end:

[1] "((?<b_b_e>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|(?<B_b_e>January|February|March|April|May|June|July|August|September|October|November|December))(?![[:alpha:]])"
Warning message:
In Sys.setlocale("LC_TIME", locale) :
  OS reports request to set locale to "fr_FR.utf8" cannot be honored

I am running Mac OS X, with the whole system set to English.

Session info:

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3       ggplot2_0.9.3.1    grid_3.0.1        
 [6] gtable_0.1.2       labeling_0.2       MASS_7.3-27        munsell_0.4.2      plyr_1.8          
[11] proto_0.3-10       RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3       stringr_0.6.2     
[16] tools_3.0.1       

It'd be useful to clarify why juillet is the only month that does not parse, though. The rest of the months work fine.

Owner

hadley commented Jul 18, 2013

On OS X the locale is called fr_FR:

> lubridate:::.build_locale_regs("fr_FR")$alpha_exact[["b"]]
[1] "((?<b_b_e>jan|fév|mar|avr|mai|jui|jul|aoû|sep|oct|nov|déc)|(?<B_b_e>janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre))(?![[:alpha:]])"

It's not obvious why julliet fails to match that regexp though.

briatte commented Jul 18, 2013

Same thing here:

> lubridate:::.build_locale_regs("fr_FR")$alpha_exact[["b"]]
[1] "((?<b_b_e>jan|fév|mar|avr|mai|jui|jul|aoû|sep|oct|nov|déc)|(?<B_b_e>janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre))(?![[:alpha:]])"

And it does work elsewhere:

> regexpr("janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre", "13 juillet 2013")
[1] 4
attr(,"match.length")
[1] 7
> grepl("janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre", "13 juillet 2013")
[1] TRUE
Collaborator

vspinu commented Jul 18, 2013

I meant that juillet is parsed for me on linux:

dmy("3 juillet 2013", locale = "fr_FR.utf8")
[1] "2013-07-03 UTC"

So it is again a regexp issue on OS and it is getting closer to our Japanese friend problem.

briatte commented Mar 8, 2014

R 3.0.3 has solved this issue:

  • strptime() now checks the locale only when locale-specific formats are used and caches the locale in use: this can halve the time taken on OSes with slow system functions (e.g. OS X).
  • strptime() and the format() methods for classes "POSIXct", "POSIXlt" and "Date" recognize strings with marked encodings: this allows, for example, UTF-8 French month names to be read on (French) Windows.

The mistake it caused in lubridate is gone:

library(lubridate)

clean_date = function(x) {

  # fix bugs in dates
  x = gsub("juillet", "07", x) # fix small bug in French month parser

  # parse to Date format
  x = parse_date_time(x, "%d %m* %Y", locale = "fr_FR.UTF-8")
  x = as.Date(x)

  return(x)
}

unclean_date = function(x) {

  # parse to Date format
  x = parse_date_time(x, "%d %m* %Y", locale = "fr_FR.UTF-8")
  x = as.Date(x)

  return(x)
}

clean_date("15 juillet 2007")
[1] "2007-07-15"

# returns NA on R < 3.0.3
unclean_date("15 juillet 2007")
[1] "2007-07-15"

The "juillet" month does not need fixing anymore.

briatte closed this Mar 8, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment