Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistake in French locale #194

Closed
briatte opened this issue Jul 18, 2013 · 7 comments
Closed

Mistake in French locale #194

briatte opened this issue Jul 18, 2013 · 7 comments

Comments

@briatte
Copy link

briatte commented Jul 18, 2013

Hello,

There is a parsing issue in the French locale:

> dmy("3 janvier 2013", locale = "fr_FR")
[1] "2013-01-03 UTC"
> dmy("3 juillet 2013", locale = "fr_FR")
[1] NA
Warning message:
All formats failed to parse. No formats found.

The month juillet is the month july.

If this is not an issue with lubridate itself, please let me where shall I report this?

@vspinu
Copy link
Member

vspinu commented Jul 18, 2013

Works fine for me under linux:

dmy("3 janvier 2013", locale = "fr_FR.utf8")
[1] "2013-01-03 UTC"

What do you get on

lubridate:::.build_locale_regs("fr_FR.utf8")$alpha_exact[["b"]]

A very long thread #181 might be relevant here, but would good to know more about the system.

@briatte
Copy link
Author

briatte commented Jul 18, 2013

You are right, the problem is on my end:

[1] "((?<b_b_e>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|(?<B_b_e>January|February|March|April|May|June|July|August|September|October|November|December))(?![[:alpha:]])"
Warning message:
In Sys.setlocale("LC_TIME", locale) :
  OS reports request to set locale to "fr_FR.utf8" cannot be honored

I am running Mac OS X, with the whole system set to English.

Session info:

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3       ggplot2_0.9.3.1    grid_3.0.1        
 [6] gtable_0.1.2       labeling_0.2       MASS_7.3-27        munsell_0.4.2      plyr_1.8          
[11] proto_0.3-10       RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3       stringr_0.6.2     
[16] tools_3.0.1       

It'd be useful to clarify why juillet is the only month that does not parse, though. The rest of the months work fine.

@hadley
Copy link
Member

hadley commented Jul 18, 2013

On OS X the locale is called fr_FR:

> lubridate:::.build_locale_regs("fr_FR")$alpha_exact[["b"]]
[1] "((?<b_b_e>jan|fév|mar|avr|mai|jui|jul|aoû|sep|oct|nov|déc)|(?<B_b_e>janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre))(?![[:alpha:]])"

It's not obvious why julliet fails to match that regexp though.

@briatte
Copy link
Author

briatte commented Jul 18, 2013

Same thing here:

> lubridate:::.build_locale_regs("fr_FR")$alpha_exact[["b"]]
[1] "((?<b_b_e>jan|fév|mar|avr|mai|jui|jul|aoû|sep|oct|nov|déc)|(?<B_b_e>janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre))(?![[:alpha:]])"

And it does work elsewhere:

> regexpr("janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre", "13 juillet 2013")
[1] 4
attr(,"match.length")
[1] 7
> grepl("janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre", "13 juillet 2013")
[1] TRUE

@vspinu
Copy link
Member

vspinu commented Jul 18, 2013

I meant that juillet is parsed for me on linux:

dmy("3 juillet 2013", locale = "fr_FR.utf8")
[1] "2013-07-03 UTC"

So it is again a regexp issue on OS and it is getting closer to our Japanese friend problem.

@hadley
Copy link
Member

hadley commented Jul 26, 2013

Interestingly, it also fails in as.Date: http://stackoverflow.com/questions/17856415/mysterious-error-by-parsing-french-dates

@briatte
Copy link
Author

briatte commented Mar 8, 2014

R 3.0.3 has solved this issue:

  • strptime() now checks the locale only when locale-specific formats are used and caches the locale in use: this can halve the time taken on OSes with slow system functions (e.g. OS X).
  • strptime() and the format() methods for classes "POSIXct", "POSIXlt" and "Date" recognize strings with marked encodings: this allows, for example, UTF-8 French month names to be read on (French) Windows.

The mistake it caused in lubridate is gone:

library(lubridate)

clean_date = function(x) {

  # fix bugs in dates
  x = gsub("juillet", "07", x) # fix small bug in French month parser

  # parse to Date format
  x = parse_date_time(x, "%d %m* %Y", locale = "fr_FR.UTF-8")
  x = as.Date(x)

  return(x)
}

unclean_date = function(x) {

  # parse to Date format
  x = parse_date_time(x, "%d %m* %Y", locale = "fr_FR.UTF-8")
  x = as.Date(x)

  return(x)
}

clean_date("15 juillet 2007")
[1] "2007-07-15"

# returns NA on R < 3.0.3
unclean_date("15 juillet 2007")
[1] "2007-07-15"

The "juillet" month does not need fixing anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants