Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default locale is C #19

Closed
wch opened this issue Oct 3, 2014 · 19 comments
Closed

Default locale is C #19

wch opened this issue Oct 3, 2014 · 19 comments

Comments

@wch
Copy link
Contributor

wch commented Oct 3, 2014

This is in the eddelbuettel/ubuntu-rstudio image:

> Sys.getlocale(category = "LC_ALL")
[1] "C"

For best interoperability, it should be UTF-8.

Some information about it here:
http://jaredmarkell.com/docker-and-locales/
https://crosbymichael.com/dockerfile-best-practices-take-2.html

@eddelbuettel
Copy link
Member

Another thing to add to r-base so that it bubbles up.

[ That said, I am a 7-bit snob now and rarely ever set these... But we probably should. ]

@cboettig
Copy link
Member

cboettig commented Oct 7, 2014

Note, I set locale to C.UTF-8 as in the second example, rather than en_US.UTF-8 as in the first example; and just set the Debian base. (A summary of C.UTF-8 vs en_US.UTF-8 here, but happy for input on which locale @wch had in mind).

@wch
Copy link
Contributor Author

wch commented Oct 8, 2014

I'm not an expert in this stuff, but I think that en_US.UTF-8 would be better, since it defines proper sorting for non-ASCII characters, while C_UTF-8 does not -- it probably just uses the unicode value for sorting.

For example, in en_US.UTF-8, all the a's with accents come before b:

> sort(c('A', 'a', 'Ä', 'ä', 'À', 'à', 'b'))
[1] "a" "A" "à" "À" "ä" "Ä" "b"

But it's not true in C.UTF-8:

> sort(c('A', 'a', 'Ä', 'ä', 'À', 'à', 'b'))
[1] "A" "a" "b" "À" "Ä" "à" "ä"

So I think that, despite the provincial-sounding label, en_US actually supports non-English languages better than C.

@cboettig
Copy link
Member

cboettig commented Oct 8, 2014

@wch sounds reasonable to me.

For reasons that are not obvious to me, just switching C.UTF-8 to en_US.UTF-8 in this Dockerfile results in an error:

*** update-locale: Error: invalid locale settings:  LANG=en_US.UTF-8
2014/10/07 20:12:37 The command [/bin/sh -c dpkg-reconfigure locales     && locale-gen en_US.UTF-8     && /usr/sbin/update-locale LANG=en_US.UTF-8] returned a non-zero code: 255

No idea why, en_US.UTF-8 is on the list of locales returned by the command...

@wch
Copy link
Contributor Author

wch commented Oct 8, 2014

This seems to work:

RUN echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen \
   && locale-gen en_US.utf8 \
   && /usr/sbin/update-locale LANG=en_US.UTF-8

The starting state for /etc/locale.gen has en_US.UTF-8 commented out, along with all the other entries. Running dpkg-configure interactively and selecting en_US.UTF-8 has the same effect as this set of commands, I think.

Edit: FWIW, I found another Dockerfile that uses a similar strategy: https://registry.hub.docker.com/u/etna/drone-debian/dockerfile/

@eddelbuettel
Copy link
Member

+1 -- I don't think I have ever seen C.UTF-8 in the wild anywhere. Not that I pay much attention though...

@eddelbuettel
Copy link
Member

Blech:

root@e5b38b5f638c:/# du -csh /usr/share/locale/
87M     /usr/share/locale/
87M     total
root@e5b38b5f638c:/# 

@wch
Copy link
Contributor Author

wch commented Oct 9, 2014

Doesn't seem so bad when I do it:

$ docker run --rm -ti eddelbuettel/debian-r-base /bin/bash

root@1dbe56be3aa1:/# du -csh /usr/share/locale
43M /usr/share/locale
43M total

root@1dbe56be3aa1:/# apt-get install -qq -y locales

root@1dbe56be3aa1:/# du -csh /usr/share/locale
47M /usr/share/locale
47M total

root@1dbe56be3aa1:/# echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen \
>    && locale-gen en_US.utf8 \
>    && /usr/sbin/update-locale LANG=en_US.UTF-8
Generating locales (this might take a while)...
  en_US.UTF-8... done
Generation complete.

root@1dbe56be3aa1:/# du -csh /usr/share/locale
47M /usr/share/locale
47M total

@eddelbuettel
Copy link
Member

I was using the 'drd' (ie daily r-devel) which has more packages hence more po files. Anyway, on my home system it is 177 mb so ... that's just a cost of doing business.

I learned something new which may help shrink the image some more.

@cboettig
Copy link
Member

cboettig commented Oct 9, 2014

Testing: docker run -it rocker/r-base R

> Sys.getlocale(category = "LC_ALL")

[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

@wch Look good?

@cboettig
Copy link
Member

cboettig commented Oct 9, 2014

For some reason, the rstudio image (and thus hadleyverse) object to the locale settings.

The container throws a warning on startup:

$ docker run --rm -it rocker/rstudio R
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

And likewise R complains as well:

R version 3.1.1 (2014-07-10) -- "Sock it to Me"
...
Type 'q()' to quit R.

During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 
6: Setting LC_PAPER failed, using "C" 
7: Setting LC_MEASUREMENT failed, using "C" 

and then defaults to the "C" locale:

> Sys.getlocale(category = "LC_ALL")
[1] "C"

@eddelbuettel
Copy link
Member

That rings a bell but I don;t quite recall what to do. Should be a generic issue for Debian-based VMs etc though. Maybe as simple as setting it in /etc/bash/bashrc, or profile or ...

@jangorecki
Copy link

I get C locale on recent image r-base image.

docker run -it r-base
Sys.getlocale(category = "LC_ALL")
# [1] "C"

According to discussion here I should get US UTF8 so it looks like this issue needs to be reopened.

@jangorecki
Copy link

I used this SO answer to solve that issue on Ubuntu 14.04.

RUN locale-gen en_US.UTF-8  
ENV LANG en_US.UTF-8  
ENV LANGUAGE en_US:en  
ENV LC_ALL en_US.UTF-8 
Sys.getlocale(category = "LC_ALL")
# [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

I tried the same on official debian's r-base but it throws a lot of warnings about locale while build and in R console after run. So it cannot be directly applied to debian too.

@cboettig
Copy link
Member

Really? On r-base I see:


> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
> Sys.getenv("LC_ALL")
[1] "en_US.UTF-8"
> Sys.getlocale(category="LC_ALL")
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
> 

Are you sure you have the latest r-base image? (Not sure what you mean by 'official debian's r-base' or what warnings you're seeing either)

Yes, ubuntu and debian set locales differently; both are described in the link above. (And of course the debian way is also illustrated at the top of the r-base Dockerfile.

Does anyone else still see the C locale in r-base?

@eddelbuettel
Copy link
Member

I get the same as Carl:

$ docker run --rm -ti r-base R -e 'sessionInfo()'

R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> 
$

@wch
Copy link
Contributor Author

wch commented Oct 27, 2015

@cboettig, I get the same result as you.

@jangorecki
Copy link

heh, I cannot reproduce it anymore... so it was likely some issue on my side, maybe overlapping name of an image I've build a while ago.

briandk pushed a commit to briandk/latex-trusty that referenced this issue Dec 29, 2015
By default, the locale is set to just C, which then gets inherited by R
and passed along to pandoc, and that causes all sorts of problems.

- rstudio/rmarkdown#383
- rocker-org/rocker#19
- http://crosbymichael.com/dockerfile-best-practices-take-2.html
@hakanai
Copy link

hakanai commented Sep 13, 2018

So I think that, despite the provincial-sounding label, en_US actually supports non-English languages better than C.

Not sure how you jump to that conclusion.

What about languages where, for instance, "ä" or "å" are supposed to sort after "z"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants