utf-8 encoding after yaml.load_file #6

HenricoWitvliet · 2013-05-08T20:36:35Z

I've got a file with utf-8 characters. yaml.load_file loads the character strings correctly. But the encoding, as given by Encoding(), returns unknown. Now I use Encoding(...) <-'UTF-8' to set the encoding.
It would be nice if the character strings had the utf-8 encoding bit set.

RinatMenyashev · 2014-02-13T22:04:21Z

same problem

viking · 2014-02-13T22:26:43Z

This same behavior occurs when using R core functions like readLines, at least in Linux. As far as I know, R does not do any kind of encoding detection. If you run example(Encoding), what is your output?

HenricoWitvliet · 2014-02-17T16:30:03Z

Since a yaml file is encoded in unicode, I would expect strings to be given this encoding. The character string that yaml.load_file returns in my example is utf-8 encoded. But I haven't tried an example yaml in utf-16, so I don't know if setting a bit in every string would be enough.

viking · 2014-02-17T21:46:19Z

Ah, I see. I didn't realize that all YAML documents are unicode, but the YAML specification agrees with you. The specification says that by default, the encoding is UTF-8. For UTF-16, the document must provide a byte-order mark:
http://yaml.org/spec/1.1/#id868742

It looks like LibYAML has an encoding property:
http://pyyaml.org/wiki/LibYAML#StylisticEventAttributes

I'll add this into the next update.

viking · 2014-02-17T22:03:56Z

As it turns out, R does not support UTF-16 at all in Encoding() as of version 3.0.2.

yihui · 2015-04-28T16:27:51Z

We just ran into the same problem. It will be nice if you can explicitly mark the encoding of character strings as UTF-8. Thanks! (We probably do not need to worry about UTF-16)

viking · 2015-04-29T21:48:03Z

I had forgotten about this issue, unfortunately. I will take a fresh look at it.

yihui · 2015-04-29T22:03:20Z

Thanks! FWIW, this is our current workaround: rstudio/rmarkdown#421 (Recursively mark the character elements of yaml.load() output as UTF-8)

…at/r-yaml#6

ofurkusi · 2016-02-04T13:27:30Z

There seem to be two issues here, one with yaml.load_file and another with yaml.load.

When yaml.load_file calls readLines without explicitly defining the encoding as UTF-8, the contents of a valid UTF-8 encoded yaml file is read into a string with the encoding set to unknown (while in fact being UTF-8). On Windows, R treats the string as latin1 (I guess) so the characters are all garbled when displayed. By adding encoding="UTF-8 as a parameter to readLines the raw text input is read correctly and set as UTF-8 before being passed on to yaml.load.

While I suggest setting encoding="UTF-8 parameter for readLines in yaml.load_file it does not seem to be enough to fix the problem. Once yaml.load starts processing the text read by readLines, it messes the characters up again by reverting the encoding to unknown.

yihui · 2016-06-29T13:39:01Z

We were bitten by this issue again: rstudio/bookdown#142 Is there a chance that you could fix it? The fix should be fairly simple (mark the input and output strings as UTF-8), and I'm just not familiar with C.

not sure when this bug can be fixed: vubiostat/r-yaml#6

…alternative form of chapter_name (due to the bug vubiostat/r-yaml#6, we cannot use R expressions in YAML that contains multibyte characters)

shrektan · 2016-10-14T08:56:45Z

We encountered the same issue as well, although it can by solved as @yihui did in https://github.com/rstudio/bookdown/blob/3ed7fc6bd30e2832948d28298dee5cd546339fc8/R/utils.R#L82

We thought it would be nicer if it's fixed in the package yaml.

Thanks.

yihui · 2016-10-19T14:22:19Z

And bitten by this again rstudio/rmarkdown#841 so yet yet another patch...

viking · 2016-10-19T14:29:25Z

Unfortunately I have precious little time to work on this project at present. A pull request would be appreciated.

yihui · 2016-10-19T14:32:08Z

@viking Okay, actually that is all I need from you. I'll try to find someone to do the work and submit a pull request. Thanks!

yihui · 2016-10-20T16:32:29Z

@viking Done in #32. Tested on Windows and *nix.

In the long run, if you feel it is difficult for you to maintain this package, you may consider finding a new maintainer. It seems you are having the similar situation of the tikzDevice package, which is a package that I was highly interested in but the original authors lacked time. The yaml package is critical to the R Markdown world, and I hope you could consider increasing the bus factor so this important project can be carried forward nicely in the future.

BTW, I found this article very inspiring: I gave commit rights to someone I didn't know, I could never have guessed what happened next!.

viking · 2016-10-27T14:41:33Z

Thank you.

yihui · 2016-11-03T19:10:23Z

@viking Any chance you could make a CRAN release soon? I hate bugging you like this, but without the CRAN release, we just keep hearing users report this issue. Here again: http://rmarkdown.rstudio.com/r_notebooks.html#comment-2982649887

viking · 2016-11-03T19:59:27Z

I'll get to it soon. Not being funded to do this means that I have other priorities. Please recognize that.

viking · 2016-11-03T20:03:21Z

I don't wish to continue this discussion here. I will let you know when the new version is on CRAN.

yihui · 2016-11-03T20:04:31Z

Yep definitely understood, and much appreciated!

viking · 2016-11-12T19:15:16Z

New version is up on CRAN as of about 10 minutes ago.

@yihui

Upstream changes: CHANGES IN knitr VERSION 1.15.1 @yihui yihui released this on 23 Nov 2016 · 49 commits to master since this release NEW FEATURES added a new hook function hook_pngquant() that can call pngquant to optimize PNG images (thanks, @slowkow, #1320) BUG FIXES not really a knitr bug, but knit_params() should be better at dealing with multibyte characters now due to the bug fix in the yaml package vubiostat/r-yaml#6 Downloads Source code (zip) Source code (tar.gz) v1.15 b08a7bc CHANGES IN knitr VERSION 1.15 @yihui yihui released this on 10 Nov 2016 · 63 commits to master since this release NEW FEATURES NA values can be displayed using different characters (including empty strings) in kable(); you can set the option knitr.kable.NA, e.g. options(knitr.kable.NA = '') to hide NA values (#1283) added a fortran95 engine (thanks, @stefanedwards, #1282) added a block2 engine for R Markdown documents as an alternative to the block engine; it should be faster and supports arbitrary Pandoc's Markdown syntax, but it is essentially a hack; note when the output format is LaTeX/PDF, you have to define \let\BeginKnitrBlock\begin \let\EndKnitrBlock\end in the LaTeX preamble figure captions specified in the chunk option fig.cap are also applied to HTML widgets (thanks, @byzheng, rstudio/bookdown#118) when the chunk option fig.show = 'animate' and ffmpeg.format = 'gif', a GIF animation of the plots in the chunk will be generated for HTML output (https://twitter.com/thomasp85/status/785800003436421120) added a width argument to write_bib() so long lines in bib entries can be wrapped the inline syntax r#code is also supported besides r code; this can make sure the inline expression is not split when the line is wrapped (thanks, Dave Jarvis) provided a global R option knitr.use.cwd so users can choose to evaluate the R code chunks in the current working directory after setting options(knitr.use.cwd = TRUE); the default is to evaluate code in the directory of the input document, unless the knitr option opts_knit$set(root.dir = ...) has been set if options(knitr.digits.signif = TRUE), numbers from inline expressions will be formatted using getOption('digits') as the number of significant digits, otherwise (the default behavior) getOption('digits') is treated as the number of decimal places (thanks, @numatt, #1053) the chunk option engine.path can also be a list of paths to the engine executables now, e.g., you can set knitr::opts_chunk$set(engine.path = list(python = '/anaconda/bin/python', perl = '/usr/local/bin/perl')), then when a python code chunk is executed, /anaconda/bin/python will be called instead of the system default (rstudio/rmarkdown#812) introduced a mechanism to protect text output in the sense that it will not be touched by Pandoc during the conversion from R Markdown to another format; this is primarily for package developers to extend R Markdown; see ?raw_output for details (which also shows new functions extract_raw_output() and restore_raw_output()) MAJOR CHANGES the minimal version of R required for knitr is 3.1.0 now (#1269) the formatR package is an optional package since the default chunk option tidy = FALSE has been there for a long time; if you use tidy = TRUE, you need to install formatR separately if it is not installed :set +m is no longer automatically added to haskell code chunks (#1274) MINOR CHANGES the package option opts_knit$get('stop_on_error') has been removed the confusing warning message about knitr::knit2html() when buiding package vignettes using the knitr::rmarkdown engine without pandoc/pandoc-citeproc has been removed (#1286) the default value of the quiet argument of plot_crop() was changed from !opts_knit$get('progress') to TRUE, i.e., by default the messages from cropping images are suppressed BUG FIXES the chunk option cache.vars did not really behave like what was documented (thanks, @simonKTH, #1280) asis_output() should not be merged with normal character output when results='hold' (thanks, @kevinushey, #1310) Downloads Source code (zip) Source code (tar.gz) v1.14 b34be0d CHANGES IN knitr VERSION 1.14 @yihui yihui released this on 12 Aug 2016 · 845 commits to master since this release NEW FEATURES improved caching for Rcpp code chunks: the shared library built from the C++ code will be preserved on disk and reloaded the next time if caching is enabled (chunk option cache = TRUE), so that the exported R functions are still usable in later R code chunks; note this feature requires Rcpp >= 0.12.5.6 (thanks, @jjallaire, #1239) added a helper function all_rcpp_labels(), which is simply all_labels(engine == 'Rcpp') and can be used to extract all chunk lables of Rcpp chunks added a new engine named sql that uses the DBI package to execute SQL queries, and optionally assign the result to a variable in the knitr session; see http://rmarkdown.rstudio.com/authoring_knitr_engines.html for details (#1241) fig.keep now accepts numeric values to index low-level plots to keep (#1265) BUG FIXES fixed #1211: pandoc('foo.md') generates foo_utf8.html instead of foo.html by default fixed #1236: include = FALSE for code chunks inside blockquotes did not work (should return > instead of a blank line) (thanks, @fmichonneau) fixed #1217: define the command \hlipl for syntax highlighting for Rnw documents (thanks, @conjugateprior) fixed #1215: restoring par() settings might fail when the plot window is partitioned, e.g. par(mfrow = c(1, 2)) (thanks, @jrwishart @jmichaelgilbert) fixed #1250: in the quiet mode, knit() should not emit the message "processing file ..." when processing child documents (thanks, @KZARCA) MAJOR CHANGES knitr will no longer generate screenshots automatically for HTML widgets if the webshot package or PhantomJS is not installed MINOR CHANGES if dev = 'cairo_pdf', the cairo_pdf device will be used to record plots (previously the pdf device was used) (#1235) LaTeX short captions now go up to the first ., : or ; character followed by a space or newline (thanks, @knokknok, #1249)

This was referenced Apr 28, 2015

Encoding of special characters in the YAML header on Windows rstudio/rmarkdown#420

Closed

Encoding and Spanish Characters #10

Closed

yihui added a commit to yihui/knitr that referenced this issue Apr 30, 2015

the same fix as rstudio/rmarkdown#421 to solve the problem of vubiost…

929cdc4

…at/r-yaml#6

wush978 mentioned this issue May 20, 2015

Encoding issues swirldev/swirl#299

Open

yihui added a commit to rstudio/bookdown that referenced this issue Jun 29, 2016

mark the output of yaml.load() as UTF8

3ed7fc6

not sure when this bug can be fixed: vubiostat/r-yaml#6

yihui mentioned this issue Jun 29, 2016

中文问题 rstudio/bookdown#126

Closed

yihui mentioned this issue Oct 20, 2016

Mark character input/output as UTF-8 #32

Merged

viking closed this as completed in #32 Oct 27, 2016

yihui added a commit to yihui/knitr that referenced this issue Nov 17, 2016

no longer need the mark_utf8 hack after yaml 2.1.14 (vubiostat/r-yaml#6)

dd85151

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf-8 encoding after yaml.load_file #6

utf-8 encoding after yaml.load_file #6

HenricoWitvliet commented May 8, 2013

RinatMenyashev commented Feb 13, 2014

viking commented Feb 13, 2014

HenricoWitvliet commented Feb 17, 2014

viking commented Feb 17, 2014

viking commented Feb 17, 2014

yihui commented Apr 28, 2015

viking commented Apr 29, 2015

yihui commented Apr 29, 2015

ofurkusi commented Feb 4, 2016

yihui commented Jun 29, 2016

shrektan commented Oct 14, 2016

yihui commented Oct 19, 2016

viking commented Oct 19, 2016

yihui commented Oct 19, 2016

yihui commented Oct 20, 2016

viking commented Oct 27, 2016

yihui commented Nov 3, 2016

viking commented Nov 3, 2016

viking commented Nov 3, 2016

yihui commented Nov 3, 2016

viking commented Nov 12, 2016

utf-8 encoding after yaml.load_file #6

utf-8 encoding after yaml.load_file #6

Comments

HenricoWitvliet commented May 8, 2013

RinatMenyashev commented Feb 13, 2014

viking commented Feb 13, 2014

HenricoWitvliet commented Feb 17, 2014

viking commented Feb 17, 2014

viking commented Feb 17, 2014

yihui commented Apr 28, 2015

viking commented Apr 29, 2015

yihui commented Apr 29, 2015

ofurkusi commented Feb 4, 2016

yihui commented Jun 29, 2016

shrektan commented Oct 14, 2016

yihui commented Oct 19, 2016

viking commented Oct 19, 2016

yihui commented Oct 19, 2016

yihui commented Oct 20, 2016

viking commented Oct 27, 2016

yihui commented Nov 3, 2016

viking commented Nov 3, 2016

viking commented Nov 3, 2016

yihui commented Nov 3, 2016

viking commented Nov 12, 2016