{{ message }}

# tibble multiplication sign is invalid UTF-8 character #216

Closed
opened this issue Jan 24, 2017 · 14 comments
Closed

# tibble multiplication sign is invalid UTF-8 character#216

opened this issue Jan 24, 2017 · 14 comments

### aalexandersson commented Jan 24, 2017 • edited by krlmlr

 The tibble multiplication sign is an invalid UTF-8 character. Here is a typical example output from http://readr.tidyverse.org/reference/read_delim.html : #> # A tibble: 32 × 11 The multiplication sign character in read_csv outputs such as above is extended ASCII but it should be either in plain ASCII or in Unicode UTF-8. In UTF-8 encoding, the character is displayed as xD7 but pandoc gives the error message "Cannot decode byte '\xd7': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream" This is a problem for pandoc on Windows only. I tried pandoc version 1.13.1 and 1.18. I mentioned the problem on Statalist and wondered if it was a problem with Stata's user-written program "Markdoc", which is Stata's equivalent program to R Markdown. The user-programmer of MarkDoc concluded that read_csv should have avoided the invalid UTF-8 character, and I agree. The Statalist URL is http://www.statalist.org/forums/forum/general-stata-discussion/general/1355554-markdoc-manual-gui?p=1362612#post1362612 What is the rationale for using extended ASCII instead of plain ASCII or UTF-8 for the tibble multiplication sign? Given (1) the compatibility problems with pandoc on Windows and with dependent programs such as Stata's markdoc, (2) the no need for extended ASCII, and (3) having an obvious easy fix, I assume this issue was simply overlooked. The problem occurs with R's read_csv () but in bug tidyverse/readr#547 hadley closed the bug and instead suggested this is a tibble problem. The text was updated successfully, but these errors were encountered:

### aalexandersson commented Jan 24, 2017

 Fixed typo in last sentence: Changed wickham to hadley.

### krlmlr commented Jan 25, 2017

 Thanks. Could you please post the .Rmd file you use for testing on Windows, just to make sure we're on the same page?

## title: "Test of tibble in R Markdown" author: "Anders Alexandersson" date: "January 25, 2017" output: html_document

knitr::opts_chunk\$set(echo = TRUE)


## System Information

This is some system information.

sessionInfo()
rmarkdown::pandoc_version()


## Create tibble output

This creates some tibble output.

read_csv("auto.txt")


## Test tibble output using pandoc

This is a test of tibble using pandoc. How to run pandoc from R? In Stata's Rcall command it is automated. To reproduce the error in R, I copy-paste the above output to Notepad, which defaults to Encoding ANSI. I save the filename as "output.txt". Then from the command prompt where Pandoc is installed, I typed

pandoc Markdown.txt -o Word.docx

I saved the error message as "error_message.png".

The same problem by another user was also reported on Stack Exchange at
http://stackoverflow.com/questions/26492750/using-imported-utf-8-character-in-knitr-with-r

Here is the error message:

### aalexandersson commented Jan 25, 2017

 I am not allowed to paste HTML output. For you to see my R output, I attach the PDF output. test_tibble.pdf

### thibautjombart commented Apr 5, 2017 • edited

 I can confirm this bug. The 'x' is the culprit. Here is a short Rnw with reproducible example: \documentclass{article} \usepackage[utf8]{inputenc} \begin{document} This will generate an error when compiling the \texttt{tex}. <>= library(tibble) as_tibble(cars) @ The error I get on linux, and some colleagues on Mac, is: \begin{verbatim} > knit2pdf("test.Rnw") processing file: test.Rnw |...................... | 33% ordinary text without R code |........................................... | 67% label: test |.................................................................| 100% ordinary text without R code output file: test.tex Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, : Running 'texi2dvi' on 'test.tex' failed. LaTeX errors: ! Package inputenc Error: Unicode char \u8:× not set up for use with LaTeX. See the inputenc package documentation for explanation. Type H for immediate help. ... \end{verbatim} \end{document} 

### NikNakk commented Apr 18, 2017

 On Windows 10, R 3.3.3, rmarkdown 1.4, tibble 1.3.0.9000 I am unable to reproduce this with either Rmd or Rnw. However, if I use rmarkdown::render("file", clean = FALSE) and use the non-UTF8 Md file of the two generated, I can get pandoc to produce the error indicated. There doesn't, however, seem to be anything wrong as such with the code in tibble.

### krlmlr commented Apr 19, 2017

 @yihui: Is there a way to determine the expected encoding for console output for a knitr or rmarkdown run? Or do we just assume UTF-8? tibble is printing a multiplication sign which requires Unicode and seems to break knitr documents in some cases.

### thibautjombart commented Apr 19, 2017

 The weird thing is that my system is using utf8, and other non-ascii characters seem to do just fine. In the example provided the encoding is declared when loading the inputenc package in the LaTeX header (\usepackage[utf8]{inputenc}).

### yihui commented Apr 19, 2017

 I received a similar report recently about the multiplication sign: yihui/knitr#1389 but I could not reproduce it on Windows. I guess @thibautjombart's problem is that he didn't tell knitr the encoding was supposed to be UTF-8 (which is the default on *nix but not Windows): knit2pdf("test.Rnw", encoding = "UTF-8"). I'd recommend that you just use the letter x instead of the fancy Unicode character... Character encoding problems on Windows are forever pain.

### krlmlr commented Apr 20, 2017

 @hadley: Okay to revert to plain ASCII x?

### thibautjombart commented Apr 20, 2017 • edited

 @yihui nope, my native encoding is utf-8 (I'm on linux). Adding the option hasn't changed the error. I can reproduce the error on the current rocker/verse docker image too: File toto.Rnw saved root@0aee4758d237:~# R R version 3.3.3 (2017-03-06) -- "Another Canoe" Copyright (C) 2017 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > knitr::knit2pdf("toto.Rnw") processing file: toto.Rnw |...................... | 33% ordinary text without R code |........................................... | 67% label: test |.................................................................| 100% ordinary text without R code output file: toto.tex Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, : Running 'texi2dvi' on 'toto.tex' failed. LaTeX errors: ! Package inputenc Error: Unicode char \u8:× not set up for use with LaTeX. See the inputenc package documentation for explanation. Type H for immediate help. ... >  Also note this character is used in the print method for tibble object. I am not using it otherwise.

### thibautjombart commented Apr 20, 2017

 For what it's worth, this is what emacs thinks of this character:  position: 1 of 2 (0%), column: 0 character: × (displayed as ×) (codepoint 215, #o327, #xd7) preferred charset: unicode (Unicode (ISO10646)) code point in charset: 0xD7 script: latin syntax: _ which means: symbol category: .:Base, c:Chinese, h:Korean, j:Japanese, l:Latin to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME" buffer code: #xC3 #x97 file code: #xC3 #x97 (encoded by coding system utf-8-unix) display: by this font (glyph code) xft:-PfEd-DejaVu Sans Mono-normal-normal-normal-*-19-*-*-*-m-0-iso10646-1 (#x99)  Seems like a valid utf8 character to my (naive) eye..

### hadley commented Apr 20, 2017

 @krlmlr yeah, it's not worth the hassle.
closed this in  6693f4c  May 9, 2017
mentioned this issue May 10, 2017
added a commit that referenced this issue May 13, 2017
 Merge tag 'v1.3.0.9003' 
 7503b32 
- The print(), format(), and tbl_sum() methods are now implemented for class "tbl" and not for "tbl_df". This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing tbl_sum() for the subclass.
- New set_tidy_names() and tidy_names(), a simpler version of repair_names() which works unchanged for now (#217).
- Printing now uses x again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- glimpse() now properly displays tibbles with foreign characters in column names (#235).
added a commit that referenced this issue May 17, 2017
 Merge tag 'v1.3.1' 
 8f30072 
- Subsetting zero columns no longer returns wrong number of rows (#241, @echasnovski).

- New set_tidy_names() and tidy_names(), a simpler version of repair_names() which works unchanged for now (#217).
- New rowid_to_column() that adds a rowid column as first column and removes row names (#243, @barnettjacob).
- The all.equal.tbl_df() method has been removed, calling all.equal() now forwards to base::all.equal.data.frame(). To compare tibbles ignoring row and column order, please use dplyr::all_equal() (#247).

- Printing now uses x again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- String values are now quoted when printing if they contain non-printable characters or quotes (#253).
- The print(), format(), and tbl_sum() methods are now implemented for class "tbl" and not for "tbl_df". This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing tbl_sum() for the subclass, which is expected to return a named character vector. The print.tbl_df() method is still implemented for compatibility with downstream packages, but only calls NextMethod().
- Own printing routine, not relying on print.data.frame() anymore. Now providing format.tbl_df() and full support for Unicode characters in names and data, also for glimpse() (#235).

- Improve formatting of error messages (#223).
- Using rlang instead of lazyeval (#225, @lionel-), and rlang functions (#244).
- tribble() now handles values that have a class (#237, @NikNakk).
- Minor efficiency gains by replacing any(is.na()) with anyNA() (#229, @csgillespie).
- The microbenchmark package is now used conditionally (#245).
- pkgdown website.
mentioned this issue May 25, 2017

### github-actions bot commented Dec 13, 2020

 This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.
bot locked and limited conversation to collaborators Dec 13, 2020