New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tibble multiplication sign is invalid UTF-8 character #216

Closed
aalexandersson opened this Issue Jan 24, 2017 · 13 comments

Comments

Projects
6 participants
@aalexandersson

aalexandersson commented Jan 24, 2017

The tibble multiplication sign is an invalid UTF-8 character. Here is a typical example output from
http://readr.tidyverse.org/reference/read_delim.html :

#> # A tibble: 32 × 11

The multiplication sign character in read_csv outputs such as above is extended ASCII but it should be either in plain ASCII or in Unicode UTF-8. In UTF-8 encoding, the character is displayed as xD7 but pandoc gives the error message

"Cannot decode byte '\xd7': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream"

This is a problem for pandoc on Windows only. I tried pandoc version 1.13.1 and 1.18. I mentioned the problem on Statalist and wondered if it was a problem with Stata's user-written program "Markdoc", which is Stata's equivalent program to R Markdown. The user-programmer of MarkDoc concluded that read_csv should have avoided the invalid UTF-8 character, and I agree. The Statalist URL is http://www.statalist.org/forums/forum/general-stata-discussion/general/1355554-markdoc-manual-gui?p=1362612#post1362612

What is the rationale for using extended ASCII instead of plain ASCII or UTF-8 for the tibble multiplication sign? Given (1) the compatibility problems with pandoc on Windows and with dependent programs such as Stata's markdoc, (2) the no need for extended ASCII, and (3) having an obvious easy fix, I assume this issue was simply overlooked. The problem occurs with R's read_csv () but in bug tidyverse/readr#547 hadley closed the bug and instead suggested this is a tibble problem.

@aalexandersson

This comment has been minimized.

aalexandersson commented Jan 24, 2017

Fixed typo in last sentence: Changed wickham to hadley.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 25, 2017

Thanks. Could you please post the .Rmd file you use for testing on Windows, just to make sure we're on the same page?

@aalexandersson

This comment has been minimized.

aalexandersson commented Jan 25, 2017

Here is the .Rmd file:

title: "Test of tibble in R Markdown"
author: "Anders Alexandersson"
date: "January 25, 2017"
output: html_document

knitr::opts_chunk$set(echo = TRUE)

System Information

This is some system information.

sessionInfo()
library(readr)
rmarkdown::pandoc_version()

Create tibble output

This creates some tibble output.

read_csv("auto.txt")

Test tibble output using pandoc

This is a test of tibble using pandoc. How to run pandoc from R? In Stata's Rcall command it is automated. To reproduce the error in R, I copy-paste the above output to Notepad, which defaults to Encoding ANSI. I save the filename as "output.txt". Then from the command prompt where Pandoc is installed, I typed

pandoc Markdown.txt -o Word.docx

I saved the error message as "error_message.png".

Screenshot of error message

The same problem by another user was also reported on Stack Exchange at
http://stackoverflow.com/questions/26492750/using-imported-utf-8-character-in-knitr-with-r

Here is the error message:
error_message

@aalexandersson

This comment has been minimized.

aalexandersson commented Jan 25, 2017

I am not allowed to paste HTML output. For you to see my R output, I attach the PDF output.
test_tibble.pdf

@thibautjombart

This comment has been minimized.

thibautjombart commented Apr 5, 2017

I can confirm this bug. The 'x' is the culprit. Here is a short Rnw with reproducible example:

\documentclass{article}
\usepackage[utf8]{inputenc}

\begin{document}

This will generate an error when compiling the \texttt{tex}.

<<test>>=
library(tibble)
as_tibble(cars)
@ 

The error I get on linux, and some colleagues on Mac, is:
\begin{verbatim}
> knit2pdf("test.Rnw")


processing file: test.Rnw
  |......................                                           |  33%
  ordinary text without R code

  |...........................................                      |  67%
label: test
  |.................................................................| 100%
  ordinary text without R code


output file: test.tex

Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  Running 'texi2dvi' on 'test.tex' failed.
LaTeX errors:
! Package inputenc Error: Unicode char \u8:× not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...      
\end{verbatim}

\end{document}

@krlmlr krlmlr added this to To Do in krlmlr Apr 17, 2017

@NikNakk

This comment has been minimized.

NikNakk commented Apr 18, 2017

On Windows 10, R 3.3.3, rmarkdown 1.4, tibble 1.3.0.9000 I am unable to reproduce this with either Rmd or Rnw. However, if I use rmarkdown::render("file", clean = FALSE) and use the non-UTF8 Md file of the two generated, I can get pandoc to produce the error indicated. There doesn't, however, seem to be anything wrong as such with the code in tibble.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Apr 19, 2017

@yihui: Is there a way to determine the expected encoding for console output for a knitr or rmarkdown run? Or do we just assume UTF-8?

tibble is printing a multiplication sign which requires Unicode and seems to break knitr documents in some cases.

@thibautjombart

This comment has been minimized.

thibautjombart commented Apr 19, 2017

The weird thing is that my system is using utf8, and other non-ascii characters seem to do just fine. In the example provided the encoding is declared when loading the inputenc package in the LaTeX header (\usepackage[utf8]{inputenc}).

@yihui

This comment has been minimized.

Contributor

yihui commented Apr 19, 2017

I received a similar report recently about the multiplication sign: yihui/knitr#1389 but I could not reproduce it on Windows.

I guess @thibautjombart's problem is that he didn't tell knitr the encoding was supposed to be UTF-8 (which is the default on *nix but not Windows): knit2pdf("test.Rnw", encoding = "UTF-8").

I'd recommend that you just use the letter x instead of the fancy Unicode character... Character encoding problems on Windows are forever pain.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Apr 20, 2017

@hadley: Okay to revert to plain ASCII x?

@thibautjombart

This comment has been minimized.

thibautjombart commented Apr 20, 2017

@yihui nope, my native encoding is utf-8 (I'm on linux). Adding the option hasn't changed the error.
I can reproduce the error on the current rocker/verse docker image too:

File toto.Rnw saved
root@0aee4758d237:~# R

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> knitr::knit2pdf("toto.Rnw")


processing file: toto.Rnw
  |......................                                           |  33%
  ordinary text without R code

  |...........................................                      |  67%
label: test
  |.................................................................| 100%
  ordinary text without R code


output file: toto.tex

Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  Running 'texi2dvi' on 'toto.tex' failed.
LaTeX errors:
! Package inputenc Error: Unicode char \u8:× not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
> 

Also note this character is used in the print method for tibble object. I am not using it otherwise.

@thibautjombart

This comment has been minimized.

thibautjombart commented Apr 20, 2017

For what it's worth, this is what emacs thinks of this character:

             position: 1 of 2 (0%), column: 0
            character: × (displayed as ×) (codepoint 215, #o327, #xd7)
    preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0xD7
               script: latin
               syntax: _ 	which means: symbol
             category: .:Base, c:Chinese, h:Korean, j:Japanese, l:Latin
             to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
          buffer code: #xC3 #x97
            file code: #xC3 #x97 (encoded by coding system utf-8-unix)
              display: by this font (glyph code)
    xft:-PfEd-DejaVu Sans Mono-normal-normal-normal-*-19-*-*-*-m-0-iso10646-1 (#x99)

Seems like a valid utf8 character to my (naive) eye..

@hadley

This comment has been minimized.

Member

hadley commented Apr 20, 2017

@krlmlr yeah, it's not worth the hassle.

@krlmlr krlmlr moved this from To Do to Done in krlmlr May 9, 2017

@krlmlr krlmlr closed this in 6693f4c May 9, 2017

krlmlr added a commit that referenced this issue May 13, 2017

Merge tag 'v1.3.0.9003'
- The `print()`, `format()`, and `tbl_sum()` methods are now implemented for class `"tbl"` and not for `"tbl_df"`. This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing `tbl_sum()` for the subclass.
- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
- Printing now uses `x` again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- `glimpse()` now properly displays tibbles with foreign characters in column names (#235).

krlmlr added a commit that referenced this issue May 17, 2017

Merge tag 'v1.3.1'
- Subsetting zero columns no longer returns wrong number of rows (#241, @echasnovski).

- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
- New `rowid_to_column()` that adds a `rowid` column as first column and removes row names (#243, @barnettjacob).
- The `all.equal.tbl_df()` method has been removed, calling `all.equal()` now forwards to `base::all.equal.data.frame()`. To compare tibbles ignoring row and column order, please use `dplyr::all_equal()` (#247).

- Printing now uses `x` again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- String values are now quoted when printing if they contain non-printable characters or quotes (#253).
- The `print()`, `format()`, and `tbl_sum()` methods are now implemented for class `"tbl"` and not for `"tbl_df"`. This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing `tbl_sum()` for the subclass, which is expected to return a named character vector. The `print.tbl_df()` method is still implemented for compatibility with downstream packages, but only calls `NextMethod()`.
- Own printing routine, not relying on `print.data.frame()` anymore. Now providing `format.tbl_df()` and full support for Unicode characters in names and data, also for `glimpse()` (#235).

- Improve formatting of error messages (#223).
- Using `rlang` instead of `lazyeval` (#225, @lionel-), and `rlang` functions (#244).
- `tribble()` now handles values that have a class (#237, @NikNakk).
- Minor efficiency gains by replacing `any(is.na())` with `anyNA()` (#229, @csgillespie).
- The `microbenchmark` package is now used conditionally (#245).
- `pkgdown` website.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment