Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tibble multiplication sign is invalid UTF-8 character #216

Closed
aalexandersson opened this issue Jan 24, 2017 · 14 comments
Closed

tibble multiplication sign is invalid UTF-8 character #216

aalexandersson opened this issue Jan 24, 2017 · 14 comments

Comments

@aalexandersson
Copy link

aalexandersson commented Jan 24, 2017

The tibble multiplication sign is an invalid UTF-8 character. Here is a typical example output from
http://readr.tidyverse.org/reference/read_delim.html :

#> # A tibble: 32 × 11

The multiplication sign character in read_csv outputs such as above is extended ASCII but it should be either in plain ASCII or in Unicode UTF-8. In UTF-8 encoding, the character is displayed as xD7 but pandoc gives the error message

"Cannot decode byte '\xd7': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream"

This is a problem for pandoc on Windows only. I tried pandoc version 1.13.1 and 1.18. I mentioned the problem on Statalist and wondered if it was a problem with Stata's user-written program "Markdoc", which is Stata's equivalent program to R Markdown. The user-programmer of MarkDoc concluded that read_csv should have avoided the invalid UTF-8 character, and I agree. The Statalist URL is http://www.statalist.org/forums/forum/general-stata-discussion/general/1355554-markdoc-manual-gui?p=1362612#post1362612

What is the rationale for using extended ASCII instead of plain ASCII or UTF-8 for the tibble multiplication sign? Given (1) the compatibility problems with pandoc on Windows and with dependent programs such as Stata's markdoc, (2) the no need for extended ASCII, and (3) having an obvious easy fix, I assume this issue was simply overlooked. The problem occurs with R's read_csv () but in bug tidyverse/readr#547 hadley closed the bug and instead suggested this is a tibble problem.

@aalexandersson
Copy link
Author

Fixed typo in last sentence: Changed wickham to hadley.

@krlmlr
Copy link
Member

krlmlr commented Jan 25, 2017

Thanks. Could you please post the .Rmd file you use for testing on Windows, just to make sure we're on the same page?

@aalexandersson
Copy link
Author

Here is the .Rmd file:

title: "Test of tibble in R Markdown"
author: "Anders Alexandersson"
date: "January 25, 2017"
output: html_document

knitr::opts_chunk$set(echo = TRUE)

System Information

This is some system information.

sessionInfo()
library(readr)
rmarkdown::pandoc_version()

Create tibble output

This creates some tibble output.

read_csv("auto.txt")

Test tibble output using pandoc

This is a test of tibble using pandoc. How to run pandoc from R? In Stata's Rcall command it is automated. To reproduce the error in R, I copy-paste the above output to Notepad, which defaults to Encoding ANSI. I save the filename as "output.txt". Then from the command prompt where Pandoc is installed, I typed

pandoc Markdown.txt -o Word.docx

I saved the error message as "error_message.png".

Screenshot of error message

The same problem by another user was also reported on Stack Exchange at
http://stackoverflow.com/questions/26492750/using-imported-utf-8-character-in-knitr-with-r

Here is the error message:
error_message

@aalexandersson
Copy link
Author

I am not allowed to paste HTML output. For you to see my R output, I attach the PDF output.
test_tibble.pdf

@thibautjombart
Copy link

thibautjombart commented Apr 5, 2017

I can confirm this bug. The 'x' is the culprit. Here is a short Rnw with reproducible example:

\documentclass{article}
\usepackage[utf8]{inputenc}

\begin{document}

This will generate an error when compiling the \texttt{tex}.

<<test>>=
library(tibble)
as_tibble(cars)
@ 

The error I get on linux, and some colleagues on Mac, is:
\begin{verbatim}
> knit2pdf("test.Rnw")


processing file: test.Rnw
  |......................                                           |  33%
  ordinary text without R code

  |...........................................                      |  67%
label: test
  |.................................................................| 100%
  ordinary text without R code


output file: test.tex

Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  Running 'texi2dvi' on 'test.tex' failed.
LaTeX errors:
! Package inputenc Error: Unicode char \u8:× not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...      
\end{verbatim}

\end{document}

@NikNakk
Copy link

NikNakk commented Apr 18, 2017

On Windows 10, R 3.3.3, rmarkdown 1.4, tibble 1.3.0.9000 I am unable to reproduce this with either Rmd or Rnw. However, if I use rmarkdown::render("file", clean = FALSE) and use the non-UTF8 Md file of the two generated, I can get pandoc to produce the error indicated. There doesn't, however, seem to be anything wrong as such with the code in tibble.

@krlmlr
Copy link
Member

krlmlr commented Apr 19, 2017

@yihui: Is there a way to determine the expected encoding for console output for a knitr or rmarkdown run? Or do we just assume UTF-8?

tibble is printing a multiplication sign which requires Unicode and seems to break knitr documents in some cases.

@thibautjombart
Copy link

The weird thing is that my system is using utf8, and other non-ascii characters seem to do just fine. In the example provided the encoding is declared when loading the inputenc package in the LaTeX header (\usepackage[utf8]{inputenc}).

@yihui
Copy link

yihui commented Apr 19, 2017

I received a similar report recently about the multiplication sign: yihui/knitr#1389 but I could not reproduce it on Windows.

I guess @thibautjombart's problem is that he didn't tell knitr the encoding was supposed to be UTF-8 (which is the default on *nix but not Windows): knit2pdf("test.Rnw", encoding = "UTF-8").

I'd recommend that you just use the letter x instead of the fancy Unicode character... Character encoding problems on Windows are forever pain.

@krlmlr
Copy link
Member

krlmlr commented Apr 20, 2017

@hadley: Okay to revert to plain ASCII x?

@thibautjombart
Copy link

thibautjombart commented Apr 20, 2017

@yihui nope, my native encoding is utf-8 (I'm on linux). Adding the option hasn't changed the error.
I can reproduce the error on the current rocker/verse docker image too:

File toto.Rnw saved
root@0aee4758d237:~# R

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> knitr::knit2pdf("toto.Rnw")


processing file: toto.Rnw
  |......................                                           |  33%
  ordinary text without R code

  |...........................................                      |  67%
label: test
  |.................................................................| 100%
  ordinary text without R code


output file: toto.tex

Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  Running 'texi2dvi' on 'toto.tex' failed.
LaTeX errors:
! Package inputenc Error: Unicode char \u8:× not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
> 

Also note this character is used in the print method for tibble object. I am not using it otherwise.

@thibautjombart
Copy link

For what it's worth, this is what emacs thinks of this character:

             position: 1 of 2 (0%), column: 0
            character: × (displayed as ×) (codepoint 215, #o327, #xd7)
    preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0xD7
               script: latin
               syntax: _ 	which means: symbol
             category: .:Base, c:Chinese, h:Korean, j:Japanese, l:Latin
             to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
          buffer code: #xC3 #x97
            file code: #xC3 #x97 (encoded by coding system utf-8-unix)
              display: by this font (glyph code)
    xft:-PfEd-DejaVu Sans Mono-normal-normal-normal-*-19-*-*-*-m-0-iso10646-1 (#x99)

Seems like a valid utf8 character to my (naive) eye..

@hadley
Copy link
Member

hadley commented Apr 20, 2017

@krlmlr yeah, it's not worth the hassle.

@krlmlr krlmlr closed this as completed in 6693f4c May 9, 2017
krlmlr added a commit that referenced this issue May 13, 2017
- The `print()`, `format()`, and `tbl_sum()` methods are now implemented for class `"tbl"` and not for `"tbl_df"`. This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing `tbl_sum()` for the subclass.
- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
- Printing now uses `x` again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- `glimpse()` now properly displays tibbles with foreign characters in column names (#235).
krlmlr added a commit that referenced this issue May 17, 2017
- Subsetting zero columns no longer returns wrong number of rows (#241, @echasnovski).

- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
- New `rowid_to_column()` that adds a `rowid` column as first column and removes row names (#243, @barnettjacob).
- The `all.equal.tbl_df()` method has been removed, calling `all.equal()` now forwards to `base::all.equal.data.frame()`. To compare tibbles ignoring row and column order, please use `dplyr::all_equal()` (#247).

- Printing now uses `x` again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- String values are now quoted when printing if they contain non-printable characters or quotes (#253).
- The `print()`, `format()`, and `tbl_sum()` methods are now implemented for class `"tbl"` and not for `"tbl_df"`. This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing `tbl_sum()` for the subclass, which is expected to return a named character vector. The `print.tbl_df()` method is still implemented for compatibility with downstream packages, but only calls `NextMethod()`.
- Own printing routine, not relying on `print.data.frame()` anymore. Now providing `format.tbl_df()` and full support for Unicode characters in names and data, also for `glimpse()` (#235).

- Improve formatting of error messages (#223).
- Using `rlang` instead of `lazyeval` (#225, @lionel-), and `rlang` functions (#244).
- `tribble()` now handles values that have a class (#237, @NikNakk).
- Minor efficiency gains by replacing `any(is.na())` with `anyNA()` (#229, @csgillespie).
- The `microbenchmark` package is now used conditionally (#245).
- `pkgdown` website.
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants