Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having problem with Chinese Characters in Windows environment #329

Open
hetong007 opened this issue Jan 13, 2014 · 41 comments
Open

Having problem with Chinese Characters in Windows environment #329

hetong007 opened this issue Jan 13, 2014 · 41 comments

Comments

@hetong007
Copy link

Chinese characters are encoded as UTF8 in Linux/OS x, but they are encoded as GBK in Windows. Slidify is having problem with understanding UTF8 and GBK now.

One can clone my repo Douban_Folksonomy to reproduce the following result. A properly generated html version(under Ubuntu 12.04) is available here. I am using Windows XP, but the same problem could be found on Windows 7 as well.

Here are the first few lines in my 'index.Rmd' file:

---
title       : 豆瓣网标签的整理和分析
subtitle    : 
author      : 何通
job         : 豆瓣算法组实习生
framework   : io2012        # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js  # {highlight.js, prettify, highlight}
hitheme     : tomorrow      # 
widgets     : [bootstrap]            # {mathjax, quiz, bootstrap}
mode        : selfcontained # {standalone, draft}
--- #ending

## 什么是标签?

---

When using Windows, if my 'index.Rmd' file is encoded as UTF8, then function slidify will throw out an Error , with unrecognized Chinese characters.

 > slidify('index.Rmd')


processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code


output file: index.md

Error in substring(string, start, end) : 
  invalid multibyte string at '<90>
<73>ubtitle    : 
author      : 浣曢€<9a>
job         : 璞嗙摚绠楁硶缁勫疄涔犵敓
framework   : io2012        # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js  # {highlight.js, prettify, highlight}
hitheme     : tomorrow      # 
widgets     : [bootstrap]            # {mathjax, quiz, bootstrap}
mode        : selfcontained # {standalone, draft}
--- #ending

## 浠€涔堟槸鏍囩锛<9f>

---

>- 璞嗙摚鐢靛奖涓殑鏍囩
  - ![](pics/what_is_folksonomy2.png)
>- 璞嗙摚闊充箰涓殑鏍囩
  - ![](pics/what_is_folksonomy3.png)
>- 璞嗙摚闃呰涓殑鏍囩
  - ![](pics/what_is_folksonomy4.png)

---
## 浠€涔堟槸鏍囩

>- 鐢ㄦ埛涓诲姩鐢熸垚
>- 瀵规枃瀛楀唴瀹逛笉鍔犻檺鍒
>- 鏄鐗╁搧鏈夌泭鐨勮ˉ鍏呰鏄庝俊鎭
>- 鑻辨枃閲岀О杩欐牱鐨勪笢瑗垮彨鍋<9a>**folksonomy**(folk+taxonomy)锛屽苟涓嶆槸*tag*

---

## 鏍囩鏃犲涓嶅湪

闄や簡璞嗙摚锛屽叾瀹炶繕鏈夊緢澶氬湴鏂瑰嚭鐜颁簡鏍囩锛<9a>

>- 鏂版氮寰崥涓殑鏍囩
  - ![](pics/

Obviously showing different characters and of course nobody could understand the latter one.

If I turn to GBK for Chinese characters, function slidify will work:

> slidify('index.Rmd')
processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code
output file: index.md
[1] "index.html"

But the html contains unrecongnized characters:

Inproper HTML
Comparing to the proper version:
proper HTML

@ramnathv
Copy link
Owner

Can you print out your sessionInfo() so that I can see what versions of packages you are using?

@hetong007
Copy link
Author

Here comes:

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Chinese_People's Republic of China.936 
[2] LC_CTYPE=Chinese_People's Republic of China.936   
[3] LC_MONETARY=Chinese_People's Republic of China.936
[4] LC_NUMERIC=C                                      
[5] LC_TIME=Chinese_People's Republic of China.936    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
loaded via a namespace (and not attached):
[1] tools_3.0.2

And the result was generated with slidify 0.3.3

@ramnathv
Copy link
Owner

It seems to work with the latest version of slidify. I checked online using the slidify playground at http://slidify.github.io/playground. Make sure to remove the line with mode before you paste it to the playground.

You can install the latest version of slidify and slidifyLibraries by running

devtools::install_github(c('slidify', 'slidifyLibraries'), 'ramnathv')

Before you slidify your deck, make sure to delete the libraries folder in your slide deck directory.

@hetong007
Copy link
Author

I met the same problem after installing the latet version according to your code.

Since Linux/OS x could handle Chinese fluently, I guess the success of slidify playground is not surprising.

But is slidify playground running under Windows environment? I suspect the way it deals with UTF8 and GBK is the main problem.

@ramnathv
Copy link
Owner

You are right. I believe the issue is a combination of Windows + Encoding. Let me see if I can test under Windows and get back on this.

@hetong007
Copy link
Author

Most Chinese users are suffering from it because Windows is still the most popular OS in China. A lot of users would benefit from fixing this issue :)

@ramnathv
Copy link
Owner

Can you try this @hetong007 ? It runs the index.Rmd through knitr directly, before passing it on to Slidify. This solutions has fixed some problems with encoding, and I wanted to check if it has any effect on this problem.

slidify(knit("index.Rmd", encoding = 'GBK'), knit_deck = FALSE)

@hetong007
Copy link
Author

I used that code on the GBK file. The result remains exactly the same.

I also tried slidify(knit("index.Rmd", encoding = 'UTF8'), knit_deck = FALSE) on the UTF8 version. Not working either.

@ramnathv
Copy link
Owner

Okay. Let me try to isolate the problem here. If you run knit2html on your Rmd file, are the characters displaying correctly. Let us first try to make it work with knitr and then focus on how to get slidify working with it.

@hetong007
Copy link
Author

knit2html is not working correctly under Windows. I got error messages.

This is what I got from running it on the GBK version:

> knit2html('index.Rmd')


processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code


output file: index.md

Error in sub("#!r_highlight#", highlight, html, fixed = TRUE) : 
  invalid multibyte string at '<9f><<2f>title>

#!r_highlight#

#!mathjax#

<style type="text/css">
body, td {
   font-family: sans-serif;
   background-color: white;
   font-size: 12px;
   margin: 8px;
}

tt, code, pre {
   font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}

h1 { 
   font-size:2.2em; 
}

h2 { 
   font-size:1.8em; 
}

h3 { 
   font-size:1.4em; 
}

h4 { 
   font-size:1.0em; 
}

h5 { 
   font-size:0.9em; 
}

h6 { 
   font-size:0.8em; 
}

a:visited {
   color: rgb(50%, 0%, 50%);
}

pre {   
   margin-top: 0;
   max-width: 95%;
   border: 1px solid #ccc;
   white-space: pre-wrap;
}

pre code {
   display: block; padding: 0.5em;
}

code.r, code.cpp {
   background-color: #F8F8F8;
}

table, td, th {
  border: none;
}

blockquote {
   color:#666666;
   margin:0;
   padding-left: 1em;
   border-left: 0.5em #EEE solid;
}

hr {
   height: 0px;
   border-bottom: none;
   border-top-width: thin;
   border-top-style: dotted;

This is what I got from running it on the UTF8 version:

> knit2html('index.Rmd')


processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code


output file: index.md

Error in substring(u, so, so + ml - 1L) : 
  invalid multibyte string at '<9f><<2f>h2>

<hr/>

<blockquote>
<ul>
<li>璞嗙摚鐢靛奖涓殑鏍囩

<ul>
<li><img src="pics/what_is_folksonomy2.png" alt=""/></li>
</ul></li>
<li>璞嗙摚闊充箰涓殑鏍囩

<ul>
<li><img src="pics/what_is_folksonomy3.png" alt=""/></li>
</ul></li>
<li>璞嗙摚闃呰涓殑鏍囩

<ul>
<li><img src="pics/what_is_folksonomy4.png" alt=""/></li>
</ul></li>
</ul>
</blockquote>

<hr/>

<h2>浠€涔堟槸鏍囩</h2>

<blockquote>
<ul>
<li>鐢ㄦ埛涓诲姩鐢熸垚</li>
<li>瀵规枃瀛楀唴瀹逛笉鍔犻檺鍒<b6></li>
<li>鏄鐗╁搧鏈夌泭鐨勮ˉ鍏呰鏄庝俊鎭<af></li>
<li>鑻辨枃閲岀О杩欐牱鐨勪笢瑗垮彨鍋<9a><strong>folksonomy</strong>(folk+taxonomy)锛屽苟涓嶆槸<em>tag</em></li>
</ul>
</blockquote>

<hr/>

<h2>鏍囩鏃犲涓嶅湪</h2>

<p>闄や簡璞嗙摚锛屽叾瀹炶繕鏈夊緢澶氬湴鏂瑰嚭鐜颁簡鏍囩锛<9a></p>

<blockquote>
<ul>
<li>鏂版氮寰崥涓殑鏍囩

<ul>
<li><img src="pics/folksonomy_is_everywhere1.png" alt=""/></li>
</ul></li>
<li>缁熻涔嬮兘涓殑鏍囩

<ul>
<li><img src="pic

@ramnathv
Copy link
Owner

You need to explicitly pass the encoding to knit2html using knit2html('index.Rmd', encoding = "GBK").

@hetong007
Copy link
Author

Sorry, but the result still remains the same :(

@ramnathv
Copy link
Owner

Okay. Can you save your Rmd file and provide me a link to it? Don't copy paste it as I want to ensure that it is saved with the correct encoding. Since you are having trouble using knit2html as well, @yihui may have some idea as to what might be messing things up. Also print your sessionInfo() so that we know the versions of all packages that were loaded in your R Console.

@hetong007
Copy link
Author

@yihui is not a Windows user, maybe he chose to ignore those errors before :(

Here is a repo I just created with the Rmd files index-GBK.Rmd and index-UTF8.Rmd. Also, sessionInfo.txt has the result from sessionInfo().

@ramnathv
Copy link
Owner

Well knitr has lots of Windows users and I have seen @yihui do a lot of encoding related work. If there is an R expert on encoding, my money will be on @yihui :)

@hetong007
Copy link
Author

Chinese programmers suffer from encoding related problems everyday. Thank you and good luck! :)

@yihui
Copy link

yihui commented Jan 22, 2014

I think I know what is the problem, but it will take me a while to find out where the character encoding got messed up. The encoding of this page https://github.com/hetong007/temp_files/blob/master/index-GBK.html is not UTF-8, but it contains the spec <meta charset="utf-8">, which is wrong. Actually this page contains characters with different encodings: some are UTF-8 and some are GBK. It might be the problem of slidify, slidifyLibraries, whisker, or markdown.

@hetong007 I rarely use Windows myself, but that does not mean I do not care about Windows users :)

@ramnathv
Copy link
Owner

@yihui, I understand why slidify fails on this file. The <meta charset="utf-8"> is from from the slidifyLIbraries template for the io2012 library, and can be fixed by modifying this line in the libraries folder.

The failure of knit2html is possibly explained either by the mixed encoding, or the utf-8 encoding specified in the default template

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

I am thinking @hetong007 needs to convert the entire document to GBK or UTF-8 and the modify the template, if he were using GBK. Does that sound about right @yihui ? Thanks for taking a look at this.

@yihui
Copy link

yihui commented Mar 11, 2014

I'll take a look at @kohske's PR rstudio/markdown#49 and rstudio/markdown#50. The problem should be at least alleviated after the encoding problem is gone in the markdown package, although there are still other places that may have to be fixed.

@ramnathv
Copy link
Owner

Thanks @yihui. I will look forward to these fixes. I presume that these issues are non-existent with rmarkdown or is encoding handling still going to be tricky?

@kohske
Copy link

kohske commented Mar 11, 2014

FYI, here is the fix of encoding for markdown, slidify, and knitrBootstrap.
I hope someone else also will test this, and confirm it does not break any existing codes.

The below is the test script and markdown files:
http://kohske.github.io/sandbox/knit-encode.zip

kohske

@kohske
Copy link

kohske commented Mar 11, 2014

I tested the UTF8 file including GBK characters (below) and slidify works perfectly on Windows!!
https://github.com/hetong007/Douban_Folksonomy/blob/master/index.Rmd

Note that before running slidiy, change the locale's code page to 936.

@ramnathv
Copy link
Owner

Thanks @kohske. This is a really significant contribution as it opens up things for a large group of users. I will run through the tests and merge this weekend. Can you add yourself as a contributor in the DESCRIPTION file?

@hetong007
Copy link
Author

@kohske Thanks, this solution works perfectly on my Windows XP!

Meanwhile, the framework of the generated slides is not the same as before, i.e. io2012 is not applied to the generated file. Is it caused by the dev version of slidify @ramnathv ?

@ramnathv
Copy link
Owner

Are you using RStudio? If yes, what version? If you can paste a screenshot of the output you get, that would be useful for me to figure out what might be going on.

@kohske
Copy link

kohske commented Mar 12, 2014

@ramnathv Okay, thanks. Note that MBCS-compatible slidify requires MBCS-compatible markdown package.

@hetong007
Copy link
Author

@kohske After the code install_github("kohske/knitrBootstrap@fix/encode", quick=TRUE), there's a warning saying package ‘’ is not available (for R version 3.0.2) . The name of the 'missing' package is empty. Is it a tiny bug or I just missed something? Thank you.

@kohske
Copy link

kohske commented Mar 12, 2014

@hetong007 This is due to DESCRIPTION of knitrBootstrap. R (> 3.0.0), should be R (> 3.0.0) Please just ignore the warning. Thanks for your test and report!!

@hetong007
Copy link
Author

@ramnathv I am using the newest RStudio, i.e. 0.98.692. Under dev_mode(), and I am generating the html file with only the pics folder and the index.Rmd file from the original repository.

The output information is

d> slidify("Douban_Folksonomy-master/index.Rmd", encoding="UTF8")
processing file: index.Rmd
  |.................................................................| 100%
  ordinary text without R code
output file: index.md
Copying files to libraries/frameworks/io2012...
Copying files to libraries/highlighters/highlight.js...
Copying files to libraries/widgets/bootstrap...
Warning messages:
1: In readLines(con, ...) : incomplete final line found on 'index.Rmd'
2: In readLines(con, ...) : incomplete final line found on 'index.Rmd'

And the first page looks like

io2012 not working

The second page looks like

io2012 not working 2

Comparing to this original version, it is not hard to find the significant difference.

@kohske
Copy link

kohske commented Mar 12, 2014

@hetong007 Obviously the libraries in the original repository is quite old. The results are same to the newer version by generating under Mac OS X.

@ramnathv
Copy link
Owner

@kohske is right. I updated the default stylesheets for io2012, adding the bottle green background in the title slide and the blue color for slide titles. You can always modify it, if you prefer a different appearance of the slides.

@hetong007
Copy link
Author

@ramnathv @kohske Thanks for pointing that out. Then I would say Chinese users (maybe including Japanese and other users as well) will enjoy slidify in Windows! Thanks :)

@ramnathv
Copy link
Owner

Thanks to @kohske for so diligently plugging away on this. Encoding issues are not the most pleasant ones to be working on, but are so critical. I will try to merge this pull request this weekend, after ensuring that it doesn't break any other features of slidify. @kohske, please add yourself as a contributor in the DESCRIPTION!

@kohske
Copy link

kohske commented Mar 12, 2014

@ramnathv I did it, thanks.

@ramnathv
Copy link
Owner

Thanks to @kohske, I just merged in some changes that provide for better encoding support. You can install it from the fix-encode branch.

library(devtools)
install_github("ramnathv/slidify@fix-encode")

Can you install it and test if it solves the encoding issues you had mentioned here?

@hetong007
Copy link
Author

This fix everything on my system. But I am using Win 7 instead of Win XP now. I hope it doesn't matter.

I created two Rmd files in GB2312 and UTF8 respectively, and ran the following code:

library(devtools)
install_github("ramnathv/slidify@fix-encode")

# setwd(...)

require(slidify)

slidify('index.Rmd', encoding='CP936')
slidify('index-UTF8.Rmd', encoding='UTF8')

The result is great.

Thank you @ramnathv and @kohske

@kohske
Copy link

kohske commented Jun 16, 2014

Thanks @ramnathv, everything works perfectly with Japanese_Japan.CP932 and UTF8 under Win7.

@suensummit
Copy link

Thanks all your efforts! @hetong007 @ramnathv @kohske
This patch works well with Traditional Chinese under Win8 (with encoding UTF8) as well, great job done!

@ramnathv
Copy link
Owner

All credit should go to @kohske for painstakingly working on fixing encoding related issues.

@yihui
Copy link

yihui commented Oct 22, 2014

Is the fix-encode branch ready to be merged, then?

@ramnathv
Copy link
Owner

Yes. I will be merging it this weekend, when I will be working on slidify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants