Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 in/out #118

Merged
merged 14 commits into from
Nov 23, 2020
Merged

UTF-8 in/out #118

merged 14 commits into from
Nov 23, 2020

Conversation

shrektan
Copy link
Contributor

@shrektan shrektan commented Nov 18, 2020

Fixes #103

Hi, first of all, thanks for writing this fantasic package.

As a Chinese user, I often need to read / write Excel sheets contain Chinese letters on Windows, where the encoding issue is a huge headache. According to my test, there're at least two known issues::

  1. When the comment contains non-ASCII letters, the comment will be garbaged
  2. When the Excel table's name contains non-ASCII letters, the name will be garbaged

These two issues are difficult to avoid because by default, Excel will create locale language names for the comments (the author name) and Excel table's names.

And these two issues cause more trouble when I use such an Excel as a template. Resaving the excel object modified by openxlsx will malform the Excel files. I mean, when I open the Excel files that write by openxlsx, Excel will alarm that some contents are illegal.

The root cause of this is that R can't use UTF-8 as the native encoding on Windows as the time of writing (luckily, this may be available in the near future). So, the only reliable cure for this headache for now is to use UTF-8 encoded strings whenever we can.

This PR tries to do the two things:

  1. Ensure all the values of C++ functions that return a character vector be marked as UTF-8 encoded.
  2. Defined a helper function readUTF8() to ensure that we read the xml as UTF-8 encoded.

Despite it looks like lots of lines has been changed, I believe the PR is simple, if you read the commit in sequence.

Thanks!

TODO

  • more manual tests on write/read: non-ASCII range names, sheet names, table names, formula
  • add unit tests
  • add NEWS

Think & Future

We should check all the C++ function's that used in R code and change the argument type from std::string to SEXP. Only so can we convert the strings to UTF-8 encoded later on the Cpp side, thus guarantee our policy UTF-8 in / out in everywhere. The reason is that we can only know the encoding info in SEXP but not std::string. Another possible solution is to see if there's some configs in Rcpp so that it can ensure the auto-SEXP-to-std::string ends up with a UTF-8 encoded string.

Anyway, I think we'd better not make this PR further complicate and leave this task to the future PRs.

@codecov
Copy link

codecov bot commented Nov 18, 2020

Codecov Report

Merging #118 (2b6b5ba) into master (44014fa) will increase coverage by 2.64%.
The diff coverage is 93.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #118      +/-   ##
==========================================
+ Coverage   59.68%   62.33%   +2.64%     
==========================================
  Files          30       30              
  Lines        8729     8740      +11     
==========================================
+ Hits         5210     5448     +238     
+ Misses       3519     3292     -227     
Impacted Files Coverage Δ
R/writeData.R 77.41% <ø> (+11.17%) ⬆️
R/WorkbookClass.R 53.63% <66.66%> (+2.57%) ⬆️
src/helper_functions.cpp 74.09% <88.88%> (+0.84%) ⬆️
R/loadWorkbook.R 82.24% <92.30%> (+11.62%) ⬆️
R/helperFunctions.R 82.35% <100.00%> (+2.74%) ⬆️
R/readWorkbook.R 86.24% <100.00%> (ø)
R/wrappers.R 49.32% <100.00%> (+0.28%) ⬆️
src/load_workbook.cpp 94.64% <100.00%> (+0.21%) ⬆️
src/write_file.cpp 90.07% <100.00%> (+1.49%) ⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44014fa...2b6b5ba. Read the comment docs.

Copy link
Owner

@ycphs ycphs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function would also fit into the helper.R file

Copy link
Owner

@ycphs ycphs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently planning to replace all the readLines with an equivalent in cpp. That should boost the performance and the UTF8 encoding can be solved at the same time.

@ycphs
Copy link
Owner

ycphs commented Nov 18, 2020

Thank you for your work. This will improve the package a lot.

@shrektan shrektan marked this pull request as ready for review November 21, 2020 13:59
@shrektan shrektan requested a review from ycphs November 21, 2020 16:04
@ycphs ycphs merged commit 4dfbf55 into ycphs:master Nov 23, 2020
@shrektan shrektan deleted the shrektan/encoding branch November 23, 2020 16:48
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Sep 8, 2021
# development  openxlsx 4.2.4

## Fixes

* `Write.xlsx()` now successfully passes `withFilter`
  ([#151](ycphs/openxlsx#151))
* code clean up PR [#168](ycphs/openxlsx#168)
* removal of unused variables PR
  [#168](ycphs/openxlsx#168)

## New features

* adds `buildWorkbook()` to generate a `Workbook` object from a
  (named) list or a data.frame
  ([#192](ycphs/openxlsx#192),
  [#187](ycphs/openxlsx#187))
  * this is now recommended rather than the `write.xlsx(x, file) ; wb
    <- read.xlsx(file)` functionality before
  * `write.xlsx()` is now a wrapper for `wb <- buildWorkbook(x);
    saveWorkbook(x, file)`
  * parameter checking from `write.xlsx()` >> `buildWorkbook()` are
    now held off until passed to `writeData()`, `writeDataTable()`,
    etc
  * `row.names` is now deprecated for `writeData()` and
    `writeDataTable()`; please use `rowNames` instead
* `read.xlsx()` now checks for the file extension `.xlsx`; previously
  it would throw an error when the file was `.xls` or `.xlm` files

* memory allocation improvements
* global options added for `minWidth` and `maxWidth`

* `write.xlsx()` >> `buildWorkbook()` can now handle `colWidths`
  passed as either a single element or a `list()`

* Added ability to change positioning of summary columns and rows.
  * These can be set with the `summaryCol` and `summaryRow` arguments
    in `pageSetup()`.

* `activeSheet` allows to set and get the active (displayed) sheet of a worbook.

* Adds new global options for workbook formatting
  ([#165](ycphs/openxlsx#165); see
  `?op.openxlsx`)


# openxlsx 4.2.3

## New Features

* Most of functions in openxlsx now support non-ASCII arguments
  better. More specifically, we can use non-ASCII strings as names or
  contents for `createNamedRegion()`
  ([#103](ycphs/openxlsx#103)),
  `writeComment()`, `writeData()`, `writeDataTable()` and
  `writeFormula()`. In addition, openxlsx now reads comments and
  region names that contain non-ASCII strings correctly on
  Windows. Thanks to @shrektan for the PR
  [#118](ycphs/openxlsx#118).

* `setColWidths()` now supports zero-length `cols`, which is
  convinient when `cols` is dynamically provided
  [#128](ycphs/openxlsx#128). Thanks to
  @shrektan for the feature request and the PR.

## Fixes for Check issues

* Fix to pass the tests for link-time optimization type mismatches

* Fix to pass the checks of native code (C/C++) based on static code
  analysis

## Bug Fixes

* Grouping columns after setting widths no longer throws an error
  ([#100](ycphs/openxlsx#100))

* Fix inability to save workbook more than once
  ([#106](ycphs/openxlsx#106))

* Fix `loadWorkbook()` sometimes importing incorrect column attributes

# openxlsx 4.2.2

## New Features

* Added features for `conditionalFormatting` to support also 'contains
  not', 'begins with' and 'ends with'

* Added return value for `saveWorkbook()` the default value for
  `returnValue` is `FALSE`
  ([#71](ycphs/openxlsx#71))

* Added Tests for new parameter of `saveWorkbook()`

## Bug Fixes

* Solved CRAN check errors based on the change disussed in
  [PR#17277](https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17277)

# openxlsx 4.2.0

## New Features

* Added `groupColumns()`, `groupRows()`, `ungroupColumns()`, and
  `ungroupRows()` to group/ugroup columns/rows
  ([#32](ycphs/openxlsx#32))

## Bug Fixes

* Allow xml-sensitve characters in sheetnames
  ([#78](ycphs/openxlsx#78))

## Internal

* Updated roxygen2 to 7.1.1

# openxlsx 4.1.5.1

## Bug Fixes

*  fixed issue [#68](ycphs/openxlsx#68])

# openxlsx 4.1.5

## New Features

*  Add functions to get and set the creator of the xlsx file

*  add function to set the name of the user who last modified the xlsx file

## Bug Fixes

*  Fixed NEWS hyperlink

*  Fixed writing of mixed EST/EDT datetimes

* Added description for `writeFormula()` to use only english function
   names

*  Fixed validateSheet for special characters

## Internal

*  applied the tidyverse-style to the package `styler::style_pkg()`

*  include tests for `cloneWorksheet`

# openxlsx 4.1.4

## New Features

* Added `getCellRefs()` as
   function. [#7](ycphs/openxlsx#7)

*  Added parameter for customizing na.strings

## Bug Fixes

*  Use `zip::zipr()` instead of `zip::zip()`.

* Keep correct visibility option for
   loadWorkbook. [#12](ycphs/openxlsx#12])

* Add space surrounding "wrapText"
   [#17](ycphs/openxlsx#17)

* Corrected Percentage, Accounting, Comma, Currency class on column
   level


*  update to rogygen2 7.0.0

# openxlsx 4.1.3

## New Features

*  Added a `NEWS.md` file to track changes to the package.
*  Added `pkgdown` to create site.

## Bug Fixes

*  Return values for cpp changed to R_NilValue for r-devel tests

*  Added empty lines at the end of files

# openxlsx 4.1.2

*  Changed maintainer

# openxlsx 4.1.1

## New Features

* `sep.names` allows choose other separator than '.' for variable
   names with a blank inside

* Improve handling of non-region names in `getNamedRegions` and add
   related test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoding issue
2 participants