Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 'd' in [red] is interpreted as a date format #388

Closed
nacnudus opened this issue Sep 24, 2017 · 3 comments · Fixed by #559
Closed

The 'd' in [red] is interpreted as a date format #388

nacnudus opened this issue Sep 24, 2017 · 3 comments · Fixed by #559
Labels
bug an unexpected problem or unintended behavior formats 💅

Comments

@nacnudus
Copy link
Contributor

nacnudus commented Sep 24, 2017

readxl/src/ColSpec.h

Lines 145 to 148 in 0d9ad4f

// TO FIX? So far no bug reports due to this.
// Logic below is too simple. For example, it deems this format string a date:
// "$"#,##0_);[Red]\("$"#,##0\)
// because of the `d` in `[Red]`

That was too tempting for somebody, who created this beauty (see MedicareAndLevies!Q:Q ):

How to fix?

// Ideally this can wait until we are using something like

... like the awful hack in tidyxl? 😉

// Adapted from hadley/readxl
bool styles::isDateFormat(std::string formatCode) {
  for (size_t i = 0; i < formatCode.size(); ++i) {
    switch (formatCode[i]) {
      case 'd':
        // Might be as in "[Red]"
        if (i < formatCode.size() - 1) {
          // there's at least one more character
          if (formatCode[i+1] == ']') {
            break;
          } else {
            return true;
          }
        }

Reprex:

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
library(readxl)

source_url <- "https://github.com/tidyverse/readxl/files/1327821/CPS.v17-09-12.xlsx"
filename <- "CPS v17-09-12.xlsx"
sheetname <- "MedicareAndLevies"

download.file(source_url, filename, mode = "wb")

x <- read_xlsx(filename, sheetname, skip = 6)
class(x$MedLevSurRate2)
#> [1] "POSIXct" "POSIXt"

The culpable number formats:

library(tidyxl)
y <- tidy_xlsx(filename, sheetname) # Slow. I wish I could call Jim Hester for advice.

numFmt <- tibble(numFmt = y$formats$local$numFmt,
                 id = seq_along(y$formats$local$numFmt))

y$data[[1]] %>%
  filter(row >= 8, col == 17) %>%
  left_join(numFmt, by = c("local_format_id" = "id")) %>%
  distinct(numeric, numFmt)
#> # A tibble: 3 x 2
#>   numeric                               numFmt
#>     <dbl>                                <chr>
#> 1  0.0100     "#,##0.00_ ;[Red]\\-#,##0.00\\ "
#> 2  0.0125 "#,##0.0000_ ;[Red]\\-#,##0.0000\\ "
#> 3      NA                              General

Kudos to @HughParsonage for discovering this file and daring to import it into R.

@reviewher
Copy link

@nacnudus I hate to be that person, but there are other colors to consider. The valid colors are Black, Green, White, Blue, Magenta, Yellow, Cyan, and Red, and they can appear with lowercase characters. The other cases are fairly obvious: m is in [magenta]General, y is in [Cyan]0.00

https://jsfiddle.net/w7veddw9/1/ will generate a file rainbow.xlsx with the lowercase color names, if you would like to play with it. It fills cells A1-A8 with the named colors, cells B1-B8 with the 0-based numeric index, and sets the format to the appropriate color. If you specifically need an XLSB file, https://jsfiddle.net/w7veddw9/2/ writes the same file in the other format:

Other cases to consider:

  • e is also a type of year! If you want to see that, try setting a cell to format eeee-mm-dd.
  • you can embed literal strings and escape characters in format strings! For example, 0" dollars" and 0\ \d\o\l\l\a\r\s are valid formats that display a number followed by the string.
  • square brackets can be escaped or nested in string literals, e.g. 0" [Red]"

If you really want a fun case, the format "[R"ed"]" with number 12345 is rendered as [R193318]

Putting that all together, the correct date check would look something like:

/* TODO: actually validate that the format is valid */
#define CASEI(c) case c: case c | 0x20
#define CMPLC(i,n) if(x[i+i] | 0x20 == n)
inline bool isDateFormat(std::string x) {
  char escaped = 0, bracket = 0;
  for (size_t i = 0; i < x.size(); ++i) switch (x[i]) {
    CASEI('D'):
    CASEI('E'):
    CASEI('H'):
    CASEI('M'):
    CASEI('S'):
    CASEI('Y'):
      if(!escaped && !bracket) return true;
      break;
    case '"':
      escaped = 1 - escaped; break;
    case '\\': ++i; break;
    case '[': if(!escaped) bracket = 1; break;
    case ']': if(!escaped) bracket = 0; break;
    CASEI('G'):
      if(i + 6 < x.size())
      CMPLC(1,'e')
      CMPLC(2,'n')
      CMPLC(3,'e')
      CMPLC(4,'r')
      CMPLC(5,'a')
      CMPLC(6,'l')
        return false;
  }
  return false;
}
#undef CMPLC
#undef CASEI

@nacnudus
Copy link
Contributor Author

@reviewher That is really helpful, thank you for being 'that person'! I didn't know about all-lowercase names, but to miss Cyan and White was pretty dumb.

nacnudus added a commit to nacnudus/tidyxl that referenced this issue Oct 17, 2017
It was too crude before.  Thanks to @reviewher
tidyverse/readxl#388 for providing info and code.
@nacnudus
Copy link
Contributor Author

It turns out underscores should be treated as escape characters too.

See also ECMA-376-1:2016 page 1785 (actual page 1795) section 18.8.31 "numFmts (Number Formats)"

Skips the width of the next character. This is useful for lining up negative and positive values in different cells of the same column. [Example: The number format _(0.0_);(0.0) aligns the numbers 2.3 and -4.5 in the column even though the negative number is enclosed by parentheses. end example]

    case '\\':
    case '_':
      ++i;
      break;

@jennybc jennybc added the bug an unexpected problem or unintended behavior label Apr 15, 2018
nacnudus added a commit to nacnudus/readxl that referenced this issue Mar 9, 2019
nacnudus added a commit to nacnudus/readxl that referenced this issue Mar 9, 2019
nacnudus added a commit to nacnudus/readxl that referenced this issue Mar 9, 2019
jennybc added a commit that referenced this issue Mar 23, 2022
* Don't misinterpret colors as dates (fixes #388)

* Incorporate some updates

* Add a test; tweak some tests

Add a test and test file based on a format where the confusion can come via currency, not colour.

Link to issues.

We generally incorporate `-xls` or `-xlsx` into test file names, so the format is obvious even if the file is open in Excel.

Added the actual implicit number formats in comments.

Co-authored-by: Jenny Bryan <jenny.f.bryan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior formats 💅
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants