New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't load empty cells; use empirical worksheet dimensions #248

Merged
merged 5 commits into from Feb 6, 2017

Conversation

Projects
None yet
2 participants
@jennybc
Member

jennybc commented Feb 6, 2017

There can be cells that look empty to the human eye, but that are styled, e.g. someone applied a custom number format to the cell at some point. I propose we do not load those anymore. There's nothing to learn from their XML.

If such a cell still falls into the target rectangle, it will simply be NA in the output (ok, won't be totally true until we stop dropping unnamed, empty columns #157).

If such a cell falls outside the target rectangle, this PR prevents the creation of trailing row(s) or column(s) consisting entirely of NA.

In addition to rows/columns of NA, this has been a puzzling source of errors about column names and types not being compatible.

To fully fix this, you also can't trust a worksheet's declared dimensions, because these cells "count". Therefore I propose we always compute dimension ourselves.

I haven't dealt with xls yet. I can leave this open and work on that. Or we could resolve this for xlsx and I'd open an issue to fix it for xls.

This branch builds off #247. I'm assuming it will be merged first.

@jennybc jennybc requested a review from hadley Feb 6, 2017

@hadley

LGTM. Just a few tiny niggles

XlsxCell xcell(cell, i, j);
cells_.push_back(xcell);
// don't load a cell with no child nodes, e.g. it only has style
rapidxml::xml_node<>* first_child = cell->first_node(0);

This comment has been minimized.

@hadley

hadley Feb 6, 2017

Member

I think the check here probably eliminates a check somewhere else.

This comment has been minimized.

@jennybc

jennybc Feb 6, 2017

Member

There's not an obvious one, but I'll keep my eyes peeled.

Styled empty cells were a major source of "cell lacks the t attribute" here, which is why they always lead to numeric NA columns. But it's not clear they are the only source? When I get to column types, I'll try to settle this definitively.

readxl/src/XlsxCell.h

Lines 122 to 124 in a78440d

rapidxml::xml_attribute<>* t = cell_->first_attribute("t");
if (t == NULL || strncmp(t->value(), "n", 5) == 0) {

This comment has been minimized.

@jennybc

jennybc Feb 6, 2017

Member

Similar situation with this check for a cell having a v node. But I'm not sure if these cells are the only source of this either?

readxl/src/XlsxCell.h

Lines 102 to 104 in a78440d

rapidxml::xml_node<>* v = cell_->first_node("v");
if (v == NULL)
return NA_STRING;

## in a trailing empty column
## in some trailing rows
out <- read_excel(test_sheet("style-only-cells.xlsx"))
expect_equal(

This comment has been minimized.

@hadley

hadley Feb 6, 2017

Member

I think code like this is slightly cleaner if you define the tibble outside the test

This comment has been minimized.

@jennybc
test_that("user-supplied column names play nicely with empty columns", {
skip("waiting for dust to settle re: treatment of empty columns")
## do stuff like this:
out <- read_excel(test_sheet("style-only-cells.xlsx"),

This comment has been minimized.

@hadley

hadley Feb 6, 2017

Member

House style is

read_excel(
  test_sheet("style-only-cells.xlsx"),
  col_names = LETTERS[1:4]
)

This comment has been minimized.

@jennybc

@jennybc jennybc merged commit c6e0f8f into tidyverse:master Feb 6, 2017

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details

@jennybc jennybc deleted the jennybc:format-only-cells branch Feb 9, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment