New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load cells at xlsx worksheet ingest; handle skipping and/or blank row… #240

Merged
merged 12 commits into from Jan 31, 2017

Conversation

Projects
None yet
3 participants
@jennybc
Member

jennybc commented Jan 30, 2017

…s; fixes #224, fixes #163

Partially closes #222 (skipping), which will remain open to cover re: xls. Other linked issues are mentioned in NEWS and commit messages.

@@ -142,6 +142,7 @@ inline Rcpp::RObject makeCol(CellType type, int n) {
switch(type) {
case CELL_BLANK:
return R_NilValue;
break;

This comment has been minimized.

@hadley

This comment has been minimized.

@jimhester

jimhester Jan 31, 2017

Member

These breaks aren't actually needed, if the function is going to return first you don't need to break out of the switch.

This comment has been minimized.

@hadley

hadley Jan 31, 2017

Member

Either way it seems like we should consistently use or not use in a given switch.

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

I wondered why there had been no suffering due this missing break. I will leave it because at least it's now all consistent.

@@ -47,6 +47,11 @@ List read_xlsx_(std::string path, int sheet, RObject col_names,
XlsxWorkSheet ws(path, sheet);
if (ws.nrow() == 0 && ws.ncol() == 0) {
Rcpp::List empty(0);
return empty;

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

You could just do return Rcpp::List(0)

(And I assume this gets tibble-ised elsewhere?)

This comment has been minimized.

@jennybc

jennybc Jan 30, 2017

Member

Aha! And YES re: tibble-isation.

@@ -58,7 +63,7 @@ List read_xlsx_(std::string path, int sheet, RObject col_names,
{
sheetHasColumnNames = as<bool>(col_names);
if (sheetHasColumnNames) {
colNames = ws.colNames(nskip);
colNames = ws.colNames(std::max(ws.min_row(), nskip));

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

What's the logic here?

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

Since you use it in a few places, maybe assign to an informatively named variable?

This comment has been minimized.

@jennybc

jennybc Jan 30, 2017

Member

I will see if I can remove or generalize min_row_ now.

I think it might be wise to record info about where data starts (after advancing past nskip) and where the next row starts (assuming we'll need to advance past column names). Because yes I currently do this in at least two different places with the same logic.

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

Maybe call it first_row_? I think that's a bit more descriptive.

And first_data_row_? Or first_data_cell_

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

At the time of worksheet construction, right after dimensions are cached, I now mark the first cell with declared row >= nskip and the first cell with declared row >= to that of the former. Then all subsequent functions (column names, column types, column reading) just consult that.

@@ -72,11 +77,16 @@ List read_xlsx_(std::string path, int sheet, RObject col_names,
Rcpp::stop("`col_names` must be a logical or character vector");
}
// if we know there's no data reading to be done, we should return a
// 0 row tibble HERE with the correct colNames
// TO THINK: case where colTypes were explicitly specified but no data to read

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

I think you could return a data frame with a 0-length vector for each col, but it's not necessary and few people will notice the omission

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

Done.

@@ -16,11 +16,14 @@
class XlsxWorkSheet {
XlsxWorkBook wb_;
// JB QUESTION: seems like a worksheet should know its own name (and/or
// position). This could be used in messaging, such as, "SHEET has no data"

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

Yes, I think that's a good idea

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

Still need to do this.

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

Done.

return min_row_;
}
// JB: this seems to be leftover from previous development?

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

I think you can/should kill it now that we have loadCells()

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

deleted this

if (xcell.col() >= ncol_)
continue;
// go to row nskip or, if impossible, the next row that exists
std::vector<XlsxCell>::const_iterator it = getNextRow(nskip);

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

I think this should get moved out too. Maybe loadCells() should record which is the first non-blank cell that it sees?

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

And maybe also record the first cell after all the skipping?

This comment has been minimized.

@jennybc

jennybc Jan 30, 2017

Member

Agreed and acknowledged in above comment.

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

Deleted, in favour of marking the cells to start reading from at the time of ingest.

int base = it->row();
// advance past column names and any embedded blank rows
if (has_col_names) {
it = getNextRow(base + 1);

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

It seems a bit inefficient to start again from scratch to find the next row. Maybe getNextRow() should take a std::vector<XlsxCell>::const_iterator and a number of rows to advance?

This comment has been minimized.

@jennybc

jennybc Jan 30, 2017

Member

But you don't know how many rows to advance. However, I think this will all go away anyway, once I record these "milestones" exactly once at the start of worksheet ingest.

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

I do this row identification exactly once now and then just work with the iterators.

if (xcell.col() >= ncol_)
continue;
// go to row nskip or, if impossible, the next row that has data
std::vector<XlsxCell>::const_iterator it = getNextRow(nskip);

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

So does nskip include or exclude completely blank rows?

// JB: it feels like I should reveal sheet ?name? here
// currently worksheet doesn't know it's own name, though
// Or perhaps I should say nothing at all?
Rcpp::warning("No <row> in sheet xml. No data read.");

This comment has been minimized.

@hadley

hadley Jan 30, 2017

Member

I think generally loading an empty file should not generate a warning/message.

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

warning deleted

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 31, 2017

@hadley

I think I've addressed all of your comments and have gotten the NEWs/issues sorted out. Once you are satisfied, I think this can be merged. There are still a few "to do's" but they are best addressed as part of the next phase, where I'll work on column names and types for xlsx.

@hadley

A few more minor things, otherwise LGTM.

You might want to modify one of the commits to contain the full list of "fixes".

public:
XlsxWorkSheet(XlsxWorkBook wb, int sheet_i): wb_(wb) {
XlsxWorkSheet(XlsxWorkBook wb, int sheet_i, int nskip):
wb_(wb)

This comment has been minimized.

@hadley

hadley Jan 31, 2017

Member

I think our policy is to indent default constructors. @jimhester what do you do?

This comment has been minimized.

@jimhester

jimhester Jan 31, 2017

Member

I indent them.

It might be worth standardizing on clang-format and setting up a .clang-format file, which would give us standardized formatting across projects.

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

This seems not to be what RStudio wants to do. Does it do so for either of you and I've got some setting wrong?

I'm worried about doing it manually, because I use "reindent lines" so often, I fear my indenting will be worse overall, i.e. not internally consistent.

@jimhester Would RStudio implement a .clang-format file?

void printCells() {
for (rapidxml::xml_node<>* row = sheetData_->first_node("row");
row; row = row->next_sibling("row")) {
std::string sheetName() {

This comment has been minimized.

@hadley

hadley Jan 31, 2017

Member

const ?

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

Is there any reason to not do same for ncol() and nrow(), just above this?

I added for all 3, so speak up if it's a mistake to add for the other two.

## currently we get a 0-row tibble with 2 unnamed columns
## a big improvement but when I rationlize column names, see why these
## empty column names aren't causing the columns to be dropped and fix it
out <- read_excel(test_sheet("skipping.xlsx"), skip = 10)

This comment has been minimized.

@hadley

hadley Jan 31, 2017

Member

To make it less likely you'll forget about this column, you can use skip() to get a message in the tests

This comment has been minimized.

@jennybc

jennybc Jan 31, 2017

Member

Done! Nice idea.

jennybc added some commits Jan 31, 2017

Use skip() to issue note-to-future-self
Expedient place to park a summary of what this entire PR does.

* Improved parsing of sheet geometry for xlsx. (#240, @jennybc).

    - Better handling of leading and embedded blank rows and explicit row skipping. (#224, #194, #178, #156, #101)
    - Worksheets that are completely empty or that contain only column names no longer error, but return a tibble with zero rows. (#222, #144, #65)
    - Location is inferred for cells that do not declare their location (e.g. xlsx written by JMP). (#163, #102)

@jennybc jennybc merged commit 55291b4 into tidyverse:master Jan 31, 2017

4 checks passed

codecov/patch 89.58% of diff hit (target 61.3%)
Details
codecov/project 62.63% (+1.33%) compared to b35e2bc
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jennybc jennybc deleted the jennybc:refactor-xlsx-ingest branch Mar 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment