New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logical cell & col types; refactor cell typing and coercion #277

Merged
merged 18 commits into from Mar 5, 2017

Conversation

Projects
None yet
3 participants
@jennybc
Member

jennybc commented Mar 1, 2017

Main points:

  • Adds new CellType: CELL_LOGICAL. Detects Boolean cells, both static and formula.
  • Adds new ColType: COL_LOGICAL.
  • Adds these conversions:
    • logical to numeric or character
    • numeric or character to logical
    • character to numeric
    • numeric to date
  • Handle cells in error (explicit #N/A or as result of formula). Treat as CELL_BLANK and therefore read as NA.
  • Checking whether cell contents match the values in the na argument is now generally delegated to the type() member function of both XlsCell and XlsxCell. If the reported type is not CELL_BLANK, then the cell has contents you should work with. I have not exploited this to the maximum extent possible yet tbh, but am working on it.

Closes #270 Support a logical column type? (also previously discussed in #73).
Closes #263 Dates in numeric columns.
Closes #62 #NV (Excel function error) translates to string error instead of NA
Closes #106 Force numeric when col_types specifies it
Closes #217 NAs for cells with 'numbers as text' in a numeric column
Closes #266 Dates in Excel's internal format
Rescues xls formula dates, even though no one has complained about that yet 馃.

General style & housekeeping work:

  • Handle everything based on CellType or ColType enums in order.
  • Changed return whatever; break; into return whatever; inside all switch() statements.
  • Uniform messages/warnings across cell/col types and for xls and xlsx.
  • Make everything as similar as possible for xls and xlsx.
return CELL_BLANK;
}
int format = styles->xf[cell_->xf].format;
return isDateTime(format, customDateFormats) ? CELL_DATE : CELL_NUMERIC;

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

This rescues xls formula dates, which were previously coming in as numeric.

return CELL_BLANK;
} else {
return CELL_TEXT;
// error should become NA
return cell_->d == 0 ? CELL_TEXT : CELL_BLANK;

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

Formula cells that are in error come in now as NA instead of the string "error".

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

Why is that implied by cell_->d == 0? Might be worth a comment.

This comment has been minimized.

@jennybc

jennybc Mar 4, 2017

Member

done 699a5c8

break;
// error should become NA
// 2.5.10 Bes p592 explains the codes, if we ever change our minds
return (!strcmp((char *) cell_->str, "bool")) ? CELL_LOGICAL : CELL_BLANK;

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

Ensures explicit #N/A becomes NA instead of the string "error".

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

strncmp() here too. Maybe extract out into helper function?

This comment has been minimized.

@jennybc

jennybc Mar 4, 2017

Member

done cf46fcf (well, not the helper function part (yet?))

}
return CELL_TEXT;

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

Made CELL_TEXT the default cell type vs CELL_NUMERIC. Presumably has almost zero practical significance.

@@ -13,23 +14,6 @@
// 18.3.1.96 v (Cell Value) [p1709]
// 18.18.11 ST_CellType (Cell Type) [p2443]
// Simple parser: does not check that order of numbers and letters is correct

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

Moved this to utils.h because it's not necessarily specific to xlsx.

@@ -54,17 +38,177 @@ class XlsxCell {
return location_.second;
}
std::string asStdString(const std::vector<std::string>& stringTable) const {
CellType type(const StringSet& na,

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

This got moved higher in the class and substantially reworked.

// anything that is not explicitly addressed elsewhere
}
std::string asStdString(const StringSet& na,

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

This was refactored to use and trust type() and handle the new logical cell type.

std::string text_string = xcell->asStdString(na, wb_.stringTable(),
wb_.dateStyles());
bool text_boolean;
if (logicalFromString(text_string, &text_boolean)) {

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

I am converting strings to logical as R does. At least, that is my intent.

{
std::string text_string((char*) xcell->cell()->str);
bool text_boolean;
if (logicalFromString(text_string, &text_boolean)) {

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

I am converting strings to logical as R does. At least, that is my intent.

case CELL_DATE:
REAL(col)[row] = xcell->asDouble(na);

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

See how dates in a numeric column were treated as numeric? I don't do that anymore. They become NA. See #263.

case CELL_DATE:
// print date string here, when it's easier to do so
Rcpp::warning("Expecting numeric in [%i, %i]: got a date",
i + 1, j + 1);
REAL(col)[row] = NA_REAL;

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

Dates become NA in a numeric column, instead of their integer value. See #263.

return std::make_pair(row - 1, col - 1); // zero indexed
}
inline bool logicalFromString(std::string maybe_tf, bool *out) {

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

Have I made this much harder than it needs to be? I borrowed the logic for looking for matches from the way the na argument is handled.

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

I think you probably should initialise true_strings and false_strings once, outside of this function. @jimhester can suggest the best way

This comment has been minimized.

@jimhester

jimhester Mar 2, 2017

Member

I believe R exposes Rf_StringTrue in R_ext/Utils.h, is it possible we could use that rather than rolling our own here.

If that is not possible I would do something like https://github.com/jennybc/readxl/pull/2

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

Yeah looks like Rf_StringTrue is the way to go

This comment has been minimized.

@jennybc
// columns come after cell types)
ColType inline as_ColType(CellType cell) {
return (ColType) cell;
// column types come after cell types)

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

This is mostly me moving things around.

// print date string here, when it's easier to do so
Rcpp::warning("Expecting numeric in [%i, %i]: got the date '%s'",
i + 1, j + 1, xcell->cell()->str);
REAL(col)[row] = NA_REAL;

This comment has been minimized.

@jennybc

jennybc Mar 1, 2017

Member

Dates become NA in numeric columns, instead of their integer value. See #263.

}
}
if (v == NULL || na.contains(v->value())) {

This comment has been minimized.

@jennybc

jennybc Mar 2, 2017

Member

This is now the main check for an empty cell or a cell with contents that match na. If so, the cell's type will be CELL_BLANK. In theory I could remove a lot of checks elsewhere for such a cell and just trust the cell type. But I have not done that systematically.

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

I think that would be a good idea.

This comment has been minimized.

@jennybc

jennybc Mar 4, 2017

Member

Maybe cell type should become part of the Xls[x]Cell class and determined + stored when the cell is loaded. Otherwise, we re-type many cells up to 3 times: during col type guessing, insertion of cells into columns, and coercion from one type to another.

This is the last substantial open question in this PR.

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

Yeah, that seems like a good idea. Maybe do as a next step in a separate PR?

This comment has been minimized.

@jennybc

jennybc Mar 5, 2017

Member

Issue opened #283

}
}
int asInteger(const StringSet& na) const {

This comment has been minimized.

@jennybc

jennybc Mar 2, 2017

Member

New. Needed to make logical cells.

@jennybc jennybc changed the title from [WIP] Add logical cell and col types; general refactor of cell typing and c鈥 to [WIP] Add logical cell & col types; refactor cell typing and coercion Mar 2, 2017

@jennybc jennybc requested a review from hadley Mar 2, 2017

@jennybc jennybc changed the title from [WIP] Add logical cell & col types; refactor cell typing and coercion to Add logical cell & col types; refactor cell typing and coercion Mar 2, 2017

REAL(col)[row] = NA_REAL;
Rcpp::warning("Coercing numeric to date in [%i, %i]",
i + 1, j + 1);
REAL(col)[row] = (xcell->cell()->d - offset_) * 86400;

This comment has been minimized.

@jennybc

jennybc Mar 2, 2017

Member

I think this is the right thing to do (convert numeric to date if column is "date"), but I'm not sure. It was requested (#266). What do you think?

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

That seems consistent with the new behaviour in the other branches (i.e. try harder to be helpful). I think you could drop the warning.

src/utils.h Outdated
return matches;
}
inline bool doubleFromString(std::string in, double& out) {

This comment has been minimized.

@jennybc

jennybc Mar 2, 2017

Member

I am working on a replacement for this that does not assume C++11. Travis/Appveyor have informed me that this apparently does.

This comment has been minimized.

@jennybc

jennybc Mar 2, 2017

Member

This is sorted out now.

jennybc added some commits Mar 2, 2017

int format = styles->xf[cell_->xf].format;
return isDateTime(format, customDateFormats) ? CELL_DATE : CELL_NUMERIC;
} else { // formula evaluates to Boolean, string, or error
if (!strcmp((char *) cell_->str, "bool")) {

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

I think you should use strncmp() here to be safe (in case cell_->str doesn't contain a \0)

This comment has been minimized.

@jennybc

jennybc Mar 4, 2017

Member

done cf46fcf

return CELL_BLANK;
} else {
return CELL_TEXT;
// error should become NA
return cell_->d == 0 ? CELL_TEXT : CELL_BLANK;

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

Why is that implied by cell_->d == 0? Might be worth a comment.

break;
// error should become NA
// 2.5.10 Bes p592 explains the codes, if we ever change our minds
return (!strcmp((char *) cell_->str, "bool")) ? CELL_LOGICAL : CELL_BLANK;

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

strncmp() here too. Maybe extract out into helper function?

for (size_t i = 0; i < types.size(); i++) {
types[i] = COL_BLANK;
}
std::fill(types.begin(), types.end(), COL_BLANK);

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

馃憤

case CELL_DATE:
REAL(col)[row] = xcell->cell()->d;
// print date string here, when/if it's possible to do so
Rcpp::warning("Expecting logical in [%i, %i] got a date",

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

I wonder if it's worth pulling out a cellPosition(int i, int j) now? That would make it easier to implement #230. (But this PR is already quite big, so feel to leave)

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

And actually it would probably be better to have a cellProblem(std::string message, int i, int j) but that definitely doesn't belong here

// print date string here, when/if possible
Rcpp::warning("Expecting numeric in [%i, %i]: got a date",
i + 1, j + 1);
REAL(col)[row] = NA_REAL;

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

Shouldn't this be xcell->cell()->d ?

This comment has been minimized.

@jennybc

jennybc Mar 4, 2017

Member

At the moment, the choice to not coerce a date to numeric is intentional and consistent between xls and xlsx. See #263. Does anyone ever want that number mixed in with other numbers that are not dates? Yet I can see how this seems in conflict with all the other places where we've added coercion.

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

Yeah, it seems like reasonable behaviour but not in keeping with the other places where we do conversion. It might be worth spending a couple of minutes trying to articulate the principle of why this is ok here.

case CELL_NUMERIC: {
SET_VECTOR_ELT(col, row, Rf_ScalarReal(xcell->cell()->d));
case CELL_LOGICAL: {
SET_VECTOR_ELT(col, row, Rf_ScalarLogical((xcell->cell()->d)));

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

Two many parens?

This comment has been minimized.

@jennybc
}
}
if (v == NULL || na.contains(v->value())) {

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

I think that would be a good idea.

@@ -269,7 +362,7 @@ class XlsxWorkSheet {
ncol_ = 0;
// empty sheet case
if (cells_.size() == 0) {
if (cells_.empty()) {

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

This seems like a lot of duplicated code. I think it's probably worth the effort to define a common base class for XlsxCell and XlsCell so we only need one function instead of two. That can happen after this PR though.

This comment has been minimized.

@jennybc

jennybc Mar 5, 2017

Member

Issue opened #284

return std::make_pair(row - 1, col - 1); // zero indexed
}
inline bool logicalFromString(std::string maybe_tf, bool *out) {

This comment has been minimized.

@hadley

hadley Mar 2, 2017

Member

I think you probably should initialise true_strings and false_strings once, outside of this function. @jimhester can suggest the best way

case CELL_DATE:
REAL(col)[row] = xcell->cell()->d;
// print date string here, when/if it's possible to do so
Rcpp::warning("Expecting logical in [%i, %i] got a date",

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

And actually it would probably be better to have a cellProblem(std::string message, int i, int j) but that definitely doesn't belong here

REAL(col)[row] = NA_REAL;
break;
case CELL_LOGICAL:
Rcpp::warning("Coercing boolean to numeric in [%i, %i]",

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

Oh I see. This will become less annoying once we have readr style problems tibble.

// print date string here, when/if possible
Rcpp::warning("Expecting numeric in [%i, %i]: got a date",
i + 1, j + 1);
REAL(col)[row] = NA_REAL;

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

Yeah, it seems like reasonable behaviour but not in keeping with the other places where we do conversion. It might be worth spending a couple of minutes trying to articulate the principle of why this is ok here.

}
}
if (v == NULL || na.contains(v->value())) {

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

Yeah, that seems like a good idea. Maybe do as a next step in a separate PR?

// summary of how Excel cell types have been mapped to our CellType
//
// CELL_BLANK
// inlineStr cell and (string is na or string can't be found)

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

I feel like this might be more informative at the start of the function.

This comment has been minimized.

@jennybc

jennybc Mar 5, 2017

Member

OK done in a9402dc

// inlineStr
rapidxml::xml_node<>* is = cell_->first_node("is");
if (is != NULL) {
return parseString(is, &out_string) ? out_string : "NA";

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

It's a bit odd to see a literal "NA" here

return matches;
}
inline bool doubleFromString(std::string mystring, double& out) {

This comment has been minimized.

@hadley

hadley Mar 4, 2017

Member

It might be a bit more idiomatic to return a std::pair<bool, double> here.

What do you think @jimhester?

This comment has been minimized.

@jennybc

jennybc Mar 5, 2017

Member

It feels like the various solutions I read on SO work this way. I also note it is how the existing parseString() works:

inline bool parseString(const rapidxml::xml_node<>* string, std::string *out) {

This comment has been minimized.

@hadley

hadley Mar 5, 2017

Member

Yeah, it's a totally reasonable pattern, but for some reason it feels off to me here.

@hadley

This comment has been minimized.

Member

hadley commented Mar 5, 2017

I think this is fine to merge if you want to; other stuff can happen later since this is already a gigantic PR.

@jennybc

This comment has been minimized.

Member

jennybc commented Mar 5, 2017

That would be a great mercy (merging and attacking to do's in smaller pieces). Will do that.

@jennybc jennybc merged commit 843724e into tidyverse:master Mar 5, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jennybc jennybc deleted the jennybc:typing-and-coercion branch Mar 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment