New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

xls catch up re: changes to sheet geometry and col specification #271

Merged
merged 30 commits into from Feb 24, 2017

Conversation

Projects
None yet
2 participants
@jennybc
Member

jennybc commented Feb 24, 2017

This became something of a monster PR 馃懝 sorry. What's here, in order of importance:

  1. xls reading was aligned with xlsx.
    • All logic pushed to the C++ side, instead of R. Overall sheet reading function read_xls_ is almost carbon copy of read_xlsx_. Ditto to various degrees for the XlsWorksheet constructor and member functions.
  2. read_excel_() worker created so that sheet standardization, col_names/col_types checks, tibble-ization, and name repair are applied uniformly, but not repeated.
  3. General hygiene. Removed more vestigial code. Remove name inconsistencies that had no significance. Moved stuff between files, renamed files, eliminated files (e.g. xls-specific cell type content moved to new XlsCell.h header, remaining content folded into ColSpec.h, CellType.h deleted).

Closes #258 still checking if this closes some other xls issues

jennybc added some commits Feb 20, 2017

@jennybc jennybc requested a review from hadley Feb 24, 2017

@@ -21,11 +22,13 @@ class XlsWorkBook {
public:
XlsWorkBook(std::string path) {
XlsWorkBook(const std::string& path)

This comment has been minimized.

@jennybc

jennybc Feb 24, 2017

Member

It feels like offset_ and customDateFormats_ should belong to the workbook, as they do on the xlsx side. And should be read right here.

An XlsxWorksheet has its host XlsxWorkbook as a member, which then provides access to this date info. My attempts to do same for xls failed. I don't know if the xls binary format (vs the xlsx XML) makes this impossible/difficult or if it just requires different methods, which I do not know.

}
// [[Rcpp::export]]
List xls_cols(std::string path, int i, CharacterVector col_names,
CharacterVector col_types, std::vector<std::string> na, int nskip = 0) {
List read_xls_(std::string path, int sheet_i, RObject col_names,

This comment has been minimized.

@jennybc

jennybc Feb 24, 2017

Member

This is now virtually identical to read_xlsx_. To the point that it feels like I should do something about that.

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

Let's make that another project. Probably the right approach is to make a joint base class for XlsxWorkSheet and XlsWorkSheet etc.

class XlsWorkSheet {
xls::xlsWorkSheet* pWS_;
int nrow_, ncol_;
double offset_;

This comment has been minimized.

@jennybc

jennybc Feb 24, 2017

Member

As commented earlier, it feels like this date-related information (offset_ and customDateFormats_) should belong to the workbook, not the worksheet. But then I don't know how to make it available inside the member functions here, since I was not able to make the workbook a member, as is done on the xlsx side.

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

Let's talk about that in person. It's probably somehow related to memory management.

if (nskip > nrow_)
return out;
Rcpp::CharacterVector colNames() {

This comment has been minimized.

@jennybc

jennybc Feb 24, 2017

Member

Almost identical to analogous member of XlsxWorksheet class. Should I do something about that?

return out;
}
std::vector<ColType> colTypes(const StringSet &na, int nskip = 0,
int guess_max = 1000) {
std::vector<ColType> colTypes(const StringSet &na,

This comment has been minimized.

@jennybc

jennybc Feb 24, 2017

Member

Almost identical to analogous member of XlsxWorksheet class. Should I do something about that?

int nominal_ncol = pWS_->rows.lastcol;
int nominal_nrow = pWS_->rows.lastrow;
for (xls::WORD i = 0; i <= nominal_nrow; ++i) {

This comment has been minimized.

@jennybc

jennybc Feb 24, 2017

Member

libxls goes out of its way to define and rigorously use custom types, such as WORD, which is uint16_t. readxl OTOH is all over the place with int, size_t, etc. Should I worry about this?

This comment has been minimized.

@hadley
// secondRow_ = first cell for which declared row > that of firstRow_
// fallback to cells_.end() if the above not possible
// Assumes loaded cells are arranged s.t. row is non-decreasing
void parseGeometry(int skip) {

This comment has been minimized.

@jennybc

jennybc Feb 24, 2017

Member

Almost identical to analogous member of XlsxWorksheet class. Should I do something about that?

@hadley

hadley approved these changes Feb 24, 2017

I didn't read the C++ code closely - I assume if all the tests pass with out crashing you've done the right thing.

I'd say we should merge this and then figure out how to reduce the duplicated code.

}
// [[Rcpp::export]]
List xls_cols(std::string path, int i, CharacterVector col_names,
CharacterVector col_types, std::vector<std::string> na, int nskip = 0) {
List read_xls_(std::string path, int sheet_i, RObject col_names,

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

Let's make that another project. Probably the right approach is to make a joint base class for XlsxWorkSheet and XlsWorkSheet etc.

class XlsWorkSheet {
xls::xlsWorkSheet* pWS_;
int nrow_, ncol_;
double offset_;

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

Let's talk about that in person. It's probably somehow related to memory management.

}
case CELL_NUMERIC:
case CELL_DATE:
REAL(col)[row] = xcell->cell()->d;

This comment has been minimized.

@hadley

hadley Feb 24, 2017

Member

I think this code would also become much more similar to the xlsx equivalent if we added getDoubleValue() etc methods to the cell class.

int nominal_ncol = pWS_->rows.lastcol;
int nominal_nrow = pWS_->rows.lastrow;
for (xls::WORD i = 0; i <= nominal_nrow; ++i) {

This comment has been minimized.

@hadley
@jennybc

This comment has been minimized.

Member

jennybc commented Feb 24, 2017

Yes all tests are passing everywhere and this PR added quite a few, esp for xls. So I think it is a net positive.

I'll take or create opportunities to get advice on some of the obvious opportunities for improvement, touched on above.

@jennybc jennybc merged commit fa50e03 into tidyverse:master Feb 24, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jennybc jennybc deleted the jennybc:xls-catch-up branch Feb 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment