Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor xlsx col type handling #261

Merged
merged 33 commits into from
Feb 13, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
d8481ab
add new param n for guessing col_type
tklebel Oct 17, 2015
2c8e6c2
Remove all column name processing from colTypes()
Feb 10, 2017
c4b12b0
Move colName and colType comparison out of readCols()
Feb 10, 2017
9143dcb
Stop dropping blank columns; fixes #157
Feb 10, 2017
81643b6
Move tests into more logical places
Feb 10, 2017
3cf7dc9
More specific error when col_names or col_types has wrong length
Feb 10, 2017
bb567be
Rcpp churn
Feb 11, 2017
0c7e90f
If no data, call it blank right here
Feb 11, 2017
4657868
Simplify col_type learning loop
Feb 11, 2017
54213fd
Sketch my grand plans
Feb 11, 2017
4eeddf2
Check and test col_types
Feb 11, 2017
d100d5f
Simplify readCols() loop
Feb 11, 2017
624b02b
Recycle col_types if length 1; fixes #127
Feb 11, 2017
8459ff9
Deprecate `col_types = "blank"`; fixes #260
Feb 11, 2017
6ab2d36
Rationalize joint processing of col_names + col_types; fixes #81
Feb 11, 2017
65cd24b
Ignore docs for xls format
Feb 12, 2017
b5c55ce
Fix/update xlsx_col_types()
Feb 12, 2017
80905cf
Delete benchmarks.cpp, home of the shifty, vestigial parseXml()and co…
Feb 12, 2017
09562b1
Add bullets to NEWS
Feb 12, 2017
4a90dec
README and pkgdown
Feb 12, 2017
0d57735
Groom error messages and tests thereof
Feb 12, 2017
3c10a4b
Wording in NEWS
Feb 12, 2017
bba7655
Stop for unknown type
Feb 12, 2017
3dab676
Move CELL_SKIP; add comment re: enum order
Feb 12, 2017
5f7f908
Recycle length-one col_type with std::fill()
Feb 13, 2017
b200038
Indenting
Feb 13, 2017
158bbc4
Helper function for reconciling col names
Feb 13, 2017
7d63e93
Helper function for recycling col types
Feb 13, 2017
ee18b43
Simplify col_type learning loop; avoid cell copy
Feb 13, 2017
5189042
Simplify and avoid cell copy in readCols() too
Feb 13, 2017
642e23c
It's guess_max, not max_guess
Feb 13, 2017
c45a0a7
Delete comment, test code style
Feb 13, 2017
6bb4ee3
Cleaner with i, j
Feb 13, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,5 @@
^_pkgdown\.yml$
^docs$
^clippy$
^\[MS-XLS\]\.pdf$
^excelfileformat\.pdf$
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@
.RData
~*
clippy
\[MS-XLS\].pdf
excelfileformat.pdf
10 changes: 10 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
# readxl 0.1.1.9000

*currently much of this applies only to xlsx, but will be extended to xls*

* A user-specified `col_types` of length one will be replicated to have length equal to the number of columns. (#127, #114, #261 @jennybc)

* Column type `"blank"` has been deprecated in favor of the more descriptive `"skip"`, which also supports the goal to become more consistent with readr. (#260, #193, #261 @jennybc)

* User-supplied `col_names` are processed relative to user-supplied `col_types`, if given. Specifically, `col_names` is considered valid if it has the same length as `col_types`, before *or after* removing skipped columns. (#81, #261 @jennybc)

* Leading or embedded empty columns are no longer dropped, regardless of whether there is a column name. (#157, #261 @jennybc)

* New argument `guess_max` lets user adjust the number of rows used to guess column types, similar to functions in readr. (#223, #257 @tklebel, @jennybc)

* Improved handling of empty cells for xlsx. (#248 @jennybc)
Expand Down
12 changes: 2 additions & 10 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ parse_ref <- function(ref) {
.Call('readxl_parse_ref', PACKAGE = 'readxl', ref)
}

xlsx_col_types <- function(path, sheet = 0L, na = character(), nskip = 0L, guess_max = 1000L) {
.Call('readxl_xlsx_col_types', PACKAGE = 'readxl', path, sheet, na, nskip, guess_max)
xlsx_col_types <- function(path, sheet = 0L, na = character(), nskip = 0L, guess_max = 1000L, sheetHasColumnNames = FALSE) {
.Call('readxl_xlsx_col_types', PACKAGE = 'readxl', path, sheet, na, nskip, guess_max, sheetHasColumnNames)
}

xlsx_col_names <- function(path, sheet = 0L, nskip = 0L) {
Expand All @@ -53,14 +53,6 @@ read_xlsx_ <- function(path, sheet, col_names, col_types, na, nskip = 0L, guess_
.Call('readxl_read_xlsx_', PACKAGE = 'readxl', path, sheet, col_names, col_types, na, nskip, guess_max)
}

parseXml <- function(base, internal) {
invisible(.Call('readxl_parseXml', PACKAGE = 'readxl', base, internal))
}

countRows <- function(base, sheet) {
.Call('readxl_countRows', PACKAGE = 'readxl', base, sheet)
}

zip_xml <- function(zip_path, file_path) {
invisible(.Call('readxl_zip_xml', PACKAGE = 'readxl', zip_path, file_path))
}
Expand Down
42 changes: 36 additions & 6 deletions R/read_excel.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,19 @@ NULL
#' @param path Path to the xls/xlsx file
#' @param sheet Sheet to read. Either a string (the name of a sheet), or an
#' integer (the position of the sheet). Defaults to the first sheet.
#' @param col_names `TRUE` to use the first row as column names, `FALSE`
#' to get default names, or a character vector giving a name for each column.
#' @param col_names `TRUE` to use the first row as column names, `FALSE` to get
#' default names, or a character vector giving a name for each column. If user
#' provides `col_types` as a vector, `col_names` can have one entry per
#' column, i.e. have the same length as `col_types`, or one entry per
#' unskipped column.
#' @param col_types Either `NULL` to guess from the spreadsheet or a character
#' vector containing one entry per column from these options: "blank",
#' "numeric", "date" or "text".
#' @param na Character vector of strings to use for missing values. By default
#' vector containing one entry per column from these options: "skip",
#' "numeric", "date" or "text". The content of a cell in a skipped column is
#' never read and that column will not appear in the data frame output.
#' @param na Character vector of strings to use for missing values. By default,
#' readxl treats blank cells as missing data.
#' @param skip Number of rows to skip before reading any data.
#' @param skip Number of rows to skip before reading any data. Leading blank
#' rows are automatically skipped.
#' @param guess_max Maximum number of rows to use for guessing column types.
#' @export
#' @examples
Expand All @@ -32,6 +37,7 @@ read_excel <- function(path, sheet = 1L, col_names = TRUE, col_types = NULL,

path <- check_file(path)
guess_max <- check_guess_max(guess_max)
col_types <- check_col_types(col_types)

switch(excel_format(path),
xls = read_xls(path, sheet, col_names, col_types, na, skip, guess_max),
Expand Down Expand Up @@ -128,6 +134,30 @@ standardise_sheet <- function(sheet, sheet_names) {
}
}

check_col_types <- function(col_types) {
if (is.null(col_types)) {
return(col_types)
}
stopifnot(is.character(col_types), length(col_types) > 0, !anyNA(col_types))

blank <- col_types == "blank"
if (any(blank)) {
message("`col_type = \"blank\"` deprecated. Use \"skip\" instead.")
col_types[blank] <- "skip"
}

accepted_types <- c("skip", "numeric", "date", "text")
ok <- col_types %in% accepted_types
if (any(!ok)) {
info <- paste(
paste0("'", col_types[!ok], "' [", seq_along(col_types)[!ok], "]"),
collapse = ", "
)
stop(paste("Illegal column type:", info), call. = FALSE)
}
col_types
}

## from readr
check_guess_max <- function(guess_max, max_limit = .Machine$integer.max %/% 100) {

Expand Down
4 changes: 3 additions & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ If you are new to the tidyverse conventions for data import, you may want to con
* Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904)
date specifications are processed correctly.

* Blank columns are automatically dropped (*but this is changing!*). Blank rows that appear before the data are automatically dropped; embedded blank rows are not.
* Blank rows that appear before the data are automatically dropped; embedded blank rows are not. User can exert more control of this with `skip`.

* Column names and types are determined from the data in the sheet, by default, but user can also supply via `col_names` and `col_types`.

* It returns a tibble, i.e. a data frame with an additional `tbl_df` class. Among other things, this provide nicer printing.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,8 @@ Features

- Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904) date specifications are processed correctly.

- Blank columns are automatically dropped (*but this is changing!*). Blank rows that appear before the data are automatically dropped; embedded blank rows are not.
- Blank rows that appear before the data are automatically dropped; embedded blank rows are not. User can exert more control of this with `skip`.

- Column names and types are determined from the data in the sheet, by default, but user can also supply via `col_names` and `col_types`.

- It returns a tibble, i.e. a data frame with an additional `tbl_df` class. Among other things, this provide nicer printing.
3 changes: 2 additions & 1 deletion docs/index.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions docs/news/index.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

25 changes: 16 additions & 9 deletions docs/reference/read_excel.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 13 additions & 8 deletions man/read_excel.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

25 changes: 17 additions & 8 deletions src/CellType.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,12 @@
#include <libxls/xls.h>
#include "StringSet.h"

// CELL_BLANK can arise only from an individual cell during type guessing
// important that it be the first entry = default type, instead of ...
// CELL_SKIP which can arise only as a user-specified column type
enum CellType {
CELL_BLANK,
CELL_SKIP,
CELL_DATE,
CELL_NUMERIC,
CELL_TEXT
Expand All @@ -26,11 +30,12 @@ inline std::vector<CellType> cellTypes(Rcpp::CharacterVector x) {
types.push_back(CELL_DATE);
} else if (type == "numeric") {
types.push_back(CELL_NUMERIC);
} else if (type == "skip") {
types.push_back(CELL_SKIP);
} else if (type == "text") {
types.push_back(CELL_TEXT);
} else {
Rcpp::warning("Unknown type '%s' at position %i. Using text instead.",
type, i + 1);
Rcpp::stop("Unknown type '%s' at position %i", type, i + 1);
}
}

Expand All @@ -42,6 +47,7 @@ inline std::string cellTypeDesc(CellType type) {
case CELL_BLANK: return "blank";
case CELL_DATE: return "date";
case CELL_NUMERIC: return "numeric";
case CELL_SKIP: return "skip";
case CELL_TEXT: return "text";
}
return "???";
Expand Down Expand Up @@ -153,6 +159,9 @@ inline Rcpp::RObject makeCol(CellType type, int n) {
case CELL_NUMERIC:
return Rcpp::NumericVector(n, NA_REAL);
break;
case CELL_SKIP:
return R_NilValue;
break;
case CELL_TEXT:
return Rcpp::CharacterVector(n, NA_STRING);
break;
Expand All @@ -161,24 +170,24 @@ inline Rcpp::RObject makeCol(CellType type, int n) {
return R_NilValue;
}

// Drop blanks from list of columns
inline Rcpp::List removeBlankColumns(Rcpp::List cols,
Rcpp::CharacterVector names,
std::vector<CellType> types) {
inline Rcpp::List removeSkippedColumns(Rcpp::List cols,
Rcpp::CharacterVector names,
std::vector<CellType> types) {
int p = cols.size();

int p_out = 0;
for (int j = 0; j < p; ++j) {
if (types[j] != CELL_BLANK)
if (types[j] != CELL_SKIP)
p_out++;
}

Rcpp::List out(p_out);
Rcpp::CharacterVector names_out(p_out);
int j_out = 0;
for (int j = 0; j < p; ++j) {
if (types[j] == CELL_BLANK)
if (types[j] == CELL_SKIP) {
continue;
}

out[j_out] = cols[j];
names_out[j_out] = names[j];
Expand Down
49 changes: 49 additions & 0 deletions src/ColSpec.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#ifndef READXL_COLSPEC_
#define READXL_COLSPEC_

#include <Rcpp.h>
#include "CellType.h"

inline std::vector<CellType> recycleTypes(std::vector<CellType> types,
int ncol) {
if (types.size() == 1) {
types.resize(ncol);
std::fill(types.begin(), types.end(), types[0]);
}
return types;
}

inline Rcpp::CharacterVector reconcileNames(Rcpp::CharacterVector names,
const std::vector<CellType>& types,
int sheet) {
size_t ncol_names = names.size();
size_t ncol_types = types.size();

if (ncol_names == ncol_types) {
return names;
}

size_t ncol_noskip = 0;
for (size_t i = 0; i < types.size(); i++) {
if (types[i] != CELL_SKIP) {
ncol_noskip++;
}
}
if (ncol_names != ncol_noskip) {
Rcpp::stop("Sheet %d has %d columns (%d unskipped), but `col_names` has length %d.",
sheet + 1, ncol_types, ncol_noskip, ncol_names);
}

Rcpp::CharacterVector newNames(ncol_types, "");
size_t j_short = 0;
for (size_t j_long = 0; j_long < ncol_types; ++j_long) {
if (types[j_long] == CELL_SKIP) {
continue;
}
newNames[j_long] = names[j_short];
j_short++;
}
return newNames;
}

#endif