Experiment with converting the whole df to character matrix first #126

chainsawriot · 2023-08-08T10:49:14Z

For the research of #113

Line 107 in dfe3d06

value <- as.character(x[i, j, drop = TRUE])

Would there be performance gain for converting the whole df to character matrix first, rather than do this for every cell?

x_mat <- apply(x, 2, as.character)

chainsawriot · 2023-08-08T14:59:59Z

Current HEAD

system.time(write_ods(nycflights13::flights))
##   user  system elapsed 
##203.636   4.029 207.797

This is experimental (it won't work for single-row df, for example)

.write_sheet_con <- function(x, con, sheet = "Sheet1", row_names = FALSE, col_names = FALSE, na_as_string = FALSE, padding = FALSE) {
    cmax <- force(if(ncol(x) > 1024) { 16384 } else { 1024 })
    types <- unlist(lapply(x, class))
    types <- ifelse(types %in% c("integer", "numeric"), "float", "string")
    colj <- seq_len(NCOL(x))
    cols <- ncol(x)
    if (row_names) {
        cols <- cols + 1
    }
    rows <- nrow(x)
    if (col_names) {
        rows <- rows + 1
    }
    if (padding) {
        .write_as_utf8(.gen_sheet_tag(sheet = sheet, cols = cmax), con)
    } else {
        .write_as_utf8(.gen_sheet_tag(sheet = sheet, cols = cols), con)
    }
    ## add data
    rownames_x <- rownames(x)
    colnames_x <- colnames(x)
    if (col_names) {
        .write_as_utf8("<table:table-row table:style-name=\"ro1\">", con)
        if (row_names) {
            .cell_out("string", value = "", con = con)
        }
        for (j in colj) {
            .cell_out(type = "string", value = colnames_x[j], con = con)
        }
        if (cols < cmax && padding) {
            .write_as_utf8(stringi::stri_join("<table:table-cell table:number-columns-repeated=\"", as.character(cmax - cols), "\"/>", sep = ""), con)
        }
        .write_as_utf8("</table:table-row>", con)
    }
    x2 <- apply(x, 2, as.character)
    for (i in seq_len(NROW(x2))) {
        ## create a row
        .write_as_utf8("<table:table-row table:style-name=\"ro1\">", con)
        if (row_names) {
            .cell_out(type = "string", value = rownames_x[i], con = con)
        }
        for (j in colj) {
            value <- x2[i, j, drop = TRUE]
            write_empty_cell <- FALSE
            if (is.na(value) && !na_as_string) {
                write_empty_cell <- TRUE
            }
            if (is.na(value) && na_as_string) {
                type <- "string"
                value <- "NA"
            } else {
                type <- types[j]
            }
            .cell_out(type = type, value = value, con = con, write_empty_cell = write_empty_cell)
        }
        if (cols < cmax && padding) {
            .write_as_utf8(stringi::stri_join("<table:table-cell table:number-columns-repeated=\"", as.character(cmax - cols), "\"/>", sep = ""), con)
        }
        .write_as_utf8("</table:table-row>", con)
    }
    if (rows < 2^20 && padding) {
        .write_as_utf8(stringi::stri_join("<table:table-row table:style-name=\"ro1\" table:number-rows-repeated=\"", 2^20 - rows, "\"><table:table-cell table:number-columns-repeated=\"", cmax, "\"/></table:table-row>", sep = ""), con)
    }
    .write_as_utf8("</table:table>", con)
    return(invisible(con))
}

system.time(write_ods(nycflights13::flights))
##    user  system elapsed 
## 126.228   0.827 127.168

pbrohan · 2023-08-09T15:19:57Z

I think this is probably the route we'll want to go down for #113.

I haven't looked into it that heavily, but my thought is to do a similar thing to write_xlsx, which passes a list/df of strings and a list of types to the C++ function for writing.

For very large datasets, this presents a problem, as it possibly doubles the memory required for writing, and so it may be that it's a little better to do it in chunks, but that's by the by.

Take2: Using list instead; it passes all tests

TimTaylor mentioned this issue Aug 8, 2023

.escape_xml() performance tweak #127

Closed

chainsawriot added a commit that referenced this issue Aug 9, 2023

Implement #126

ed02746

Take2: Using list instead; it passes all tests

chainsawriot closed this as completed in 9ec437d Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with converting the whole df to character matrix first #126

Experiment with converting the whole df to character matrix first #126

chainsawriot commented Aug 8, 2023

chainsawriot commented Aug 8, 2023

pbrohan commented Aug 9, 2023

Experiment with converting the whole df to character matrix first #126

Experiment with converting the whole df to character matrix first #126

Comments

chainsawriot commented Aug 8, 2023

chainsawriot commented Aug 8, 2023

pbrohan commented Aug 9, 2023