Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with converting the whole df to character matrix first #126

Closed
chainsawriot opened this issue Aug 8, 2023 · 2 comments
Closed

Comments

@chainsawriot
Copy link
Collaborator

For the research of #113

value <- as.character(x[i, j, drop = TRUE])

Would there be performance gain for converting the whole df to character matrix first, rather than do this for every cell?

x_mat <- apply(x, 2, as.character)
@chainsawriot
Copy link
Collaborator Author

Current HEAD

system.time(write_ods(nycflights13::flights))
##   user  system elapsed 
##203.636   4.029 207.797 

This is experimental (it won't work for single-row df, for example)

.write_sheet_con <- function(x, con, sheet = "Sheet1", row_names = FALSE, col_names = FALSE, na_as_string = FALSE, padding = FALSE) {
    cmax <- force(if(ncol(x) > 1024) { 16384 } else { 1024 })
    types <- unlist(lapply(x, class))
    types <- ifelse(types %in% c("integer", "numeric"), "float", "string")
    colj <- seq_len(NCOL(x))
    cols <- ncol(x)
    if (row_names) {
        cols <- cols + 1
    }
    rows <- nrow(x)
    if (col_names) {
        rows <- rows + 1
    }
    if (padding) {
        .write_as_utf8(.gen_sheet_tag(sheet = sheet, cols = cmax), con)
    } else {
        .write_as_utf8(.gen_sheet_tag(sheet = sheet, cols = cols), con)
    }
    ## add data
    rownames_x <- rownames(x)
    colnames_x <- colnames(x)
    if (col_names) {
        .write_as_utf8("<table:table-row table:style-name=\"ro1\">", con)
        if (row_names) {
            .cell_out("string", value = "", con = con)
        }
        for (j in colj) {
            .cell_out(type = "string", value = colnames_x[j], con = con)
        }
        if (cols < cmax && padding) {
            .write_as_utf8(stringi::stri_join("<table:table-cell table:number-columns-repeated=\"", as.character(cmax - cols), "\"/>", sep = ""), con)
        }
        .write_as_utf8("</table:table-row>", con)
    }
    x2 <- apply(x, 2, as.character)
    for (i in seq_len(NROW(x2))) {
        ## create a row
        .write_as_utf8("<table:table-row table:style-name=\"ro1\">", con)
        if (row_names) {
            .cell_out(type = "string", value = rownames_x[i], con = con)
        }
        for (j in colj) {
            value <- x2[i, j, drop = TRUE]
            write_empty_cell <- FALSE
            if (is.na(value) && !na_as_string) {
                write_empty_cell <- TRUE
            }
            if (is.na(value) && na_as_string) {
                type <- "string"
                value <- "NA"
            } else {
                type <- types[j]
            }
            .cell_out(type = type, value = value, con = con, write_empty_cell = write_empty_cell)
        }
        if (cols < cmax && padding) {
            .write_as_utf8(stringi::stri_join("<table:table-cell table:number-columns-repeated=\"", as.character(cmax - cols), "\"/>", sep = ""), con)
        }
        .write_as_utf8("</table:table-row>", con)
    }
    if (rows < 2^20 && padding) {
        .write_as_utf8(stringi::stri_join("<table:table-row table:style-name=\"ro1\" table:number-rows-repeated=\"", 2^20 - rows, "\"><table:table-cell table:number-columns-repeated=\"", cmax, "\"/></table:table-row>", sep = ""), con)
    }
    .write_as_utf8("</table:table>", con)
    return(invisible(con))
}
system.time(write_ods(nycflights13::flights))
##    user  system elapsed 
## 126.228   0.827 127.168 

@pbrohan
Copy link
Contributor

pbrohan commented Aug 9, 2023

I think this is probably the route we'll want to go down for #113.

I haven't looked into it that heavily, but my thought is to do a similar thing to write_xlsx, which passes a list/df of strings and a list of types to the C++ function for writing.

For very large datasets, this presents a problem, as it possibly doubles the memory required for writing, and so it may be that it's a little better to do it in chunks, but that's by the by.

chainsawriot added a commit that referenced this issue Aug 9, 2023
Take2: Using list instead; it passes all tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants