tweaks to unark for more robust parsing (#19)

- `unark()` will strip out non-compliant characters in table names by default. - `unark()` gains the optional argument `tablenames`, allowing the user to specify the corresponding table names manually, rather than enforcing they correspond with the incoming file names. closes #18 - `unark()` gains the argument `encoding`, allowing users to directly set the encoding of incoming files. Previously this could only be set by setting `options(encoding)`, which will still work as well. See `FAO.R` example in `examples` for an illustration. - `unark()` will now attempt to guess which streaming parser to use (e.g `csv` or `tsv`) based on the file extension pattern, rather than defaulting to a `tsv` parser. (`ark()` still defaults to exporting in the more portable `tsv` format).
ropensci · Sep 26, 2018 · 04f353c · 04f353c
1 parent e92e1d2
commit 04f353c
Show file tree

Hide file tree

Showing 8 changed files with 182 additions and 36 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,5 +1,5 @@
 Package: arkdb
-Version: 0.0.3
+Version: 0.0.3.9000
 Title: Archive and Unarchive Databases Using Flat Files
 Description: Flat text files provide a robust, compressible, and portable
   way to store tables from databases.  This package provides convenient
@@ -18,7 +18,7 @@ Encoding: UTF-8
 LazyData: true
 ByteCompile: true
 VignetteBuilder: knitr
-RoxygenNote: 6.0.1.9000
+RoxygenNote: 6.1.0
 Roxygen: list(markdown = TRUE)
 Imports: 
     DBI,

diff --git a/NEWS.md b/NEWS.md
@@ -1,4 +1,21 @@
-# arkdb 0.0.3
+
+# arkdb 0.0.4
+
+- `unark()` will strip out non-compliant characters in table names by default.
+- `unark()` gains the optional argument `tablenames`, allowing the user to
+   specify the corresponding table names manually, rather than enforcing
+   they correspond with the incoming file names. 
+   [#18](https://github.com/ropensci/arkdb/issues/18)
+-  `unark()` gains the argument `encoding`, allowing users to directly set
+   the encoding of incoming files.  Previously this could only be set by
+   setting `options(encoding)`, which will still work as well. See
+  `FAO.R` example in `examples` for an illustration.  
+- `unark()` will now attempt to guess which streaming parser to use 
+   (e.g `csv` or `tsv`) based on the file extension pattern, rather than
+   defaulting to a `tsv` parser.  (`ark()` still defaults to exporting in
+   the more portable `tsv` format).
+
+# arkdb 0.0.3 2018-09-11
 
 * Remove dependency on utils::askYesNo for backward compatibility, [#17](https://github.com/ropensci/arkdb/issues/17)
 

diff --git a/R/unark.R b/R/unark.R
@@ -9,6 +9,10 @@
 #' default is "ask", which will ask for confirmation in an interactive session, and
 #' overwrite in a non-interactive script.  TRUE will always overwrite, FALSE will
 #' always skip such tables.
+#' @param encoding encoding to be assumed for input files.
+#' @param tablenames vector of tablenames to be used for corresponding files.
+#' By default, tables will be named using lowercase names from file basename with
+#' special characters replaced with underscores (for SQL compatibility).
 #' @param ... additional arguments to `streamable_table$read` method.
 #' @details `unark` will read in a files in chunks and 
 #' write them into a database.  This is essential for processing
@@ -46,24 +50,41 @@
 #' @export
 unark <- function(files, 
                   db_con,
-                  streamable_table =  streamable_base_tsv(), 
+                  streamable_table = NULL, 
                   lines = 50000L, 
                   overwrite = "ask",
+                  encoding = Sys.getenv("encoding", "UTF-8"),
+                  tablenames = NULL,
                   ...){
 
   assert_files_exist(files)
   assert_dbi(db_con)
-  assert_streamable(streamable_table)
+
+  ## Guess streamable table
+  if(is.null(streamable_table)){
+    streamable_table <- guess_stream(files[[1]])
+  }
 
+  assert_streamable(streamable_table)
+
+
+  if(is.null(tablenames)){
+    tablenames <- vapply(files, base_name, character(1))
+  }
 
   db <- normalize_con(db_con)
-  lapply(files, 
-         unark_file, 
-         db, 
-         streamable_table = streamable_table, 
-         lines = lines, 
-         overwrite = overwrite,
-         ...)
+
+  lapply(seq_along(files), 
+         function(i){
+           unark_file(files[[i]],
+                      db_con = db, 
+                      streamable_table = streamable_table, 
+                      lines = lines, 
+                      overwrite = overwrite,
+                      encoding = encoding,
+                      tablename = tablenames[[i]],
+                      ...)
+           })
   invisible(db_con)  
 }
 
@@ -77,28 +98,34 @@ normalize_con <- function(db_con){
   }
 }
 
-
 #' @importFrom DBI dbWriteTable
 #' @importFrom progress progress_bar
-unark_file <- function(filename, db_con, streamable_table, lines = 10000L, overwrite, ...){
+unark_file <- function(filename,
+                       db_con,
+                       streamable_table,
+                       lines,
+                       overwrite,
+                       encoding,
+                       tablename = base_name(filename),
+                       ...){
 
-  tbl_name <- base_name(filename)
-
-  if(!assert_overwrite_db(db_con, tbl_name, overwrite)){
+
+  if(!assert_overwrite_db(db_con, tablename, overwrite)){
     return(NULL)
   }
 
 
 
-  con <- compressed_file(filename, "r")
+  con <- compressed_file(filename, "r", encoding = encoding)
   on.exit(close(con))
 
   ## Handle case of `col_names != TRUE`?
-  header <- readLines(con, n = 1L)
+  ## readr method needs UTF-8 encoding for these newlines to be newlines
+  header <- read_lines(con, n = 1L, encoding = encoding)
   if(length(header) == 0){ # empty file, would throw error
     return(invisible(db_con))
   }
-  reader <- read_chunked(con, lines)
+  reader <- read_chunked(con, lines, encoding)
 
   # May throw an error if we need to read more than 'total' chunks?
   p <- progress::progress_bar$new("[:spin] chunk :current", total = 100000)
@@ -110,7 +137,7 @@ unark_file <- function(filename, db_con, streamable_table, lines = 10000L, overw
     body <- paste0(c(header, d$data), "\n", collapse = "")
     p$tick()
     chunk <- streamable_table$read(body, ...)
-    DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE)
+    DBI::dbWriteTable(db_con, tablename, chunk, append=TRUE)
 
     if (d$complete) {
       break
@@ -126,16 +153,16 @@ unark_file <- function(filename, db_con, streamable_table, lines = 10000L, overw
 # https://github.com/vimc/montagu-r
 # /blob/4fe82fd29992635b30e637d5412312b0c5e3e38f/R/util.R#L48-L60
 
-read_chunked <- function(con, n) {
+read_chunked <- function(con, n, encoding) {
   assert_connection(con)
-  next_chunk <- readLines(con, n)
+  next_chunk <- read_lines(con, n, encoding = encoding)
   if (length(next_chunk) == 0L) {
     warning("connection has already been completely read")
     return(function() list(data = character(0), complete = TRUE))
   }
   function() {
     data <- next_chunk
-    next_chunk <<- readLines(con, n)
+    next_chunk <<- read_lines(con, n, encoding = encoding)
     complete <- length(next_chunk) == 0L
     list(data = data, complete = complete)
   }
@@ -148,7 +175,10 @@ base_name <- function(filename){
   ext_regex <- "(?<!^|[.])[.][^.]+$"
   path <- sub(ext_regex, "", path, perl = TRUE)
   path <- sub(ext_regex, "", path, perl = TRUE)
-  sub(ext_regex, "", path, perl = TRUE)
+  path <- sub(ext_regex, "", path, perl = TRUE)
+  ## Remove characters not permitted in table names
+  path <- gsub("[^a-zA-Z0-9_]", "_", path, perl = TRUE)
+  tolower(path)
 }
 
 #' @importFrom tools file_ext
@@ -160,3 +190,32 @@ compressed_file <- function(path, ...){
          zip = unz(path, ...),
          file(path, ...))
 }
+
+
+read_lines <- function(con,
+                       n,
+                       encoding = "unknown",
+                       warn = FALSE){
+  out <- readLines(con,
+                   n = n,
+                   encoding = encoding,
+                   warn = FALSE)
+
+}
+
+guess_stream <- function(x){  
+  ext <- tools::file_ext(x)
+  ## if compressed, chop off that and try again
+  if(ext %in% c("gz", "bz2", "xz", "zip")){
+    ext <- tools::file_ext(gsub("\\.([[:alnum:]]+)$", "", x))
+  }
+  streamable_table <- 
+    switch(ext,
+           "csv" = streamable_base_csv(),
+           "tsv" = streamable_base_tsv(),
+           stop(paste("Streaming file parser could not be", 
+                      "guessed from file extension.",
+                      "Please specify a streamable_table option"))
+    )
+  streamable_table
+}
diff --git a/codemeta.json b/codemeta.json
@@ -10,7 +10,7 @@
   "codeRepository": "https://github.com/ropensci/arkdb",
   "issueTracker": "https://github.com/ropensci/arkdb/issues",
   "license": "https://spdx.org/licenses/MIT",
-  "version": "0.0.3",
+  "version": "0.0.3.9000",
   "programmingLanguage": {
     "@type": "ComputerLanguage",
     "name": "R",
@@ -237,7 +237,7 @@
   ],
   "releaseNotes": "https://github.com/ropensci/arkdb/blob/master/NEWS.md",
   "readme": "https://github.com/ropensci/arkdb/blob/master/README.md",
-  "fileSize": "16.996KB",
+  "fileSize": "20.527KB",
   "contIntegration": [
     "https://travis-ci.org/cboettig/arkdb",
     "https://codecov.io/github/cboettig/arkdb?branch=master",

diff --git a/inst/examples/fao.R b/inst/examples/fao.R
@@ -0,0 +1,59 @@
+library(arkdb)
+#unzip("~/Desktop/FAOSTAT.zip")
+#lapply(x, unzip)
+
+x <- list.files("~/FAOSTAT/", pattern="[.]csv",full.names = TRUE)
+dbdir <- rappdirs::user_data_dir("faostat")
+#fs::dir_delete(dbdir)
+db <- DBI::dbConnect(MonetDBLite::MonetDBLite(), dbdir)
+
+
+### using the readr parser ###
+#options(encoding = "latin2") # Must enforce UTF-8 for readr parsing
+unark(x[[1]], 
+      db, 
+      #streamable_table = streamable_readr_csv(), # either works
+      streamable_table = streamable_base_csv(),
+      lines = 5e5, 
+      overwrite = TRUE,
+      encoding = "latin2")
+
+
+
+## Inspect
+tbls <- DBI::dbListTables(db)
+DBI::dbListFields(db, tbls[[1]])
+library(tidyverse)
+tbl(db, tbls[[1]]) %>% select(Country) %>% distinct() %>% collect() %>% pull(Country)
+
+
+#############################################################
+### Alternative Approach: custom streamable_table method ####
+#############################################################
+
+## A slightly modified base read.csv function is used here to standardize column names
+  read <- function(file, ...) {
+    tbl <- utils::read.table(textConnection(file), header = TRUE, 
+                             sep = ",", quote = "\"", stringsAsFactors = FALSE, 
+                             ...)
+    ## ADDING THESE LINES to the default method.  use lowercase column names
+    names(tbl) <- tolower(names(tbl))
+    names(tbl) <- gsub("\\.", "_", names(tbl))
+    tbl
+  }
+
+
+  read <- function(file, ...) {
+    readr::read_csv(file = file, ...)
+  }
+
+  write <- function(x, path, omit_header) {
+    utils::write.table(x, file = path, sep = ",", quote = TRUE, 
+                       qmethod = "double", row.names = FALSE, col.names = !omit_header, 
+                       append = omit_header)
+  }
+  stream <- arkdb::streamable_table(read, write, "csv")
+
+
+
+unark(x, db, streamable_table = stream, lines = 5e5, overwrite = TRUE)
diff --git a/man/ark.Rd b/man/ark.Rd
diff --git a/man/unark.Rd b/man/unark.Rd
diff --git a/vignettes/medium-data.Rmd b/vignettes/medium-data.Rmd
@@ -3,16 +3,19 @@ title: "Working with medium-sized data"
 author: "Carl Boettiger"
 date: "9/22/2018"
 output: pdf_document
+vignette: >
+  %\VignetteIndexEntry{working_with_data}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
 ---
 
+Over the past summer, I have written two small-ish R packages to address challenges I frequently run up against during the course of my research.  Both are challenges with what I will refer to as medium-sized data -- not the kind of petabyte scale "big data" which precludes analysis on standard hardware or existing methodology, but large enough that the size alone starts creating problems for certain bits of a typical workflow.  More precisely, I will take *medium-sized* to refer to data that is too large to comfortably fit in memory on most laptops (e.g. on the order of several GB), or data that is merely too large to commit to GitHub.  By *typical workflow*, I mean easily being able to share all parts of analysis publicly or privately with collaborators (or merely different machines, such as my laptop and cloud server) who should be able to reproduce the results with minimal fuss and configuration.
 
-Over the past summer, I have written two small-ish R packages to address challenges I frequently run up against during the course of my research.  Both are challenges with what I will refer to as medium-sized data -- not the kind of petabyte scale "big data" which percludes analysis on standard hardware or existing methodology, but large enough that the size alone starts creating problems for certain bits of a typical workflow.  More precisely, I will take *medium-sized* to refer to data that is too large to comfortably fit in memory on most laptops (e.g. on the order of several GB), or data that is merely too large to commit to GitHub.  By *typical workflow*, I mean easily being able to share all parts of analysis publicly or privately with collaborators (or merely different machines, such as my laptop and cloud server) who should be able to reproduce the results with minimal fuss and configuration.
-
-For data too large to fit into memory, there's already a well-established solution of using an external database, to store the data.  Thanks to `dplyr`'s database backends, many R users can adapt their workflow relatively seemlessly to move from `dplyr` commands that call in-memory data frames to identical or nearly identical commands that call a database. This all works pretty well when your data *is already in a database*, but getting it onto a database, and then moving the data around so that other people/machines can access it is not nearly so straight forward. So far, this part of the problem has recieved relatively little attention.
+For data too large to fit into memory, there's already a well-established solution of using an external database, to store the data.  Thanks to `dplyr`'s database backends, many R users can adapt their workflow relatively seamlessly to move from `dplyr` commands that call in-memory data frames to identical or nearly identical commands that call a database. This all works pretty well when your data *is already in a database*, but getting it onto a database, and then moving the data around so that other people/machines can access it is not nearly so straight forward. So far, this part of the problem has received relatively little attention.
 
 The reason for that is because the usual response to this problem is "you're doing it wrong."  The standard practice in this context is simply not to move the data at all.  A central database server, usually with access controlled by password or other credential, can allow multiple users to all query the same database directly.  Thanks to the magical abstractions of SQL queries such as the `DBI` package, the user (aka client), doesn't need to care about the details of where the database is located, or even what particular backend is used. Moving all that data around can be slow and expensive. Arbitrarily large data can be housed in a central/cloud location and provisioned with enough resources to store everything and process complex queries. Consequently, just about every database backend not only to provides a mechanism for doing your `SQL` / `dplyr` querying, filtering, joining etc on data that cannot fit into memory all at once, but also nearly every such backend provides *server* abilities to do so over a network connection, handling secure logins and so forth.  Why would you want to do anything else?
 
-The problem with the usual response is that it is often at odds with our original objectives and typical scientific workflows.  Setting up a database server can be non-trivial; by which I mean: difficult to automate in a portable/cross-platform manner when working entirely from R.  More importantly, it reflects a use-case more typical of industry context than scientific practice.  Individual researchers need to make data avialable to a global community of scientists who can reproduce results years or decades later; not just to a handful of employees who can be granted authenticated access to a central database.  Archiving data as static text files is far more *scalable*, more *cost-effective* (storing static files is much cheaper than keeping a database server running), more *future-proof* (rapid evolution in database technology is not always backwards compatible) and simplifies or *avoids most security issues* involved in maintaining a public server. In the scientific context, it almost always makes more sense to move the data after all.  
+The problem with the usual response is that it is often at odds with our original objectives and typical scientific workflows.  Setting up a database server can be non-trivial; by which I mean: difficult to automate in a portable/cross-platform manner when working entirely from R.  More importantly, it reflects a use-case more typical of industry context than scientific practice.  Individual researchers need to make data available to a global community of scientists who can reproduce results years or decades later; not just to a handful of employees who can be granted authenticated access to a central database.  Archiving data as static text files is far more *scalable*, more *cost-effective* (storing static files is much cheaper than keeping a database server running), more *future-proof* (rapid evolution in database technology is not always backwards compatible) and simplifies or *avoids most security issues* involved in maintaining a public server. In the scientific context, it almost always makes more sense to move the data after all.  
 
 Scientific data repositories are already built on precisely this model: providing long term storage of files that can be downloaded and analyzed locally. For smaller `.csv` files, this works pretty well.