Skip to content

Commit

Permalink
tweaks to unark for more robust parsing (#19)
Browse files Browse the repository at this point in the history
- `unark()` will strip out non-compliant characters in table names by default.

- `unark()` gains the optional argument `tablenames`, allowing the user to
   specify the corresponding table names manually, rather than enforcing
   they correspond with the incoming file names. 
   closes #18

-  `unark()` gains the argument `encoding`, allowing users to directly set
   the encoding of incoming files.  Previously this could only be set by
   setting `options(encoding)`, which will still work as well. See
  `FAO.R` example in `examples` for an illustration.  

- `unark()` will now attempt to guess which streaming parser to use 
   (e.g `csv` or `tsv`) based on the file extension pattern, rather than
   defaulting to a `tsv` parser.  (`ark()` still defaults to exporting in
   the more portable `tsv` format).
  • Loading branch information
cboettig committed Sep 26, 2018
1 parent e92e1d2 commit 04f353c
Show file tree
Hide file tree
Showing 8 changed files with 182 additions and 36 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Package: arkdb
Version: 0.0.3
Version: 0.0.3.9000
Title: Archive and Unarchive Databases Using Flat Files
Description: Flat text files provide a robust, compressible, and portable
way to store tables from databases. This package provides convenient
Expand All @@ -18,7 +18,7 @@ Encoding: UTF-8
LazyData: true
ByteCompile: true
VignetteBuilder: knitr
RoxygenNote: 6.0.1.9000
RoxygenNote: 6.1.0
Roxygen: list(markdown = TRUE)
Imports:
DBI,
Expand Down
19 changes: 18 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,21 @@
# arkdb 0.0.3

# arkdb 0.0.4

- `unark()` will strip out non-compliant characters in table names by default.
- `unark()` gains the optional argument `tablenames`, allowing the user to
specify the corresponding table names manually, rather than enforcing
they correspond with the incoming file names.
[#18](https://github.com/ropensci/arkdb/issues/18)
- `unark()` gains the argument `encoding`, allowing users to directly set
the encoding of incoming files. Previously this could only be set by
setting `options(encoding)`, which will still work as well. See
`FAO.R` example in `examples` for an illustration.
- `unark()` will now attempt to guess which streaming parser to use
(e.g `csv` or `tsv`) based on the file extension pattern, rather than
defaulting to a `tsv` parser. (`ark()` still defaults to exporting in
the more portable `tsv` format).

# arkdb 0.0.3 2018-09-11

* Remove dependency on utils::askYesNo for backward compatibility, [#17](https://github.com/ropensci/arkdb/issues/17)

Expand Down
103 changes: 81 additions & 22 deletions R/unark.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@
#' default is "ask", which will ask for confirmation in an interactive session, and
#' overwrite in a non-interactive script. TRUE will always overwrite, FALSE will
#' always skip such tables.
#' @param encoding encoding to be assumed for input files.
#' @param tablenames vector of tablenames to be used for corresponding files.
#' By default, tables will be named using lowercase names from file basename with
#' special characters replaced with underscores (for SQL compatibility).
#' @param ... additional arguments to `streamable_table$read` method.
#' @details `unark` will read in a files in chunks and
#' write them into a database. This is essential for processing
Expand Down Expand Up @@ -46,24 +50,41 @@
#' @export
unark <- function(files,
db_con,
streamable_table = streamable_base_tsv(),
streamable_table = NULL,
lines = 50000L,
overwrite = "ask",
encoding = Sys.getenv("encoding", "UTF-8"),
tablenames = NULL,
...){

assert_files_exist(files)
assert_dbi(db_con)
assert_streamable(streamable_table)

## Guess streamable table
if(is.null(streamable_table)){
streamable_table <- guess_stream(files[[1]])
}

assert_streamable(streamable_table)


if(is.null(tablenames)){
tablenames <- vapply(files, base_name, character(1))
}

db <- normalize_con(db_con)
lapply(files,
unark_file,
db,
streamable_table = streamable_table,
lines = lines,
overwrite = overwrite,
...)

lapply(seq_along(files),
function(i){
unark_file(files[[i]],
db_con = db,
streamable_table = streamable_table,
lines = lines,
overwrite = overwrite,
encoding = encoding,
tablename = tablenames[[i]],
...)
})
invisible(db_con)
}

Expand All @@ -77,28 +98,34 @@ normalize_con <- function(db_con){
}
}


#' @importFrom DBI dbWriteTable
#' @importFrom progress progress_bar
unark_file <- function(filename, db_con, streamable_table, lines = 10000L, overwrite, ...){
unark_file <- function(filename,
db_con,
streamable_table,
lines,
overwrite,
encoding,
tablename = base_name(filename),
...){

tbl_name <- base_name(filename)

if(!assert_overwrite_db(db_con, tbl_name, overwrite)){

if(!assert_overwrite_db(db_con, tablename, overwrite)){
return(NULL)
}



con <- compressed_file(filename, "r")
con <- compressed_file(filename, "r", encoding = encoding)
on.exit(close(con))

## Handle case of `col_names != TRUE`?
header <- readLines(con, n = 1L)
## readr method needs UTF-8 encoding for these newlines to be newlines
header <- read_lines(con, n = 1L, encoding = encoding)
if(length(header) == 0){ # empty file, would throw error
return(invisible(db_con))
}
reader <- read_chunked(con, lines)
reader <- read_chunked(con, lines, encoding)

# May throw an error if we need to read more than 'total' chunks?
p <- progress::progress_bar$new("[:spin] chunk :current", total = 100000)
Expand All @@ -110,7 +137,7 @@ unark_file <- function(filename, db_con, streamable_table, lines = 10000L, overw
body <- paste0(c(header, d$data), "\n", collapse = "")
p$tick()
chunk <- streamable_table$read(body, ...)
DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE)
DBI::dbWriteTable(db_con, tablename, chunk, append=TRUE)

if (d$complete) {
break
Expand All @@ -126,16 +153,16 @@ unark_file <- function(filename, db_con, streamable_table, lines = 10000L, overw
# https://github.com/vimc/montagu-r
# /blob/4fe82fd29992635b30e637d5412312b0c5e3e38f/R/util.R#L48-L60

read_chunked <- function(con, n) {
read_chunked <- function(con, n, encoding) {
assert_connection(con)
next_chunk <- readLines(con, n)
next_chunk <- read_lines(con, n, encoding = encoding)
if (length(next_chunk) == 0L) {
warning("connection has already been completely read")
return(function() list(data = character(0), complete = TRUE))
}
function() {
data <- next_chunk
next_chunk <<- readLines(con, n)
next_chunk <<- read_lines(con, n, encoding = encoding)
complete <- length(next_chunk) == 0L
list(data = data, complete = complete)
}
Expand All @@ -148,7 +175,10 @@ base_name <- function(filename){
ext_regex <- "(?<!^|[.])[.][^.]+$"
path <- sub(ext_regex, "", path, perl = TRUE)
path <- sub(ext_regex, "", path, perl = TRUE)
sub(ext_regex, "", path, perl = TRUE)
path <- sub(ext_regex, "", path, perl = TRUE)
## Remove characters not permitted in table names
path <- gsub("[^a-zA-Z0-9_]", "_", path, perl = TRUE)
tolower(path)
}

#' @importFrom tools file_ext
Expand All @@ -160,3 +190,32 @@ compressed_file <- function(path, ...){
zip = unz(path, ...),
file(path, ...))
}


read_lines <- function(con,
n,
encoding = "unknown",
warn = FALSE){
out <- readLines(con,
n = n,
encoding = encoding,
warn = FALSE)

}

guess_stream <- function(x){
ext <- tools::file_ext(x)
## if compressed, chop off that and try again
if(ext %in% c("gz", "bz2", "xz", "zip")){
ext <- tools::file_ext(gsub("\\.([[:alnum:]]+)$", "", x))
}
streamable_table <-
switch(ext,
"csv" = streamable_base_csv(),
"tsv" = streamable_base_tsv(),
stop(paste("Streaming file parser could not be",
"guessed from file extension.",
"Please specify a streamable_table option"))
)
streamable_table
}
4 changes: 2 additions & 2 deletions codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"codeRepository": "https://github.com/ropensci/arkdb",
"issueTracker": "https://github.com/ropensci/arkdb/issues",
"license": "https://spdx.org/licenses/MIT",
"version": "0.0.3",
"version": "0.0.3.9000",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
Expand Down Expand Up @@ -237,7 +237,7 @@
],
"releaseNotes": "https://github.com/ropensci/arkdb/blob/master/NEWS.md",
"readme": "https://github.com/ropensci/arkdb/blob/master/README.md",
"fileSize": "16.996KB",
"fileSize": "20.527KB",
"contIntegration": [
"https://travis-ci.org/cboettig/arkdb",
"https://codecov.io/github/cboettig/arkdb?branch=master",
Expand Down
59 changes: 59 additions & 0 deletions inst/examples/fao.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
library(arkdb)
#unzip("~/Desktop/FAOSTAT.zip")
#lapply(x, unzip)

x <- list.files("~/FAOSTAT/", pattern="[.]csv",full.names = TRUE)
dbdir <- rappdirs::user_data_dir("faostat")
#fs::dir_delete(dbdir)
db <- DBI::dbConnect(MonetDBLite::MonetDBLite(), dbdir)


### using the readr parser ###
#options(encoding = "latin2") # Must enforce UTF-8 for readr parsing
unark(x[[1]],
db,
#streamable_table = streamable_readr_csv(), # either works
streamable_table = streamable_base_csv(),
lines = 5e5,
overwrite = TRUE,
encoding = "latin2")



## Inspect
tbls <- DBI::dbListTables(db)
DBI::dbListFields(db, tbls[[1]])
library(tidyverse)
tbl(db, tbls[[1]]) %>% select(Country) %>% distinct() %>% collect() %>% pull(Country)


#############################################################
### Alternative Approach: custom streamable_table method ####
#############################################################

## A slightly modified base read.csv function is used here to standardize column names
read <- function(file, ...) {
tbl <- utils::read.table(textConnection(file), header = TRUE,
sep = ",", quote = "\"", stringsAsFactors = FALSE,
...)
## ADDING THESE LINES to the default method. use lowercase column names
names(tbl) <- tolower(names(tbl))
names(tbl) <- gsub("\\.", "_", names(tbl))
tbl
}


read <- function(file, ...) {
readr::read_csv(file = file, ...)
}

write <- function(x, path, omit_header) {
utils::write.table(x, file = path, sep = ",", quote = TRUE,
qmethod = "double", row.names = FALSE, col.names = !omit_header,
append = omit_header)
}
stream <- arkdb::streamable_table(read, write, "csv")



unark(x, db, streamable_table = stream, lines = 5e5, overwrite = TRUE)
7 changes: 4 additions & 3 deletions man/ark.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 9 additions & 2 deletions man/unark.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 7 additions & 4 deletions vignettes/medium-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,19 @@ title: "Working with medium-sized data"
author: "Carl Boettiger"
date: "9/22/2018"
output: pdf_document
vignette: >
%\VignetteIndexEntry{working_with_data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

Over the past summer, I have written two small-ish R packages to address challenges I frequently run up against during the course of my research. Both are challenges with what I will refer to as medium-sized data -- not the kind of petabyte scale "big data" which precludes analysis on standard hardware or existing methodology, but large enough that the size alone starts creating problems for certain bits of a typical workflow. More precisely, I will take *medium-sized* to refer to data that is too large to comfortably fit in memory on most laptops (e.g. on the order of several GB), or data that is merely too large to commit to GitHub. By *typical workflow*, I mean easily being able to share all parts of analysis publicly or privately with collaborators (or merely different machines, such as my laptop and cloud server) who should be able to reproduce the results with minimal fuss and configuration.

Over the past summer, I have written two small-ish R packages to address challenges I frequently run up against during the course of my research. Both are challenges with what I will refer to as medium-sized data -- not the kind of petabyte scale "big data" which percludes analysis on standard hardware or existing methodology, but large enough that the size alone starts creating problems for certain bits of a typical workflow. More precisely, I will take *medium-sized* to refer to data that is too large to comfortably fit in memory on most laptops (e.g. on the order of several GB), or data that is merely too large to commit to GitHub. By *typical workflow*, I mean easily being able to share all parts of analysis publicly or privately with collaborators (or merely different machines, such as my laptop and cloud server) who should be able to reproduce the results with minimal fuss and configuration.

For data too large to fit into memory, there's already a well-established solution of using an external database, to store the data. Thanks to `dplyr`'s database backends, many R users can adapt their workflow relatively seemlessly to move from `dplyr` commands that call in-memory data frames to identical or nearly identical commands that call a database. This all works pretty well when your data *is already in a database*, but getting it onto a database, and then moving the data around so that other people/machines can access it is not nearly so straight forward. So far, this part of the problem has recieved relatively little attention.
For data too large to fit into memory, there's already a well-established solution of using an external database, to store the data. Thanks to `dplyr`'s database backends, many R users can adapt their workflow relatively seamlessly to move from `dplyr` commands that call in-memory data frames to identical or nearly identical commands that call a database. This all works pretty well when your data *is already in a database*, but getting it onto a database, and then moving the data around so that other people/machines can access it is not nearly so straight forward. So far, this part of the problem has received relatively little attention.

The reason for that is because the usual response to this problem is "you're doing it wrong." The standard practice in this context is simply not to move the data at all. A central database server, usually with access controlled by password or other credential, can allow multiple users to all query the same database directly. Thanks to the magical abstractions of SQL queries such as the `DBI` package, the user (aka client), doesn't need to care about the details of where the database is located, or even what particular backend is used. Moving all that data around can be slow and expensive. Arbitrarily large data can be housed in a central/cloud location and provisioned with enough resources to store everything and process complex queries. Consequently, just about every database backend not only to provides a mechanism for doing your `SQL` / `dplyr` querying, filtering, joining etc on data that cannot fit into memory all at once, but also nearly every such backend provides *server* abilities to do so over a network connection, handling secure logins and so forth. Why would you want to do anything else?

The problem with the usual response is that it is often at odds with our original objectives and typical scientific workflows. Setting up a database server can be non-trivial; by which I mean: difficult to automate in a portable/cross-platform manner when working entirely from R. More importantly, it reflects a use-case more typical of industry context than scientific practice. Individual researchers need to make data avialable to a global community of scientists who can reproduce results years or decades later; not just to a handful of employees who can be granted authenticated access to a central database. Archiving data as static text files is far more *scalable*, more *cost-effective* (storing static files is much cheaper than keeping a database server running), more *future-proof* (rapid evolution in database technology is not always backwards compatible) and simplifies or *avoids most security issues* involved in maintaining a public server. In the scientific context, it almost always makes more sense to move the data after all.
The problem with the usual response is that it is often at odds with our original objectives and typical scientific workflows. Setting up a database server can be non-trivial; by which I mean: difficult to automate in a portable/cross-platform manner when working entirely from R. More importantly, it reflects a use-case more typical of industry context than scientific practice. Individual researchers need to make data available to a global community of scientists who can reproduce results years or decades later; not just to a handful of employees who can be granted authenticated access to a central database. Archiving data as static text files is far more *scalable*, more *cost-effective* (storing static files is much cheaper than keeping a database server running), more *future-proof* (rapid evolution in database technology is not always backwards compatible) and simplifies or *avoids most security issues* involved in maintaining a public server. In the scientific context, it almost always makes more sense to move the data after all.

Scientific data repositories are already built on precisely this model: providing long term storage of files that can be downloaded and analyzed locally. For smaller `.csv` files, this works pretty well.

Expand Down

0 comments on commit 04f353c

Please sign in to comment.