-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
tweaks to unark for more robust parsing (#19)
- `unark()` will strip out non-compliant characters in table names by default. - `unark()` gains the optional argument `tablenames`, allowing the user to specify the corresponding table names manually, rather than enforcing they correspond with the incoming file names. closes #18 - `unark()` gains the argument `encoding`, allowing users to directly set the encoding of incoming files. Previously this could only be set by setting `options(encoding)`, which will still work as well. See `FAO.R` example in `examples` for an illustration. - `unark()` will now attempt to guess which streaming parser to use (e.g `csv` or `tsv`) based on the file extension pattern, rather than defaulting to a `tsv` parser. (`ark()` still defaults to exporting in the more portable `tsv` format).
- Loading branch information
Showing
8 changed files
with
182 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
library(arkdb) | ||
#unzip("~/Desktop/FAOSTAT.zip") | ||
#lapply(x, unzip) | ||
|
||
x <- list.files("~/FAOSTAT/", pattern="[.]csv",full.names = TRUE) | ||
dbdir <- rappdirs::user_data_dir("faostat") | ||
#fs::dir_delete(dbdir) | ||
db <- DBI::dbConnect(MonetDBLite::MonetDBLite(), dbdir) | ||
|
||
|
||
### using the readr parser ### | ||
#options(encoding = "latin2") # Must enforce UTF-8 for readr parsing | ||
unark(x[[1]], | ||
db, | ||
#streamable_table = streamable_readr_csv(), # either works | ||
streamable_table = streamable_base_csv(), | ||
lines = 5e5, | ||
overwrite = TRUE, | ||
encoding = "latin2") | ||
|
||
|
||
|
||
## Inspect | ||
tbls <- DBI::dbListTables(db) | ||
DBI::dbListFields(db, tbls[[1]]) | ||
library(tidyverse) | ||
tbl(db, tbls[[1]]) %>% select(Country) %>% distinct() %>% collect() %>% pull(Country) | ||
|
||
|
||
############################################################# | ||
### Alternative Approach: custom streamable_table method #### | ||
############################################################# | ||
|
||
## A slightly modified base read.csv function is used here to standardize column names | ||
read <- function(file, ...) { | ||
tbl <- utils::read.table(textConnection(file), header = TRUE, | ||
sep = ",", quote = "\"", stringsAsFactors = FALSE, | ||
...) | ||
## ADDING THESE LINES to the default method. use lowercase column names | ||
names(tbl) <- tolower(names(tbl)) | ||
names(tbl) <- gsub("\\.", "_", names(tbl)) | ||
tbl | ||
} | ||
|
||
|
||
read <- function(file, ...) { | ||
readr::read_csv(file = file, ...) | ||
} | ||
|
||
write <- function(x, path, omit_header) { | ||
utils::write.table(x, file = path, sep = ",", quote = TRUE, | ||
qmethod = "double", row.names = FALSE, col.names = !omit_header, | ||
append = omit_header) | ||
} | ||
stream <- arkdb::streamable_table(read, write, "csv") | ||
|
||
|
||
|
||
unark(x, db, streamable_table = stream, lines = 5e5, overwrite = TRUE) |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters