tweaks to unark for more robust parsing #19

cboettig · 2018-09-25T22:14:24Z

unark() will strip out non-compliant characters by default.
unark() is also be more flexible, allowing the user to specify the corresponding table names manually, rather than enforcing they correspond with the incoming csv names. #18
Technical tweak: readLines call inside unark() method will use encoding directly from getOption("encoding"), e.g. allowing encoding to be set to UTF-8.

This can resolve parsing errors when using the readr parser on certain files. See FAO.R example in examples for an illustration.

cc @noamross thanks for reporting these issues and maybe for testing this out too.

@noamross

- `unark()` will strip out non-compliant characters by default. - `unark()` is also be more flexible, allowing the user to specify the corresponding table names manually, rather than enforcing they correspond with the incoming csv names. [#18](#18) - Technical tweak: readLines call inside `unark()` method will use encoding directly from `getOption("encoding")`, e.g. allowing encoding to be set to UTF-8. This can resolve parsing errors when using the readr parser on certain files. See `FAO.R` example in `examples` for an illustration. cc @noamross

Though based on stringi:stri_enc_detect, encoding may actually be ISO-8859-2 instead of ISO-8859-1 (latin1)? Though that causes other parsing errors...

stringi's guess was correct, we just needed to use R's short name instead of the official encoding name in `options`.

also adds ability for unark to guess csv vs tsv.

cboettig added 6 commits September 25, 2018 15:12

update

53aea82

switch example to latin1

2ca50c2

Though based on stringi:stri_enc_detect, encoding may actually be ISO-8859-2 instead of ISO-8859-1 (latin1)? Though that causes other parsing errors...

fix bug

adfdd90

all is good with latin2.

a9ffd14

stringi's guess was correct, we just needed to use R's short name instead of the official encoding name in `options`.

adds option to pass encoding directly.

1ca09cc

also adds ability for unark to guess csv vs tsv.

cboettig merged commit 04f353c into master Sep 26, 2018

cboettig deleted the patch-tablename branch September 26, 2018 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tweaks to unark for more robust parsing #19

tweaks to unark for more robust parsing #19

cboettig commented Sep 25, 2018

tweaks to unark for more robust parsing #19

tweaks to unark for more robust parsing #19

Conversation

cboettig commented Sep 25, 2018