Dealing with horrible files #94

Ironholds · 2015-03-19T15:49:41Z

Some humans - some terrible, terrible humans - leave tabs in content-insertable fields. Tabs that are not consistently escaped. Would it be possible to add an option to read_delim that, if set to FALSE, continues current behaviour around one row unexpectedly having an extra field and, if set to TRUE, mashes the extra field into the last "expected" field and issues a warning noting the row number?

hadley · 2015-03-20T14:16:46Z

I think so. Could you outline a few examples? (maybe using commas so it's a bit easier to see)

Ironholds · 2015-03-20T16:19:14Z

Totally!

h1 h2 h3
1, 6, A RUN-ON SENTENCE (sometimes called a "fused sentence")
1, 9, has at least two parts, either one of which can stand by itself
5, 4, (in other words, two independent clauses)

We've got an unquoted text field, which shouldn't be a thing that exists but is, because humans are terrible. As a result, any comma is (potentially) a delimiter; it's impossible for a machine to tell.

In R-core, the response is to create an additional field at the end and then (confusingly) warn you about incomplete lines. In readr, the response is to say "more columns than column names, abort!". It would be nice to have a way of solving for this by saying "I don't care if you've got a comma in you, field, I've been told there are six fields and you're no.7 so I'm going to issue a warning and then fields[6] += fields[7]. In you go now."

hadley · 2015-03-20T18:18:27Z

Hmmm, this is the same basic behaviour as tidyr::separate() so maybe have the same option? i.e. extra = c("error", "drop", "merge")

hadley · 2015-03-20T18:18:52Z

Maybe warn (= warning + drop) would be the best default

Ironholds · 2015-03-20T18:23:03Z

Yep; sounds perfect!

hadley · 2015-09-03T19:42:38Z

Related to #189

hadley · 2015-09-23T16:32:23Z

I think this is ok now - readr will just expand the columns as needed, and you can do the cleanup afterwards. (i.e. it's like extra = "warn" with no option to change the behaviour). The code is already getting pretty complicated, so I don't want to mess with it at the moment.

Ironholds mentioned this issue Mar 21, 2015

Option to ignore unparsable lines? Ironholds/webreadr#1

Closed

msonnabaum mentioned this issue Mar 23, 2015

Option to ignore unparsable lines? #99

Closed

dbuijs mentioned this issue Sep 15, 2015

Horrible CSV: Can readr deal with the full spectrum of horrible files? #262

Closed

hadley closed this as completed Sep 23, 2015

lock bot locked and limited conversation to collaborators Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with horrible files #94

Dealing with horrible files #94

Ironholds commented Mar 19, 2015

hadley commented Mar 20, 2015

Ironholds commented Mar 20, 2015

hadley commented Mar 20, 2015

hadley commented Mar 20, 2015

Ironholds commented Mar 20, 2015

hadley commented Sep 3, 2015

hadley commented Sep 23, 2015

Dealing with horrible files #94

Dealing with horrible files #94

Comments

Ironholds commented Mar 19, 2015

hadley commented Mar 20, 2015

Ironholds commented Mar 20, 2015

hadley commented Mar 20, 2015

hadley commented Mar 20, 2015

Ironholds commented Mar 20, 2015

hadley commented Sep 3, 2015

hadley commented Sep 23, 2015