Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with horrible files #94

Closed
Ironholds opened this issue Mar 19, 2015 · 7 comments
Closed

Dealing with horrible files #94

Ironholds opened this issue Mar 19, 2015 · 7 comments

Comments

@Ironholds
Copy link
Contributor

Some humans - some terrible, terrible humans - leave tabs in content-insertable fields. Tabs that are not consistently escaped. Would it be possible to add an option to read_delim that, if set to FALSE, continues current behaviour around one row unexpectedly having an extra field and, if set to TRUE, mashes the extra field into the last "expected" field and issues a warning noting the row number?

@hadley
Copy link
Member

hadley commented Mar 20, 2015

I think so. Could you outline a few examples? (maybe using commas so it's a bit easier to see)

@Ironholds
Copy link
Contributor Author

Totally!

h1 h2 h3
1, 6, A RUN-ON SENTENCE (sometimes called a "fused sentence")
1, 9, has at least two parts, either one of which can stand by itself
5, 4, (in other words, two independent clauses)

We've got an unquoted text field, which shouldn't be a thing that exists but is, because humans are terrible. As a result, any comma is (potentially) a delimiter; it's impossible for a machine to tell.

In R-core, the response is to create an additional field at the end and then (confusingly) warn you about incomplete lines. In readr, the response is to say "more columns than column names, abort!". It would be nice to have a way of solving for this by saying "I don't care if you've got a comma in you, field, I've been told there are six fields and you're no.7 so I'm going to issue a warning and then fields[6] += fields[7]. In you go now."

@hadley
Copy link
Member

hadley commented Mar 20, 2015

Hmmm, this is the same basic behaviour as tidyr::separate() so maybe have the same option? i.e. extra = c("error", "drop", "merge")

@hadley
Copy link
Member

hadley commented Mar 20, 2015

Maybe warn (= warning + drop) would be the best default

@Ironholds
Copy link
Contributor Author

Yep; sounds perfect!

@hadley
Copy link
Member

hadley commented Sep 3, 2015

Related to #189

@hadley
Copy link
Member

hadley commented Sep 23, 2015

I think this is ok now - readr will just expand the columns as needed, and you can do the cleanup afterwards. (i.e. it's like extra = "warn" with no option to change the behaviour). The code is already getting pretty complicated, so I don't want to mess with it at the moment.

@hadley hadley closed this as completed Sep 23, 2015
@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants