New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent parsing failure "no trailing characters e3" #645

Closed
cswaters opened this Issue Mar 30, 2017 · 12 comments

Comments

Projects
None yet
5 participants
@cswaters

cswaters commented Mar 30, 2017

I'm importing a raw csv file using read_csv.

  1. The raw csv file contains date column, on import it's converted to col_datetime format.
  2. I create a column seas (short for season) using the code

df <- mutate(df, seas = ifelse(month(date) < 5, year(date) - 1, year(date)))

  1. I run the code group_by(df, seas) %>% tally() to check the seasons calculated properly. The output looks good. No NA values.
seas n
1999 642
2000 648
2001 644
2002 666
2003 664
2004 664
2005 666
  1. Export the dataframe with the new seas column using write_csv.
  2. Import the new csv (the one created in the step above) and get the error.

Warning: 648 parsing failures.

row col expected actual file
1453 seas no trailing characters e3 'df_cleaned.csv'
1454 seas no trailing characters e3 'df_cleaned.csv'
1455 seas no trailing characters e3 'df_cleaned.csv'
1456 seas no trailing characters e3 'df_cleaned.csv'
1457 seas no trailing characters e3 'df_cleaned.csv'

.... .... ...................... ...... ....................
See problems(...) for more details.

  1. Even though the 2000 seas (season) calculated properly before the export (see the dataframe results in step 3), upon import something is triggering a parsing issue.

unique(nfl$seas)

    [1] 1995 1996 1997 1998 1999   NA 2001 2002 2003 2004 2005 2006 2007
    [14] 2008 2009 2010 2011 2012 2013 2014 2015 2016

filter(df, is.na(seas)) %>% select(date, seas) produces

A tibble: 648 × 2

date seas
2000-07-29 NA
2000-07-29 NA
2000-07-30 NA
2000-07-30 NA
  1. read.csv works with no issues.
@bpbond

This comment has been minimized.

bpbond commented Apr 20, 2017

We are seeing this too--the value 2000 getting converted to 2e3. Not handy. Is write_csv trying to save a single character here for efficiency?

@jimhester

This comment has been minimized.

Member

jimhester commented Apr 20, 2017

The issue stems from not being precise with your input and output types.

If you output a 2000 as a double it will be written as 2e3.

library(readr)
write_csv(data.frame(a = 2000), NULL)
#> a
#> 2e3

But if you write an integer value it will be output as 2000

write_csv(data.frame(a = 2000L), NULL)
#> a
#> 2000

Then if you read the value with the default guessing it will guess as double

read_csv("a\n2e3\n")
#> # A tibble: 1 × 1
#>       a
#>   <dbl>
#> 1  2000
read_csv("a\n2e3\n", col_types = "?")
#> # A tibble: 1 × 1
#>       a
#>   <dbl>
#> 1  2000

However if you try to read it as an integer it will fail.

read_csv("a\n2e3\n", col_types = "i")
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 2)
#> Warning: 1 parsing failure.
#> row # A tibble: 1 × 5 col     row   col               expected actual         file expected   <int> <chr>                  <chr>  <chr>        <chr> actual 1     1     a no trailing characters     e3 literal data file # A tibble: 1 × 5
#> # A tibble: 1 × 1
#>       a
#>   <int>
#> 1    NA

readr is guessing that the column is an integer column because the first 1000 values are all integers, you should explicitly set the column type to double to avoid this.

read_csv("a\n2e3\n", col_types = "d")
#> # A tibble: 1 × 1
#>       a
#>   <dbl>
#> 1  2000

See #526 for the discussion on why we no longer include .0 when writing doubles, which would avoid this issue.

@rplzzz

This comment has been minimized.

rplzzz commented Apr 20, 2017

Just to clarify what Ben was talking about above, there are two things interacting to produce this behavior. The first is that write_csv is rendering 1999 and 2001 as four-digit strings that look like integers, but it is rendering 2000 as "2e3". The second is that read_csv scans a limited number of rows into the table, and if everything it sees up to that point looks like an integer, then it concludes that the column is an integer column. When it sees "2e3" further down, it can't parse that as an integer, so it chokes on it.

Both of these have to happen to cause the error in your case, so you can either convert seas to integer type (assuming they are all in fact integers) (write_csv won't use the exponential notation for an integer), or you can set the guess_max parameter large enough to see at least one of the "2e3" values.

The reply from jimhester came in while I was typing this, and I have to say that I'm not impressed with the complaint of the OP in issue 526. Writing out a double should produce a double literal, and "2000" is not a double literal. If you want integers, then use an integer type.

What's even worse is that now a column of type double with integer-ish values prints out in a way that makes it look like a column of integers, even though they were correctly typed as doubles. Thus, if you have 1500 or so entries with no decimal part, followed by entries that do have decimals, write_csv will now produce an output that causes vanilla read_csv to choke. It seems you have exchanged an easy to solve problem (i.e., use integer types if you want the output to look like integers) for a problem that is actually quite tricky (i.e., you have to know how far into the data your first non-integer value occurs in order to read the table correctly). That's a monumentally bad idea.

@jimhester

This comment has been minimized.

Member

jimhester commented Apr 20, 2017

The solution here is if you want integers you need to convert them to integers when you write out the data.

You also need to avoid guessing column types if you know the types of the columns a priori , it avoids a host of problems (including this one).

@bpbond

This comment has been minimized.

bpbond commented Apr 20, 2017

The issue stems from not being precise with your input and output types.
if you know the types of the columns a priori

Unfortunately, in our case, we can't, because we don't.

@jimhester thanks for your thoughts. But the point above still seems really troubling re write_csv behavior: that (quoting Robert) a column of type double with integer-ish values prints out in a way that makes it look like a column of integers, even though they were correctly typed as doubles, and this then breaks default read_csv.

Given our particular situation (we don't know the column types ahead of time) it seems like the only solution is to force read_csv to scan the entire file when determining column types. FWIW I agree with Robert (and actually with @hadley 's original comment in #526 - "Why is that a problem?"): if people want integer output, it's not asking a lot for them to use as.integer.

@cswaters

This comment has been minimized.

cswaters commented Apr 21, 2017

Thanks for your answer. I'll make sure to convert to integer before export.

@rplzzz

This comment has been minimized.

rplzzz commented Apr 21, 2017

Jim, that's the point. We don't want integers. Our data is floating point, and it's correctly typed as floating point, but when it happens to have integer values (like, say, 0.0) it's being written out as an integer, and that's broken behavior.

As to specifying the column types explicitly, we don't always know them. Sure, there are ways we could figure them out, but that's a lot of hassle to impose on users like us just to save the users in the other camp from having to type "as.integer" from time to time. It seems like a poor tradeoff.

@jimhester

This comment has been minimized.

Member

jimhester commented Apr 21, 2017

Sorry, we already changed printing of integer-ish values once and are not going to change it back. If you want the previous behavior use version 1.0.0 or maintain a fork with the change (3526cc5) reverted.

@jimhester jimhester closed this Apr 21, 2017

@hadley

This comment has been minimized.

Member

hadley commented Apr 21, 2017

@rplzzz This is a decision made not only philosophy, but pragmatics — far more people had problems with the previous behaviour. Most R users are not programming experts and shouldn't be forced to understand the distinction between integer and numeric. You are clearly a more adept R programmer, so unfortunately the burden is going to fall on you.

@bpbond

This comment has been minimized.

bpbond commented Apr 22, 2017

In case anyone finds this thread while googling this unfortunate readr problem, here's our workaround: convert doubles in problematic tables (i.e. ones with a lot of initial integer-ish doubles that break read_csv) into characters before writing.

protect_float <- function(df) {
   floatcols <- names(df)[sapply(df, function(col) {is.numeric(col) &&
       !is.integer(col)})]
   for(col in floatcols) {
     ## Write entries with very large or very small values in scientific
     ## notation.  Other values will be in decimal notation.
 
     df[[col]] <- if_else(abs(df[[col]]) < 1e-4 | abs(df[[col]]) > 1e6,
                          sprintf("%e", df[[col]]),
                          sprintf("%f", df[[col]]))
   }
   df
 }
@hadley

This comment has been minimized.

Member

hadley commented Apr 23, 2017

That seems like a bad idea to me. Why can't you fix on the reading side?

@rplzzz

This comment has been minimized.

rplzzz commented Apr 24, 2017

@hadley,

Expedience, primarily. This was the fastest way to get to a solution while still maintaining acceptable performance. Ultimately, the government is paying me to deliver scientific insight, and the code is just a means to that end.

To that I'd add, there's nothing really wrong on the reading side; read_csv is working as designed. The problem here is on the write side, inasmuch as the stuff that it is writing out is not the same as the stuff in the data structure that is being written. Developing a workaround for the reading side is treating the symptom, but not the disease.

Ultimately, the real right answer is to is to use a well designed binary format, as it's the only way to be sure that what you read in is the same as what you originally wrote out. However, sometimes one is stuck with textual formats for legacy reasons, and in those cases we do what we can because we must.

Thanks for your interest. And thanks for the tools you have developed. Notwithstanding this little bump in the road, they have been a huge help in our projects (this one and others).

jimhester added a commit to jimhester/readr that referenced this issue Apr 24, 2017

jimhester added a commit to jimhester/readr that referenced this issue Apr 24, 2017

joelgombin added a commit to datactivist/presidentielle2017 that referenced this issue Apr 25, 2017

jimhester added a commit that referenced this issue Apr 26, 2017

WWolf pushed a commit to WWolf/korea-election that referenced this issue Nov 6, 2017

@lock lock bot locked and limited conversation to collaborators Sep 24, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.