Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vroom keeps crashing my R session #119

Closed
randomgambit opened this issue May 26, 2019 · 25 comments

Comments

@randomgambit
Copy link

commented May 26, 2019

Hi @jimhester ,

I am experiencing some issues with vroom. I am loading a csv.gz file that I was able to load without problems with read_csv and data.table.

The file is not too big (about 800MB), and I load it using:

 vroom(num_threads = 20, delim = ',', quote = '"',
        col_types = cols(.default = col_character()),
        n_max = 10) 

Strangely enough, instead of just loading 10 rows as requested, vroom loads the entire thing. Then running a simple tail() on the tibble will crash the session.

Unfortunately I cannot share the data. Could you please let me know what kind of tests I could run that could be useful to fix that issue?

Thanks!

@emilio-berti

This comment has been minimized.

Copy link

commented May 27, 2019

Hi @jimhester and @randomgambit ,

similar issue here. I am trying to import several csv files using vroom and my RStudio session keeps crashing. I tried with few files of 3kb each and vroom sometimes crashes both RStudio and R session in bash terminal. An example of my code can be seen below.

library(vroom)

files <- list.files("Folder", recursive = T)

vroom(paste0("Folder/" files))

Hope this will be helpful. Great package!

@jimhester

This comment has been minimized.

Copy link
Member

commented May 28, 2019

I would need at least some more idea of what your datasets look like to have any hope of reproducing this.

If you can't share the data try randomizing it, likely all I really need is data that looks similar, e.g. the same type of delimiters and newlines.

Strangely enough, instead of just loading 10 rows as requested, vroom loads the entire thing.

I don't see how this is possible...

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 28, 2019

hi @jimhester, let me try to get back with more details. I also confirm the recurrent crashes after loading multiple files with vroom and triggering some computations later in the code

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 28, 2019

@jimhester I think I have an idea what is going on. I think vroom is struggling with parsing errors / empty columns.

Here is an example. The file I am loading has 9 columns but the first n rows (where n is pretty large) have missing obs for the last 3 columns.

I load my dataset using readr::read_csv() without issues

test_7 <- read_csv(files[4],
         col_names = c('date','time',
                       'type',
                       'family',
                       'gender',
                       'Q','X','Y','Z'),
         col_types = cols(.default = col_character()),
         n_max = 10)

As expected the tibble contains 10 observations

# A tibble: 10 x 9
   date       time         type    family gender Q     X     Y     Z    
   <chr>      <chr>        <chr>   <chr>  <chr>  <chr> <chr> <chr> <chr>
 1 2012/04/03 14:00:00.000 ABC/DEF L      3      6     NA    0     0  

Running the equivalent with vroom creates a much larger tibble (9,454 x 9)

test3 <- vroom(files[4], 
              col_names = c('date','time',
                            'type',
                            'family',
                            'gender',
                            'Q','X','Y','Z'),
              n_max = 10,
              col_types = cols(.default = col_character()),
              delim = ',')

with the 11-th row and first column looking like:
"2012/04/03,19:00:00.000,ABC/EDF,F,2,1,,0,0\n2012/05/03,19:00:00.000,ABC/DEF,Q,1,1,,0,0

My guess is that vroom gets confused by these missing columns.
What do you think?

Thanks!!

@jimhester

This comment has been minimized.

Copy link
Member

commented May 28, 2019

Is it really "2012/04/03,19:00:00.000,ABC/EDF,F,2,1,,0,0\n2012/05/03,19:00:00.000,ABC/DEF,Q,1,1,,0,0 with a leading double quote and no closing ending quote? If so vroom is going to take the whole column as a quoted field. Try turning off quotations, quote = ""

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 28, 2019

no the output is longer (and trimmed in the console). Let me try to put the entire cell

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 28, 2019

funnily enough, running
test3 %>% slice(11) %>% select(1) %>% pull()

returns a super long string (that is quoted propertly between ").

Essentially a really long version of:

"2012/04/03,19:00:00.000,ABC/EDF,F,2,1,,0,0\n2012/05/03,19:00:00.000,ABC/DEF,F,1,1,0.12333,8120000,1\n2018/09/03,21:04:31.000"
@jimhester

This comment has been minimized.

Copy link
Member

commented May 28, 2019

I think your file has a unpaired quote that is causing the issue. Try looking at the raw lines.

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 28, 2019

Actually, the raw data does not contain any " at all!

I opened the file with sublime text and I do not see anything weird. Also, recall that read_csv does not complain. The first few rows look like that in sublime text (the file has no headers)

2015/09/04,21:00:00.000,ABC/DEF,F,0,1,,0,0
2015/09/04,21:00:00.000,ABC/DEF,F,1,1,,0,0

Then after a few hundreds similar rows more complete rows look like:

2015/03/02,21:00:00.100,ABC/DEF,F,0,1,225.800,1000000,1
2015/05/02,21:00:00.800,ABC/DEF,F,0,1,226.545,1000000,1

What do you think?

@jimhester

This comment has been minimized.

Copy link
Member

commented May 28, 2019

From the just the lines above I don't see how it could be causing the crashes, there must be something else in the file that is causing the indexing to become wrong, but I don't know what it could be.

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 28, 2019

humm.... I ll try to find out more. But in any case the parsing in vroom seems to be problematic (as shown by the row eleven above). Perhaps there are other parsing arguments I can try in vroom? What about the altrep options? It is so strange the file is parsed OK with readr while with vroom it is not. Thanks again!

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 28, 2019

Actually, I was able to get more crashes :) by loading a different dataset that as many very long strings. The columns are actually python dictionaries converted to strings, like

weird_col
"{""<2"":4,""4-200"":30,""200-400"":23,""400-10000"":24,"">10000"":10}"

I tried with many combinations of escape_backslash and escape_double but Rstudio crashes as soon as I start a computation on these tibbles.

Please let me know if you think these crazy columns could be the root cause of the crashes. Again. loading the files with read_csv generates no bogus rows.

Thanks for your help!

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

@jimhester I was able to get a nice predictable crash when I run the following code

library(tidyverse)
library(vroom)
df <- seq(1:10000) %>% map_dfr(., ~vroom('C:\\Users\\john\\Documents\\file.out', col_names = c('a','b','c','d'),
            col_types = cols(.default = col_character())))

with the attached file (you have to unzip it because I was not able to attach my .out file) which is essentially a repetition of the example above with weird character columns. I hope this helps.

Please let me know

Thanks!
file.zip

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

@jimhester does that work (i.e. crash your session)? :D

@jimhester

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Try using progress = FALSE, it seems like this due to a race condition in the progress bar.

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

@jimhester ha! let me try ASAP with the two big datasets I have. I ll report to the HQ soon. Thanks!

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

@jimhester Here is what I found out. First, progress = FALSE prevent the crashing at load time. The good news is that I found a way to reproduce the parsing bug I told you about.

Simply run

df <- seq(1:100) %>% map_dfr(., ~vroom('/mypath/file.zip', col_names = c('a','b','c','d'),
                                         col_types = cols(.default = col_character()),
                                         progress = FALSE))

and you will notice that the first rows are correctly parsed


> df %>% head()
# A tibble: 6 x 4
  a                                                               b                                                               c     d    
  <chr>                                                           <chr>                                                           <chr> <chr>
1 "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? {}    2    
2 "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? {}    2    
3 "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? {}    2    
4 "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? {}    2    
5 "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? {}    2    
6 "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? "{\"<2\":4,\"4-200\":30,\"200-400\":23,\"400-10000\":24,\">100? {}    2    
> 

while the bottom rows are completely wrong (and thus can trigger some crashes when the data is big).

See below:


> df %>% tail()
# A tibble: 6 x 4
  a              b                c                                d          
  <chr>          <chr>            <chr>                            <chr>      
1 "200-400\":23" "400-10000\":24" ">10000\":10},{},2\r\n{\"<2\":4" "4-200\":3"
2 "200-400\":23" "400-10000\":24" ">10000\":10},{\"<2\":4"         "4-200\":3"
3 "200-400\":23" "400-10000\":24" ">10000\":10},{},2\r\n{\"<2\":4" "4-200\":3"
4 "200-400\":23" "400-10000\":24" ">10000\":10},{\"<2\":4"         "4-200\":3"
5 "200-400\":23" "400-10000\":24" ">10000\":10},{},2\r\n{\"<2\":4" "4-200\":3"
6 "200-400\":23" "400-10000\":24" ">10000\":10},{\"<2\":4"         "4-200\":3"

Playing with arguments such as delim = ',',quote = '"' does not help.
What do you think? I am so happy I was able to reproduce that stuff. One step closer to fixing the bug.

Thanks!

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

@jimhester I was able to narrow down the issue further. I think vroom gets confused when unzipping the file.

Indeed, just loading the (unzipped) file looks fine:


df <- vroom('/mypath/file.out', col_names = c('a','b','c','d'))
> df %>% tail()
# A tibble: 6 x 4
  a                                 b                                c         d
  <chr>                             <chr>                            <chr> <dbl>
1 "{\"<2\":4,\"4-200\":30,\"200-40… "{\"<2\":4,\"4-200\":30,\"200-4… {}        2
2 "{\"<2\":4,\"4-200\":30,\"200-40… "{\"<2\":4,\"4-200\":30,\"200-4… {}        2
3 "{\"<2\":4,\"4-200\":30,\"200-40… "{\"<2\":4,\"4-200\":30,\"200-4… {}        2
4 "{\"<2\":4,\"4-200\":30,\"200-40… "{\"<2\":4,\"4-200\":30,\"200-4… {}        2
5 "{\"<2\":4,\"4-200\":30,\"200-40… "{\"<2\":4,\"4-200\":30,\"200-4… {}        2
6 {}                                NA                               {}        2

Instead, loading the .zip creates the parsing error


df <- vroom('/mypath/file.zip', col_names = c('a','b','c','d'))
Observations: 1,457                                                           
Variables: 4
chr [4]: a, b, c, d

Call `spec()` for a copy-pastable column specification
Specify the column types with `col_types` to quiet this message
> df %>% tail()
# A tibble: 6 x 4
  a              b                c                                d          
  <chr>          <chr>            <chr>                            <chr>      
1 "200-400\":23" "400-10000\":24" ">10000\":10},{},2\r\n{\"<2\":4" "4-200\":3"
2 "200-400\":23" "400-10000\":24" ">10000\":10},{\"<2\":4"         "4-200\":3"
3 "200-400\":23" "400-10000\":24" ">10000\":10},{},2\r\n{\"<2\":4" "4-200\":3"
4 "200-400\":23" "400-10000\":24" ">10000\":10},{\"<2\":4"         "4-200\":3"
5 "200-400\":23" "400-10000\":24" ">10000\":10},{},2\r\n{\"<2\":4" "4-200\":3"
6 "200-400\":23" "400-10000\":24" ">10000\":10},{\"<2\":4"         "4-200\":3"

@jimhester jimhester closed this in fcb260c May 29, 2019

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

@jimhester thanks! happy to help! Is this fixed already? 💯

@jimhester

This comment has been minimized.

Copy link
Member

commented May 29, 2019

The issue was when you were reading from a connection (which is what happens internally when you have a compressed file), the file contained quoted fields with the delimiter in some of the fields and the size of the connection buffer happened to land within a quoted field. Previously each time a new buffer was read the code assumed it was not in a quoted field, now it retains the information from the last buffer, so it now works properly.

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

perfect! great! when will you push the new package on CRAN? Alternatively, can I just download the zip master and install?

@jimhester

This comment has been minimized.

Copy link
Member

commented May 29, 2019

devtools::install_github("r-lib/vroom"), but you will need a development environment set up on Windows to install the package.

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

I cant do that with my firewall... sob sob

@jimhester

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Then install the tarball and install that directly, it is the same thing.

@randomgambit

This comment has been minimized.

Copy link
Author

commented May 29, 2019

great. thanks. I ll run some tests and let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.