Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_lines craches on big gzipped file #309

Closed
dselivanov opened this issue Nov 8, 2015 · 6 comments
Closed

read_lines craches on big gzipped file #309

dselivanov opened this issue Nov 8, 2015 · 6 comments
Assignees
Labels
feature a feature request or enhancement

Comments

@dselivanov
Copy link

I'm trying to read big text file (~4.5gb in gzipped form, english wikipedia dump)
read_lines produce following error:

Error in read_lines_(ds, locale_ = locale, n_max = n_max, progress = progress) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:137

EDIT: I'm doing it on machine with 250gb ram, so ram is not an issue.

@dselivanov
Copy link
Author

Also I'm using Rcpp_0.12.1, see RcppCore/Rcpp#302.

@MagicForrest
Copy link

I can report the same thing. Trying to read a large text file with read_fwf(), 220 MB gzipped, 2.5 GB uncompressed. It reads fine when uncompressed but fails with the "long vectors not supported" error above. I am using readr 0.2.2, Rcpp 0.12.3, R 3.02.

Just let me know if there is any more info I can provide or testing.

@hadley
Copy link
Member

hadley commented Jun 2, 2016

Deleting all the data.table stuff which is peripheral to this issue. @dselivanov can you please provide a reproducible example?

@dselivanov
Copy link
Author

library(readr)
# works
txt = rep(paste(rep('a', 2 ^ 16), collapse = ''), 2 ^ 15 - 1)
writeLines(txt, con = gzfile('~/temp/test_read_lines.gz', open = 'w+', compression = 1))
rm(txt)
txt = read_lines("~/temp/test_read_lines.gz")
rm(txt)
# not works
txt = rep(paste(rep('a', 2 ^ 16), collapse = ''), 2 ^ 15)
writeLines(txt, con = gzfile('~/temp/test_read_lines.gz', open = 'w+', compression = 1))
rm(txt)
txt = read_lines("~/temp/test_read_lines.gz")

@hadley
Copy link
Member

hadley commented Jun 2, 2016

Minimal reprex

tmp <- tempfile(fileext = ".gz")

x <- rep(paste(rep('a', 2 ^ 16), collapse = ''), 2 ^ 15)
writeLines(x, con = gzfile(tmp, open = 'w+', compression = 1))
y <- readr::read_lines(tmp)

@hadley hadley added feature a feature request or enhancement ready labels Jun 2, 2016
@hadley
Copy link
Member

hadley commented Jun 14, 2016

@jimhester this should be fairly straightforward - will just require using the long vector API.

jimhester added a commit to jimhester/readr that referenced this issue Jun 14, 2016
@jimhester jimhester removed the ready label Jun 15, 2016
@lock lock bot locked and limited conversation to collaborators Sep 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants