New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't load large files on Windows #296
Comments
|
This might be fixed now in upstream Arrow -- I am going to make a Feather 0.4.0 release which makes feather-format depend on pyarrow. Do you have a reproducible test case I can try out to see if it's fixed? |
|
Great, thanks. Just tried this synthetic example below, gives the same results as before. R: out_dir <- # directory to store the files nrow <- 40000000 rand_col <- rnorm(nrow) df_small <- data.frame(rand_col) df_big <- df_small df_small %>% dim write_feather(df_small, paste0(out_dir, 'df_small.feather')) #3.4 gb Python: import numpy as np out_dir = # directory with the two files df_small = feather.read_dataframe(out_dir+'df_small.feather') |
|
I can reproduce the issue and I opened https://issues.apache.org/jira/browse/ARROW-1096 to fix this. |
|
I seem to be having this same problem (reading feather in Python after writing from R), around the same size cutoff, using pyarrow 0.10.0 and the latest feather/pandas. Error below:
Attempts to read some files will alternate between the above error and the further error below:
|
|
Can you post an minimal reproducible example? |
|
Thanks for getting back to me -- I modified the code used above to produce slightly bigger files, and was able to reproduce the error that way. I should also note that I am using Linux, not Windows. The following R code creates the synthetic data: There are 3 different ways I've been able to reproduce this error, which I'll note in comments in this python code (each method assumes a fresh instance of python): I can provide additional information from any of the error messages if needed. As a final note, the two datasets I'm working with that originally produced these messages are 40G and 8G in size. I can load the 8G file as long as I do it first, but attempting to load it after the 40G file load fails means an error, and attempting to read the 40G file produces an error every time. Thanks for your time. |
|
Thank you. I'm not sure when I'll be able to take a look but this will help |
|
Not sure if it will help, but I noticed that the 'feather_metadata()' function in R also fails when attempting to access the larger (40G) file. To reproduce this I've created a 37G synthetic dataset by adding the following to the previously posted R code: Then, when attempting to get the metadata in R: |
|
Unfortunately I'm declaring bankruptcy on maintaining this codebase any further. The R community is best advised to invest in getting R bindings to Apache Arrow shipped so we can work together to fix these issues in one place. |
|
Thanks for your note, I understand. As it turns out the problem was on my end-- those errors seem to be the way python was telling me it was running up against the limits of system memory. |
|
I think this should be fixed in the |
I'm currently dumping dataframes to feather from R and loading them into python. These files range in size from 1gb to 10gb. There is a clear cutoff somewhere between 4.5 and 6.5 gb, where below this cutoff reading in the files is always totally fine, whereas above this cutoff the reading always fails on "check_status". I am on windows server 2012 r2. Error below.
C:\ProgramData\Anaconda3\lib\site-packages\feather\ext.pyx in feather.ext.FeatherReader.cinit (feather/ext.cpp:4460)()
C:\ProgramData\Anaconda3\lib\site-packages\feather\ext.pyx in feather.ext.check_status (feather/ext.cpp:1921)()
FeatherError: IO error: Memory mapping file failed
The text was updated successfully, but these errors were encountered: