Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't load large files on Windows #296

Closed
j-mark-hou opened this issue May 11, 2017 · 11 comments
Closed

can't load large files on Windows #296

j-mark-hou opened this issue May 11, 2017 · 11 comments

Comments

@j-mark-hou
Copy link

@j-mark-hou j-mark-hou commented May 11, 2017

I'm currently dumping dataframes to feather from R and loading them into python. These files range in size from 1gb to 10gb. There is a clear cutoff somewhere between 4.5 and 6.5 gb, where below this cutoff reading in the files is always totally fine, whereas above this cutoff the reading always fails on "check_status". I am on windows server 2012 r2. Error below.

C:\ProgramData\Anaconda3\lib\site-packages\feather\ext.pyx in feather.ext.FeatherReader.cinit (feather/ext.cpp:4460)()
C:\ProgramData\Anaconda3\lib\site-packages\feather\ext.pyx in feather.ext.check_status (feather/ext.cpp:1921)()
FeatherError: IO error: Memory mapping file failed

@wesm
Copy link
Owner

@wesm wesm commented May 11, 2017

This might be fixed now in upstream Arrow -- I am going to make a Feather 0.4.0 release which makes feather-format depend on pyarrow. Do you have a reproducible test case I can try out to see if it's fixed?

@j-mark-hou
Copy link
Author

@j-mark-hou j-mark-hou commented May 11, 2017

Great, thanks. Just tried this synthetic example below, gives the same results as before.

R:
library(feather)
library(tidyverse)

out_dir <- # directory to store the files

nrow <- 40000000
num_cols_small <- 10
num_cols_big <- 20

rand_col <- rnorm(nrow)

df_small <- data.frame(rand_col)
names(df_small) <- c('col0')
for(i in 1:num_cols_small){
df_small[paste0('col', i)] <- rand_col
}

df_big <- df_small
for(i in num_cols_small:num_cols_big){
df_big[paste0('col', i)] <- rand_col
}

df_small %>% dim
df_big %>% dim

write_feather(df_small, paste0(out_dir, 'df_small.feather')) #3.4 gb
write_feather(df_big, paste0(out_dir, 'df_big.feather')) # 6.5 gb

Python:

import numpy as np
import pandas as pd
import feather

out_dir = # directory with the two files

df_small = feather.read_dataframe(out_dir+'df_small.feather')
print(df_small.shape) # (40000000, 11)
df_big = feather.read_dataframe(out_dir+'df_big.feather')
print(df_big.shape) # FeatherError: IO error: Memory mapping file failed

@wesm
Copy link
Owner

@wesm wesm commented Jun 7, 2017

I can reproduce the issue and I opened

https://issues.apache.org/jira/browse/ARROW-1096

to fix this.

@alex-addiego
Copy link

@alex-addiego alex-addiego commented Sep 7, 2018

I seem to be having this same problem (reading feather in Python after writing from R), around the same size cutoff, using pyarrow 0.10.0 and the latest feather/pandas. Error below:

data_temp = feather.read_dataframe('<input_file>')
Traceback (most recent call last):
File "", line 1, in
File ".../anaconda3/lib/python3.6/site-packages/pyarrow/feather.py", line 213, in read_feather
reader = FeatherReader(source)
File ".../anaconda3/lib/python3.6/site-packages/pyarrow/feather.py", line 46, in init
self.open(source)
File "pyarrow/feather.pxi", line 80, in pyarrow.lib.FeatherReader.open
File "pyarrow/io.pxi", line 1050, in pyarrow.lib.get_reader
File "pyarrow/io.pxi", line 633, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 597, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Memory mapping file failed: Cannot allocate memory

Attempts to read some files will alternate between the above error and the further error below:

data_temp = feather.read_dataframe('<input_file>')
Traceback (most recent call last):
File "", line 1, in
File ".../anaconda3/lib/python3.6/site-packages/pyarrow/feather.py", line 214, in read_feather
return reader.read_pandas(columns=columns, use_threads=use_threads)
File ".../anaconda3/lib/python3.6/site-packages/pyarrow/feather.py", line 74, in read_pandas
use_threads=use_threads)
File "pyarrow/table.pxi", line 1326, in pyarrow.lib.Table.to_pandas
File ".../anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 577, in table_to_blockmanager
blocks = _table_to_blocks(options, block_table, memory_pool, categories)
File ".../anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 742, in _table_to_blocks
categories)
File "pyarrow/table.pxi", line 971, in pyarrow.lib.table_to_blocks
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError

@wesm
Copy link
Owner

@wesm wesm commented Sep 8, 2018

Can you post an minimal reproducible example?

@alex-addiego
Copy link

@alex-addiego alex-addiego commented Sep 11, 2018

Thanks for getting back to me -- I modified the code used above to produce slightly bigger files, and was able to reproduce the error that way. I should also note that I am using Linux, not Windows.

The following R code creates the synthetic data:

library(feather)
library(tidyverse)

out_dir <- # directory to store the files

nrow <- 40000000
num_cols_small <- 10
num_cols_big <- 20
num_cols_huge <- 30
num_cols_huge2 <- 60

rand_col <- rnorm(nrow)

df_small <- data.frame(rand_col)
names(df_small) <- c('col0')
for(i in 1:num_cols_small){
	df_small[paste0('col',i)] <- rand_col
}

df_big <- df_small
for(i in num_cols_small:num_cols_big){
	df_big[paste0('col', i)] <- rand_col
}

df_huge <- df_big
for(i in num_cols_big:num_cols_huge){
	df_huge[paste0('col',i)] <- rand_col
}

df_huge2 <- df_huge
for(i in num_cols_huge:num_cols_huge2){
	df_huge2[paste0('col',i)] <- rand_col
}


write_feather(df_small, paste0(out_dir, 'df_small.feather'))      # 3.3G
write_feather(df_big, paste0(out_dir, 'df_big.feather'))            # 6.3G
write_feather(df_huge, paste0(out_dir, 'df_huge.feather'))      # 9.3G
write_feather(df_huge2, paste0(out_dir, 'df_huge2.feather'))  # 19G

There are 3 different ways I've been able to reproduce this error, which I'll note in comments in this python code (each method assumes a fresh instance of python):

import pandas as pd
import feather

out_dir = # file directory

# Method 1: Loading the largest dataset (df_huge2)
# -----------------------------------------------------------
df_huge2 = feather.read_dataframe(out_dir+'df_huge2.feather')
# This produces two alternating errors, if you run it over and over:  
#       'pyarrow.lib.ArrowMemoryError' 
#       'pyarrow.lib.ArrowIOError: Memory mapping file failed: Cannot allocate memory


# Method 2: Loading 'df_huge' after failing a load of 'df_huge2'
# 'df_huge' loading works in a fresh instance of python, but fails if 'df_huge2' load has been attempted
# -----------------------------------------------------------
df_huge2 = feather.read_dataframe(out_dir+'df_huge2.feather')
df_huge = feather.read_dataframe(out_dir+'df_huge.feather')
#       'pyarrow.lib.ArrowMemoryError'


# Method 3: Attempting to load 'df_huge' twice consecutively
# 'df_huge' loading works in a fresh instance of python, but fails in a repeat call after the first load-in
# -----------------------------------------------------------
df_huge = feather.read_dataframe(out_dir+'df_huge.feather')
df_huge = feather.read_dataframe(out_dir+'df_huge.feather')
#        'pyarrow.lib.ArrowMemoryError'

I can provide additional information from any of the error messages if needed.

As a final note, the two datasets I'm working with that originally produced these messages are 40G and 8G in size. I can load the 8G file as long as I do it first, but attempting to load it after the 40G file load fails means an error, and attempting to read the 40G file produces an error every time.

Thanks for your time.

@wesm
Copy link
Owner

@wesm wesm commented Sep 11, 2018

Thank you. I'm not sure when I'll be able to take a look but this will help

@alex-addiego
Copy link

@alex-addiego alex-addiego commented Sep 12, 2018

Not sure if it will help, but I noticed that the 'feather_metadata()' function in R also fails when attempting to access the larger (40G) file.

To reproduce this I've created a 37G synthetic dataset by adding the following to the previously posted R code:

num_cols_huge3 <- 120

df_huge3 <- df_huge2
for(i in num_cols_huge2:num_cols_huge3){
	df_huge3[paste0('col',i)] <- rand_col
}

write_feather(df_huge3, paste0(out_dir, 'df_huge3.feather'))

Then, when attempting to get the metadata in R:

library(feather)
library(tidyverse)

out_dir <- # directory to store the files

feather_metadata(out_dir+'df_huge3.feather')
# Error in metadataFeather(path) : IO error: Memory mapping file failed

@wesm
Copy link
Owner

@wesm wesm commented Sep 13, 2018

Unfortunately I'm declaring bankruptcy on maintaining this codebase any further. The R community is best advised to invest in getting R bindings to Apache Arrow shipped so we can work together to fix these issues in one place.

@alex-addiego
Copy link

@alex-addiego alex-addiego commented Sep 18, 2018

Thanks for your note, I understand. As it turns out the problem was on my end-- those errors seem to be the way python was telling me it was running up against the limits of system memory.

@wesm
Copy link
Owner

@wesm wesm commented Apr 10, 2020

I think this should be fixed in the arrow package. We are working toward feather using arrow internally, should be done in the near future

@wesm wesm closed this as completed Apr 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants