Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: close intermediate file descriptor right after it is used in netcdf.py #3429

Closed
wants to merge 5 commits into from
Closed

Conversation

jinhyokh
Copy link
Contributor

@jinhyokh jinhyokh commented Mar 4, 2014

netcdf opens intermediate file descriptors and they are closed when the main file descriptor is closed. But this causes a too many open files error when the number of intermediate descriptors are too many. This patch closes the intermediate file descriptor right after it is used.

@rgommers
Copy link
Member

rgommers commented Mar 5, 2014

The failures are segfaults on all Python versions. Haven't checked if it's related, but this is new.

@jinhyokh
Copy link
Contributor Author

jinhyokh commented Mar 5, 2014

I guess I fixed the problem. Please review my patch.

@rgommers
Copy link
Member

rgommers commented Mar 6, 2014

Doesn't this now cause a lot of extra data copying? I assume you have some large netcdf files, otherwise you wouldn't see this. Can you take one that does load without this fix and time opening it in read mode?

@jinhyokh
Copy link
Contributor Author

jinhyokh commented Mar 6, 2014

Without copy(), segfaults occur. Why does data.flags.writeable become True with copy()?

@rgommers
Copy link
Member

rgommers commented Mar 6, 2014

I mean take one that works with current scipy master, and compare with this patch.

Flags: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flags.html

@jinhyokh
Copy link
Contributor Author

jinhyokh commented Mar 6, 2014

Now it sets writeable flag properly and passes the tests.

@jinhyokh
Copy link
Contributor Author

jinhyokh commented Mar 7, 2014

Sorry I didn't pay attention to the details. Considering the inefficiency of copy(), I realized this is not a good way of solving the problem. It may be better simply to advertise not to use mmap for netcdf having too many variables, or pass mmap=False in netcdf_file().

A better idea may be to check the number of potential mmap() calls and, if it is greater than resource.getrlimit(resource.RLIMIT_NOFILE), change self.use_mmap to False. How can I count the number of opened file descriptors in Python?

Or, any better idea?

@pv
Copy link
Member

pv commented Mar 7, 2014

It should be possible to use a single mmap, initialized when the file is opened,
and obtain the variables via something like value = self._mm[start_byte_pos:end_byte_pos].view(dtype).reshape(shape)

@pv
Copy link
Member

pv commented Mar 7, 2014

Also: isn't netcdf.py actually pupynere? The upstream is still active, so we should in principle track it: https://bitbucket.org/robertodealmeida/pupynere/commits/all

@rgommers
Copy link
Member

rgommers commented Mar 9, 2014

@pv it was, a very long time ago. Last sync by Roberto (Pupynere author) was in 5b02684. After that a bunch of fixes went into scipy, that I don't think we contributed back. I'll open a separate issue for re-syncing with upstream.

@rgommers
Copy link
Member

rgommers commented Mar 9, 2014

Travis error is a timeout, looks OK for this patch now.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 0b79ea0 on heoj:master into c81c978 on scipy:master.

@jinhyokh
Copy link
Contributor Author

I further modified to use buffer as pv suggested. Memory efficiency improved significantly!

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 09d235e on heoj:master into c81c978 on scipy:master.

@betodealmeida
Copy link

There's another major improvement in using ALLOCATIONGRANULARITY when available. See:

https://bitbucket.org/robertodealmeida/pupynere/src/0fa832fa4400da818019f7d2883669c1ce08823f/pupynere.py?at=default#cl-93

When ALLOCATIONGRANULARITY is not set the mmap for each variable in the file is created from the start of the file to the end of the variable, since it's not possible to specify an offset; the offset is later specified when creating the Numpy array. For files with many variables this is highly inefficient. The code at bitbucket will check if ALLOCATIONGRANULARITY is available (on Python >= 2.6), and makes use of the paging.

pv added a commit that referenced this pull request Apr 27, 2014
BUG: io/netcdf: use only a single mmap in netcdf

Using a separate mmap for each access of each varible can cause running
out of file descriptors on some platforms.  Address this by using a
single mmap covering the whole file.
@pv
Copy link
Member

pv commented Apr 27, 2014

Merged with minor changes in 58d8117

@pv pv closed this Apr 27, 2014
@pv pv added this to the 0.15.0 milestone Apr 27, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants