BUG: close intermediate file descriptor right after it is used in netcdf.py #3429

jinhyokh · 2014-03-04T23:35:54Z

netcdf opens intermediate file descriptors and they are closed when the main file descriptor is closed. But this causes a too many open files error when the number of intermediate descriptors are too many. This patch closes the intermediate file descriptor right after it is used.

rgommers · 2014-03-05T07:17:15Z

The failures are segfaults on all Python versions. Haven't checked if it's related, but this is new.

jinhyokh · 2014-03-05T20:35:41Z

I guess I fixed the problem. Please review my patch.

rgommers · 2014-03-06T07:22:17Z

Doesn't this now cause a lot of extra data copying? I assume you have some large netcdf files, otherwise you wouldn't see this. Can you take one that does load without this fix and time opening it in read mode?

jinhyokh · 2014-03-06T17:00:09Z

Without copy(), segfaults occur. Why does data.flags.writeable become True with copy()?

rgommers · 2014-03-06T20:46:23Z

I mean take one that works with current scipy master, and compare with this patch.

Flags: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flags.html

jinhyokh · 2014-03-06T21:04:49Z

Now it sets writeable flag properly and passes the tests.

jinhyokh · 2014-03-07T15:30:30Z

Sorry I didn't pay attention to the details. Considering the inefficiency of copy(), I realized this is not a good way of solving the problem. It may be better simply to advertise not to use mmap for netcdf having too many variables, or pass mmap=False in netcdf_file().

A better idea may be to check the number of potential mmap() calls and, if it is greater than resource.getrlimit(resource.RLIMIT_NOFILE), change self.use_mmap to False. How can I count the number of opened file descriptors in Python?

Or, any better idea?

pv · 2014-03-07T15:35:33Z

It should be possible to use a single mmap, initialized when the file is opened,
and obtain the variables via something like value = self._mm[start_byte_pos:end_byte_pos].view(dtype).reshape(shape)

pv · 2014-03-07T15:41:00Z

Also: isn't netcdf.py actually pupynere? The upstream is still active, so we should in principle track it: https://bitbucket.org/robertodealmeida/pupynere/commits/all

rgommers · 2014-03-09T12:05:26Z

@pv it was, a very long time ago. Last sync by Roberto (Pupynere author) was in 5b02684. After that a bunch of fixes went into scipy, that I don't think we contributed back. I'll open a separate issue for re-syncing with upstream.

rgommers · 2014-03-09T12:07:27Z

Travis error is a timeout, looks OK for this patch now.

coveralls · 2014-03-09T12:44:30Z

Coverage remained the same when pulling 0b79ea0 on heoj:master into c81c978 on scipy:master.

jinhyokh · 2014-03-11T02:03:54Z

I further modified to use buffer as pv suggested. Memory efficiency improved significantly!

coveralls · 2014-03-11T02:41:09Z

Coverage remained the same when pulling 09d235e on heoj:master into c81c978 on scipy:master.

betodealmeida · 2014-03-11T02:42:35Z

There's another major improvement in using ALLOCATIONGRANULARITY when available. See:

https://bitbucket.org/robertodealmeida/pupynere/src/0fa832fa4400da818019f7d2883669c1ce08823f/pupynere.py?at=default#cl-93

When ALLOCATIONGRANULARITY is not set the mmap for each variable in the file is created from the start of the file to the end of the variable, since it's not possible to specify an offset; the offset is later specified when creating the Numpy array. For files with many variables this is highly inefficient. The code at bitbucket will check if ALLOCATIONGRANULARITY is available (on Python >= 2.6), and makes use of the paging.

BUG: io/netcdf: use only a single mmap in netcdf Using a separate mmap for each access of each varible can cause running out of file descriptors on some platforms. Address this by using a single mmap covering the whole file.

pv · 2014-04-27T15:00:50Z

Merged with minor changes in 58d8117

BUG: close intermediate file descriptor in netcdf.py

c6f3161

rgommers added the scipy.io label Mar 5, 2014

BUG: close intermediate file descriptor in netcdf.py 2

69de524

WarrenWeckesser added the PR label Mar 5, 2014

Set flags.writeable properly when copying array

01d5a0a

Call mmap only once to prevent too many open files

0b79ea0

rgommers mentioned this pull request Mar 9, 2014

sync io.netcdf with upstream #3443

Closed

Prevent copying with numpy buffer in netcdf

09d235e

pv closed this Apr 27, 2014

pv added this to the 0.15.0 milestone Apr 27, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: close intermediate file descriptor right after it is used in netcdf.py #3429

BUG: close intermediate file descriptor right after it is used in netcdf.py #3429

jinhyokh commented Mar 4, 2014

rgommers commented Mar 5, 2014

jinhyokh commented Mar 5, 2014

rgommers commented Mar 6, 2014

jinhyokh commented Mar 6, 2014

rgommers commented Mar 6, 2014

jinhyokh commented Mar 6, 2014

jinhyokh commented Mar 7, 2014

pv commented Mar 7, 2014

pv commented Mar 7, 2014

rgommers commented Mar 9, 2014

rgommers commented Mar 9, 2014

coveralls commented Mar 9, 2014

jinhyokh commented Mar 11, 2014

coveralls commented Mar 11, 2014

betodealmeida commented Mar 11, 2014

pv commented Apr 27, 2014

BUG: close intermediate file descriptor right after it is used in netcdf.py #3429

BUG: close intermediate file descriptor right after it is used in netcdf.py #3429

Conversation

jinhyokh commented Mar 4, 2014

rgommers commented Mar 5, 2014

jinhyokh commented Mar 5, 2014

rgommers commented Mar 6, 2014

jinhyokh commented Mar 6, 2014

rgommers commented Mar 6, 2014

jinhyokh commented Mar 6, 2014

jinhyokh commented Mar 7, 2014

pv commented Mar 7, 2014

pv commented Mar 7, 2014

rgommers commented Mar 9, 2014

rgommers commented Mar 9, 2014

coveralls commented Mar 9, 2014

jinhyokh commented Mar 11, 2014

coveralls commented Mar 11, 2014

betodealmeida commented Mar 11, 2014

pv commented Apr 27, 2014