Skip to content

Commit

Permalink
BUG: Fixing memory leaks in read_csv
Browse files Browse the repository at this point in the history
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
  • Loading branch information
zhezherun authored and gfyoung committed Nov 19, 2018
1 parent 0ab8eb2 commit e3ff389
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 18 deletions.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Expand Up @@ -1382,6 +1382,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
- Bug in :func:`DataFrame.to_string()` that caused representations of :class:`DataFrame` to not take up the whole window (:issue:`22984`)
- Bug in :func:`DataFrame.to_csv` where a single level MultiIndex incorrectly wrote a tuple. Now just the value of the index is written (:issue:`19589`).
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
- Bug in :func:`read_csv()` in which memory leaks occurred in the C engine when parsing ``NaN`` values due to insufficient cleanup on completion or error (:issue:`21353`)
- Bug in :func:`read_csv()` in which incorrect error messages were being raised when ``skipfooter`` was passed in along with ``nrows``, ``iterator``, or ``chunksize`` (:issue:`23711`)
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
Expand Down
38 changes: 20 additions & 18 deletions pandas/_libs/parsers.pyx
Expand Up @@ -1070,18 +1070,6 @@ cdef class TextReader:

conv = self._get_converter(i, name)

# XXX
na_flist = set()
if self.na_filter:
na_list, na_flist = self._get_na_list(i, name)
if na_list is None:
na_filter = 0
else:
na_filter = 1
na_hashset = kset_from_list(na_list)
else:
na_filter = 0

col_dtype = None
if self.dtype is not None:
if isinstance(self.dtype, dict):
Expand All @@ -1106,13 +1094,26 @@ cdef class TextReader:
self.c_encoding)
continue

# Should return as the desired dtype (inferred or specified)
col_res, na_count = self._convert_tokens(
i, start, end, name, na_filter, na_hashset,
na_flist, col_dtype)
# XXX
na_flist = set()
if self.na_filter:
na_list, na_flist = self._get_na_list(i, name)
if na_list is None:
na_filter = 0
else:
na_filter = 1
na_hashset = kset_from_list(na_list)
else:
na_filter = 0

if na_filter:
self._free_na_set(na_hashset)
try:
# Should return as the desired dtype (inferred or specified)
col_res, na_count = self._convert_tokens(
i, start, end, name, na_filter, na_hashset,
na_flist, col_dtype)
finally:
if na_filter:
self._free_na_set(na_hashset)

if upcast_na and na_count > 0:
col_res = _maybe_upcast(col_res)
Expand Down Expand Up @@ -2059,6 +2060,7 @@ cdef kh_str_t* kset_from_list(list values) except NULL:

# None creeps in sometimes, which isn't possible here
if not isinstance(val, bytes):
kh_destroy_str(table)
raise ValueError('Must be all encoded bytes')

k = kh_put_str(table, PyBytes_AsString(val), &ret)
Expand Down

0 comments on commit e3ff389

Please sign in to comment.