MemoryError when using .loc or .ix #4280

BAM-BAM-BAM · 2013-07-17T19:33:22Z

from pandas import *
df = read_csv(open('mydata.csv.gz', 'r'), compression='gzip', index_col=False)
df = df[(df.land != 1)]
print df
# 
# Int64Index: 977579 entries, 0 to 1100398
# Data columns (total 89 columns):
# 

# sample 100,000 rows, only use some of the columns
rows = np.random.choice(df.index.values, 100000)
keep_cols = ['sq_ft', 'zip', 'year', 'bathrooms', 'bedrooms', 'floors']
sampled_df = df.ix[rows, keep_cols]

sampled_df.loc[sampled_df.year.notnull()].year        # works fine
sampled_df.loc[sampled_df.year.notnull(),['year']]    # MemoryError

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
 in ()
      1 #sampled_df.loc[sampled_df['year'].notnull(),['year']]
      2 sampled_df.loc[sampled_df.year.notnull()].year
----> 3 sampled_df.loc[sampled_df.year.notnull(),['year']]

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
    695     def __getitem__(self, key):
    696         if type(key) is tuple:
--> 697             return self._getitem_tuple(key)
    698         else:
    699             return self._getitem_axis(key, axis=0)

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
    260         # ugly hack for GH #836
    261         if self._multi_take_opportunity(tup):
--> 262             return self._multi_take(tup)
    263 
    264         # no shortcut needed

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/indexing.pyc in _multi_take(self, tup)
    300             index = self._convert_for_reindex(tup[0], axis=0)
    301             columns = self._convert_for_reindex(tup[1], axis=1)
--> 302             return self.obj.reindex(index=index, columns=columns)
    303         elif isinstance(self.obj, Panel4D):
    304             conv = [self._convert_for_reindex(x, axis=i)

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/frame.pyc in reindex(self, index, columns, method, level, fill_value, limit, copy, takeable)
   2623         if index is not None:
   2624             frame = frame._reindex_index(index, method, copy, level,
-> 2625                                          fill_value, limit, takeable)
   2626 
   2627         return frame

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/frame.pyc in _reindex_index(self, new_index, method, copy, level, fill_value, limit, takeable)
   2703         new_index, indexer = self.index.reindex(new_index, method, level,
   2704                                                 limit=limit, copy_if_needed=True,
-> 2705                                                 takeable=takeable)
   2706         return self._reindex_with_indexers(new_index, indexer, None, None,
   2707                                            copy, fill_value)

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/index.pyc in reindex(self, target, method, level, limit, copy_if_needed, takeable)
    930                         raise ValueError("cannot reindex a non-unique index "
    931                                          "with a method or limit")
--> 932                     indexer, _ = self.get_indexer_non_unique(target)
    933 
    934         return target, indexer

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer_non_unique(self, target, **kwargs)
    843             tgt_values = target.values
    844 
--> 845         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
    846         return Index(indexer), missing
    847 

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_indexer_non_unique (pandas/index.c:5049)()

MemoryError:

Sorry I haven't figured out how to reproduce the error with a toy example.

jreback · 2013-07-17T20:13:11Z

pretty sure this is fixed in 0.12; can u try on master?

BAM-BAM-BAM · 2013-07-17T20:16:14Z

In [145]: pandas.__version__
Out[145]: '0.12.0rc1'

is rc1 not master?

jreback · 2013-07-17T20:26:15Z

yes

can u provide your dataset (a link via dropbox of something like that)

and specs: python,numpy versions, os (32//64 bit), memory size

thanks

BAM-BAM-BAM · 2013-07-17T20:55:03Z

import sys
print(sys.version)
2.7 (r27:82500, Jan 10 2013, 09:03:02) 
[GCC 4.4.5 20110214 (Red Hat 4.4.5-6)]
print numpy.__version__
1.7.1
!free -g
             total       used       free     shared    buffers     cached
Mem:            39         35          3          0          0         17
-/+ buffers/cache:         17         21
Swap:           11         10          1
!cat /etc/*-release
Scientific Linux release 6.1 (Carbon)
!uname -a
Linux analyticsdev1 2.6.32-220.4.1.el6.x86_64 #1 SMP Mon Jan 23 17:20:44 CST 2012 x86_64 x86_64 x86_64 GNU/Linux

Unfortunately I'm not sure I can provide a dataset, we have contracts with our clients which prevent us from sharing specifics. Will have to check on that.

jreback · 2013-07-17T21:03:38Z

how much memory do you have in GB total?

jtratner · 2013-07-17T21:03:58Z

@BAM-BAM-BAM You are using the commit marked v0.12rc1, which isn't quite master. (if you download the entire pandas repo as a tarball, you should be able to get it on master)

@jreback This might be a general issue with using git describe for version names, because Github now calls out specific "releases"

jreback · 2013-07-17T21:04:12Z

or is that above in GB?

cpcloud · 2013-07-17T22:10:18Z

free -g is in GB

jreback · 2013-07-18T02:42:25Z

@BAM-BAM-BAM

ok...all fixed up...this was a bug in the way was allocating memory for determinig non-unique indexers (which only showed up with a large frame and a large number of locations to index)

closed by #4283

thanks for the report!

BAM-BAM-BAM · 2013-07-18T03:04:19Z

Great thanks!

jreback · 2013-07-18T03:15:42Z

if u can pull down master and give a try would e great

jreback mentioned this issue Jul 18, 2013

BUG: Fixed non-unique indexing memory allocation issue with .ix/.loc (GH4280) #4283

Merged

jreback closed this as completed in #4283 Jul 18, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError when using .loc or .ix #4280

MemoryError when using .loc or .ix #4280

BAM-BAM-BAM commented Jul 17, 2013

jreback commented Jul 17, 2013

BAM-BAM-BAM commented Jul 17, 2013

jreback commented Jul 17, 2013

BAM-BAM-BAM commented Jul 17, 2013

jreback commented Jul 17, 2013

jtratner commented Jul 17, 2013

jreback commented Jul 17, 2013

cpcloud commented Jul 17, 2013

jreback commented Jul 18, 2013

BAM-BAM-BAM commented Jul 18, 2013

jreback commented Jul 18, 2013

MemoryError when using .loc or .ix #4280

MemoryError when using .loc or .ix #4280

Comments

BAM-BAM-BAM commented Jul 17, 2013

jreback commented Jul 17, 2013

BAM-BAM-BAM commented Jul 17, 2013

jreback commented Jul 17, 2013

BAM-BAM-BAM commented Jul 17, 2013

jreback commented Jul 17, 2013

jtratner commented Jul 17, 2013

jreback commented Jul 17, 2013

cpcloud commented Jul 17, 2013

jreback commented Jul 18, 2013

BAM-BAM-BAM commented Jul 18, 2013

jreback commented Jul 18, 2013