Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError when using .loc or .ix #4280

Closed
BAM-BAM-BAM opened this issue Jul 17, 2013 · 11 comments · Fixed by #4283
Closed

MemoryError when using .loc or .ix #4280

BAM-BAM-BAM opened this issue Jul 17, 2013 · 11 comments · Fixed by #4283
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@BAM-BAM-BAM
Copy link

from pandas import *
df = read_csv(open('mydata.csv.gz', 'r'), compression='gzip', index_col=False)
df = df[(df.land != 1)]
print df
# 
# Int64Index: 977579 entries, 0 to 1100398
# Data columns (total 89 columns):
# 

# sample 100,000 rows, only use some of the columns
rows = np.random.choice(df.index.values, 100000)
keep_cols = ['sq_ft', 'zip', 'year', 'bathrooms', 'bedrooms', 'floors']
sampled_df = df.ix[rows, keep_cols]

sampled_df.loc[sampled_df.year.notnull()].year        # works fine
sampled_df.loc[sampled_df.year.notnull(),['year']]    # MemoryError

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
 in ()
      1 #sampled_df.loc[sampled_df['year'].notnull(),['year']]
      2 sampled_df.loc[sampled_df.year.notnull()].year
----> 3 sampled_df.loc[sampled_df.year.notnull(),['year']]

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
    695     def __getitem__(self, key):
    696         if type(key) is tuple:
--> 697             return self._getitem_tuple(key)
    698         else:
    699             return self._getitem_axis(key, axis=0)

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
    260         # ugly hack for GH #836
    261         if self._multi_take_opportunity(tup):
--> 262             return self._multi_take(tup)
    263 
    264         # no shortcut needed

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/indexing.pyc in _multi_take(self, tup)
    300             index = self._convert_for_reindex(tup[0], axis=0)
    301             columns = self._convert_for_reindex(tup[1], axis=1)
--> 302             return self.obj.reindex(index=index, columns=columns)
    303         elif isinstance(self.obj, Panel4D):
    304             conv = [self._convert_for_reindex(x, axis=i)

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/frame.pyc in reindex(self, index, columns, method, level, fill_value, limit, copy, takeable)
   2623         if index is not None:
   2624             frame = frame._reindex_index(index, method, copy, level,
-> 2625                                          fill_value, limit, takeable)
   2626 
   2627         return frame

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/frame.pyc in _reindex_index(self, new_index, method, copy, level, fill_value, limit, takeable)
   2703         new_index, indexer = self.index.reindex(new_index, method, level,
   2704                                                 limit=limit, copy_if_needed=True,
-> 2705                                                 takeable=takeable)
   2706         return self._reindex_with_indexers(new_index, indexer, None, None,
   2707                                            copy, fill_value)

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/index.pyc in reindex(self, target, method, level, limit, copy_if_needed, takeable)
    930                         raise ValueError("cannot reindex a non-unique index "
    931                                          "with a method or limit")
--> 932                     indexer, _ = self.get_indexer_non_unique(target)
    933 
    934         return target, indexer

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer_non_unique(self, target, **kwargs)
    843             tgt_values = target.values
    844 
--> 845         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
    846         return Index(indexer), missing
    847 

/home/jprior/Scratch/VENV1/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_indexer_non_unique (pandas/index.c:5049)()

MemoryError: 

Sorry I haven't figured out how to reproduce the error with a toy example.

@jreback
Copy link
Contributor

jreback commented Jul 17, 2013

pretty sure this is fixed in 0.12; can u try on master?

@BAM-BAM-BAM
Copy link
Author

In [145]: pandas.__version__
Out[145]: '0.12.0rc1'

is rc1 not master?

@jreback
Copy link
Contributor

jreback commented Jul 17, 2013

yes

can u provide your dataset (a link via dropbox of something like that)

and specs: python,numpy versions, os (32//64 bit), memory size

thanks

@BAM-BAM-BAM
Copy link
Author

import sys
print(sys.version)
2.7 (r27:82500, Jan 10 2013, 09:03:02) 
[GCC 4.4.5 20110214 (Red Hat 4.4.5-6)]
print numpy.__version__
1.7.1
!free -g
             total       used       free     shared    buffers     cached
Mem:            39         35          3          0          0         17
-/+ buffers/cache:         17         21
Swap:           11         10          1
!cat /etc/*-release
Scientific Linux release 6.1 (Carbon)
!uname -a
Linux analyticsdev1 2.6.32-220.4.1.el6.x86_64 #1 SMP Mon Jan 23 17:20:44 CST 2012 x86_64 x86_64 x86_64 GNU/Linux

Unfortunately I'm not sure I can provide a dataset, we have contracts with our clients which prevent us from sharing specifics. Will have to check on that.

@jreback
Copy link
Contributor

jreback commented Jul 17, 2013

how much memory do you have in GB total?

@jtratner
Copy link
Contributor

@BAM-BAM-BAM You are using the commit marked v0.12rc1, which isn't quite master. (if you download the entire pandas repo as a tarball, you should be able to get it on master)

@jreback This might be a general issue with using git describe for version names, because Github now calls out specific "releases"

@jreback
Copy link
Contributor

jreback commented Jul 17, 2013

or is that above in GB?

@cpcloud
Copy link
Member

cpcloud commented Jul 17, 2013

free -g is in GB

@jreback
Copy link
Contributor

jreback commented Jul 18, 2013

@BAM-BAM-BAM

ok...all fixed up...this was a bug in the way was allocating memory for determinig non-unique indexers (which only showed up with a large frame and a large number of locations to index)

closed by #4283

thanks for the report!

@BAM-BAM-BAM
Copy link
Author

Great thanks!

@jreback
Copy link
Contributor

jreback commented Jul 18, 2013

if u can pull down master and give a try would e great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants