Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index error when fit_transform a large dataset #26

Closed
eromoe opened this issue Jan 13, 2016 · 2 comments
Closed

index error when fit_transform a large dataset #26

eromoe opened this issue Jan 13, 2016 · 2 comments

Comments

@eromoe
Copy link

eromoe commented Jan 13, 2016

I have X_train_features.shape (30962, 15637) and y_train.shape (30962,)

type(X_train_features) is scipy.sparse.csr.csr_matrix

get index error:

IndexError                                Traceback (most recent call last)
<ipython-input-44-fa5e3a9ff626> in <module>()
      5 os = OverSampler(ratio=ratio, verbose=verbose)
      6 
----> 7 osx, osy = os.fit_transform(X_train_features, y_train)

C:\Python27\lib\site-packages\unbalanceddataset-0.1-py2.7.egg\unbalanced_dataset\unbalanced_dataset.pyc in fit_transform(self, x, y)
    260 
    261         self.fit(x, y)
--> 262         self.out_x, self.out_y = self.resample()
    263 
    264         return self.out_x, self.out_y

C:\Python27\lib\site-packages\unbalanceddataset-0.1-py2.7.egg\unbalanced_dataset\over_sampling.pyc in resample(self)
     52 
     53         # Start with the majority class
---> 54         overx = self.x[self.y == self.maxc]
     55         overy = self.y[self.y == self.maxc]
     56 

C:\Python27\lib\site-packages\scipy\sparse\csr.pyc in __getitem__(self, key)
    305             row, col = self._index_to_arrays(row, col)
    306 
--> 307         row = asindices(row)
    308         col = asindices(col)
    309         if row.shape != col.shape:

C:\Python27\lib\site-packages\scipy\sparse\csr.pyc in asindices(x)
    224                     x = x.astype(idx_dtype)
    225             except:
--> 226                 raise IndexError('invalid index')
    227             else:
    228                 return x

IndexError: invalid index
@fmfn
Copy link
Collaborator

fmfn commented Jan 16, 2016

It seems like a problem resulting from using a sparse matrix. To be honest, sparse matrix were completely neglected when writing this. Which kinda makes sense given when they are usually needed and the fact that most methods here (definitely SMOTE) rely on local density estimations to generate new samples.

I'll take a look over the weekend.

@eromoe
Copy link
Author

eromoe commented Jan 20, 2016

I found above problem cause by y_train be converted to pandas.Series in some place at my code. After I fix that, sparse become pain, you are right.

concatenate((overx,self.x[self.y == key],self.x[self.y == key][indx]), axis=0)
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-40-b5ea4e093941>", line 1, in <module>
    concatenate((overx,self.x[self.y == key],self.x[self.y == key][indx]), axis=0)
ValueError: zero-dimensional arrays cannot be concatenated

test code:

a = sparse.csr_matrix(np.zeros((100,200)))
b = sparse.csr_matrix(np.zeros((200,200)))
c = sparse.csr_matrix(np.zeros((300,200)))

np.concatenate((a,b,c), axis=0)

>>>  ValueError: zero-dimensional arrays cannot be concatenated .....

sulotion:

import scipy.sparse as sp
sparse.vstack((a, b, c))
>>> <600x200 sparse matrix of type '<type 'numpy.float64'>'
        with 0 stored elements in Compressed Sparse Row format>

Add a custom function to detect the type then choose concatenate or vstack?
Maybe there are some similar problem like this, I would search for is there any exist package would play this situation well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants