Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError when it shouldn't #8941

Closed
BenKaehler opened this issue May 27, 2017 · 3 comments · Fixed by #9147

Comments

@BenKaehler
Copy link

Description

sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError: indices and data should have the same size for data of a certain length. If you chunk the same data it runs fine.

Steps/Code to Reproduce

import sklearn
from sklearn.feature_extraction.text import HashingVectorizer
print('scikit-learn version')
print(sklearn.__version__)
vectorizer = HashingVectorizer(
    analyzer='char', non_negative=True,
    n_features=1024, ngram_range=[4,16])
X = ['A'*1432]*203452
print('works')
vectorizer.fit_transform(X[:100000])
print('does not work')
vectorizer.fit_transform(X)

Expected Results

scikit-learn version
0.18.1
works
does not work

Actual Results

scikit-learn version
0.18.1
works
does not work
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-aae200adab09> in <module>()
     10 vectorizer.fit_transform(X[:100000])
     11 print('does not work')
---> 12 vectorizer.fit_transform(X)

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/sklearn/feature_extraction/text.py in transform(self, X, y)
    485 
    486         analyzer = self.build_analyzer()
--> 487         X = self._get_hasher().transform(analyzer(doc) for doc in X)
    488         if self.binary:
    489             X.data.fill(1)

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/sklearn/feature_extraction/hashing.py in transform(self, raw_X, y)
    147 
    148         X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
--> 149                           shape=(n_samples, self.n_features))
    150         X.sum_duplicates()  # also sorts the indices
    151         if self.non_negative:

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     96             self.data = np.asarray(self.data, dtype=dtype)
     97 
---> 98         self.check_format(full_check=False)
     99 
    100     def getnnz(self, axis=None):

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/scipy/sparse/compressed.py in check_format(self, full_check)
    165         # check index and data arrays
    166         if (len(self.indices) != len(self.data)):
--> 167             raise ValueError("indices and data should have the same size")
    168         if (self.indptr[-1] > len(self.indices)):
    169             raise ValueError("Last value of index pointer should be less than "

ValueError: indices and data should have the same size

Versions

Darwin-16.5.0-x86_64-i386-64bit
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 12:15:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1

and

Linux-2.6.32-504.16.2.el6.x86_64-x86_64-with-centos-6.6-Final
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1
@jnothman
Copy link
Member

So you're producing 13 non-zero cells per row from 18600 tokens or so. This will exceed a 32 bit index.

Yes, chunking the data would be a sufficient fix, assuming no single row produced this problem. Or else, could consider using scipy.sparse's csr_sum_duplicates directly.

PR welcome

@BenKaehler
Copy link
Author

Thanks @jnothman! How would calling csr_sum_duplicates directly help? We have to return the csr_matrix at some point regardless. Wouldn't we run into the same problem regardless of when we create the csr_matrix?

@jnothman
Copy link
Member

jnothman commented May 28, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants