BUG: sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError when it shouldn't #8941

BenKaehler · 2017-05-27T03:57:19Z

Description

sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError: indices and data should have the same size for data of a certain length. If you chunk the same data it runs fine.

Steps/Code to Reproduce

import sklearn
from sklearn.feature_extraction.text import HashingVectorizer
print('scikit-learn version')
print(sklearn.__version__)
vectorizer = HashingVectorizer(
    analyzer='char', non_negative=True,
    n_features=1024, ngram_range=[4,16])
X = ['A'*1432]*203452
print('works')
vectorizer.fit_transform(X[:100000])
print('does not work')
vectorizer.fit_transform(X)

Expected Results

scikit-learn version
0.18.1
works
does not work

Actual Results

scikit-learn version
0.18.1
works
does not work
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-aae200adab09> in <module>()
     10 vectorizer.fit_transform(X[:100000])
     11 print('does not work')
---> 12 vectorizer.fit_transform(X)

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/sklearn/feature_extraction/text.py in transform(self, X, y)
    485 
    486         analyzer = self.build_analyzer()
--> 487         X = self._get_hasher().transform(analyzer(doc) for doc in X)
    488         if self.binary:
    489             X.data.fill(1)

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/sklearn/feature_extraction/hashing.py in transform(self, raw_X, y)
    147 
    148         X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
--> 149                           shape=(n_samples, self.n_features))
    150         X.sum_duplicates()  # also sorts the indices
    151         if self.non_negative:

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     96             self.data = np.asarray(self.data, dtype=dtype)
     97 
---> 98         self.check_format(full_check=False)
     99 
    100     def getnnz(self, axis=None):

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/scipy/sparse/compressed.py in check_format(self, full_check)
    165         # check index and data arrays
    166         if (len(self.indices) != len(self.data)):
--> 167             raise ValueError("indices and data should have the same size")
    168         if (self.indptr[-1] > len(self.indices)):
    169             raise ValueError("Last value of index pointer should be less than "

ValueError: indices and data should have the same size

Versions

Darwin-16.5.0-x86_64-i386-64bit
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 12:15:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1

and

Linux-2.6.32-504.16.2.el6.x86_64-x86_64-with-centos-6.6-Final
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1

The text was updated successfully, but these errors were encountered:

jnothman · 2017-05-27T11:36:30Z

So you're producing 13 non-zero cells per row from 18600 tokens or so. This will exceed a 32 bit index.

Yes, chunking the data would be a sufficient fix, assuming no single row produced this problem. Or else, could consider using scipy.sparse's csr_sum_duplicates directly.

PR welcome

BenKaehler · 2017-05-27T20:12:03Z

Thanks @jnothman! How would calling csr_sum_duplicates directly help? We have to return the csr_matrix at some point regardless. Wouldn't we run into the same problem regardless of when we create the csr_matrix?

jnothman · 2017-05-28T02:54:54Z

I mean you could call csr_sum_duplicates on a partially filled buffer rather than the full matrix, to avoid creating a bunch of partial matrices and then stacking. But this is almost certainly excessive premature optimisation

…

On 28 May 2017 6:12 am, "BenKaehler" ***@***.***> wrote: Thanks @jnothman <https://github.com/jnothman>! How would calling csr_sum_duplicates directly help? We have to return the csr_matrix at some point regardless. Wouldn't we run into the same problem regardless of when we create the csr_matrix? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8941 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6x6Gz73SpGrXZRK3xuT_wKAJ0uZvks5r-IOWgaJpZM4NoOcr> .

BenKaehler mentioned this issue May 27, 2017

fit-classifier-naive-bayes raises a ValueError for valid input qiime2/q2-feature-classifier#74

Closed

This was referenced Jun 16, 2017

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

Closed

[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

Merged

rth mentioned this issue Aug 28, 2017

sklearn/feature_extraction/_hashing.pyx size is int32 but should be larger #9642

Closed

jnothman closed this as completed in #9147 Nov 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError when it shouldn't #8941

BUG: sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError when it shouldn't #8941

BenKaehler commented May 27, 2017

jnothman commented May 27, 2017

BenKaehler commented May 27, 2017

jnothman commented May 28, 2017 via email

BUG: sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError when it shouldn't #8941

BUG: sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError when it shouldn't #8941

Comments

BenKaehler commented May 27, 2017

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jnothman commented May 27, 2017

BenKaehler commented May 27, 2017

jnothman commented May 28, 2017 via email