New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError when it shouldn't #8941
Comments
So you're producing 13 non-zero cells per row from 18600 tokens or so. This will exceed a 32 bit index. Yes, chunking the data would be a sufficient fix, assuming no single row produced this problem. Or else, could consider using scipy.sparse's csr_sum_duplicates directly. PR welcome |
Thanks @jnothman! How would calling csr_sum_duplicates directly help? We have to return the csr_matrix at some point regardless. Wouldn't we run into the same problem regardless of when we create the csr_matrix? |
I mean you could call csr_sum_duplicates on a partially filled buffer
rather than the full matrix, to avoid creating a bunch of partial matrices
and then stacking. But this is almost certainly excessive premature
optimisation
…On 28 May 2017 6:12 am, "BenKaehler" ***@***.***> wrote:
Thanks @jnothman <https://github.com/jnothman>! How would calling
csr_sum_duplicates directly help? We have to return the csr_matrix at some
point regardless. Wouldn't we run into the same problem regardless of when
we create the csr_matrix?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8941 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6x6Gz73SpGrXZRK3xuT_wKAJ0uZvks5r-IOWgaJpZM4NoOcr>
.
|
Description
sklearn.feature_extraction.text.HashingVectorizer.fit_transform
raisesValueError: indices and data should have the same size
for data of a certain length. If you chunk the same data it runs fine.Steps/Code to Reproduce
Expected Results
Actual Results
Versions
and
The text was updated successfully, but these errors were encountered: