Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long' #1709

Closed
pemistahl opened this issue Feb 24, 2013 · 12 comments · Fixed by #2271
Closed

ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long' #1709

pemistahl opened this issue Feb 24, 2013 · 12 comments · Fixed by #2271
Labels
Milestone

Comments

@pemistahl
Copy link

In scikit-learn 0.13.0, I'm trying to use the class sklearn.preprocessing.StandardScaler to scale my data for being used in an SVM classifier of class sklearn.svm.LinearSVC. The essential parts of my code are the following:

vectorizer = CountVectorizer(...)

X_train = vectorizer.fit_transform(my_training_data_here)
y_train = np.array(my_labels_here)
X_test = vectorizer.transform(my_test_data_here)

scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X=X_train)
X_test_scaled = scaler.transform(X=X_test)

linear_svm_classifier = LinearSVC()
linear_svm_classifier.fit(X=X_train_scaled, y=y_train)
predictions = linear_svm_classifier.predict(X=X_test_scaled)

Unfortunately, an exception is raised by the line X_train_scaled = scaler.fit_transform(X=X_train). This is the relevant part of the stacktrace:

/[...]/sklearn/utils/validation.py:230: UserWarning: StandardScaler assumes floating point values as input, got int64
"got %s" % (estimator, X.dtype))
Traceback (most recent call last):
[...]
X_train_scaled = scaler.fit_transform(X=X_train)
  File "/[...]/sklearn/base.py", line 361, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/[...]/sklearn/preprocessing.py", line 302, in fit
    var = mean_variance_axis0(X)[1]
  File "sparsefuncs.pyx", line 272, in sklearn.utils.sparsefuncs.mean_variance_axis0 (sklearn/utils/sparsefuncs.c:3551)
  File "sparsefuncs.pyx", line 41, in sklearn.utils.sparsefuncs.csr_mean_variance_axis0 (sklearn/utils/sparsefuncs.c:1416)
ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long'

Do I have to change the dtype myself? If so, how do I do that? Thank youl

@amueller
Copy link
Member

X_train_scaled = scaler.fit_transform(X=X_train.astype(np.float))

should do it.

Flagging this as a bug as first raising a warning and then crashing is weird. Also, I think this should do the conversion internally.

@pemistahl
Copy link
Author

Thanks, @amueller, that did the trick. I was thinking as well that this conversion should be done internally as the users cannot know about it, that's why I asked. But at least you now know that you might fix this in a later release. :)

@amueller
Copy link
Member

ping @GaelVaroquaux should we convert? I would prefer to warn and convert - I guess this is to much magic for you?
The alternative here would be to raise an error instead of a warning. Currently it warns, then raises a hard to read error.
Any opinions from the other sprinters? @ogrisel @mblondel @ogrisel @arjoly ?

@amueller
Copy link
Member

for dense array it currently rounds btw.

@vene
Copy link
Member

vene commented Jul 27, 2013

This looks like a useful pipeline scenario. If we don't convert automatically, there is no way to make this work in a pipeline without writing a converting transformer, which is awkward.

@GaelVaroquaux
Copy link
Member

ping @GaelVaroquaux should we convert? I would prefer to warn and convert - I
guess this is to much magic for you?

Convert is fine: this is what we have been doing silently (and what numpy
does silently).

@amueller
Copy link
Member

should I then also change the behavior for dense? that did not convert until now. It warned and rounded - which is not very helpful in a pipeline as @vene pointed out.

@GaelVaroquaux
Copy link
Member

"long long" is an integer right? I think that converting to flloat implicitly (and mentioning the change of behavior in the release notes) is good.

@amueller
Copy link
Member

ok, I'll do it.

@GaelVaroquaux
Copy link
Member

Thanks

@ogrisel
Copy link
Member

ogrisel commented Jul 27, 2013

+1 as well.

@amueller
Copy link
Member

closed by #2271. Thanks for the report @pemistahl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants