ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long' #1709

pemistahl · 2013-02-24T11:08:13Z

In scikit-learn 0.13.0, I'm trying to use the class sklearn.preprocessing.StandardScaler to scale my data for being used in an SVM classifier of class sklearn.svm.LinearSVC. The essential parts of my code are the following:

vectorizer = CountVectorizer(...)

X_train = vectorizer.fit_transform(my_training_data_here)
y_train = np.array(my_labels_here)
X_test = vectorizer.transform(my_test_data_here)

scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X=X_train)
X_test_scaled = scaler.transform(X=X_test)

linear_svm_classifier = LinearSVC()
linear_svm_classifier.fit(X=X_train_scaled, y=y_train)
predictions = linear_svm_classifier.predict(X=X_test_scaled)

Unfortunately, an exception is raised by the line X_train_scaled = scaler.fit_transform(X=X_train). This is the relevant part of the stacktrace:

/[...]/sklearn/utils/validation.py:230: UserWarning: StandardScaler assumes floating point values as input, got int64
"got %s" % (estimator, X.dtype))
Traceback (most recent call last):
[...]
X_train_scaled = scaler.fit_transform(X=X_train)
  File "/[...]/sklearn/base.py", line 361, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/[...]/sklearn/preprocessing.py", line 302, in fit
    var = mean_variance_axis0(X)[1]
  File "sparsefuncs.pyx", line 272, in sklearn.utils.sparsefuncs.mean_variance_axis0 (sklearn/utils/sparsefuncs.c:3551)
  File "sparsefuncs.pyx", line 41, in sklearn.utils.sparsefuncs.csr_mean_variance_axis0 (sklearn/utils/sparsefuncs.c:1416)
ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long'

Do I have to change the dtype myself? If so, how do I do that? Thank youl

The text was updated successfully, but these errors were encountered:

amueller · 2013-02-24T11:36:31Z

X_train_scaled = scaler.fit_transform(X=X_train.astype(np.float))

should do it.

Flagging this as a bug as first raising a warning and then crashing is weird. Also, I think this should do the conversion internally.

pemistahl · 2013-02-24T11:47:41Z

Thanks, @amueller, that did the trick. I was thinking as well that this conversion should be done internally as the users cannot know about it, that's why I asked. But at least you now know that you might fix this in a later release. :)

amueller · 2013-07-27T10:34:37Z

ping @GaelVaroquaux should we convert? I would prefer to warn and convert - I guess this is to much magic for you?
The alternative here would be to raise an error instead of a warning. Currently it warns, then raises a hard to read error.
Any opinions from the other sprinters? @ogrisel @mblondel @ogrisel @arjoly ?

amueller · 2013-07-27T10:34:57Z

for dense array it currently rounds btw.

vene · 2013-07-27T11:10:39Z

This looks like a useful pipeline scenario. If we don't convert automatically, there is no way to make this work in a pipeline without writing a converting transformer, which is awkward.

GaelVaroquaux · 2013-07-27T11:12:26Z

ping @GaelVaroquaux should we convert? I would prefer to warn and convert - I
guess this is to much magic for you?

Convert is fine: this is what we have been doing silently (and what numpy
does silently).

amueller · 2013-07-27T11:19:01Z

should I then also change the behavior for dense? that did not convert until now. It warned and rounded - which is not very helpful in a pipeline as @vene pointed out.

GaelVaroquaux · 2013-07-27T11:27:03Z

"long long" is an integer right? I think that converting to flloat implicitly (and mentioning the change of behavior in the release notes) is good.

amueller · 2013-07-27T11:28:17Z

ok, I'll do it.

GaelVaroquaux · 2013-07-27T11:28:54Z

Thanks

ogrisel · 2013-07-27T11:46:04Z

+1 as well.

amueller · 2013-07-27T13:21:27Z

closed by #2271. Thanks for the report @pemistahl

amueller mentioned this issue Jul 27, 2013

[MRG] make StandardScaler convert int input to float #2271

Merged

amueller closed this as completed in #2271 Jul 27, 2013

kianho mentioned this issue Aug 20, 2014

Error fitting kde.KDEUnivariate using integer arrays statsmodels/statsmodels#1915

Closed

fgregg mentioned this issue Sep 14, 2015

build error under conda on Windows 7 dirko/pyhacrf#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long' #1709

ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long' #1709

pemistahl commented Feb 24, 2013

amueller commented Feb 24, 2013

pemistahl commented Feb 24, 2013

amueller commented Jul 27, 2013

amueller commented Jul 27, 2013

vene commented Jul 27, 2013

GaelVaroquaux commented Jul 27, 2013

amueller commented Jul 27, 2013

GaelVaroquaux commented Jul 27, 2013

amueller commented Jul 27, 2013

GaelVaroquaux commented Jul 27, 2013

ogrisel commented Jul 27, 2013

amueller commented Jul 27, 2013

ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long' #1709

ValueError: Buffer dtype mismatch, expected 'DOUBLE' but got 'long long' #1709

Comments

pemistahl commented Feb 24, 2013

amueller commented Feb 24, 2013

pemistahl commented Feb 24, 2013

amueller commented Jul 27, 2013

amueller commented Jul 27, 2013

vene commented Jul 27, 2013

GaelVaroquaux commented Jul 27, 2013

amueller commented Jul 27, 2013

GaelVaroquaux commented Jul 27, 2013

amueller commented Jul 27, 2013

GaelVaroquaux commented Jul 27, 2013

ogrisel commented Jul 27, 2013

amueller commented Jul 27, 2013