Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] FIX dtypes to conform to the stricter type cast rules of numpy 1.10 #5398

Merged
merged 5 commits into from
Oct 15, 2015

Conversation

raghavrv
Copy link
Member

Fixes #5397

Explicitly specifies the dtypes to avoid failures in numpy v10...

@amueller

X_test = _check_copy_and_writeable(X_test, copy).astype(np.float64,
copy=False)
y_test = _check_copy_and_writeable(y_test, copy).astype(np.float64,
copy=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I do this correctly?? (@MechCoder(?)) Will this work properly when copy is False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need as_float_array here too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI .astype(np.float64, copy=False) is not supported on old numpy. we have a backport in sklearn.utils.fixes.

@GaelVaroquaux's suggestion to use as_float_array is better to avoid a systematic upcast of 32 bit float to 64 bit float.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for that it was me in the previous comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats awesome... are we having a bot?? :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and okay :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the input checks on dtypes should probably be done outside, in the public function or in the fit method.

We do the _check_copy_and_writeable check in this function for specific constraints caused by the use of multiprocessing with memmap data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the error happening in a test that calls _lars_path_residues directly? If so it's better to change the test to pass float data only.

@lesteve
Copy link
Member

lesteve commented Oct 14, 2015

Maybe it would be worth adding an additional distribution with Python 3.5 and latest numpy + scipy versions to .travis.yml ? (as noted in #5397)

You can cherry-pick or take inspiration from this commit.

beta,
(1. - error_vect) * self.learning_rate)
sample_weight *= as_float_array(np.power(
beta, (1. - error_vect) * self.learning_rate), copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which fix is more preferrable but I had a slightly different fix for the AdaboostRegressor failure, see this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that actually seems better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I'll take @lesteve's fix here...!!

@raghavrv
Copy link
Member Author

2 questions -

  1. Do we need any DataConversion warning here at any of these places?
  2. One of the Binarizer test fails as copy is set to False but the typecasting (from int to float) can't be done inplace.

I've explicitly passed float64 in the test as a workaround... is that correct? or should binarizer raise an error (or warning)?

@@ -563,8 +565,7 @@ def inverse_transform(self, X, copy=True):
"""
check_is_fitted(self, 'mixing_')

if copy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that not check_array here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why would we want to copy? fast_dot doesn't change inplace, right? That wouldn't make any sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there an error here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes X += self.mean_ made numpy 0.10 raise an error since X is not explicitly float but mean_ was... (is this fix okay or should I rather do check_array(X, dtype=FLOAT_DTYPES)?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so even if self.mixing_ is float, fast_dot(X, self.mixing_.T) is not? That is somewhat surprising to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast_dot does casting according to my quick check on numpy 1.10

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! So I must've done this pre-emptively I think... I'll check against master once to confirm!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have replaced with check_array...! And yes I did this pre-emptively since there was a failure w.r.t to fast ica here -

======================================================================
ERROR: sklearn.tests.test_common.test_non_meta_estimators('FastICA', <class 'sklearn.decomposition.fastica_.FastICA'>)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/scikit-learn/sklearn/utils/testing.py", line 317, in wrapper
    return fn(*args, **kwargs)
  File "/scikit-learn/sklearn/utils/estimator_checks.py", line 669, in check_estimators_dtypes
    estimator.fit(X_train, y)
  File "/scikit-learn/sklearn/decomposition/fastica_.py", line 522, in fit
    self._fit(X, compute_sources=False)
  File "/scikit-learn/sklearn/decomposition/fastica_.py", line 478, in _fit
    compute_sources=compute_sources, return_n_iter=True)
  File "/scikit-learn/sklearn/decomposition/fastica_.py", line 300, in fastica
    X -= X_mean[:, np.newaxis]
TypeError: Cannot cast ufunc subtract output from dtype('float64') to dtype('int64') with casting rule 'same_kind'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set copy=copy and self.whiten? There is no need to trigger a copy when self.whiten is False.

@amueller
Copy link
Member

If there needs to be a type conversion, then copy=False shouldn't do anything. Why does it fail?

@raghavrv
Copy link
Member Author

Not if the initial datatype is of a lower type like int and we want it converted to float right? So that overrides the copy=False (ref: this comment in one stack overflow answer )

@amueller
Copy link
Member

Exactly, that overwrites copy=False, i.e. copy=False has no effect, as I said. Is the test to see whether copy=False works? I only see "is not" tests, not "is" tests. Which test are we talking about?

@raghavrv
Copy link
Member Author

Oh :P Anyway I had meant for this commit, which fixes the previously failing check which tests if the X is X_bin when copy is set to False... Is that commit the correct thing to do?

@amueller
Copy link
Member

Ah, ok. Yeah the fix is fine.

@raghavrv
Copy link
Member Author

@amueller One more question, do we need a DataConversion or some kind of warning to notify the user that the conversion is taking place?

- DISTRIB="conda" PYTHON_VERSION="3.4" INSTALL_MKL="true"
NUMPY_VERSION="1.8.1" SCIPY_VERSION="0.14.0"
- DISTRIB="conda" PYTHON_VERSION="3.5" INSTALL_MKL="true"
NUMPY_VERSION="1.10.1" SCIPY_VERSION="0.16.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve @ogrisel Is this better than * since we know which version we are testing against?

@raghavrv
Copy link
Member Author

This failure is from python3.5 rather than due to the new version of numpy / scipy I think --

======================================================================

ERROR: sklearn.preprocessing.tests.test_imputation.test_imputation_mean_median

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/miniconda/envs/testenv/lib/python3.5/site-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/preprocessing/tests/test_imputation.py", line 162, in test_imputation_mean_median

    true_statistics[j] = true_value_fun(z, v, p)

  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/preprocessing/tests/test_imputation.py", line 139, in <lambda>

    ("median", "NaN", lambda z, v, p: np.median(np.hstack((z, v)))),

  File "/home/travis/miniconda/envs/testenv/lib/python3.5/site-packages/numpy/lib/function_base.py", line 3084, in median

    overwrite_input=overwrite_input)

  File "/home/travis/miniconda/envs/testenv/lib/python3.5/site-packages/numpy/lib/function_base.py", line 2997, in _ureduce

    r = func(a, **kwargs)

  File "/home/travis/miniconda/envs/testenv/lib/python3.5/site-packages/numpy/lib/function_base.py", line 3138, in _median

    n = np.isnan(part[..., -1])

IndexError: index -1 is out of bounds for axis 0 with size 0

----------------------------------------------------------------------

@amueller
Copy link
Member

It looks to me like some numerics changed, which is more likely to be a numpy/ scipy issue. Also appveyor already runs 3.5

@amueller
Copy link
Member

see numpy/numpy#6462

@raghavrv
Copy link
Member Author

Okay!! So this waits until that gets fixed?

@amueller
Copy link
Member

we could put in a temporary fix? Why is there the median of an empty array in the test?



def safe_mean(arr, *args, **kwargs):
# np.mean([]) raises a RuntimeWarning for numpy >= 1.10.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and before, I think. But the statement is not wrong, I guess ^^

@amueller amueller changed the title [MRG] FIX dtypes to conform to the stricter type cast rules of numpy 1.10 [MRG + 1] FIX dtypes to conform to the stricter type cast rules of numpy 1.10 Oct 15, 2015

def safe_mean(arr, *args, **kwargs):
# np.mean([]) raises a RuntimeWarning for numpy >= 1.10.1
length = arr.size if hasattr(arr, 'size') else len(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line is missing here now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol sorry :P done :)

@amueller
Copy link
Member

LGTM once the missing line is added back in and the tests pass.

@MechCoder
Copy link
Member

why do we need copy here at all?

Right, we don't, let us move that to another PR.

Just my one comment about the binarizer remains, i.e (https://github.com/scikit-learn/scikit-learn/pull/5398/files#diff-5ebddebc20987b6125fffc893f5abc4cR1336) if this is necessary.

@@ -1243,10 +1244,20 @@ def test_binarizer():
assert_equal(np.sum(X_bin == 0), 2)
assert_equal(np.sum(X_bin == 1), 4)

# dtype of X is int* and binarizer will require the X to be of type
# float and hence inplace computation (without a copy will fail)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first test doesn't really make sense, right? was that your comment @MechCoder ?

@raghavrv
Copy link
Member Author

@amueller no the question was when some dtype conversions can happen inplace, why does binarize unable to do so...?
It gets changed at check_array itself! I am looking into it :)

@@ -1333,7 +1333,8 @@ def binarize(X, threshold=0.0, copy=True):
using the ``Transformer`` API (e.g. as part of a preprocessing
:class:`sklearn.pipeline.Pipeline`)
"""
X = check_array(X, accept_sparse=['csr', 'csc'], copy=copy)
X = check_array(X, accept_sparse=['csr', 'csc'], copy=copy,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why this change here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raised no errors... This was done just to fix the input dtype to float (preemptively)... Is this unnecessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, seems so please remove it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I assume you haven't made any such changes pre emptively elsewhere :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol :P wait I'll confirm in a minute!

@MechCoder
Copy link
Member

The inplace modification produces an error only when you do something like += , -=, etc

@raghavrv
Copy link
Member Author

>>> X = np.array([[1, 0, 5], [2, 3, -1]])
>>> print(X is check_array(X, copy=False, dtype=np.float64))
False
>>> print(X is np.array(X, dtype=np.float64, copy=False)) # This is done by check_array
False
>>> X[X>3]=3.0
>>> print(X is X)
True

@MechCoder
Copy link
Member

Exactly, in the first two you are explicitly casting the dtype of X, so it triggers a copy. In the second the dtype of X remains the same, and 3.0 is cast to an int, ie. the dtype of X

@raghavrv
Copy link
Member Author

Okay so your question is same as andy's asking why I made that change at binarize? (I assumed you asked why binarize was copying it after that change :p)

I'll revert that in a bit :)

And I think here too - https://github.com/scikit-learn/scikit-learn/pull/5398/files#r42003035 ;)

@MechCoder
Copy link
Member

lgtm as well

@MechCoder MechCoder changed the title [MRG + 1] FIX dtypes to conform to the stricter type cast rules of numpy 1.10 [MRG + 2] FIX dtypes to conform to the stricter type cast rules of numpy 1.10 Oct 15, 2015
@raghavrv
Copy link
Member Author

Thanks for the reviews :)

amueller added a commit that referenced this pull request Oct 15, 2015
[MRG + 2] FIX dtypes to conform to the stricter type cast rules of numpy 1.10
@amueller amueller merged commit 1b9e791 into scikit-learn:master Oct 15, 2015
@MechCoder
Copy link
Member

Thanks ! 🍷 🍷 🍷

@amueller
Copy link
Member

getting more classy here. Though if you drink three glasses at once, maybe not that classy any more?

@MechCoder
Copy link
Member

I like it when people underestimate my abilities ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants