Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067

Merged
merged 8 commits into from Jun 21, 2017

Conversation

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Jun 8, 2017

Refer #8769

This will ensure float32 is not upcast to float64 in `randomized_

cc: @ogrisel @massich @GaelVaroquaux

@massich
Copy link
Contributor

@massich massich commented Jun 8, 2017

LGTM.

@raghavrv should we do something similar here
assert_almost_equal(lr_32.coef_, lr_64.coef_, decimal=5) ?

@@ -195,6 +195,7 @@ def randomized_range_finder(A, size, n_iter,

# Generating normal random vectors with shape: (A.shape[1], size)
Q = random_state.normal(size=(A.shape[1], size))
Q = Q.astype(A.dtype, copy=False)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jun 8, 2017
Member

Is there a risk that A be an integer, and that it breaks something?

This comment has been minimized.

@raghavrv

raghavrv Jun 9, 2017
Author Member

Thanks! I'll special case them and avoid astype-ing for int dtypes...

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 9, 2017

Can you expand this PR to add a test to PCA that, with the different solvers, it doesn't upcast float32 to float64.

@@ -99,38 +99,50 @@ def test_logsumexp():
assert_array_almost_equal(np.exp(logsumexp(logX, axis=1)), X.sum(axis=1))


def test_randomized_svd_low_rank():
def check_randomized_svd_low_rank(dtype):

This comment has been minimized.

@TomDLT

TomDLT Jun 9, 2017
Member

nitpick: I prefer to keep the word test instead of check.
It is a unit test, not a input check inside the code.

This comment has been minimized.

@raghavrv

raghavrv Jun 9, 2017
Author Member

It was renamed from test to check so it doesn't run as a separate test by itself. There is a test function below which will use this function to test different dtypes.

This comment has been minimized.

@TomDLT

TomDLT Jun 9, 2017
Member

It is not run when there is an undefined parameter, isn't it?

This comment has been minimized.

@raghavrv

raghavrv Jun 9, 2017
Author Member

Oh I didn't know that... Thanks!

This comment has been minimized.

@raghavrv

raghavrv Jun 9, 2017
Author Member

humm doesn't seem so. both pytest and nose raise error if I rename it back to test. Basically it seems to run whenever the function name has test in it.

This comment has been minimized.

@ogrisel

ogrisel Jun 9, 2017
Member

It's a convention in the projet to prefix by check_ helper functions that are called by test_ generators (with the yield syntax).

This comment has been minimized.

@TomDLT

TomDLT Jun 9, 2017
Member

ok then

# If the input dtype is float, then the output dtype is float of the
# same bit size (f32 is not upcast to f64)
# But if the input dtype is int, the output dtype is float32/float64
# depending on the platform

This comment has been minimized.

@raghavrv

raghavrv Jun 9, 2017
Author Member

@GaelVaroquaux @ogrisel does this seem correct?

This comment has been minimized.

@ogrisel

ogrisel Jun 9, 2017
Member

No we should always convert to double precision floating poing data.

@raghavrv raghavrv changed the title [MRG] ENH Ensure randomized_svd_low_rank doesn't upcast float to double [MRG] ENH Ensure PCA and randomized_svd_low_rank doesn't upcast float to double Jun 9, 2017
@@ -362,7 +362,7 @@ def _fit(self, X):
raise TypeError('PCA does not support sparse input. See '
'TruncatedSVD for a possible alternative.')

X = check_array(X, dtype=[np.float64], ensure_2d=True,
X = check_array(X, dtype=[np.float32, np.float64], ensure_2d=True,

This comment has been minimized.

@TomDLT

TomDLT Jun 9, 2017
Member

We need to be consistent between estimator:

The other PR use [np.float64, np.float32] and not [np.float32, np.float64]. It is important since when the dtype is not on the list, it is casted to the first one on the list.

This comment has been minimized.

@raghavrv

raghavrv Jun 9, 2017
Author Member

Ok so int gets cast to f64 right? I think that's what @GaelVaroquaux wanted too... I'll push a commit. Thx

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jun 9, 2017

@raghavrv should we do something similar here

sorry I didn't respond here. as discussed IRL, yeah this seems cleaner than astype + comparison.

@ogrisel ogrisel changed the title [MRG] ENH Ensure PCA and randomized_svd_low_rank doesn't upcast float to double [MRG] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double Jun 9, 2017
@ogrisel
ogrisel approved these changes Jun 9, 2017
Copy link
Member

@ogrisel ogrisel left a comment

LGTM once the following comments are addressed.



def test_pca_dtype_preservation():
for svd_solver in solver_list:

This comment has been minimized.

@ogrisel

ogrisel Jun 9, 2017
Member

Please use a test generator (with the yield syntax) to generate the checks for different solver names so that we get the solver name in the test failure report.

assert_array_almost_equal(pca_64.components_, pca_32.components_,
decimal=5)

# But all int types should be upcast to float64

This comment has been minimized.

@ogrisel

ogrisel Jun 9, 2017
Member

I would put the integer case in a separate test (to keep the code focused and easier to follow).

assert pca_64.components_.dtype == np.float64
assert pca_32.components_.dtype == np.float32
assert_array_almost_equal(pca_64.components_, pca_32.components_,
decimal=5)

This comment has been minimized.

@ogrisel

ogrisel Jun 9, 2017
Member

Please also check the dtype preservation for:

assert pca_64.transform(X_64) == np.float64
assert pca_32.transform(X_32) == np.float32

This is a straightforward consequence of the above but better be explicit :)

# If the input dtype is float, then the output dtype is float of the
# same bit size (f32 is not upcast to f64)
# But if the input dtype is int, the output dtype is float32/float64
# depending on the platform

This comment has been minimized.

@ogrisel

ogrisel Jun 9, 2017
Member

No we should always convert to double precision floating poing data.

@@ -362,7 +362,7 @@ def _fit(self, X):
raise TypeError('PCA does not support sparse input. See '
'TruncatedSVD for a possible alternative.')

X = check_array(X, dtype=[np.float64], ensure_2d=True,
X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,

This comment has been minimized.

@ogrisel

ogrisel Jun 9, 2017
Member

@raghavrv this line makes it such that if X is integer based, it would be converted to np.float64 (the first element of the list) on all the platforms.

This answers your question at: https://github.com/scikit-learn/scikit-learn/pull/9067/files#r121114986

This comment has been minimized.

@amueller

amueller Jun 19, 2017
Member

Should we change this behavior of check_array?

This comment has been minimized.

@raghavrv

raghavrv Jun 19, 2017
Author Member

I think this is helpful. It converts to the dtype as per the given order.

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jun 12, 2017

@ogrisel Thanks for the review. Have addressed your comments.

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jun 16, 2017

@ogrisel The appveyor failed on some estimator checks. I've updated the checks in a way that only 4 decimal places are checked for float32 data. Could you verify the last commit and see if it is correct?

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jun 16, 2017

Ah! There was a change to the master which actually addressed the failures. I think CIs should pass after the last commit.

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jun 16, 2017

Yeah. @ogrisel @GaelVaroquaux final review and merge?

@raghavrv raghavrv changed the title [MRG] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double [MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double Jun 16, 2017
@jnothman jnothman added this to the 0.19 milestone Jun 18, 2017
@@ -1142,7 +1142,6 @@ def check_classifiers_train(name, classifier_orig):
if hasattr(classifier, "predict_log_proba"):
# predict_log_proba is a transformation of predict_proba
y_log_prob = classifier.predict_log_proba(X)
assert_allclose(y_log_prob, np.log(y_prob), 8, atol=1e-9)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jun 20, 2017
Member

How did this get in this PR? It seems unrelated

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 20, 2017

LGTM once my comment above is addressed.

@raghavrv
Copy link
Member Author

@raghavrv raghavrv commented Jun 21, 2017

Thanks @GaelVaroquaux. Have fixed it.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 21, 2017

OK, good. Merging! Thanks

@GaelVaroquaux GaelVaroquaux merged commit 1c1566e into scikit-learn:master Jun 21, 2017
5 checks passed
5 checks passed
ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.29%)
Details
codecov/project 96.31% (+0.01%) compared to 7238b46
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@raghavrv raghavrv deleted the raghavrv:randomized_svd_dtype branch Jun 23, 2017
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants