Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067

Merged
merged 8 commits into from
Jun 21, 2017

Conversation

raghavrv
Copy link
Member

@raghavrv raghavrv commented Jun 8, 2017

Refer #8769

This will ensure float32 is not upcast to float64 in `randomized_

cc: @ogrisel @massich @GaelVaroquaux

@massich
Copy link
Contributor

massich commented Jun 8, 2017

LGTM.

@raghavrv should we do something similar here
assert_almost_equal(lr_32.coef_, lr_64.coef_, decimal=5) ?

@@ -195,6 +195,7 @@ def randomized_range_finder(A, size, n_iter,

# Generating normal random vectors with shape: (A.shape[1], size)
Q = random_state.normal(size=(A.shape[1], size))
Q = Q.astype(A.dtype, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk that A be an integer, and that it breaks something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll special case them and avoid astype-ing for int dtypes...

@GaelVaroquaux
Copy link
Member

Can you expand this PR to add a test to PCA that, with the different solvers, it doesn't upcast float32 to float64.

@@ -99,38 +99,50 @@ def test_logsumexp():
assert_array_almost_equal(np.exp(logsumexp(logX, axis=1)), X.sum(axis=1))


def test_randomized_svd_low_rank():
def check_randomized_svd_low_rank(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I prefer to keep the word test instead of check.
It is a unit test, not a input check inside the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was renamed from test to check so it doesn't run as a separate test by itself. There is a test function below which will use this function to test different dtypes.

Copy link
Member

@TomDLT TomDLT Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not run when there is an undefined parameter, isn't it?

Copy link
Member Author

@raghavrv raghavrv Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't know that... Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humm doesn't seem so. both pytest and nose raise error if I rename it back to test. Basically it seems to run whenever the function name has test in it.

Copy link
Member

@ogrisel ogrisel Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a convention in the projet to prefix by check_ helper functions that are called by test_ generators (with the yield syntax).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then

# If the input dtype is float, then the output dtype is float of the
# same bit size (f32 is not upcast to f64)
# But if the input dtype is int, the output dtype is float32/float64
# depending on the platform
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaelVaroquaux @ogrisel does this seem correct?

Copy link
Member

@ogrisel ogrisel Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we should always convert to double precision floating poing data.

@raghavrv raghavrv changed the title [MRG] ENH Ensure randomized_svd_low_rank doesn't upcast float to double [MRG] ENH Ensure PCA and randomized_svd_low_rank doesn't upcast float to double Jun 9, 2017
@@ -362,7 +362,7 @@ def _fit(self, X):
raise TypeError('PCA does not support sparse input. See '
'TruncatedSVD for a possible alternative.')

X = check_array(X, dtype=[np.float64], ensure_2d=True,
X = check_array(X, dtype=[np.float32, np.float64], ensure_2d=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be consistent between estimator:

The other PR use [np.float64, np.float32] and not [np.float32, np.float64]. It is important since when the dtype is not on the list, it is casted to the first one on the list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so int gets cast to f64 right? I think that's what @GaelVaroquaux wanted too... I'll push a commit. Thx

@raghavrv
Copy link
Member Author

raghavrv commented Jun 9, 2017

@raghavrv should we do something similar here

sorry I didn't respond here. as discussed IRL, yeah this seems cleaner than astype + comparison.

@ogrisel ogrisel changed the title [MRG] ENH Ensure PCA and randomized_svd_low_rank doesn't upcast float to double [MRG] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double Jun 9, 2017
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the following comments are addressed.



def test_pca_dtype_preservation():
for svd_solver in solver_list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a test generator (with the yield syntax) to generate the checks for different solver names so that we get the solver name in the test failure report.

assert_array_almost_equal(pca_64.components_, pca_32.components_,
decimal=5)

# But all int types should be upcast to float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the integer case in a separate test (to keep the code focused and easier to follow).

assert pca_64.components_.dtype == np.float64
assert pca_32.components_.dtype == np.float32
assert_array_almost_equal(pca_64.components_, pca_32.components_,
decimal=5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also check the dtype preservation for:

assert pca_64.transform(X_64) == np.float64
assert pca_32.transform(X_32) == np.float32

This is a straightforward consequence of the above but better be explicit :)

# If the input dtype is float, then the output dtype is float of the
# same bit size (f32 is not upcast to f64)
# But if the input dtype is int, the output dtype is float32/float64
# depending on the platform
Copy link
Member

@ogrisel ogrisel Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we should always convert to double precision floating poing data.

@@ -362,7 +362,7 @@ def _fit(self, X):
raise TypeError('PCA does not support sparse input. See '
'TruncatedSVD for a possible alternative.')

X = check_array(X, dtype=[np.float64], ensure_2d=True,
X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghavrv this line makes it such that if X is integer based, it would be converted to np.float64 (the first element of the list) on all the platforms.

This answers your question at: https://github.com/scikit-learn/scikit-learn/pull/9067/files#r121114986

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this behavior of check_array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is helpful. It converts to the dtype as per the given order.

@raghavrv
Copy link
Member Author

@ogrisel Thanks for the review. Have addressed your comments.

@raghavrv
Copy link
Member Author

@ogrisel The appveyor failed on some estimator checks. I've updated the checks in a way that only 4 decimal places are checked for float32 data. Could you verify the last commit and see if it is correct?

@raghavrv
Copy link
Member Author

Ah! There was a change to the master which actually addressed the failures. I think CIs should pass after the last commit.

@raghavrv
Copy link
Member Author

Yeah. @ogrisel @GaelVaroquaux final review and merge?

@raghavrv raghavrv changed the title [MRG] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double [MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double Jun 16, 2017
@jnothman jnothman added this to the 0.19 milestone Jun 18, 2017
@@ -1142,7 +1142,6 @@ def check_classifiers_train(name, classifier_orig):
if hasattr(classifier, "predict_log_proba"):
# predict_log_proba is a transformation of predict_proba
y_log_prob = classifier.predict_log_proba(X)
assert_allclose(y_log_prob, np.log(y_prob), 8, atol=1e-9)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this get in this PR? It seems unrelated

@GaelVaroquaux
Copy link
Member

LGTM once my comment above is addressed.

@raghavrv
Copy link
Member Author

Thanks @GaelVaroquaux. Have fixed it.

@GaelVaroquaux
Copy link
Member

OK, good. Merging! Thanks

@GaelVaroquaux GaelVaroquaux merged commit 1c1566e into scikit-learn:master Jun 21, 2017
@raghavrv raghavrv deleted the randomized_svd_dtype branch June 23, 2017 13:24
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…at to double (scikit-learn#9067)

* ENH Ensure randomized_svd_low_rank doesn't upcast float to double

* ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

* ENH ensure that when input is of type int, the output is float32/64

* ENH prefer float64 over float32; Use float64 for int inputs

* Make sure int types are upcasted to float64; Address Olivier's comments

* FIX check only for 4 decimals when dtype is float32

* Fix spurious line removal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants