-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067
[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067
Conversation
sklearn/utils/extmath.py
Outdated
@@ -195,6 +195,7 @@ def randomized_range_finder(A, size, n_iter, | |||
|
|||
# Generating normal random vectors with shape: (A.shape[1], size) | |||
Q = random_state.normal(size=(A.shape[1], size)) | |||
Q = Q.astype(A.dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that A be an integer, and that it breaks something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll special case them and avoid astype
-ing for int
dtypes...
Can you expand this PR to add a test to PCA that, with the different solvers, it doesn't upcast float32 to float64. |
@@ -99,38 +99,50 @@ def test_logsumexp(): | |||
assert_array_almost_equal(np.exp(logsumexp(logX, axis=1)), X.sum(axis=1)) | |||
|
|||
|
|||
def test_randomized_svd_low_rank(): | |||
def check_randomized_svd_low_rank(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I prefer to keep the word test
instead of check
.
It is a unit test, not a input check inside the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was renamed from test
to check
so it doesn't run as a separate test by itself. There is a test function below which will use this function to test different dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not run when there is an undefined parameter, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't know that... Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humm doesn't seem so. both pytest and nose raise error if I rename it back to test. Basically it seems to run whenever the function name has test in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a convention in the projet to prefix by check_
helper functions that are called by test_
generators (with the yield
syntax).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok then
sklearn/utils/tests/test_extmath.py
Outdated
# If the input dtype is float, then the output dtype is float of the | ||
# same bit size (f32 is not upcast to f64) | ||
# But if the input dtype is int, the output dtype is float32/float64 | ||
# depending on the platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GaelVaroquaux @ogrisel does this seem correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we should always convert to double precision floating poing data.
sklearn/decomposition/pca.py
Outdated
@@ -362,7 +362,7 @@ def _fit(self, X): | |||
raise TypeError('PCA does not support sparse input. See ' | |||
'TruncatedSVD for a possible alternative.') | |||
|
|||
X = check_array(X, dtype=[np.float64], ensure_2d=True, | |||
X = check_array(X, dtype=[np.float32, np.float64], ensure_2d=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be consistent between estimator:
The other PR use [np.float64, np.float32]
and not [np.float32, np.float64]
. It is important since when the dtype is not on the list, it is casted to the first one on the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so int gets cast to f64 right? I think that's what @GaelVaroquaux wanted too... I'll push a commit. Thx
sorry I didn't respond here. as discussed IRL, yeah this seems cleaner than astype + comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the following comments are addressed.
|
||
|
||
def test_pca_dtype_preservation(): | ||
for svd_solver in solver_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a test generator (with the yield
syntax) to generate the checks for different solver names so that we get the solver name in the test failure report.
assert_array_almost_equal(pca_64.components_, pca_32.components_, | ||
decimal=5) | ||
|
||
# But all int types should be upcast to float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put the integer case in a separate test (to keep the code focused and easier to follow).
assert pca_64.components_.dtype == np.float64 | ||
assert pca_32.components_.dtype == np.float32 | ||
assert_array_almost_equal(pca_64.components_, pca_32.components_, | ||
decimal=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also check the dtype preservation for:
assert pca_64.transform(X_64) == np.float64
assert pca_32.transform(X_32) == np.float32
This is a straightforward consequence of the above but better be explicit :)
sklearn/utils/tests/test_extmath.py
Outdated
# If the input dtype is float, then the output dtype is float of the | ||
# same bit size (f32 is not upcast to f64) | ||
# But if the input dtype is int, the output dtype is float32/float64 | ||
# depending on the platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we should always convert to double precision floating poing data.
@@ -362,7 +362,7 @@ def _fit(self, X): | |||
raise TypeError('PCA does not support sparse input. See ' | |||
'TruncatedSVD for a possible alternative.') | |||
|
|||
X = check_array(X, dtype=[np.float64], ensure_2d=True, | |||
X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavrv this line makes it such that if X
is integer based, it would be converted to np.float64
(the first element of the list) on all the platforms.
This answers your question at: https://github.com/scikit-learn/scikit-learn/pull/9067/files#r121114986
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change this behavior of check_array
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is helpful. It converts to the dtype as per the given order.
@ogrisel Thanks for the review. Have addressed your comments. |
@ogrisel The appveyor failed on some estimator checks. I've updated the checks in a way that only 4 decimal places are checked for float32 data. Could you verify the last commit and see if it is correct? |
Ah! There was a change to the master which actually addressed the failures. I think CIs should pass after the last commit. |
Yeah. @ogrisel @GaelVaroquaux final review and merge? |
sklearn/utils/estimator_checks.py
Outdated
@@ -1142,7 +1142,6 @@ def check_classifiers_train(name, classifier_orig): | |||
if hasattr(classifier, "predict_log_proba"): | |||
# predict_log_proba is a transformation of predict_proba | |||
y_log_prob = classifier.predict_log_proba(X) | |||
assert_allclose(y_log_prob, np.log(y_prob), 8, atol=1e-9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did this get in this PR? It seems unrelated
LGTM once my comment above is addressed. |
Thanks @GaelVaroquaux. Have fixed it. |
OK, good. Merging! Thanks |
…at to double (scikit-learn#9067) * ENH Ensure randomized_svd_low_rank doesn't upcast float to double * ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32) * ENH ensure that when input is of type int, the output is float32/64 * ENH prefer float64 over float32; Use float64 for int inputs * Make sure int types are upcasted to float64; Address Olivier's comments * FIX check only for 4 decimals when dtype is float32 * Fix spurious line removal
…at to double (scikit-learn#9067) * ENH Ensure randomized_svd_low_rank doesn't upcast float to double * ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32) * ENH ensure that when input is of type int, the output is float32/64 * ENH prefer float64 over float32; Use float64 for int inputs * Make sure int types are upcasted to float64; Address Olivier's comments * FIX check only for 4 decimals when dtype is float32 * Fix spurious line removal
…at to double (scikit-learn#9067) * ENH Ensure randomized_svd_low_rank doesn't upcast float to double * ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32) * ENH ensure that when input is of type int, the output is float32/64 * ENH prefer float64 over float32; Use float64 for int inputs * Make sure int types are upcasted to float64; Address Olivier's comments * FIX check only for 4 decimals when dtype is float32 * Fix spurious line removal
…at to double (scikit-learn#9067) * ENH Ensure randomized_svd_low_rank doesn't upcast float to double * ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32) * ENH ensure that when input is of type int, the output is float32/64 * ENH prefer float64 over float32; Use float64 for int inputs * Make sure int types are upcasted to float64; Address Olivier's comments * FIX check only for 4 decimals when dtype is float32 * Fix spurious line removal
…at to double (scikit-learn#9067) * ENH Ensure randomized_svd_low_rank doesn't upcast float to double * ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32) * ENH ensure that when input is of type int, the output is float32/64 * ENH prefer float64 over float32; Use float64 for int inputs * Make sure int types are upcasted to float64; Address Olivier's comments * FIX check only for 4 decimals when dtype is float32 * Fix spurious line removal
…at to double (scikit-learn#9067) * ENH Ensure randomized_svd_low_rank doesn't upcast float to double * ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32) * ENH ensure that when input is of type int, the output is float32/64 * ENH prefer float64 over float32; Use float64 for int inputs * Make sure int types are upcasted to float64; Address Olivier's comments * FIX check only for 4 decimals when dtype is float32 * Fix spurious line removal
…at to double (scikit-learn#9067) * ENH Ensure randomized_svd_low_rank doesn't upcast float to double * ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32) * ENH ensure that when input is of type int, the output is float32/64 * ENH prefer float64 over float32; Use float64 for int inputs * Make sure int types are upcasted to float64; Address Olivier's comments * FIX check only for 4 decimals when dtype is float32 * Fix spurious line removal
Refer #8769
This will ensure float32 is not upcast to float64 in `randomized_
cc: @ogrisel @massich @GaelVaroquaux