[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067

raghavrv · 2017-06-08T15:46:12Z

Refer #8769

This will ensure float32 is not upcast to float64 in `randomized_

cc: @ogrisel @massich @GaelVaroquaux

massich · 2017-06-08T16:13:01Z

LGTM.

@raghavrv should we do something similar here
assert_almost_equal(lr_32.coef_, lr_64.coef_, decimal=5) ?

GaelVaroquaux · 2017-06-08T19:35:50Z

sklearn/utils/extmath.py

@@ -195,6 +195,7 @@ def randomized_range_finder(A, size, n_iter,

    # Generating normal random vectors with shape: (A.shape[1], size)
    Q = random_state.normal(size=(A.shape[1], size))
+    Q = Q.astype(A.dtype, copy=False)


Is there a risk that A be an integer, and that it breaks something?

Thanks! I'll special case them and avoid astype-ing for int dtypes...

GaelVaroquaux · 2017-06-09T11:33:31Z

Can you expand this PR to add a test to PCA that, with the different solvers, it doesn't upcast float32 to float64.

TomDLT · 2017-06-09T11:36:14Z

sklearn/utils/tests/test_extmath.py

@@ -99,38 +99,50 @@ def test_logsumexp():
    assert_array_almost_equal(np.exp(logsumexp(logX, axis=1)), X.sum(axis=1))


-def test_randomized_svd_low_rank():
+def check_randomized_svd_low_rank(dtype):


nitpick: I prefer to keep the word test instead of check.
It is a unit test, not a input check inside the code.

It was renamed from test to check so it doesn't run as a separate test by itself. There is a test function below which will use this function to test different dtypes.

It is not run when there is an undefined parameter, isn't it?

~~Oh I didn't know that... Thanks!~~

humm doesn't seem so. both pytest and nose raise error if I rename it back to test. Basically it seems to run whenever the function name has test in it.

It's a convention in the projet to prefix by check_ helper functions that are called by test_ generators (with the yield syntax).

raghavrv · 2017-06-09T12:35:35Z

sklearn/utils/tests/test_extmath.py

+        # If the input dtype is float, then the output dtype is float of the
+        # same bit size (f32 is not upcast to f64)
+        # But if the input dtype is int, the output dtype is float32/float64
+        # depending on the platform


@GaelVaroquaux @ogrisel does this seem correct?

No we should always convert to double precision floating poing data.

TomDLT · 2017-06-09T12:36:43Z

sklearn/decomposition/pca.py

@@ -362,7 +362,7 @@ def _fit(self, X):
            raise TypeError('PCA does not support sparse input. See '
                            'TruncatedSVD for a possible alternative.')

-        X = check_array(X, dtype=[np.float64], ensure_2d=True,
+        X = check_array(X, dtype=[np.float32, np.float64], ensure_2d=True,


We need to be consistent between estimator:

The other PR use [np.float64, np.float32] and not [np.float32, np.float64]. It is important since when the dtype is not on the list, it is casted to the first one on the list.

Ok so int gets cast to f64 right? I think that's what @GaelVaroquaux wanted too... I'll push a commit. Thx

raghavrv · 2017-06-09T12:37:47Z

@raghavrv should we do something similar here

sorry I didn't respond here. as discussed IRL, yeah this seems cleaner than astype + comparison.

ogrisel

LGTM once the following comments are addressed.

ogrisel · 2017-06-09T14:55:28Z

sklearn/decomposition/tests/test_pca.py

+
+
+def test_pca_dtype_preservation():
+    for svd_solver in solver_list:


Please use a test generator (with the yield syntax) to generate the checks for different solver names so that we get the solver name in the test failure report.

ogrisel · 2017-06-09T14:56:19Z

sklearn/decomposition/tests/test_pca.py

+        assert_array_almost_equal(pca_64.components_, pca_32.components_,
+                                  decimal=5)
+
+        # But all int types should be upcast to float64


I would put the integer case in a separate test (to keep the code focused and easier to follow).

ogrisel · 2017-06-09T15:11:12Z

sklearn/decomposition/tests/test_pca.py

+        assert pca_64.components_.dtype == np.float64
+        assert pca_32.components_.dtype == np.float32
+        assert_array_almost_equal(pca_64.components_, pca_32.components_,
+                                  decimal=5)


Please also check the dtype preservation for:

assert pca_64.transform(X_64) == np.float64 assert pca_32.transform(X_32) == np.float32

This is a straightforward consequence of the above but better be explicit :)

ogrisel · 2017-06-09T15:13:48Z

sklearn/utils/tests/test_extmath.py

+        # If the input dtype is float, then the output dtype is float of the
+        # same bit size (f32 is not upcast to f64)
+        # But if the input dtype is int, the output dtype is float32/float64
+        # depending on the platform


No we should always convert to double precision floating poing data.

ogrisel · 2017-06-09T15:16:39Z

sklearn/decomposition/pca.py

@@ -362,7 +362,7 @@ def _fit(self, X):
            raise TypeError('PCA does not support sparse input. See '
                            'TruncatedSVD for a possible alternative.')

-        X = check_array(X, dtype=[np.float64], ensure_2d=True,
+        X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,


@raghavrv this line makes it such that if X is integer based, it would be converted to np.float64 (the first element of the list) on all the platforms.

This answers your question at: https://github.com/scikit-learn/scikit-learn/pull/9067/files#r121114986

Should we change this behavior of check_array?

I think this is helpful. It converts to the dtype as per the given order.

raghavrv · 2017-06-12T12:21:56Z

@ogrisel Thanks for the review. Have addressed your comments.

raghavrv · 2017-06-16T13:08:45Z

@ogrisel The appveyor failed on some estimator checks. I've updated the checks in a way that only 4 decimal places are checked for float32 data. Could you verify the last commit and see if it is correct?

raghavrv · 2017-06-16T13:11:55Z

Ah! There was a change to the master which actually addressed the failures. I think CIs should pass after the last commit.

raghavrv · 2017-06-16T19:55:58Z

Yeah. @ogrisel @GaelVaroquaux final review and merge?

GaelVaroquaux · 2017-06-20T14:30:43Z

sklearn/utils/estimator_checks.py

@@ -1142,7 +1142,6 @@ def check_classifiers_train(name, classifier_orig):
            if hasattr(classifier, "predict_log_proba"):
                # predict_log_proba is a transformation of predict_proba
                y_log_prob = classifier.predict_log_proba(X)
-                assert_allclose(y_log_prob, np.log(y_prob), 8, atol=1e-9)


How did this get in this PR? It seems unrelated

GaelVaroquaux · 2017-06-20T14:32:18Z

LGTM once my comment above is addressed.

raghavrv · 2017-06-21T14:24:59Z

Thanks @GaelVaroquaux. Have fixed it.

GaelVaroquaux · 2017-06-21T15:53:11Z

OK, good. Merging! Thanks

…at to double (scikit-learn#9067) * ENH Ensure randomized_svd_low_rank doesn't upcast float to double * ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32) * ENH ensure that when input is of type int, the output is float32/64 * ENH prefer float64 over float32; Use float64 for int inputs * Make sure int types are upcasted to float64; Address Olivier's comments * FIX check only for 4 decimals when dtype is float32 * Fix spurious line removal

ENH Ensure randomized_svd_low_rank doesn't upcast float to double

942ee21

GaelVaroquaux reviewed Jun 8, 2017

View reviewed changes

massich mentioned this pull request Jun 9, 2017

LogisticRegression convert to float64 #8769

Closed

TomDLT reviewed Jun 9, 2017

View reviewed changes

raghavrv added 2 commits June 9, 2017 14:26

ENH ensure PCA does not upcase f32 to f64; (int is upcast to f32)

6ab9e4a

ENH ensure that when input is of type int, the output is float32/64

93f32fa

raghavrv commented Jun 9, 2017

View reviewed changes

raghavrv changed the title ~~[MRG] ENH Ensure randomized_svd_low_rank doesn't upcast float to double~~ [MRG] ENH Ensure PCA and randomized_svd_low_rank doesn't upcast float to double Jun 9, 2017

TomDLT reviewed Jun 9, 2017

View reviewed changes

ENH prefer float64 over float32; Use float64 for int inputs

f4d21ee

ogrisel changed the title ~~[MRG] ENH Ensure PCA and randomized_svd_low_rank doesn't upcast float to double~~ [MRG] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double Jun 9, 2017

ogrisel approved these changes Jun 9, 2017

View reviewed changes

Make sure int types are upcasted to float64; Address Olivier's comments

1ebe7b5

FIX check only for 4 decimals when dtype is float32

20cc531

MRG master

659806b

raghavrv changed the title ~~[MRG] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double~~ [MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double Jun 16, 2017

jnothman added this to the 0.19 milestone Jun 18, 2017

GaelVaroquaux reviewed Jun 20, 2017

View reviewed changes

Fix spurious line removal

348c148

GaelVaroquaux merged commit 1c1566e into scikit-learn:master Jun 21, 2017

raghavrv deleted the randomized_svd_dtype branch June 23, 2017 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067

[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067

raghavrv commented Jun 8, 2017

massich commented Jun 8, 2017

GaelVaroquaux Jun 8, 2017

raghavrv Jun 9, 2017

GaelVaroquaux commented Jun 9, 2017

TomDLT Jun 9, 2017

raghavrv Jun 9, 2017

TomDLT Jun 9, 2017 •

edited

Loading

raghavrv Jun 9, 2017 •

edited

Loading

raghavrv Jun 9, 2017

ogrisel Jun 9, 2017 •

edited

Loading

TomDLT Jun 9, 2017

raghavrv Jun 9, 2017

ogrisel Jun 9, 2017 •

edited

Loading

TomDLT Jun 9, 2017

raghavrv Jun 9, 2017

raghavrv commented Jun 9, 2017

ogrisel left a comment

ogrisel Jun 9, 2017

ogrisel Jun 9, 2017

ogrisel Jun 9, 2017

ogrisel Jun 9, 2017 •

edited

Loading

ogrisel Jun 9, 2017

amueller Jun 19, 2017

raghavrv Jun 19, 2017

raghavrv commented Jun 12, 2017

raghavrv commented Jun 16, 2017

raghavrv commented Jun 16, 2017

raghavrv commented Jun 16, 2017

GaelVaroquaux Jun 20, 2017

GaelVaroquaux commented Jun 20, 2017

raghavrv commented Jun 21, 2017

GaelVaroquaux commented Jun 21, 2017



		def test_pca_dtype_preservation():
		for svd_solver in solver_list:

[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067

[MRG + 1] ENH Ensure PCA and randomized_svd_low_rank don't upcast float to double #9067

Conversation

raghavrv commented Jun 8, 2017

massich commented Jun 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jun 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT Jun 9, 2017 • edited Loading

Choose a reason for hiding this comment

raghavrv Jun 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Jun 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Jun 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghavrv commented Jun 9, 2017

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Jun 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghavrv commented Jun 12, 2017

raghavrv commented Jun 16, 2017

raghavrv commented Jun 16, 2017

raghavrv commented Jun 16, 2017

Choose a reason for hiding this comment

GaelVaroquaux commented Jun 20, 2017

raghavrv commented Jun 21, 2017

GaelVaroquaux commented Jun 21, 2017

TomDLT Jun 9, 2017 •

edited

Loading

raghavrv Jun 9, 2017 •

edited

Loading

ogrisel Jun 9, 2017 •

edited

Loading

ogrisel Jun 9, 2017 •

edited

Loading

ogrisel Jun 9, 2017 •

edited

Loading