Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving dtype for float32 / float64 in transformers #11000

Open
23 of 28 tasks
glemaitre opened this issue Apr 20, 2018 · 25 comments
Open
23 of 28 tasks

Preserving dtype for float32 / float64 in transformers #11000

glemaitre opened this issue Apr 20, 2018 · 25 comments
Labels
float32 Issues related to support for 32bit data Hard Hard level of difficulty Meta-issue General issue associated to an identified list of tasks Sprint

Comments

@glemaitre
Copy link
Member

glemaitre commented Apr 20, 2018

This is the issue which we want to tackle during the Man AHL Hackathon.

We would like that the transformer does not convert float32 to float64 whenever possible. The transformers which are currently failing are:

We could think to extend it to integer whenever possible and applicable.

Also the following transformers are not included in the common tests. We should write a specific test:

# some strange ones                                                                                                                                                         
DONT_TEST = ['SparseCoder', 'DictVectorizer',                                                                                                                               
             'TfidfTransformer',                                                                                                                     
             'TfidfVectorizer' (check 10443), 'IsotonicRegression',                                                                                                                       
             'CategoricalEncoder',                                                                                                 
             'FeatureHasher',                                                                                             
             'TruncatedSVD', 'PolynomialFeatures',                                                                                                                          
             'GaussianRandomProjectionHash', 'HashingVectorizer',                                                                                                           
             'CountVectorizer']

We could also check classifiers, regressors or clusterers (see #8769 for more context),

Below the code executed to find the failure.

# Let's check the 32 - 64 bits type conservation.                                                                                                                       
if isinstance(X, np.ndarray):                                                                                                                                           
    for dtype in [np.float32, np.float64]:                                                                                                                              
        X_cast = X.astype(dtype)                                                                                                                                        
                                                                                                                                                                            
        transformer = clone(transformer_orig)                                                                                                                           
        set_random_state(transformer)                                                                                                                                   
                                                                                                                                                                            
        if hasattr(transformer, 'fit_transform'):                                                                                                                       
            X_trans = transformer.fit_transform(X_cast, y_)                                                                                                             
        elif hasattr(transformer, 'fit_transform'):                                                                                                                     
            transformer.fit(X_cast, y_)                                                                                                                                 
            X_trans = transformer.transform(X_cast)                                                                                                                     
                                                                                                                                                                            
        # FIXME: should we check that the dtype of some attributes are the                                                                                              
        # same than dtype.                                                                                                                                              
        assert X_trans.dtype == X_cast.dtype, 'transform dtype: {} - original dtype: {}'.format(X_trans.dtype, X_cast.dtype)

Tips to run the test for a specific transformer:

  • Choose a transformer, for instance FastICA
  • If this class does not already have a method named _more_tags: add the following code snippet at the bottom of the class definition:
    def _more_tags(self):
        return {"preserves_dtype": [np.float64, np.float32]}
  • Run the common tests for this specific class:
pytest sklearn/tests/test_common.py -k "FastICA and check_transformer_preserve_dtypes" -v
  • It should fail: read the error message and try to understand why the fit_transform method (if it exists) or the transform method returns a float64 data array when it is passed a float32 input array.

It might be helpful to use a debugger, for instance by adding the line:

import pdb; pdb.set_trace()

at the beginning of the fit_transform method and then re-rerunning pytest with:

pytest sklearn/tests/test_common.py -k "FastICA and check_transformer_preserve_dtypes" --pdb

Then using the l (list), n (next), s (step into a function call), p some_array_variable.dtype (p stands for print) and c (continue) commands to interactively debug the execution of this fit_transform call.

ping @rth feel free to edit this thread.

@rth rth changed the title [Man AHL Hackathon] Preserving dtype with float / double in transformer [Man AHL Hackathon] Preserving dtype for float32 / float64 in transformers Apr 20, 2018
@jnothman
Copy link
Member

jnothman commented Apr 21, 2018 via email

@rth
Copy link
Member

rth commented Apr 21, 2018

Thanks for your comment and for mentioning this issue @jnothman ! The question is whether the gains from suporting 32bit arrays overweight the risk of running into new numerical issues.

On a different topic, the above test code can be re-written in a self-consistent way as,

import numpy as np

from sklearn.base import clone
from sklearn.utils.testing import set_random_state

def check_transformer_dtypes_casting(transformer, X, y):
    """Check that a transformer preserves 64bit / 32bit
    dtypes

    Parameters
    ----------
    transformer : estimator
      a transformer instance

    X : array, shape (n_samples, n_features)
      Training vector, where n_samples is the number of samples
      and n_features is the number of features.
    y : array, shape (n_samples)
      Target vector relative to X.
    """
    for dtype_in, dtype_out in [(np.float32, np.float32),
                                (np.float64, np.float64),
                                (np.int, np.float64)]:
        X_cast = X.copy().astype(dtype_in)

        transformer = clone(transformer)
        set_random_state(transformer)

        if hasattr(transformer, 'fit_transform'):
            X_trans = transformer.fit_transform(X_cast, y)
        elif hasattr(transformer, 'fit_transform'):
            transformer.fit(X_cast, y)
            X_trans = transformer.transform(X_cast)

        # FIXME: should we check that the dtype of some attributes are the
        # same than dtype.
        assert X_trans.dtype == dtype_out, \
            ('transform dtype: {} - original dtype: {}'
             .format(X_trans.dtype, X_cast.dtype))

For instance, we can test the LocallyLinearEmbedding as follows,

from sklearn.manifold import LocallyLinearEmbedding

X = np.random.RandomState(0).randn(1000, 100)

estimator = LocallyLinearEmbedding()

check_transformer_dtypes_casting(estimator, X, None)

which in this case will produce

AssertionError: transform dtype: float64 - original dtype: float32

@glemaitre
Copy link
Member Author

@jnothman noted that we should be careful. The motivation is the same as in #8769 mentioned by @GaelVaroquaux . For transformers, I think that there is an extra incentive: user could use other library (tensorflow, etc.) in which default type is float32. Using the scikit-learn transformers will trigger some extra conversion which would not be required. A bit similar vision than letting the NaN past through.

@rth rth changed the title [Man AHL Hackathon] Preserving dtype for float32 / float64 in transformers Preserving dtype for float32 / float64 in transformers Apr 27, 2018
@GaelVaroquaux GaelVaroquaux added this to To do in Sprint Paris 2019 via automation Feb 25, 2019
@LalliAcqua
Copy link
Contributor

Checking Isomap for this issue #11000

@ksslng
Copy link
Contributor

ksslng commented Jan 29, 2020

I'm working on GenericUnivariateSelect as well as implementing a generic test for all estimators to preserve dtypes.

@Henley13
Copy link
Contributor

Should we wait for @ksslng to merge his PR with a generic test before working on a specific transformer from the above list ? Or focus on the "strange ones" instead ?

@rth
Copy link
Member

rth commented Jan 30, 2020

@Henley13 It would have been ideal to have that common test, but you can certainly start investigation an estimator from the list meanwhile.

It should be noted that some of the above estimators may never preserve dtype , either because they use a significant number of linear algebra and we are not able to guarantee that results are consistent in 32bit (or because they are not frequently used enough to deserve this effort). So starting with frequently used and relatively simple estimators would be best.

Another point related is that you could check that some of the supervised/unsupervised models are able to work in 32bit (again focusing on the most popular ones). There is a number of PRs linked above that have done that, but many models probably still need work. So you could look into say linear checking that models are able to work without doing conversion to 64bit when given 32bit input and what would be necessary to fix it.

@jeremiedbb
Copy link
Member

When making a transformer preserve float32, it's important to check that it does not introduce performance regression.

Currently, transformers that don't preserve float32 convert to float64 as a first step and work with float64 data all the way. We need to avoid keeping float32 in the beginning and end up doing a lot of copies internally in a tight loop because we are adding a float64 array and a float32 array for instance.

PRs enabling float32 dtype preservation should come with a benchmark comparing the performances w.r.t. main on float32.

@svenstehle
Copy link
Contributor

svenstehle commented Sep 2, 2022

Hi, I checked GaussianRandomProjection.
In the DONT_TEST list this is referenced as GaussianRandomProjectionHash - maybe a typo?

It seems GaussianRandomProjection was already fixed in #22114
The tests introduced in this PR also cover GaussianRandomProjection, since it is part of the tested estimators all_RandomProjection. Running the tests [test_random_projection_dtype_match, test_random_projection_numerical_consistency] locally also passed without any errors.

_more_tags already exists in the base class BaseRandomProjection.

I think we can tick off GaussianRandomProjection. I don't know if the recommended benchmarking was done though.

@jeremiedbb
Copy link
Member

jeremiedbb commented Sep 2, 2022

@svenstehle Thanks for looking into that. Would you like to open a PR ? EDIT: there's nothing to actually besides removing it from the checklist. Thanks

@svenstehle
Copy link
Contributor

svenstehle commented Sep 2, 2022

Working on GenericUnivariateSelect
Correction: GenericUnivariateSelect already implemented in #22370

sklearn/tests/test_common.py::test_estimators[GenericUnivariateSelect()-check_transformer_preserve_dtypes] PASSED  

@svenstehle
Copy link
Contributor

Birch was fixed in #22968

sklearn/tests/test_common.py::test_estimators[Birch()-check_transformer_preserve_dtypes] PASSED 

@IvanLauLinTiong
Copy link
Contributor

Working on FeatureAgglomeration.

@svenstehle
Copy link
Contributor

svenstehle commented Sep 2, 2022

BernoulliRBM was fixed in #16352

@jeremiedbb discovered that we were missing the necessary update to _more_tags. Added this in #24318

sklearn/tests/test_common.py::test_estimators[BernoulliRBM()-check_transformer_preserve_dtypes] PASSED 

MiniBatchDictionaryLearning seems to have been fixed with PRs #22002 and #22428; but @jeremiedbb pls double-check if that is correct

sklearn/tests/test_common.py::test_estimators[MiniBatchDictionaryLearning()-check_transformer_preserve_dtypes] PASSED

@svenstehle
Copy link
Contributor

Continuing work of @thibsej on FactorAnalysis in #13303

@MisaOgura
Copy link
Contributor

Looking at LocallyLinearEmbedding

@betatim
Copy link
Member

betatim commented Sep 2, 2022

I think we can tick SkewedChi2Sampler in the list as pytest sklearn/tests/test_common.py -k "SkewedChi2Sampler and check_transformer_preserve_dtypes" -v says it passes the test.

@jeremiedbb
Copy link
Member

jeremiedbb commented Sep 2, 2022

@betatim not yet: this test only runs on float64 by default. To trigger the test on both float64 and float32, you need to set the "preserve_dtype": [np.float64, np.float32] estimator tag.

@rprkh
Copy link
Contributor

rprkh commented Sep 4, 2022

Working on SkewedChi2Sampler

@OmarManzoor
Copy link
Contributor

@jeremiedbb, @glemaitre
Is there any part of this issue that is remaining and can be worked on or all of them are taken or complete?

@glemaitre
Copy link
Member Author

I don't see any PR for Isomap.

@OmarManzoor
Copy link
Contributor

change in cdef class BinaryTree necessary to fix float32/float64 conversion in Isomap #15474

I see. It is mentioned above along with the issue #15474. Does that issue need to be completed first in order to work on Isomap within the current one?

@glemaitre
Copy link
Member Author

Does that issue need to be completed first in order to work on Isomap within the current one?

It seems so.

@ogrisel
Copy link
Member

ogrisel commented Sep 19, 2022

I see. It is mentioned above along with the issue #15474. Does that issue need to be completed first in order to work on Isomap within the current one?

No this is the opposite: we need to make Isomap preserve the float32 and then update the tests accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
float32 Issues related to support for 32bit data Hard Hard level of difficulty Meta-issue General issue associated to an identified list of tasks Sprint
Projects
No open projects
Development

No branches or pull requests