Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving dtype for float32 / float64 in transformers #11000

Open
glemaitre opened this issue Apr 20, 2018 · 7 comments
Open

Preserving dtype for float32 / float64 in transformers #11000

glemaitre opened this issue Apr 20, 2018 · 7 comments

Comments

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Apr 20, 2018

This is the issue which we want to tackle during the Man AHL Hackathon.

We would like that the transformer does not convert float32 to float64 whenever possible. The transformers which are currently failing are:

  • BernoulliRBM
  • Birch
  • CCA - inherits from PLS, so also not applicable
  • DictionaryLearning
  • FactorAnalysis (#13303)
  • FastICA
  • FeatureAgglomeration
  • GaussianRandomProjection
  • GenericUnivariateSelect
  • Isomap
  • LatentDirichletAllocation (#13275)
  • LinearDiscriminantAnalysis (#13273)
  • LocallyLinearEmbedding
  • LogisticRegression (#13243)
  • MiniBatchDictionaryLearning
  • MiniBatchSparsePCA
  • NMF
  • PLSCanonical - not applicable as both X and y are used
  • PLSRegression - not applicable as both X and y are used
  • PLSSVD - not applicable as both X and y are used
  • RBFSampler
  • RidgeRegresssion (#13302)
  • SGDClassifier/SGDRegressor (#9084)(#13346)
  • SkewedChi2Sampler
  • SparsePCA
  • SparseRandomProjection

We could think to extend it to integer whenever possible and applicable.

Also the following transformers are not included in the common tests. We should write a specific test:

# some strange ones                                                                                                                                                         
DONT_TEST = ['SparseCoder', 'DictVectorizer',                                                                                                                               
             'TfidfTransformer',                                                                                                                     
             'TfidfVectorizer' (check 10443), 'IsotonicRegression',                                                                                                                       
             'CategoricalEncoder',                                                                                                 
             'FeatureHasher',                                                                                             
             'TruncatedSVD', 'PolynomialFeatures',                                                                                                                          
             'GaussianRandomProjectionHash', 'HashingVectorizer',                                                                                                           
             'CountVectorizer']

We could also check classifiers, regressors or clusterers (see #8769 for more context),

Below the code executed to find the failure.

# Let's check the 32 - 64 bits type conservation.                                                                                                                       
if isinstance(X, np.ndarray):                                                                                                                                           
    for dtype in [np.float32, np.float64]:                                                                                                                              
        X_cast = X.astype(dtype)                                                                                                                                        
                                                                                                                                                                            
        transformer = clone(transformer_orig)                                                                                                                           
        set_random_state(transformer)                                                                                                                                   
                                                                                                                                                                            
        if hasattr(transformer, 'fit_transform'):                                                                                                                       
            X_trans = transformer.fit_transform(X_cast, y_)                                                                                                             
        elif hasattr(transformer, 'fit_transform'):                                                                                                                     
            transformer.fit(X_cast, y_)                                                                                                                                 
            X_trans = transformer.transform(X_cast)                                                                                                                     
                                                                                                                                                                            
        # FIXME: should we check that the dtype of some attributes are the                                                                                              
        # same than dtype.                                                                                                                                              
        assert X_trans.dtype == X_cast.dtype, 'transform dtype: {} - original dtype: {}'.format(X_trans.dtype, X_cast.dtype)

ping @rth feel free to edit this thread.

@rth rth changed the title [Man AHL Hackathon] Preserving dtype with float / double in transformer [Man AHL Hackathon] Preserving dtype for float32 / float64 in transformers Apr 20, 2018
@jnothman
Copy link
Member

@jnothman jnothman commented Apr 21, 2018

@rth
Copy link
Member

@rth rth commented Apr 21, 2018

Thanks for your comment and for mentioning this issue @jnothman ! The question is whether the gains from suporting 32bit arrays overweight the risk of running into new numerical issues.

On a different topic, the above test code can be re-written in a self-consistent way as,

import numpy as np

from sklearn.base import clone
from sklearn.utils.testing import set_random_state

def check_transformer_dtypes_casting(transformer, X, y):
    """Check that a transformer preserves 64bit / 32bit
    dtypes

    Parameters
    ----------
    transformer : estimator
      a transformer instance

    X : array, shape (n_samples, n_features)
      Training vector, where n_samples is the number of samples
      and n_features is the number of features.
    y : array, shape (n_samples)
      Target vector relative to X.
    """
    for dtype_in, dtype_out in [(np.float32, np.float32),
                                (np.float64, np.float64),
                                (np.int, np.float64)]:
        X_cast = X.copy().astype(dtype_in)

        transformer = clone(transformer)
        set_random_state(transformer)

        if hasattr(transformer, 'fit_transform'):
            X_trans = transformer.fit_transform(X_cast, y)
        elif hasattr(transformer, 'fit_transform'):
            transformer.fit(X_cast, y)
            X_trans = transformer.transform(X_cast)

        # FIXME: should we check that the dtype of some attributes are the
        # same than dtype.
        assert X_trans.dtype == dtype_out, \
            ('transform dtype: {} - original dtype: {}'
             .format(X_trans.dtype, X_cast.dtype))

For instance, we can test the LocallyLinearEmbedding as follows,

from sklearn.manifold import LocallyLinearEmbedding

X = np.random.RandomState(0).randn(1000, 100)

estimator = LocallyLinearEmbedding()

check_transformer_dtypes_casting(estimator, X, None)

which in this case will produce

AssertionError: transform dtype: float64 - original dtype: float32
@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Apr 21, 2018

@jnothman noted that we should be careful. The motivation is the same as in #8769 mentioned by @GaelVaroquaux . For transformers, I think that there is an extra incentive: user could use other library (tensorflow, etc.) in which default type is float32. Using the scikit-learn transformers will trigger some extra conversion which would not be required. A bit similar vision than letting the NaN past through.

@rth rth changed the title [Man AHL Hackathon] Preserving dtype for float32 / float64 in transformers Preserving dtype for float32 / float64 in transformers Apr 27, 2018
@GaelVaroquaux GaelVaroquaux added this to To do in Sprint Paris 2019 via automation Feb 25, 2019
@LalliAcqua
Copy link
Contributor

@LalliAcqua LalliAcqua commented Nov 2, 2019

Checking Isomap for this issue #11000

@ksslng
Copy link
Contributor

@ksslng ksslng commented Jan 29, 2020

I'm working on GenericUnivariateSelect as well as implementing a generic test for all estimators to preserve dtypes.

@Henley13
Copy link
Contributor

@Henley13 Henley13 commented Jan 30, 2020

Should we wait for @ksslng to merge his PR with a generic test before working on a specific transformer from the above list ? Or focus on the "strange ones" instead ?

@rth
Copy link
Member

@rth rth commented Jan 30, 2020

@Henley13 It would have been ideal to have that common test, but you can certainly start investigation an estimator from the list meanwhile.

It should be noted that some of the above estimators may never preserve dtype , either because they use a significant number of linear algebra and we are not able to guarantee that results are consistent in 32bit (or because they are not frequently used enough to deserve this effort). So starting with frequently used and relatively simple estimators would be best.

Another point related is that you could check that some of the supervised/unsupervised models are able to work in 32bit (again focusing on the most popular ones). There is a number of PRs linked above that have done that, but many models probably still need work. So you could look into say linear checking that models are able to work without doing conversion to 64bit when given 32bit input and what would be necessary to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants