Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving dtype for float32 / float64 in transformers #11000

Open
glemaitre opened this Issue Apr 20, 2018 · 3 comments

Comments

3 participants
@glemaitre
Copy link
Contributor

glemaitre commented Apr 20, 2018

This is the issue which we want to tackle during the Man AHL Hackathon.

We would like that the transformer does not convert float32 to float64 whenever possible. The transformers which are currently failing are:

  • BernoulliRBM
  • Birch
  • CCA - inherits from PLS, so also not applicable
  • DictionaryLearning
  • FactorAnalysis (#13303)
  • FastICA
  • FeatureAgglomeration
  • GaussianRandomProjection
  • GenericUnivariateSelect
  • Isomap
  • LatentDirichletAllocation (#13275)
  • LinearDiscriminantAnalysis (#13273)
  • LocallyLinearEmbedding
  • LogisticRegression (#13243)
  • MiniBatchDictionaryLearning
  • MiniBatchSparsePCA
  • NMF
  • PLSCanonical - not applicable as both X and y are used
  • PLSRegression - not applicable as both X and y are used
  • PLSSVD - not applicable as both X and y are used
  • RBFSampler
  • RidgeRegresssion (#13302)
  • SGDClassifier/SGDRegressor (#9084)(#13346)
  • SkewedChi2Sampler
  • SparsePCA
  • SparseRandomProjection

We could think to extend it to integer whenever possible and applicable.

Also the following transformers are not included in the common tests. We should write a specific test:

# some strange ones                                                                                                                                                         
DONT_TEST = ['SparseCoder', 'DictVectorizer',                                                                                                                               
             'TfidfTransformer',                                                                                                                     
             'TfidfVectorizer' (check 10443), 'IsotonicRegression',                                                                                                                       
             'CategoricalEncoder',                                                                                                 
             'FeatureHasher',                                                                                             
             'TruncatedSVD', 'PolynomialFeatures',                                                                                                                          
             'GaussianRandomProjectionHash', 'HashingVectorizer',                                                                                                           
             'CountVectorizer']

We could also check classifiers, regressors or clusterers (see #8769 for more context),

Below the code executed to find the failure.

# Let's check the 32 - 64 bits type conservation.                                                                                                                       
if isinstance(X, np.ndarray):                                                                                                                                           
    for dtype in [np.float32, np.float64]:                                                                                                                              
        X_cast = X.astype(dtype)                                                                                                                                        
                                                                                                                                                                            
        transformer = clone(transformer_orig)                                                                                                                           
        set_random_state(transformer)                                                                                                                                   
                                                                                                                                                                            
        if hasattr(transformer, 'fit_transform'):                                                                                                                       
            X_trans = transformer.fit_transform(X_cast, y_)                                                                                                             
        elif hasattr(transformer, 'fit_transform'):                                                                                                                     
            transformer.fit(X_cast, y_)                                                                                                                                 
            X_trans = transformer.transform(X_cast)                                                                                                                     
                                                                                                                                                                            
        # FIXME: should we check that the dtype of some attributes are the                                                                                              
        # same than dtype.                                                                                                                                              
        assert X_trans.dtype == X_cast.dtype, 'transform dtype: {} - original dtype: {}'.format(X_trans.dtype, X_cast.dtype)

ping @rth feel free to edit this thread.

@glemaitre glemaitre added the Sprint label Apr 20, 2018

@rth rth changed the title [Man AHL Hackathon] Preserving dtype with float / double in transformer [Man AHL Hackathon] Preserving dtype for float32 / float64 in transformers Apr 20, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 21, 2018

@rth

This comment has been minimized.

Copy link
Member

rth commented Apr 21, 2018

Thanks for your comment and for mentioning this issue @jnothman ! The question is whether the gains from suporting 32bit arrays overweight the risk of running into new numerical issues.

On a different topic, the above test code can be re-written in a self-consistent way as,

import numpy as np

from sklearn.base import clone
from sklearn.utils.testing import set_random_state

def check_transformer_dtypes_casting(transformer, X, y):
    """Check that a transformer preserves 64bit / 32bit
    dtypes

    Parameters
    ----------
    transformer : estimator
      a transformer instance

    X : array, shape (n_samples, n_features)
      Training vector, where n_samples is the number of samples
      and n_features is the number of features.
    y : array, shape (n_samples)
      Target vector relative to X.
    """
    for dtype_in, dtype_out in [(np.float32, np.float32),
                                (np.float64, np.float64),
                                (np.int, np.float64)]:
        X_cast = X.copy().astype(dtype_in)

        transformer = clone(transformer)
        set_random_state(transformer)

        if hasattr(transformer, 'fit_transform'):
            X_trans = transformer.fit_transform(X_cast, y)
        elif hasattr(transformer, 'fit_transform'):
            transformer.fit(X_cast, y)
            X_trans = transformer.transform(X_cast)

        # FIXME: should we check that the dtype of some attributes are the
        # same than dtype.
        assert X_trans.dtype == dtype_out, \
            ('transform dtype: {} - original dtype: {}'
             .format(X_trans.dtype, X_cast.dtype))

For instance, we can test the LocallyLinearEmbedding as follows,

from sklearn.manifold import LocallyLinearEmbedding

X = np.random.RandomState(0).randn(1000, 100)

estimator = LocallyLinearEmbedding()

check_transformer_dtypes_casting(estimator, X, None)

which in this case will produce

AssertionError: transform dtype: float64 - original dtype: float32
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Apr 21, 2018

@jnothman noted that we should be careful. The motivation is the same as in #8769 mentioned by @GaelVaroquaux . For transformers, I think that there is an extra incentive: user could use other library (tensorflow, etc.) in which default type is float32. Using the scikit-learn transformers will trigger some extra conversion which would not be required. A bit similar vision than letting the NaN past through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.