Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential error caused by different column order #7242

Open
SauceCat opened this issue Aug 25, 2016 · 37 comments · May be fixed by #11607

Comments

@SauceCat
Copy link

@SauceCat SauceCat commented Aug 25, 2016

Description

Sometimes it is convenient to first build a model on a recent dataset, save it as a .pkl file and then apply the model to the new dataset. However, in the last project, my friends and I found that the results turned quite wired after applying the .pkl file on the new dataset. Actually, we implemented a binary classifier. We found the probability distribution turned from unimodal distribution to bimodal distribution. Finally, we found out the problem was that the column order of the new dataset was different from the old one. Thus the predictions were totally wrong.
I have checked the source code and discovered that the fit function of sklean didn't save the column values during the process of model training. Thus there was no mean to check whether the column values were consistent during the processing of prediction. We thought it would be better if the column values could be saved during training and then be used to check the column values during predicting.

Steps/Code to Reproduce

#for simplification, consider a very simple case
from sklearn.datasets import load_iris
import pandas as pd

#make a dataframe
iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]
iris_pd = pd.DataFrame(X)
iris_pd.columns = iris.feature_names
iris_pd['target'] = y

from sklearn.cross_validation import train_test_split
train, test = train_test_split(iris_pd, test_size= 0.3)

feature_columns_train = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
feature_columns_test = ['sepal length (cm)','sepal width (cm)','petal width (cm)','petal length (cm)']

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
lg.fit(train[feature_columns_train], train['target'])

prob1 = lg.predict_proba(test[feature_columns_train])
prob2 = lg.predict_proba(test[feature_columns_test])

Expected Results

Because feature_columns_test is different from feature_columns_train, it is not surprised that prob1 is totally different from prob2 and prob1 should be the right result.

prob1[:5] = 
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
         [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
         [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
         [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
         [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

Actual Results

prob2[:5] = 
array([[  4.36321678e-01,   2.25057553e-04,   5.63453265e-01],
         [  4.92513658e-01,   1.76391882e-05,   5.07468703e-01],
         [  9.92946715e-01,   7.05167151e-03,   1.61346947e-06],
         [  9.83726756e-01,   1.62387090e-02,   3.45348884e-05],
         [  5.01392274e-01,   5.37144591e-04,   4.98070581e-01]])

Versions

Linux-2.6.32-642.1.1.el6.x86_64-x86_64-with-redhat-6.7-Santiago
('Python', '2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec  6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.1')
('SciPy', '0.16.0')
('Scikit-Learn', '0.17')

The probable solution

I also implement a very simple solution. Hope this would help. :)

class SafeLogisticRegression(LogisticRegression):
    def fit(self, X, y, sample_weight=None):
        self.columns = X.columns
        LogisticRegression.fit(self, X, y, sample_weight=None)
    def predict_proba(self, X):
        new_columns = list(X.columns)
        old_columns = list(self.columns)
        if new_columns != old_columns:
            if len(new_columns) == len(old_columns):
                try:
                    X = X[old_columns]
                    print "The order of columns has changed. Fixed."
                except:
                    raise ValueError('The columns has changed. Please check.')
            else:
                raise ValueError('The number of columns has changed.')
        return LogisticRegression.predict_proba(self, X)

Then apply this new class:

slg = SafeLogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
slg.fit(train[feature_columns_train], train['target']) 

Test one: if the column order is changed

prob1 = slg.predict_proba(test[feature_columns_train])
prob2 = slg.predict_proba(test[feature_columns_test])

#The order of columns has changed. Fixed.

Result for test one:

prob1[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

prob2[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

Test two: if the columns are different (different columns)

Simulate by changing one of the column names

prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-84cea68536fe> in <module>()
----> 1 prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     12                     print "The order of columns has changed. Fixed."
     13                 except:
---> 14                     raise ValueError('The columns has changed. Please check.')
     15             else:
     16                 raise ValueError('The number of columns has changed.')

ValueError: The columns has changed. Please check.

Test three: if the number of columns changes

Simulate by dropping one column

prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-47c63ae1ac22> in <module>()
----> 1 prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     14                     raise ValueError('The columns has changed. Please check.')
     15             else:
---> 16                 raise ValueError('The number of columns has changed.')
     17         return LogisticRegression.predict_proba(self, X)

ValueError: The number of columns has changed.
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 25, 2016

Scikit-learn currently has no concept of column labels. I agree this would be a nice to have feature. Don't hold your breath for it. Storing the column names at all would be an important first step, which I hope to achieve soonish. This will not be available in 0.18, and the consistency check probably not in 0.19 either. Scikit-learn operates on numpy arrays, which don't (currently) have column names. This might change in the future, and we might add better integration with pandas.

@amueller amueller added this to the 1.0 milestone Aug 25, 2016
@SauceCat

This comment has been minimized.

Copy link
Author

@SauceCat SauceCat commented Aug 26, 2016

Thanks for replying! Looking forward to the new version! 💯

@pjcpjc

This comment has been minimized.

Copy link

@pjcpjc pjcpjc commented Aug 7, 2017

Thanks for the edifying conversation guys!

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 8, 2017

I think a wrapper to ensure column order is maintained between fit and predict/transform would be a good idea.

@pjcpjc

This comment has been minimized.

Copy link

@pjcpjc pjcpjc commented Aug 8, 2017

I would guess the way to do it would be respect column names if they were present both for fit and predict, and respect column ordering if they were missing in either (or both). There is also feature selection that would need to abide. I'm sure its not as easy as it might first appear.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 8, 2017

@pjcpjc

This comment has been minimized.

Copy link

@pjcpjc pjcpjc commented Aug 11, 2017

In my opinion this is an sklearn issue not a pandas issue. As I said, not complaining, not expecting it to be terribly easy.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 11, 2017

I would like to solve this inside scikit-learn, but I don't think there is consensus to do that (yet).
I would argue that we should protect our users against silent wrong results, other voices might say that users shouldn't pass dataframes?

@pjcpjc

This comment has been minimized.

Copy link

@pjcpjc pjcpjc commented Aug 11, 2017

other voices might say that users shouldn't pass dataframes?

You could print a warning that says "Beware when using Dataframes. sklearn uses column ordering not column labels. Downcast to numpy arrays to avoid this warning".

That would have saved my bacon here.

Pandas prints out a warning with the view vs. copy "gotcha", and that was enough to motivate me to study and pre-empt any problem. This issue seems similar. http://bit.ly/2qR1fDH , http://bit.ly/2pRbR7Q

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 11, 2017

@pjcpjc warning is an option, but this would be very very loud. Many people are passing dataframes.
I guess it depends a bit on whether we want them to pass dataframes or not. One problem is that if we warn, that basically tells users to use .values everywhere to convert to numpy arrays, so they don't get a warning. But then this error becomes silent again, so we won nothing. We just made the users change their code so that we can never hope to catch the error.

Also, I think having the estimator know the feature names is helpful for model inspection.

@pjcpjc

This comment has been minimized.

Copy link

@pjcpjc pjcpjc commented Aug 11, 2017

I'd prefer for sklearn to respect column names if they are present both for fit and predict, and respect column ordering if they were missing in either (or both). But I also think a warning would be better than how it works right now, where its easy to get a false sense of security and end up with buggy code, which was my experience with this issue.

You can see my bug fix here. The original version didn't have cell 18. A warning would have flagged me that cell 18 was needed. The additional code changes to pass numpy arrays would be minor, if needed.

Just one opinion.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 11, 2017

I'm not sure you got the problem with the warning. If there's always a warning, people will change their code so they don't get a warning, which is by passing in df.values, i.e. a numpy array without column names. Then later they change their code and columns mismatch again but they get no warning because they only passed a numpy array. So in the end, adding the warning didn't safe them at all, just created more work.

I totally agree that we should do better then we're doing now, I just think a warning whenever a dataframe is passed will not help.

@pjcpjc

This comment has been minimized.

Copy link

@pjcpjc pjcpjc commented Aug 11, 2017

I think I understand your point. My point is if the warning links to a "see here for details" documentation page, and that page takes great pains to point out that you need to manage the column positions outside of sklearn, then my guess it will save quite a few people.

I also suspect that the people who ignore the warning, and/or don't take it seriously and just do the simplest possible thing to make the warning go away, are probably un-saveable in any event. They will find some other way to muck up their data science practice.

That said - agree to disagree. It would have helped me - but it certainly is possible I'm too unusual to be of guidance.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 11, 2017

Ok, I see where you're coming from. It might be a good way to make people aware that this issue potentially exists at all. I still thinks it's a bit too loud and unspecific, though it might indeed save some people. I hope for a near-ish term better solution ;)

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 13, 2017

I think warning for all pandas DataFrame input is overkill. Lots of people are loading DataFrames from static sources (as we recommend) and not reordering columns.

The options, as far as I'm concerned are:

  • add some documentation on Scikit-learn with Pandas and pitfalls therein.
  • add a common test which ensures column name equality, otherwise raising a warning (and perhaps in a later version an error) when mismatched.

Obviously the latter is a substantial effort, and adds O(n_features) space to the pickle representation of an estimator unless we use a hash, but ... dataframes are important to support well.

@jnothman jnothman added API Bug labels Aug 13, 2017
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 13, 2017

If we do this, I'd suggest we extend check_array, at least to do the checking, and perhaps even to help store the estimator attribute. In terms of syntax, this could be as magical as check_X_y(X, y, ..., estimator=self) in fit and check_array(X, ..., estimator=self) in predict/transform, but maybe we'd prefer check_X_y(X, y, ..., store_column_hash=self) and check_array(X, ..., column_hash=self._column_hash). (We could make it even more magical with decorators or a metaclass, but we don't do that 'round here.)

There will be cases where this is hard to do because check_array is called by a function called by the estimator, rather than directly within it. We'd also need to do the same for warm_start and partial_fit.

In short, this is possible, but it'll be a big project and will touch everything.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 13, 2017

I would rather not store hashes, because this information is very valuable for model inspection.
I'm not sure I like passing self to check_array, I guess the only alternative would be to return column_names and then store it, and pass column_names=self._column_names in predict / transform.
(if we do this, we probably want to really look into get_feature_names and the pipeline processing of the same, because it'd be AWESOME!)

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 13, 2017

@jnothman jnothman added this to Design/analysis stage in API and interoperability Aug 14, 2017
@thebeancounter

This comment has been minimized.

Copy link

@thebeancounter thebeancounter commented Jan 4, 2018

So, for now if I wish to use fit, and then use transform on another df, what is the proper way to be able to use the proper column order?
How can i guaranty pandas dataframe columns to be in a specific order?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 5, 2018

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 11, 2018

Currently, estimators have ad-hoc checks that n_features match from fit to predict.

Maybe we should implement in BaseEstimator (for use after check_array):

def _fit_consistency(self, X):
    self.__n_features = X.shape[1]
    if isinstance(X, pd.DataFrame):
        self.__columns = X.columns

def _check_consistency(self, X):
    if self.__n_features != X.shape[1]: raise ...
    if isinstance(X, pd.DataFrame):
        if not hasattr(self, 'BaseEstimator__columns'): warn ...
        if self.__columns != X.columns: raise ??
    elif hasattr(self, 'BaseEstimator__columns'): warn ...
@kienerj

This comment has been minimized.

Copy link

@kienerj kienerj commented Feb 1, 2018

I just wanted to add that this feature is extremely important out in the real-world where the model building process and the prediction process (exposed to users) often are completely separate.

The prediction process should not need to know about feature selection, feature order and so forth. It should just take all the features for the specific problem and the estimator knows which ones it needs in what order. This is how it works in other tools. It's extremely convenient and the fact I prefer that tool over sklearn. Sklearn however offers more estimator types and better performance (as in speed and prediction quality even for same model types). So it would be great to have this here too.

I would imagine this happens either through pandas column names or if you pass in numpy arrays fit and predict could have an additional optional parameter column_names (not that great to create this list). If no column names are present, it works like it does now.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Feb 1, 2018

I don't think your suggestion is going to happen.

Well, here are the options (not all mutually exclusive):

  1. Deprecate implicit conversion of DataFrames to NumPy arrays; in the future, the user would have to call .values explicitly in the future
  2. Document it: #10453 is up for grabs and your writing on the subject is fluent, @kienerj!
  3. Call df.sort_index(axis=1) in the future (with a warning for a deprecation period as it is not backwards compatible), with the risk that this (a) increases runtime/memory; (b) does not identify changes in column names where the number of columns is preserved
  4. Put some PandasProtect wrapper or mixin in sklearn-pandas and tell people to use it if they want to be safe
  5. Invent a PandasCheck transformer (either here or in sklearn-pandas) and tell Pandas users to always use pipelines and put this at the beginning
  6. Store column names in fit and raise a warning (or perhaps error) in transform etc if not matched.

Number 5. is by far the easiest, although it requires users to be aware, so it's coupled with 2. It looks like this (with documentation and an awareness campaign):

class PandasCheck(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.columns_ = getattr(X, 'columns', None)
        return self
    def transform(self, X):
        check_is_fitted(self, 'columns_')
        if not np.all(self.columns_ == X.columns):
            raise ValueError('Mismatched columns in transform. Expected %r, got %r' % (self.columns_, X.columns))
        return X
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Feb 1, 2018

I suppose, similarly, we could just add an option to Pipeline called "pandas_check" and avoid having the extra estimator...

@qdeffense

This comment has been minimized.

Copy link
Contributor

@qdeffense qdeffense commented Jan 8, 2019

I've found these functions useful:

from sklearn.utils import check_array
def check_array_pandas(X: pd.DataFrame,
                       feature_names: List,
                       *args, **kwargs):
    ''' A wrapper for sklearn check_array

    Used to check and enforce column names if pandas dataframes are used.

    Parameters
    --------
    X: pd.DataFrame
        The input you want to check.
    feature_names: list of column names
        The column names you'd like to check and their ordering you want to enforce.
    *args
        *args for check_array
    **kwargs
        **kwargs for check_array

    Returns
    --------
    X: array like
        With columns ordered like feature_names and all other columns removed.
        Also then put through check_array.

    Raises
    --------
    KeyError:
        If any feature_names aren't in X.columns.
    TypeError:
        If X is not a pd.DataFrame.
    ValueError:
        If feature_names is None.

    '''
    if feature_names is not None:
        if isinstance(X, pd.DataFrame):
            if isinstance(feature_names, (collections.Iterable)):
                X = X[feature_names]
            else:
                raise TypeError('feature_names should be iterable.')
        else:
            raise TypeError('X should be a DataFrame for safety purposes.')

    return check_array(X, *args, **kwargs)

Do this in fit:

self.feature_names_ = list(X.columns) if isinstance(X, pd.DataFrame) else None

And this in init

self.feature_names_ = None

Then just replace all calls to check_array with check_array_pandas

wouldn't that work then ?

@ryanpeach

This comment has been minimized.

Copy link

@ryanpeach ryanpeach commented Jan 8, 2019

Please see the test cases in #12936

@qdeffense

This comment has been minimized.

Copy link
Contributor

@qdeffense qdeffense commented Jan 8, 2019

@jnothman what do you think ?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 8, 2019

  • I suspect we should raise an error when the order is different from in fit, rather than reordering, in part to alert users who thought their code was working in previous scikit-learn versions but actually were mixing their features. We can always change it to reordering later.
  • feature_names_ is ambiguous. It needs a better name.
  • I am a little hesitant about putting self.feature_names_ = list(X.columns) if isinstance(X, pd.DataFrame) else None in each estimator. I would rather this abstracted away into a function or method, e.g. set_X_signature(self, X) with check_array supporting X_signature to check integrity.
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 8, 2019

We could potentially make setting (not just checking) the signature a job for check_array, e.g. check_array(X, set_signature=self)

Getting someone to implement these changes (preferably someone who knows how to edit files en-masse) is more the challenge. In practice where an estimator delegates checking of X to some other function, it's not going to be trivial to implement everywhere.

@ryanpeach

This comment has been minimized.

Copy link

@ryanpeach ryanpeach commented Jan 9, 2019

Its difficult because things don't super baseestimator much in the codebase.

@ryanpeach

This comment has been minimized.

Copy link

@ryanpeach ryanpeach commented Jan 9, 2019

I disagree on raising an error. Data frames are different philosophically than arrays. People are likely passing data frames in the wrong order at the moment, but they aren't wrong to do so they just aren't aware the sklearn package hasnt handled their edge case. Pandas dataframe column ordering has always existed, but it's just not talked about, because anyone using pandas would never iloc a column when indexing... You index columns by name.

Maybe a warning as we transition, but I wouldnt recommend an error.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 9, 2019

@ryanpeach

This comment has been minimized.

Copy link

@ryanpeach ryanpeach commented Jan 11, 2019

I absolutely agree that a warning is a must. I agree with your reply 100%.

A compromise may be to issue an update that creates a warning ASAP, by collecting column names where appropriate and creating the architecture of the solution. Then in the next breaking update, implement automatic reordering.

@ryanpeach

This comment has been minimized.

Copy link

@ryanpeach ryanpeach commented Jan 11, 2019

As for extra columns, in my own code I have implemented a lot of what we are talking about in the context of cross validated feature selectors, such as RFECV. By accepting dataframes of any size, you allow these learners to always be correct. User can give them just the columns they selected on predict, the data they fit on on predict, or any arbitrary dataframe on predict. The goal in my mind is to enable the agnostic use of estimators, because these estimators are going to be pickled and long forgotten in a repo somewhere. I dont want to use them mistakenly when I dig them out of the filing cabinet.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 13, 2019

@SreejeshPM

This comment has been minimized.

Copy link

@SreejeshPM SreejeshPM commented Jan 21, 2019

Is there any workaround for this problem. Other than reordering of the columns.

@aishwariya96

This comment has been minimized.

Copy link

@aishwariya96 aishwariya96 commented May 6, 2019

Hello,
I went through the same issue and tried this

df2 = df2[df1.columns]
df2 is the data frame where the columns need to be ordered like in the df1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
API and interoperability
Design/analysis stage
Pandas
  
To do
10 participants
You can’t perform that action at this time.