Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential error caused by different column order #7242

Open
SauceCat opened this issue Aug 25, 2016 · 45 comments
Open

Potential error caused by different column order #7242

SauceCat opened this issue Aug 25, 2016 · 45 comments

Comments

@SauceCat
Copy link

@SauceCat SauceCat commented Aug 25, 2016

Description

Sometimes it is convenient to first build a model on a recent dataset, save it as a .pkl file and then apply the model to the new dataset. However, in the last project, my friends and I found that the results turned quite wired after applying the .pkl file on the new dataset. Actually, we implemented a binary classifier. We found the probability distribution turned from unimodal distribution to bimodal distribution. Finally, we found out the problem was that the column order of the new dataset was different from the old one. Thus the predictions were totally wrong.
I have checked the source code and discovered that the fit function of sklean didn't save the column values during the process of model training. Thus there was no mean to check whether the column values were consistent during the processing of prediction. We thought it would be better if the column values could be saved during training and then be used to check the column values during predicting.

Steps/Code to Reproduce

#for simplification, consider a very simple case
from sklearn.datasets import load_iris
import pandas as pd

#make a dataframe
iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]
iris_pd = pd.DataFrame(X)
iris_pd.columns = iris.feature_names
iris_pd['target'] = y

from sklearn.cross_validation import train_test_split
train, test = train_test_split(iris_pd, test_size= 0.3)

feature_columns_train = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
feature_columns_test = ['sepal length (cm)','sepal width (cm)','petal width (cm)','petal length (cm)']

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
lg.fit(train[feature_columns_train], train['target'])

prob1 = lg.predict_proba(test[feature_columns_train])
prob2 = lg.predict_proba(test[feature_columns_test])

Expected Results

Because feature_columns_test is different from feature_columns_train, it is not surprised that prob1 is totally different from prob2 and prob1 should be the right result.

prob1[:5] = 
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
         [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
         [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
         [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
         [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

Actual Results

prob2[:5] = 
array([[  4.36321678e-01,   2.25057553e-04,   5.63453265e-01],
         [  4.92513658e-01,   1.76391882e-05,   5.07468703e-01],
         [  9.92946715e-01,   7.05167151e-03,   1.61346947e-06],
         [  9.83726756e-01,   1.62387090e-02,   3.45348884e-05],
         [  5.01392274e-01,   5.37144591e-04,   4.98070581e-01]])

Versions

Linux-2.6.32-642.1.1.el6.x86_64-x86_64-with-redhat-6.7-Santiago
('Python', '2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec  6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.1')
('SciPy', '0.16.0')
('Scikit-Learn', '0.17')

The probable solution

I also implement a very simple solution. Hope this would help. :)

class SafeLogisticRegression(LogisticRegression):
    def fit(self, X, y, sample_weight=None):
        self.columns = X.columns
        LogisticRegression.fit(self, X, y, sample_weight=None)
    def predict_proba(self, X):
        new_columns = list(X.columns)
        old_columns = list(self.columns)
        if new_columns != old_columns:
            if len(new_columns) == len(old_columns):
                try:
                    X = X[old_columns]
                    print "The order of columns has changed. Fixed."
                except:
                    raise ValueError('The columns has changed. Please check.')
            else:
                raise ValueError('The number of columns has changed.')
        return LogisticRegression.predict_proba(self, X)

Then apply this new class:

slg = SafeLogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
slg.fit(train[feature_columns_train], train['target']) 

Test one: if the column order is changed

prob1 = slg.predict_proba(test[feature_columns_train])
prob2 = slg.predict_proba(test[feature_columns_test])

#The order of columns has changed. Fixed.

Result for test one:

prob1[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

prob2[:5] =
array([[  3.89507414e-04,   3.20099743e-01,   6.79510750e-01],
       [  4.63256526e-04,   4.65385156e-01,   5.34151587e-01],
       [  8.79704318e-01,   1.20295572e-01,   1.10268420e-07],
       [  7.80611983e-01,   2.19385827e-01,   2.19046022e-06],
       [  2.78898454e-02,   7.77243988e-01,   1.94866167e-01]])

Test two: if the columns are different (different columns)

Simulate by changing one of the column names

prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-84cea68536fe> in <module>()
----> 1 prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     12                     print "The order of columns has changed. Fixed."
     13                 except:
---> 14                     raise ValueError('The columns has changed. Please check.')
     15             else:
     16                 raise ValueError('The number of columns has changed.')

ValueError: The columns has changed. Please check.

Test three: if the number of columns changes

Simulate by dropping one column

prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-47c63ae1ac22> in <module>()
----> 1 prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))

<ipython-input-21-c3000b030a21> in predict_proba(self, X)
     14                     raise ValueError('The columns has changed. Please check.')
     15             else:
---> 16                 raise ValueError('The number of columns has changed.')
     17         return LogisticRegression.predict_proba(self, X)

ValueError: The number of columns has changed.
@amueller
Copy link
Member

@amueller amueller commented Aug 25, 2016

Scikit-learn currently has no concept of column labels. I agree this would be a nice to have feature. Don't hold your breath for it. Storing the column names at all would be an important first step, which I hope to achieve soonish. This will not be available in 0.18, and the consistency check probably not in 0.19 either. Scikit-learn operates on numpy arrays, which don't (currently) have column names. This might change in the future, and we might add better integration with pandas.

@amueller amueller added this to the 1.0 milestone Aug 25, 2016
@SauceCat
Copy link
Author

@SauceCat SauceCat commented Aug 26, 2016

Thanks for replying! Looking forward to the new version! 💯

@pjcpjc
Copy link

@pjcpjc pjcpjc commented Aug 7, 2017

Thanks for the edifying conversation guys!

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 8, 2017

I think a wrapper to ensure column order is maintained between fit and predict/transform would be a good idea.

@pjcpjc
Copy link

@pjcpjc pjcpjc commented Aug 8, 2017

I would guess the way to do it would be respect column names if they were present both for fit and predict, and respect column ordering if they were missing in either (or both). There is also feature selection that would need to abide. I'm sure its not as easy as it might first appear.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 8, 2017

@pjcpjc
Copy link

@pjcpjc pjcpjc commented Aug 11, 2017

In my opinion this is an sklearn issue not a pandas issue. As I said, not complaining, not expecting it to be terribly easy.

@amueller
Copy link
Member

@amueller amueller commented Aug 11, 2017

I would like to solve this inside scikit-learn, but I don't think there is consensus to do that (yet).
I would argue that we should protect our users against silent wrong results, other voices might say that users shouldn't pass dataframes?

@pjcpjc
Copy link

@pjcpjc pjcpjc commented Aug 11, 2017

other voices might say that users shouldn't pass dataframes?

You could print a warning that says "Beware when using Dataframes. sklearn uses column ordering not column labels. Downcast to numpy arrays to avoid this warning".

That would have saved my bacon here.

Pandas prints out a warning with the view vs. copy "gotcha", and that was enough to motivate me to study and pre-empt any problem. This issue seems similar. http://bit.ly/2qR1fDH , http://bit.ly/2pRbR7Q

@amueller
Copy link
Member

@amueller amueller commented Aug 11, 2017

@pjcpjc warning is an option, but this would be very very loud. Many people are passing dataframes.
I guess it depends a bit on whether we want them to pass dataframes or not. One problem is that if we warn, that basically tells users to use .values everywhere to convert to numpy arrays, so they don't get a warning. But then this error becomes silent again, so we won nothing. We just made the users change their code so that we can never hope to catch the error.

Also, I think having the estimator know the feature names is helpful for model inspection.

@pjcpjc
Copy link

@pjcpjc pjcpjc commented Aug 11, 2017

I'd prefer for sklearn to respect column names if they are present both for fit and predict, and respect column ordering if they were missing in either (or both). But I also think a warning would be better than how it works right now, where its easy to get a false sense of security and end up with buggy code, which was my experience with this issue.

You can see my bug fix here. The original version didn't have cell 18. A warning would have flagged me that cell 18 was needed. The additional code changes to pass numpy arrays would be minor, if needed.

Just one opinion.

@amueller
Copy link
Member

@amueller amueller commented Aug 11, 2017

I'm not sure you got the problem with the warning. If there's always a warning, people will change their code so they don't get a warning, which is by passing in df.values, i.e. a numpy array without column names. Then later they change their code and columns mismatch again but they get no warning because they only passed a numpy array. So in the end, adding the warning didn't safe them at all, just created more work.

I totally agree that we should do better then we're doing now, I just think a warning whenever a dataframe is passed will not help.

@pjcpjc
Copy link

@pjcpjc pjcpjc commented Aug 11, 2017

I think I understand your point. My point is if the warning links to a "see here for details" documentation page, and that page takes great pains to point out that you need to manage the column positions outside of sklearn, then my guess it will save quite a few people.

I also suspect that the people who ignore the warning, and/or don't take it seriously and just do the simplest possible thing to make the warning go away, are probably un-saveable in any event. They will find some other way to muck up their data science practice.

That said - agree to disagree. It would have helped me - but it certainly is possible I'm too unusual to be of guidance.

@amueller
Copy link
Member

@amueller amueller commented Aug 11, 2017

Ok, I see where you're coming from. It might be a good way to make people aware that this issue potentially exists at all. I still thinks it's a bit too loud and unspecific, though it might indeed save some people. I hope for a near-ish term better solution ;)

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 13, 2017

I think warning for all pandas DataFrame input is overkill. Lots of people are loading DataFrames from static sources (as we recommend) and not reordering columns.

The options, as far as I'm concerned are:

  • add some documentation on Scikit-learn with Pandas and pitfalls therein.
  • add a common test which ensures column name equality, otherwise raising a warning (and perhaps in a later version an error) when mismatched.

Obviously the latter is a substantial effort, and adds O(n_features) space to the pickle representation of an estimator unless we use a hash, but ... dataframes are important to support well.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 13, 2017

If we do this, I'd suggest we extend check_array, at least to do the checking, and perhaps even to help store the estimator attribute. In terms of syntax, this could be as magical as check_X_y(X, y, ..., estimator=self) in fit and check_array(X, ..., estimator=self) in predict/transform, but maybe we'd prefer check_X_y(X, y, ..., store_column_hash=self) and check_array(X, ..., column_hash=self._column_hash). (We could make it even more magical with decorators or a metaclass, but we don't do that 'round here.)

There will be cases where this is hard to do because check_array is called by a function called by the estimator, rather than directly within it. We'd also need to do the same for warm_start and partial_fit.

In short, this is possible, but it'll be a big project and will touch everything.

@amueller
Copy link
Member

@amueller amueller commented Aug 13, 2017

I would rather not store hashes, because this information is very valuable for model inspection.
I'm not sure I like passing self to check_array, I guess the only alternative would be to return column_names and then store it, and pass column_names=self._column_names in predict / transform.
(if we do this, we probably want to really look into get_feature_names and the pipeline processing of the same, because it'd be AWESOME!)

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 13, 2017

@jnothman jnothman added this to Design/analysis stage in API and interoperability Aug 14, 2017
@thebeancounter
Copy link

@thebeancounter thebeancounter commented Jan 4, 2018

So, for now if I wish to use fit, and then use transform on another df, what is the proper way to be able to use the proper column order?
How can i guaranty pandas dataframe columns to be in a specific order?

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 5, 2018

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 11, 2018

Currently, estimators have ad-hoc checks that n_features match from fit to predict.

Maybe we should implement in BaseEstimator (for use after check_array):

def _fit_consistency(self, X):
    self.__n_features = X.shape[1]
    if isinstance(X, pd.DataFrame):
        self.__columns = X.columns

def _check_consistency(self, X):
    if self.__n_features != X.shape[1]: raise ...
    if isinstance(X, pd.DataFrame):
        if not hasattr(self, 'BaseEstimator__columns'): warn ...
        if self.__columns != X.columns: raise ??
    elif hasattr(self, 'BaseEstimator__columns'): warn ...

@kienerj
Copy link

@kienerj kienerj commented Feb 1, 2018

I just wanted to add that this feature is extremely important out in the real-world where the model building process and the prediction process (exposed to users) often are completely separate.

The prediction process should not need to know about feature selection, feature order and so forth. It should just take all the features for the specific problem and the estimator knows which ones it needs in what order. This is how it works in other tools. It's extremely convenient and the fact I prefer that tool over sklearn. Sklearn however offers more estimator types and better performance (as in speed and prediction quality even for same model types). So it would be great to have this here too.

I would imagine this happens either through pandas column names or if you pass in numpy arrays fit and predict could have an additional optional parameter column_names (not that great to create this list). If no column names are present, it works like it does now.

@qdeffense
Copy link
Contributor

@qdeffense qdeffense commented Jan 8, 2019

@jnothman what do you think ?

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 8, 2019

  • I suspect we should raise an error when the order is different from in fit, rather than reordering, in part to alert users who thought their code was working in previous scikit-learn versions but actually were mixing their features. We can always change it to reordering later.
  • feature_names_ is ambiguous. It needs a better name.
  • I am a little hesitant about putting self.feature_names_ = list(X.columns) if isinstance(X, pd.DataFrame) else None in each estimator. I would rather this abstracted away into a function or method, e.g. set_X_signature(self, X) with check_array supporting X_signature to check integrity.

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 8, 2019

We could potentially make setting (not just checking) the signature a job for check_array, e.g. check_array(X, set_signature=self)

Getting someone to implement these changes (preferably someone who knows how to edit files en-masse) is more the challenge. In practice where an estimator delegates checking of X to some other function, it's not going to be trivial to implement everywhere.

@ryanpeach
Copy link

@ryanpeach ryanpeach commented Jan 9, 2019

Its difficult because things don't super baseestimator much in the codebase.

@ryanpeach
Copy link

@ryanpeach ryanpeach commented Jan 9, 2019

I disagree on raising an error. Data frames are different philosophically than arrays. People are likely passing data frames in the wrong order at the moment, but they aren't wrong to do so they just aren't aware the sklearn package hasnt handled their edge case. Pandas dataframe column ordering has always existed, but it's just not talked about, because anyone using pandas would never iloc a column when indexing... You index columns by name.

Maybe a warning as we transition, but I wouldnt recommend an error.

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 9, 2019

@ryanpeach
Copy link

@ryanpeach ryanpeach commented Jan 11, 2019

I absolutely agree that a warning is a must. I agree with your reply 100%.

A compromise may be to issue an update that creates a warning ASAP, by collecting column names where appropriate and creating the architecture of the solution. Then in the next breaking update, implement automatic reordering.

@ryanpeach
Copy link

@ryanpeach ryanpeach commented Jan 11, 2019

As for extra columns, in my own code I have implemented a lot of what we are talking about in the context of cross validated feature selectors, such as RFECV. By accepting dataframes of any size, you allow these learners to always be correct. User can give them just the columns they selected on predict, the data they fit on on predict, or any arbitrary dataframe on predict. The goal in my mind is to enable the agnostic use of estimators, because these estimators are going to be pickled and long forgotten in a repo somewhere. I dont want to use them mistakenly when I dig them out of the filing cabinet.

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 13, 2019

@SreejeshPM
Copy link

@SreejeshPM SreejeshPM commented Jan 21, 2019

Is there any workaround for this problem. Other than reordering of the columns.

@aishwariya96
Copy link

@aishwariya96 aishwariya96 commented May 6, 2019

Hello,
I went through the same issue and tried this

df2 = df2[df1.columns]
df2 is the data frame where the columns need to be ordered like in the df1.

@amueller
Copy link
Member

@amueller amueller commented May 27, 2020

Hm we still don't have a slep for this, right? I think we should try doing one. @NicolasHug did you want to do that?
This is related to feature names, but actually the current feature name ideas might not solve this issue.
@adrinjalali do you have thoughts on whether we want to do this before doing the feature names?

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented May 27, 2020

I think the options are mostly listed in #7242 (comment)

@amueller I know you do not like this, but I tend to prefer PandasCheck
which also enables #12627 (comment) . This approach only does the check in the beginning. The major argument against storing is "pandas users need to use a pipeline to hav column name consistency".

The alternative is to store the names everywhere. I can't get over the fact that to have this consistency, users would need to pay a price in memory. Do we want users to buy more ram to get column name consistency?

@kienerj
Copy link

@kienerj kienerj commented May 27, 2020

I'm with @ryanpeach

By accepting dataframes of any size, you allow these learners to always be correct.

and my initial comment to this issue. This is from a naive end-user point of view not knowing the fundamental architecture of sklearn and just being concerned about making it easy for me to get correct predictions.

Do we want users to buy more ram to get column name consistency?

How much RAM does storing even thousands of column names really use up? I mean if you really go all-in and have thousand very long column names, it adds up to couple 100s of MB. As long as it is optional, I don't see the issue given the huge convenience gain. I mean getting 64 GB of RAM doesn't cost all that much, compared to developer time or even worse wrong results.

@amueller
Copy link
Member

@amueller amueller commented May 27, 2020

I want 6). I think 1, 2 and 3 don't solve the issue. 4 and 5 solve the issue if the user is aware of the issue and actively takes steps to prevent a problem from occurring. They sort-of solve the issue but not really the core issue of user expectations and silent errors. Also they are really verbose.

I guess in that I disagree with @ryanpeach in #7242 (comment) but his solution is not even on the list ;)

However, if the first step is a ColumnTransformer, then we actual do the "right" thing in that we rearrange the columns - makes me think, is inserting ColumnTransformer(remainder='passthrough') to the beginning of a pipeline actually implementing the feature @ryanpeach wants?

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented May 27, 2020

@adrinjalali do you have thoughts on whether we want to do this before doing the feature names?

I think this is independent of output_feature_names_, and I'd be in favor of the same kind of validation we have in ColumnTransformer everywhere. I think implementing the input_feature_names_ only if the given input has feature names makes kinda sense, and it can be implemented even if it's not propagated through the pipeline.

This should also be really easy to implement now that we have @NicolasHug 's work on _validate_data merged, and I don't think we need a SLEP for it, for the same reason that we didn't need a SLEP to implement the validation in ColumnTransformer.

@amueller
Copy link
Member

@amueller amueller commented May 28, 2020

@adrinjalali Cool, I totally agree (and encouraged @NicolasHug to draft a PR) ;)
Would love to hear @jnothman's thoughts as well

@amueller
Copy link
Member

@amueller amueller commented Jun 3, 2020

Also see #17407

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Aug 22, 2021

Now that we validate column names, this is partially fixed if the user uses a ColumnTransformer, right?

Removing the milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet