Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential Feature Selection for categorical features without one-hot encoding #502

Closed
polishtits opened this issue Feb 10, 2019 · 19 comments

Comments

@polishtits
Copy link

Dr. Raschka,
Thank you for all the wonderful work! Truly amazing library!

I have a question regarding SFS and categorical features. Since such features will have more than one column after we transform them, it makes intuitive sense to me that, these encoded columns should also be selected together always. The output of SFS should not output, say, 2 columns of the 3 encoded columns.
Would you please let me know what your opinion is on this matter?

@rasbt
Copy link
Owner

rasbt commented Feb 10, 2019

I have a question regarding SFS and categorical features. Since such features will have more than one column after we transform them, it makes intuitive sense to me that, these encoded columns should also be selected together always.

Yes, I agree with you. The problem is that this is more of a technical limitation (i.e., how can you encode the information of which columns belong together?).

Maybe the most general solution would be to perform the transformation from categorical to onehot encoded features after the selection step? I.e., your classifier itself could be a scikit-learn pipeline whereas the first element is a column-transformer that expands the categorical feature into a onehot encoded feature set.

I think this should work, and maybe we could add an example to the documentation.

@polishtits
Copy link
Author

Thank you for your quick response Dr. Raschka.

One casual way to work around this issue that I can think of, in case we really want to consider onehot encoded features during SFS, is to simply label these columns that are generated after transformation. I guess this process can be automated since these columns are related to the unique values within a categorical feature.

@rasbt
Copy link
Owner

rasbt commented Feb 13, 2019

Yes, that would be one solution. However, the results would be different.

E.g. let's assume we have three features, A and B, C, where A is a categorical feature with 3 possible values (0, 1, 2). Let's call these onehot feature columns A_0, A_1, A_2

If i say select 2 features on the original DataFrame, it could select A, B or A, C an so forth. On the onehot encoded DataFrame the selection can be different... E.g., the outcome could be A_0, A_1.

So, instead of transforming A into a one hot representation and doing the feature selection on it, a pipeline could be used. For example, the classifier for the feature selector could be a pipeline with elements [onehot -> classifier]. So, the feature selector would still consider the features as but would do the onehot encoding temporarily only. E.g.,

  • For feature to eliminate in {A, B, C}:
    • compute performance for subset {A, B}; here, {A, B} is passed to the pipeline [onehot -> classifier] such that the temporary feature set is A_0, A_1, A_2, B
    • compute performance for subset {B, C}; here, {B, C} is passed to the pipeline [onehot -> classifier such that the temporary feature set B, C
    • compute performance for subset {A, C}; here, {A, C} is passed to the pipeline [onehot -> classifier such that the temporary feature set is A_0, A_1, A_2, C
  • pick best feature set from {A, B}, {A, C}, {B, C}

@polishtits
Copy link
Author

I totally agree. My approach is way sloppier than yours.
Thank you for your suggestion!

@rasbt
Copy link
Owner

rasbt commented Feb 13, 2019

I actually tried that the other day with a scikit-learn Pipeline, OneHotEncoder, and ColumnTransformer. The technical limitation with using NumPy arrays is that we can't rely on column indices because they may refer to different features when we extend / shrink the subsets. One solution would be to use column names via pandas DataFrames. However, here the limitation is that while the SFS currently accepts DataFrames, it internally converts them to numpy arrays -- so the column name advantage is lost.

However, with #506, this could potentially be addressed! :)

@rasbt
Copy link
Owner

rasbt commented Feb 16, 2019

So, what I had in mind the other day was something like this

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS


class GetDummies():

    def __init__(self, columns):
        self.columns = columns

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        return pd.get_dummies(pd.DataFrame(X), columns=self.columns)

    def fit(self, X, y=None):
        return self



iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)


y_series = pd.Series(y)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'categorical'])

X_df['categorical'] = y.astype(float)



######

from sklearn.pipeline import make_pipeline

get_dummies = GetDummies(['categorical'])
pipe = make_pipeline(get_dummies, knn)

sfs1 = SFS(pipe, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X_df, y_series)

Currently, this doesn't work, unfortunately, since the SFS passes around numpy arrays (and only keeps track of pandas feature column names internally)

@polishtits
Copy link
Author

polishtits commented Feb 16, 2019

Yes, I agree. Currently SFS converts the whole feature vectors X, if it is a DataFrame object, into numpy array and treats them independently.
And I genuinely was not aware of pd.get_dummies. Quite neat indeed!

@rasbt rasbt changed the title Question: Sequential Feature Selection for Categorical features Sequential Feature Selection for categorical features without one-hot encoding Feb 16, 2019
@mckennapep
Copy link

Hi Dr. Raschka,

I am using feature selection on a dataset with categorical variables, and I came across this thread. I saw that you said this issue may be addressed with #506. Is this working now? The sample code you wrote above is exactly what I need, if I were able to call the features by column name. Have you figured out a way to work around it?

@rasbt
Copy link
Owner

rasbt commented Dec 7, 2020

Hi there.

Unfortunately, I don't think #506 fully addressed this issue, so this is not supported yet.

@mengyujackson121
Copy link

This limitation is really difficult to find out about: I had created a Pipeline with OneHotEncoding and tried to use SequentialFeatureSelector on the whole pipeline and the error messages were very unhelpful.

In the meantime, before this is fully supported, could the error messages be improved at all?

@rasbt
Copy link
Owner

rasbt commented Apr 7, 2021

I would like to add more descriptive comments. It's just hard to come up with a good rule here to catch errors related to the above mentioned one-hot encoding approach.

Actually, thinking about this again, I believe the previously proposed pipeline approach actually works if we tweak it a little bit. The solution I proposed above would not work because the column proposed for one-hot encoding might not be present due to feature selection, and then it attempts to transform a non-present column and crashes.

I believe this can be easily fixed by (1) checking which columns are actually candidates for one-hot encoding via "set intersection", and then (2) we can encode only those columns that are present in the current iteration.

I believe the following should work:

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS


class GetDummies():

    def __init__(self, columns):
        self.columns = set(columns)

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        df = pd.DataFrame(X)
        intersect = self.columns.intersection(df.columns)
        
        return pd.get_dummies(pd.DataFrame(X), columns=intersect)

    def fit(self, X, y=None):
        return self



iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)


y_series = pd.Series(y)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'categorical'])

X_df['categorical'] = y.astype(float)



######

from sklearn.pipeline import make_pipeline

get_dummies = GetDummies(['categorical'])
pipe = make_pipeline(get_dummies, knn)

sfs1 = SFS(pipe, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X_df, y_series)

Please let me know if this solves your usecase. If yes, I am happy to add it to the documentation.

@twbrandon7
Copy link

twbrandon7 commented Apr 12, 2021

I would like to add more descriptive comments. It's just hard to come up with a good rule here to catch errors related to the above mentioned one-hot encoding approach.
......

Hi, I have tried the code above, however it doesn't work in my environment. The version of the packages I installed are as follows:

  • scikit-learn 0.24.1
  • mlxtend 0.18.0

I find that df.columns in the transform() method doesn't contain the original column names. In contrast, df.columns is a RangeIndex() object, so the dummy columns doesn't generate as expected.

class GetDummies():

    # ......

    def transform(self, X, y=None):
        df = pd.DataFrame(X)

        # type(df.columns) == pandas.core.indexes.range.RangeIndex
        intersect = self.columns.intersection(df.columns)

        return pd.get_dummies(pd.DataFrame(X), columns=intersect)

    # ......

In my case, the type of the nominal attributes is string. Based on the tutorial from sklearn, I re-write the code to encode the nominal attributes by using one-hot encoding.

import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector

class DfConverter():
    def __init__(self):
        super().__init__()

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        df = pd.DataFrame(X)

        # automatically determine the data type for each columns
        df = df.convert_dtypes()        

        return df

    def fit(self, X, y=None):
        return self

    
def get_pipeline(model_provider):
    df_converter = DfConverter()

    categorical_transformer = OneHotEncoder(handle_unknown='ignore')
    preprocessor = ColumnTransformer(transformers=[
        # one-hot encode the nominal (string) attributes
        ('dynamic_cat', categorical_transformer, make_column_selector(dtype_include="string"))
    ], remainder="passthrough")

    clf = Pipeline(steps=[
        ('df_converter', df_converter),
        ('preprocessor', preprocessor),
        ('classifier', model_provider())
    ])
    
    return clf

# usage

## loading data
df_train = pd.read_csv("./data/train.csv")
df_test = pd.read_csv("./data/test.csv")

selected_features = [ ...... ]
x_train = df_train[selected_features]
x_test = df_test[selected_features]
y_train = df_train["label"]
y_test = df_test["label"]

## training
knn = get_pipeline(lambda: KNeighborsClassifier(n_jobs=-1))
sfs1 = SequentialFeatureSelector(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=4)
sfs1 = sfs1.fit(x_train, y_train)

Hope this helps. 😀

@rasbt
Copy link
Owner

rasbt commented Apr 12, 2021

Thanks for sharing! This would be another nice addition for the tutorials.

@NimaSarajpoor
Copy link
Contributor

NimaSarajpoor commented Aug 18, 2022

@rasbt
So, I have been working on a data that has 190 features, and each feature itself is a timeseries (like a 3D tabular data, where rows are samples, columns are features, and the depths are time series). So, I converted each time series into 12 features (for some reasons, I had to do feature engineering). Now, I have $190 \times 12$ features. I am thinking of doing feature selection but considering group feature selection.

Can't we do features_group similar to what we did in Exhaustive Feature Selection? I mean it should solve the one-hot-encoding problem mentioned at the top of this issue.

So, for instance, if my groups are [[0, 1, 2], [3], [4, 5]], then my high_level_indices is [0, 1, 2]. So, I can iterate through high-level indices only. Does that work?


We might say if this parameter is provided, then non-float type is not supported (?!) . So, user should take care of preprocessing in this case.


UPDATE:
I checkout a recent PR and it seems @rasbt has this idea that this task might be doable. (see PR #957 )

@rasbt
Copy link
Owner

rasbt commented Aug 18, 2022

Hey @NimaSarajpoor, you are right the sequential feature selector could (/should) eventually have a feature group support similar to the exhaustive feature selector. It’s something I was hoping to tackle eventually some time this year when time permits (was holding of with a new release until I get to this, because it would be nice to have a release that rolls this feature out for all three: sequential feature selector, exhaustive feature selector and feature importance permutations. The code base of the sequential feature selector is a bit more complicated. Personally I am also traveling the next two weeks and likely on mobile only. If someone in this thread is interested in tackling this that would be awesome of course :)

@NimaSarajpoor
Copy link
Contributor

NimaSarajpoor commented Aug 18, 2022

Cool. I can definitely work on this. I might be a little bit slow due to my current workload but I hope I can get it done in a reasonable time.

@rasbt
Copy link
Owner

rasbt commented Aug 18, 2022

Thanks and no worries about the timeline at all! Currently so many things to catch up with 😅

@NimaSarajpoor
Copy link
Contributor

@rasbt
You may want to close this.

@rasbt
Copy link
Owner

rasbt commented Sep 15, 2022

Good call

@rasbt rasbt closed this as completed Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants