Sequential Feature Selection for categorical features without one-hot encoding #502

polishtits · 2019-02-10T19:58:21Z

Dr. Raschka,
Thank you for all the wonderful work! Truly amazing library!

I have a question regarding SFS and categorical features. Since such features will have more than one column after we transform them, it makes intuitive sense to me that, these encoded columns should also be selected together always. The output of SFS should not output, say, 2 columns of the 3 encoded columns.
Would you please let me know what your opinion is on this matter?

rasbt · 2019-02-10T20:46:53Z

I have a question regarding SFS and categorical features. Since such features will have more than one column after we transform them, it makes intuitive sense to me that, these encoded columns should also be selected together always.

Yes, I agree with you. The problem is that this is more of a technical limitation (i.e., how can you encode the information of which columns belong together?).

Maybe the most general solution would be to perform the transformation from categorical to onehot encoded features after the selection step? I.e., your classifier itself could be a scikit-learn pipeline whereas the first element is a column-transformer that expands the categorical feature into a onehot encoded feature set.

I think this should work, and maybe we could add an example to the documentation.

polishtits · 2019-02-10T22:37:10Z

Thank you for your quick response Dr. Raschka.

One casual way to work around this issue that I can think of, in case we really want to consider onehot encoded features during SFS, is to simply label these columns that are generated after transformation. I guess this process can be automated since these columns are related to the unique values within a categorical feature.

rasbt · 2019-02-13T14:59:54Z

Yes, that would be one solution. However, the results would be different.

E.g. let's assume we have three features, A and B, C, where A is a categorical feature with 3 possible values (0, 1, 2). Let's call these onehot feature columns A_0, A_1, A_2

If i say select 2 features on the original DataFrame, it could select A, B or A, C an so forth. On the onehot encoded DataFrame the selection can be different... E.g., the outcome could be A_0, A_1.

So, instead of transforming A into a one hot representation and doing the feature selection on it, a pipeline could be used. For example, the classifier for the feature selector could be a pipeline with elements [onehot -> classifier]. So, the feature selector would still consider the features as but would do the onehot encoding temporarily only. E.g.,

For feature to eliminate in {A, B, C}:
- compute performance for subset {A, B}; here, {A, B} is passed to the pipeline [onehot -> classifier] such that the temporary feature set is A_0, A_1, A_2, B
- compute performance for subset {B, C}; here, {B, C} is passed to the pipeline [onehot -> classifier such that the temporary feature set B, C
- compute performance for subset {A, C}; here, {A, C} is passed to the pipeline [onehot -> classifier such that the temporary feature set is A_0, A_1, A_2, C
pick best feature set from {A, B}, {A, C}, {B, C}

polishtits · 2019-02-13T18:10:24Z

I totally agree. My approach is way sloppier than yours.
Thank you for your suggestion!

rasbt · 2019-02-13T22:56:36Z

I actually tried that the other day with a scikit-learn Pipeline, OneHotEncoder, and ColumnTransformer. The technical limitation with using NumPy arrays is that we can't rely on column indices because they may refer to different features when we extend / shrink the subsets. One solution would be to use column names via pandas DataFrames. However, here the limitation is that while the SFS currently accepts DataFrames, it internally converts them to numpy arrays -- so the column name advantage is lost.

However, with #506, this could potentially be addressed! :)

rasbt · 2019-02-16T06:04:46Z

So, what I had in mind the other day was something like this

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS


class GetDummies():

    def __init__(self, columns):
        self.columns = columns

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        return pd.get_dummies(pd.DataFrame(X), columns=self.columns)

    def fit(self, X, y=None):
        return self



iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)


y_series = pd.Series(y)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'categorical'])

X_df['categorical'] = y.astype(float)



######

from sklearn.pipeline import make_pipeline

get_dummies = GetDummies(['categorical'])
pipe = make_pipeline(get_dummies, knn)

sfs1 = SFS(pipe, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X_df, y_series)

Currently, this doesn't work, unfortunately, since the SFS passes around numpy arrays (and only keeps track of pandas feature column names internally)

polishtits · 2019-02-16T07:13:35Z

Yes, I agree. Currently SFS converts the whole feature vectors X, if it is a DataFrame object, into numpy array and treats them independently.
And I genuinely was not aware of pd.get_dummies. Quite neat indeed!

mckennapep · 2020-12-07T02:44:37Z

Hi Dr. Raschka,

I am using feature selection on a dataset with categorical variables, and I came across this thread. I saw that you said this issue may be addressed with #506. Is this working now? The sample code you wrote above is exactly what I need, if I were able to call the features by column name. Have you figured out a way to work around it?

rasbt · 2020-12-07T23:36:02Z

Hi there.

Unfortunately, I don't think #506 fully addressed this issue, so this is not supported yet.

mengyujackson121 · 2021-04-06T23:37:01Z

This limitation is really difficult to find out about: I had created a Pipeline with OneHotEncoding and tried to use SequentialFeatureSelector on the whole pipeline and the error messages were very unhelpful.

In the meantime, before this is fully supported, could the error messages be improved at all?

rasbt · 2021-04-07T00:47:02Z

I would like to add more descriptive comments. It's just hard to come up with a good rule here to catch errors related to the above mentioned one-hot encoding approach.

Actually, thinking about this again, I believe the previously proposed pipeline approach actually works if we tweak it a little bit. The solution I proposed above would not work because the column proposed for one-hot encoding might not be present due to feature selection, and then it attempts to transform a non-present column and crashes.

I believe this can be easily fixed by (1) checking which columns are actually candidates for one-hot encoding via "set intersection", and then (2) we can encode only those columns that are present in the current iteration.

I believe the following should work:

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS


class GetDummies():

    def __init__(self, columns):
        self.columns = set(columns)

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        df = pd.DataFrame(X)
        intersect = self.columns.intersection(df.columns)
        
        return pd.get_dummies(pd.DataFrame(X), columns=intersect)

    def fit(self, X, y=None):
        return self



iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)


y_series = pd.Series(y)
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
                                'sepal width', 'categorical'])

X_df['categorical'] = y.astype(float)



######

from sklearn.pipeline import make_pipeline

get_dummies = GetDummies(['categorical'])
pipe = make_pipeline(get_dummies, knn)

sfs1 = SFS(pipe, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X_df, y_series)

Please let me know if this solves your usecase. If yes, I am happy to add it to the documentation.

twbrandon7 · 2021-04-12T14:29:11Z

I would like to add more descriptive comments. It's just hard to come up with a good rule here to catch errors related to the above mentioned one-hot encoding approach.
......

Hi, I have tried the code above, however it doesn't work in my environment. The version of the packages I installed are as follows:

scikit-learn 0.24.1
mlxtend 0.18.0

I find that df.columns in the transform() method doesn't contain the original column names. In contrast, df.columns is a RangeIndex() object, so the dummy columns doesn't generate as expected.

class GetDummies():

    # ......

    def transform(self, X, y=None):
        df = pd.DataFrame(X)

        # type(df.columns) == pandas.core.indexes.range.RangeIndex
        intersect = self.columns.intersection(df.columns)

        return pd.get_dummies(pd.DataFrame(X), columns=intersect)

    # ......

In my case, the type of the nominal attributes is string. Based on the tutorial from sklearn, I re-write the code to encode the nominal attributes by using one-hot encoding.

import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector

class DfConverter():
    def __init__(self):
        super().__init__()

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        df = pd.DataFrame(X)

        # automatically determine the data type for each columns
        df = df.convert_dtypes()        

        return df

    def fit(self, X, y=None):
        return self

    
def get_pipeline(model_provider):
    df_converter = DfConverter()

    categorical_transformer = OneHotEncoder(handle_unknown='ignore')
    preprocessor = ColumnTransformer(transformers=[
        # one-hot encode the nominal (string) attributes
        ('dynamic_cat', categorical_transformer, make_column_selector(dtype_include="string"))
    ], remainder="passthrough")

    clf = Pipeline(steps=[
        ('df_converter', df_converter),
        ('preprocessor', preprocessor),
        ('classifier', model_provider())
    ])
    
    return clf

# usage

## loading data
df_train = pd.read_csv("./data/train.csv")
df_test = pd.read_csv("./data/test.csv")

selected_features = [ ...... ]
x_train = df_train[selected_features]
x_test = df_test[selected_features]
y_train = df_train["label"]
y_test = df_test["label"]

## training
knn = get_pipeline(lambda: KNeighborsClassifier(n_jobs=-1))
sfs1 = SequentialFeatureSelector(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=4)
sfs1 = sfs1.fit(x_train, y_train)

Hope this helps. 😀

rasbt · 2021-04-12T14:39:15Z

Thanks for sharing! This would be another nice addition for the tutorials.

NimaSarajpoor · 2022-08-18T01:43:04Z

@rasbt
So, I have been working on a data that has 190 features, and each feature itself is a timeseries (like a 3D tabular data, where rows are samples, columns are features, and the depths are time series). So, I converted each time series into 12 features (for some reasons, I had to do feature engineering). Now, I have $190 \times 12$ features. I am thinking of doing feature selection but considering group feature selection.

Can't we do features_group similar to what we did in Exhaustive Feature Selection? I mean it should solve the one-hot-encoding problem mentioned at the top of this issue.

So, for instance, if my groups are [[0, 1, 2], [3], [4, 5]], then my high_level_indices is [0, 1, 2]. So, I can iterate through high-level indices only. Does that work?

We might say if this parameter is provided, then non-float type is not supported (?!) . So, user should take care of preprocessing in this case.

UPDATE:
I checkout a recent PR and it seems @rasbt has this idea that this task might be doable. (see PR #957 )

rasbt · 2022-08-18T07:23:23Z

Hey @NimaSarajpoor, you are right the sequential feature selector could (/should) eventually have a feature group support similar to the exhaustive feature selector. It’s something I was hoping to tackle eventually some time this year when time permits (was holding of with a new release until I get to this, because it would be nice to have a release that rolls this feature out for all three: sequential feature selector, exhaustive feature selector and feature importance permutations. The code base of the sequential feature selector is a bit more complicated. Personally I am also traveling the next two weeks and likely on mobile only. If someone in this thread is interested in tackling this that would be awesome of course :)

NimaSarajpoor · 2022-08-18T15:06:14Z

Cool. I can definitely work on this. I might be a little bit slow due to my current workload but I hope I can get it done in a reasonable time.

rasbt · 2022-08-18T17:10:53Z

Thanks and no worries about the timeline at all! Currently so many things to catch up with 😅

NimaSarajpoor · 2022-09-14T19:17:49Z

@rasbt
You may want to close this.

rasbt · 2022-09-15T14:47:08Z

Good call

rasbt mentioned this issue Feb 13, 2019

Use values instead of DataFrame for all modes of SequentialFeatureSelector #506

Merged

5 tasks

rasbt added the Enhancement label Feb 16, 2019

rasbt changed the title ~~Question: Sequential Feature Selection for Categorical features~~ Sequential Feature Selection for categorical features without one-hot encoding Feb 16, 2019

NimaSarajpoor mentioned this issue Aug 27, 2022

Fixes 502 Add feature_groups parameter to SequentialFeatureSelection #965

Merged

rasbt closed this as completed Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential Feature Selection for categorical features without one-hot encoding #502

Sequential Feature Selection for categorical features without one-hot encoding #502

polishtits commented Feb 10, 2019

rasbt commented Feb 10, 2019

polishtits commented Feb 10, 2019

rasbt commented Feb 13, 2019

polishtits commented Feb 13, 2019

rasbt commented Feb 13, 2019

rasbt commented Feb 16, 2019

polishtits commented Feb 16, 2019 •

edited

Loading

mckennapep commented Dec 7, 2020

rasbt commented Dec 7, 2020

mengyujackson121 commented Apr 6, 2021

rasbt commented Apr 7, 2021

twbrandon7 commented Apr 12, 2021 •

edited

Loading

rasbt commented Apr 12, 2021

NimaSarajpoor commented Aug 18, 2022 •

edited

Loading

rasbt commented Aug 18, 2022

NimaSarajpoor commented Aug 18, 2022 •

edited

Loading

rasbt commented Aug 18, 2022

NimaSarajpoor commented Sep 14, 2022

rasbt commented Sep 15, 2022

Sequential Feature Selection for categorical features without one-hot encoding #502

Sequential Feature Selection for categorical features without one-hot encoding #502

Comments

polishtits commented Feb 10, 2019

rasbt commented Feb 10, 2019

polishtits commented Feb 10, 2019

rasbt commented Feb 13, 2019

polishtits commented Feb 13, 2019

rasbt commented Feb 13, 2019

rasbt commented Feb 16, 2019

polishtits commented Feb 16, 2019 • edited Loading

mckennapep commented Dec 7, 2020

rasbt commented Dec 7, 2020

mengyujackson121 commented Apr 6, 2021

rasbt commented Apr 7, 2021

twbrandon7 commented Apr 12, 2021 • edited Loading

rasbt commented Apr 12, 2021

NimaSarajpoor commented Aug 18, 2022 • edited Loading

rasbt commented Aug 18, 2022

NimaSarajpoor commented Aug 18, 2022 • edited Loading

rasbt commented Aug 18, 2022

NimaSarajpoor commented Sep 14, 2022

rasbt commented Sep 15, 2022

polishtits commented Feb 16, 2019 •

edited

Loading

twbrandon7 commented Apr 12, 2021 •

edited

Loading

NimaSarajpoor commented Aug 18, 2022 •

edited

Loading

NimaSarajpoor commented Aug 18, 2022 •

edited

Loading