Leshy works wrong with categorical features #19

Tialo · 2023-05-04T11:23:16Z

Hello, when I was using Leshy with catboost estimator and dataset that has categorical features, I've noticed that all features of my dataset are considered as categorical and passed to cat_features parameter of catboost. This is caused by this line

arfs/src/arfs/feature_selection/allrelevant.py

Line 946 in ca71fb2

X = pd.DataFrame(X)

If you have X = np.array([[1, 2, 'a'], [3, 4, 'b']]) and you pass it to pd.DataFrame then it will make dtypes of each columns equals to object. I propose using original method for creating shadow features. It keeps original dtypes of columns

def _create_shadow(x_train):
    """
    Take all X variables, creating copies and randomly shuffling them
    :param x_train: the dataframe to create shadow features on
    :return: dataframe 2x width and the names of the shadows for removing later
    """
    x_shadow = x_train.copy()
    for c in x_shadow.columns:
        np.random.shuffle(x_shadow[c].values)
    # rename the shadow
    shadow_names = ["ShadowVar" + str(i + 1) for i in range(x_train.shape[1])]
    x_shadow.columns = shadow_names
    # Combine to make one new dataframe
    new_x = pd.concat([x_train, x_shadow], axis=1)
    return new_x, shadow_names

The text was updated successfully, but these errors were encountered:

ThomasBury · 2023-05-04T22:13:40Z

Hi @Tialo,

Numpy does not have built-in support for heterogeneous data types, so it is not well-suited for handling data with mixed types or non-numeric data. However, it provides a structured array data type, which can store and manipulate arrays with different data types. Still, it has limitations compared to Pandas data frames. This is why CatBoost relies on lists, pd.DataFrame or Pool.

using:

X = np.array([[1, 2, 'a'], [3, 4, 'b']])
X

returns:

array([['1', '2', 'a'],
       ['3', '4', 'b']], dtype='<U11')

which means that all the entries are treated as object (unicode strings).

If you want to use numpy arrays with non-numerical columns rather than pd.DataFrame, use structured arrays

x = np.array([(1, 2, 'a'), (3, 4, 'b')],
             dtype=[('num1', 'i4'), ('num2', 'i4'), ('cat1', 'U10')])

returns

array([(1, 2, 'a'), (3, 4, 'b')],
      dtype=[('num1', '<i4'), ('num2', '<i4'), ('cat1', '<U10')])

and the conversion to a pd.DataFrame works as expected

x = pd.DataFrame(x)
x.dtypes

which returns

num1     int32
num2     int32
cat1    object
dtype: object

When working with heterogenous data, prefer pandas over numpy.

I hope it helps.

NB: I'm not sure I understood the original method you're referring to. Isn't it the same as the current definition of _create_shadow?

Tialo · 2023-05-05T07:37:49Z

I tried to fit Leshy with dataframe, that has numerical and categorical features. And in your fit method this dataframe will be transformed into dataframe where every feature will be encoded as categorical. I will show you why this happens.

In this method you pass dataframe into np.nan_to_num function which returns numpy ndarray. As you said as original dataframe contains at least one categorical values, numpy will encode every value into unicode strings, thus X.dtype will be object.

arfs/src/arfs/feature_selection/allrelevant.py

Line 314 in ca71fb2

def _fit(self, X_raw, y, sample_weight=None):

Mentioned function.

arfs/src/arfs/feature_selection/allrelevant.py

Line 340 in ca71fb2

X = np.nan_to_num(X)

Also if X will always be numpy ndarray(which is true because np.nan_to_num return ndarray) this if is always False.

arfs/src/arfs/feature_selection/allrelevant.py

Line 346 in ca71fb2

if not isinstance(X, np.ndarray):

Then X passed to this method.

arfs/src/arfs/feature_selection/allrelevant.py

Line 398 in ca71fb2

cur_imp = self._add_shadows_get_imps(X, y, sample_weight, dec_reg)

Where you take some columns and then concatenate it with shadow features, which then passed to this function.

arfs/src/arfs/feature_selection/allrelevant.py

Line 687 in ca71fb2

imp = _get_shap_imp(

arfs/src/arfs/feature_selection/allrelevant.py

Line 1018 in ca71fb2

model, X_tt, y_tt, w_tt = _split_fit_estimator(

Finally you are making pandas DataFrame of numpy ndarray with dtype equals 'object', then X.dtypes will be object.

arfs/src/arfs/feature_selection/allrelevant.py

Line 946 in ca71fb2

X = pd.DataFrame(X)

And after it, your function will encode all columns because all of them are object.

arfs/src/arfs/utils.py

Line 117 in ca71fb2

def get_pandas_cat_codes(X):

After encoding each column of dataframe will be int64 which are oridinal encoded features.

arfs/src/arfs/feature_selection/allrelevant.py

Line 977 in ca71fb2

model = estimator.fit(

And this will lead to poor performance of catboost, despite the fact it can encode categorical features, it is bad idea to try to encode numerical features too. I hope I explained you the problem, if not feel free to ask anything!

Upd. Sorry for closing and opening the issue, I've miss clicked.

Tialo closed this as not planned Won't fix, can't repro, duplicate, stale May 5, 2023

Tialo reopened this May 5, 2023

Tialo mentioned this issue May 5, 2023

Fixed handling of categorical features. #20

Merged

ThomasBury closed this as completed in #20 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leshy works wrong with categorical features #19

Leshy works wrong with categorical features #19

Tialo commented May 4, 2023

ThomasBury commented May 4, 2023

Tialo commented May 5, 2023 •

edited

Loading

Leshy works wrong with categorical features #19

Leshy works wrong with categorical features #19

Comments

Tialo commented May 4, 2023

ThomasBury commented May 4, 2023

Tialo commented May 5, 2023 • edited Loading

Tialo commented May 5, 2023 •

edited

Loading