Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leshy works wrong with categorical features #19

Closed
Tialo opened this issue May 4, 2023 · 2 comments · Fixed by #20
Closed

Leshy works wrong with categorical features #19

Tialo opened this issue May 4, 2023 · 2 comments · Fixed by #20

Comments

@Tialo
Copy link
Contributor

Tialo commented May 4, 2023

Hello, when I was using Leshy with catboost estimator and dataset that has categorical features, I've noticed that all features of my dataset are considered as categorical and passed to cat_features parameter of catboost. This is caused by this line

If you have X = np.array([[1, 2, 'a'], [3, 4, 'b']]) and you pass it to pd.DataFrame then it will make dtypes of each columns equals to object. I propose using original method for creating shadow features. It keeps original dtypes of columns

def _create_shadow(x_train):
    """
    Take all X variables, creating copies and randomly shuffling them
    :param x_train: the dataframe to create shadow features on
    :return: dataframe 2x width and the names of the shadows for removing later
    """
    x_shadow = x_train.copy()
    for c in x_shadow.columns:
        np.random.shuffle(x_shadow[c].values)
    # rename the shadow
    shadow_names = ["ShadowVar" + str(i + 1) for i in range(x_train.shape[1])]
    x_shadow.columns = shadow_names
    # Combine to make one new dataframe
    new_x = pd.concat([x_train, x_shadow], axis=1)
    return new_x, shadow_names
@ThomasBury
Copy link
Owner

Hi @Tialo,

Numpy does not have built-in support for heterogeneous data types, so it is not well-suited for handling data with mixed types or non-numeric data. However, it provides a structured array data type, which can store and manipulate arrays with different data types. Still, it has limitations compared to Pandas data frames. This is why CatBoost relies on lists, pd.DataFrame or Pool.

using:

X = np.array([[1, 2, 'a'], [3, 4, 'b']])
X

returns:

array([['1', '2', 'a'],
       ['3', '4', 'b']], dtype='<U11')

which means that all the entries are treated as object (unicode strings).

If you want to use numpy arrays with non-numerical columns rather than pd.DataFrame, use structured arrays

x = np.array([(1, 2, 'a'), (3, 4, 'b')],
             dtype=[('num1', 'i4'), ('num2', 'i4'), ('cat1', 'U10')])

returns

array([(1, 2, 'a'), (3, 4, 'b')],
      dtype=[('num1', '<i4'), ('num2', '<i4'), ('cat1', '<U10')])

and the conversion to a pd.DataFrame works as expected

x = pd.DataFrame(x)
x.dtypes

which returns

num1     int32
num2     int32
cat1    object
dtype: object

When working with heterogenous data, prefer pandas over numpy.

I hope it helps.

NB: I'm not sure I understood the original method you're referring to. Isn't it the same as the current definition of _create_shadow?

@Tialo
Copy link
Contributor Author

Tialo commented May 5, 2023

I tried to fit Leshy with dataframe, that has numerical and categorical features. And in your fit method this dataframe will be transformed into dataframe where every feature will be encoded as categorical. I will show you why this happens.

In this method you pass dataframe into np.nan_to_num function which returns numpy ndarray. As you said as original dataframe contains at least one categorical values, numpy will encode every value into unicode strings, thus X.dtype will be object.

def _fit(self, X_raw, y, sample_weight=None):

Mentioned function.

X = np.nan_to_num(X)

Also if X will always be numpy ndarray(which is true because np.nan_to_num return ndarray) this if is always False.

if not isinstance(X, np.ndarray):

Then X passed to this method.

cur_imp = self._add_shadows_get_imps(X, y, sample_weight, dec_reg)

Where you take some columns and then concatenate it with shadow features, which then passed to this function.

imp = _get_shap_imp(

model, X_tt, y_tt, w_tt = _split_fit_estimator(

Finally you are making pandas DataFrame of numpy ndarray with dtype equals 'object', then X.dtypes will be object.

And after it, your function will encode all columns because all of them are object.

def get_pandas_cat_codes(X):

After encoding each column of dataframe will be int64 which are oridinal encoded features.

model = estimator.fit(

And this will lead to poor performance of catboost, despite the fact it can encode categorical features, it is bad idea to try to encode numerical features too. I hope I explained you the problem, if not feel free to ask anything!

Upd. Sorry for closing and opening the issue, I've miss clicked.

@Tialo Tialo closed this as not planned Won't fix, can't repro, duplicate, stale May 5, 2023
@Tialo Tialo reopened this May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants