-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leshy works wrong with categorical features #19
Comments
Hi @Tialo, Numpy does not have built-in support for heterogeneous data types, so it is not well-suited for handling data with mixed types or non-numeric data. However, it provides a structured array data type, which can store and manipulate arrays with different data types. Still, it has limitations compared to Pandas data frames. This is why CatBoost relies on lists, pd.DataFrame or Pool. using: X = np.array([[1, 2, 'a'], [3, 4, 'b']])
X returns:
which means that all the entries are treated as object (unicode strings). If you want to use numpy arrays with non-numerical columns rather than pd.DataFrame, use structured arrays x = np.array([(1, 2, 'a'), (3, 4, 'b')],
dtype=[('num1', 'i4'), ('num2', 'i4'), ('cat1', 'U10')]) returns
and the conversion to a pd.DataFrame works as expected x = pd.DataFrame(x)
x.dtypes which returns
When working with heterogenous data, prefer pandas over numpy. I hope it helps. NB: I'm not sure I understood the original method you're referring to. Isn't it the same as the current definition of _create_shadow? |
I tried to fit Leshy with dataframe, that has numerical and categorical features. And in your fit method this dataframe will be transformed into dataframe where every feature will be encoded as categorical. I will show you why this happens. In this method you pass dataframe into
Mentioned function.
Also if X will always be numpy ndarray(which is true because
Then X passed to this method.
Where you take some columns and then concatenate it with shadow features, which then passed to this function.
arfs/src/arfs/feature_selection/allrelevant.py Line 1018 in ca71fb2
Finally you are making pandas DataFrame of numpy ndarray with dtype equals 'object', then
And after it, your function will encode all columns because all of them are object. Line 117 in ca71fb2
After encoding each column of dataframe will be int64 which are oridinal encoded features.
And this will lead to poor performance of catboost, despite the fact it can encode categorical features, it is bad idea to try to encode numerical features too. I hope I explained you the problem, if not feel free to ask anything! Upd. Sorry for closing and opening the issue, I've miss clicked. |
Hello, when I was using Leshy with catboost estimator and dataset that has categorical features, I've noticed that all features of my dataset are considered as categorical and passed to cat_features parameter of catboost. This is caused by this line
arfs/src/arfs/feature_selection/allrelevant.py
Line 946 in ca71fb2
If you have
X = np.array([[1, 2, 'a'], [3, 4, 'b']])
and you pass it to pd.DataFrame then it will make dtypes of each columns equals toobject
. I propose using original method for creating shadow features. It keeps original dtypes of columnsThe text was updated successfully, but these errors were encountered: