New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide factory functions for selecting columns in ColumnTransformer #12303
Comments
Also see #11301 |
Hi @amueller, I can have a go at this over the weekend |
Hi @GauravAhlawat you can try but this might be a bit tricky (i.e. @jnothman, @jorisvandenbossche and me didn't immediately see what the right solution is IIRC). |
I like the idea from #11301 (comment). I see this feature only used by Dataframes, thus using preprocess = make_column_transformer(
(make_select_dtypes([float, int]), StandardScaler()),
(make_select_dtypes([string, object]), OneHotEncoder())
) The |
Would that work for all kinds of floats and ints? |
The df = pd.DataFrame({"a": np.array([1,2,3], dtype=np.int8),
"b": np.array([4,5,6], dtype=int),
"c": np.array([7,8,9], dtype=np.float64),
"d": np.array(["hello", "world", "world"])})
# Select all integers
df.select_dtypes(include=[np.integer])
# Select only the "b" column
df.select_dtypes(include=[int])
# Select strings
df.select_dtypes(include=[np.object])
# Select floats
df.select_dtypes(include=[np.float])
# Select all numbers
df.select_dtypes(include=[np.number]) So it would not be possible to select on the preprocess = make_column_transformer(
(make_select_dtypes([np.number]), StandardScaler()),
(make_select_dtypes([np.object]), OneHotEncoder())
) A user would need to know that "object" means "string" in a pandas dataframe. This would give rise to the issue that other types such as "list" are considered objects in a pandas dataframe. |
It's possible that a string is not an object, too, right? If someone created a numpy string array and converted it to a pandas dataframe? Or is that cast to object? |
We should also allow filtering on column name.
|
That is always cast to object (pandas does not support numpy's fixed width string dtypes) So a string will always be object dtype, but not every object dtype will be strings.
You mean like pattern matching on the names? (eg all columns starting with a certain string) |
Yes, I mean pattern matching on the names.
|
Follow up on #11190:
It would be great to provide built-in ways to select different types of columns, particularly categorical and continuous ones.
It's a bit tricky how to detect that, but we could add some factory functions, for example with an argument on how to interpret integers.
Btw, I'm also working on a preprocessing tool that tries to do this very automagically but I haven't had time to finish / publish it.
The text was updated successfully, but these errors were encountered: