Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide factory functions for selecting columns in ColumnTransformer #12303

Closed
amueller opened this issue Oct 5, 2018 · 10 comments · Fixed by #12371
Closed

Provide factory functions for selecting columns in ColumnTransformer #12303

amueller opened this issue Oct 5, 2018 · 10 comments · Fixed by #12371

Comments

@amueller
Copy link
Member

amueller commented Oct 5, 2018

Follow up on #11190:
It would be great to provide built-in ways to select different types of columns, particularly categorical and continuous ones.
It's a bit tricky how to detect that, but we could add some factory functions, for example with an argument on how to interpret integers.

Btw, I'm also working on a preprocessing tool that tries to do this very automagically but I haven't had time to finish / publish it.

@amueller
Copy link
Member Author

amueller commented Oct 5, 2018

Also see #11301

@GauravAhlawat
Copy link
Contributor

Hi @amueller, I can have a go at this over the weekend

@amueller
Copy link
Member Author

amueller commented Oct 5, 2018

Hi @GauravAhlawat you can try but this might be a bit tricky (i.e. @jnothman, @jorisvandenbossche and me didn't immediately see what the right solution is IIRC).

Either way, check out what @partmor did in #11301.

@thomasjpfan
Copy link
Member

I like the idea from #11301 (comment). I see this feature only used by Dataframes, thus using pd.DataFrame.select_dtypes to select the columns would work. As suggested in #11190 (comment), the api would look like:

preprocess = make_column_transformer(
    (make_select_dtypes([float, int]), StandardScaler()),
    (make_select_dtypes([string, object]), OneHotEncoder())
)

The make_select_dtypes factory function would signal to the user that it uses pd.DataFrame.select_dtypes to select columns. Using make_* is a scikit-learn convention for factory funtions, such as make_scorer.

@amueller
Copy link
Member Author

amueller commented Oct 7, 2018

Would that work for all kinds of floats and ints?

@thomasjpfan
Copy link
Member

The pd.DataFrame.select_dtypes uses the numpy dtype hierarchy to select dtypes:

df = pd.DataFrame({"a": np.array([1,2,3], dtype=np.int8),
                   "b": np.array([4,5,6], dtype=int),
                   "c": np.array([7,8,9], dtype=np.float64),
                   "d": np.array(["hello", "world", "world"])})

# Select all integers
df.select_dtypes(include=[np.integer])

# Select only the "b" column
df.select_dtypes(include=[int])

# Select strings
df.select_dtypes(include=[np.object])

# Select floats
df.select_dtypes(include=[np.float])

# Select all numbers
df.select_dtypes(include=[np.number])

So it would not be possible to select on the string dtype. Thus the api would look like this:

preprocess = make_column_transformer(
    (make_select_dtypes([np.number]), StandardScaler()),
    (make_select_dtypes([np.object]), OneHotEncoder())
)

A user would need to know that "object" means "string" in a pandas dataframe. This would give rise to the issue that other types such as "list" are considered objects in a pandas dataframe.

@amueller
Copy link
Member Author

amueller commented Oct 8, 2018

It's possible that a string is not an object, too, right? If someone created a numpy string array and converted it to a pandas dataframe? Or is that cast to object?

@jnothman
Copy link
Member

jnothman commented Oct 8, 2018 via email

@jorisvandenbossche
Copy link
Member

It's possible that a string is not an object, too, right? If someone created a numpy string array and converted it to a pandas dataframe? Or is that cast to object?

That is always cast to object (pandas does not support numpy's fixed width string dtypes)

So a string will always be object dtype, but not every object dtype will be strings.

We should also allow filtering on column name.

You mean like pattern matching on the names? (eg all columns starting with a certain string)

@jnothman
Copy link
Member

jnothman commented Oct 8, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants