Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded ColumnTransformer functionality -- transforming subsets of data #28130

Open
rebeccaherman1 opened this issue Jan 15, 2024 · 8 comments

Comments

@rebeccaherman1
Copy link

rebeccaherman1 commented Jan 15, 2024

Describe the workflow you want to enable (edited)

the ability to (inverse_)transform data corresponding to a subset of the ColumnTransformer's component transformations

Describe your proposed solution

Data of a smaller size can be passed in with a new keyword that identifies the relevant component transformations by name

Describe alternatives you've considered, if relevant

a function that subsets a ColumnTransformer object, including adjusting the column numbers

Additional context

In artificial intelligence applications, the researcher may want to transform an entire dataset with column groups for learning, but then transform new data corresponding just to interventions or predictions using the same transformations at a later time. Hence, the need to be able to subset a ColumnTransform or pass in only a part of the data.

See #27957 for more discussion.

@glemaitre
Copy link
Member

We have a PR candidate here: #11639

@thomasjpfan This is maybe a user-case that you were searching for.
The inverse_transform is also useful in this context: #22574

@glemaitre
Copy link
Member

This issue is thus a duplicate from #11463

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Jan 15, 2024
@rebeccaherman1
Copy link
Author

@glemaitre Sorry I didn't notice that other issue. However this is not a duplicate because I am requesting one additional feature -- the ability to pass in a subset of the data

@rebeccaherman1 rebeccaherman1 changed the title Expanded ColumnTransformer functionality Expanded ColumnTransformer functionality -- transforming subsets of data Jan 15, 2024
@rebeccaherman1
Copy link
Author

I have now edited the issue above to focus just on the unique part of this request. Perhaps the Triage label could be re-added?

@GaelVaroquaux
Copy link
Member

Hi @rebeccaherman1 : what do you mean by "subset of the data"? How is it defined?

@rebeccaherman1
Copy link
Author

@GaelVaroquaux I mean columns associated with one or more of the transforms within the ColumnTransformer

For example, let's say my data is Mx10, and I define the ColumnTransformer using

transformers = [('X', StandardScaler(total_variance=True), [0,1,2]),
                ('Y', StandardScaler(total_variance=True), [3,4,5,6,7]),
                ('Z', StandardScaler(total_variance=True), [8,9])]

Then, maybe later, I'd like to pass in data corresponding to just 'X' (original columns [0,1,2]) and 'Z' (original columns [8,9]), without 'Y'. I would then hope to pass in an M'x5 matrix, and identify the original columns the new data columns correspond to ([0,1,2,8,9]), or the transforms they correspond to (['X', 'Z')].

@thomasjpfan
Copy link
Member

If we want to support this feature, I'll go with "slicing a column transformer":

transformers = [('X', StandardScaler(total_variance=True), [0,1,2]),
                ('Y', StandardScaler(total_variance=True), [3,4,5,6,7]),
                ('Z', StandardScaler(total_variance=True), [8,9])]
ct = ColumnTransform(transformers)
ct.fit(X)
ct_sliced = ct[["X", "Z"]]

ct_sliced.transform(X_subset)

We already have some precedence for this with Pipeline slicing.

@rebeccaherman1
Copy link
Author

@thomasjpfan This would be wonderful.

I tried to look through the code for ColumnTransformer, but it is a little too complicated for me to really understand everything that's going on in order to modify it in a reasonable amount of time.

Would someone else who is more familiar with the code be willing to implement this? Or discuss in more detail how it would be best implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants