Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request to use intermediate column transformer outputs #28877

Closed
paranjapeved15 opened this issue Apr 23, 2024 · 0 comments
Closed

Feature request to use intermediate column transformer outputs #28877

paranjapeved15 opened this issue Apr 23, 2024 · 0 comments

Comments

@paranjapeved15
Copy link

paranjapeved15 commented Apr 23, 2024

Describe the workflow you want to enable

I am trying to do the following:

import pandas
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.base import TransformerMixin, BaseEstimator

# Input Data
df = pandas.DataFrame([["car",0.1,0.0],["car",0.2,0.0],["suv",0.0,0.2]],columns=['vehicleType','features_car','features_suv'])

# Custom Transformer
class GetScore(BaseEstimator, TransformerMixin):  # type: ignore
    """Apply binarize transform for matching values to filter_value."""

    def __init__(self):
        """Initialize transformer with expected columns."""
        pass

    def dot_product(self, x) -> float:
        """Return 1.0 if input == filter_value, else 0."""
        return x[0]*x[2] + x[1] * x[3]


    def fit(self, X, y=None):  # type: ignore
        """Fit the transformer."""
        """Transform the given data."""
        if type(X) == pandas.DataFrame:
            x = X.apply(lambda x: self.dot_product(x), axis=1)
            return x.values.reshape((-1, 1))

    def transform(self, X: pandas.DataFrame):
        """Transform the given data."""
        if type(X) == pandas.DataFrame:
            x = X.apply(lambda x: self.dot_product(x), axis=1)
            return x.values.reshape((-1, 1))
        # elif type(X) == numpy.ndarray:
        #     vector_func = numpy.vectorize(self.dot_product)
        #     x = vector_func(X)
        #     return x.reshape((-1, 1))

    def get_feature_names_out(self) -> None:
        """Return feature names. Required for onnx conversion."""
        pass

onehot = ColumnTransformer(
        transformers=[
            ("onehot",OneHotEncoder(categories=[["car", "suv"]], sparse_output=False), ['vehicleType']),
            ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)

get_score = ColumnTransformer(
    transformers=[
        ("getScore", GetScore(),[0,1,2,3])
    ],
remainder='passthrough'
)

pipeline = make_pipeline([("onehot", onehot),
                          ("get_score", get_score)])

preprocesses_df = pipeline.fit(df)

print(preprocesses_df)

I basically wanna get the onehot encoded columns from onehot and then pass them into GetScore to calculate dot product with the ['features_car','features_suv'] input features from df.
The above code is throwing an error:

TypeError: Last step of Pipeline should implement fit or be the string 'passthrough'. '

Is there any easier way to do what I am trying to do?

Describe your proposed solution

Some way to use the intermediate features from previous transformer into next one in a single ColumnTransformer.

@paranjapeved15 paranjapeved15 added Needs Triage Issue requires triage New Feature labels Apr 23, 2024
@ogrisel ogrisel added Question and removed Needs Triage Issue requires triage labels Apr 30, 2024
@scikit-learn scikit-learn locked and limited conversation to collaborators Apr 30, 2024
@ogrisel ogrisel converted this issue into discussion #28917 Apr 30, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants