Feature request to use intermediate column transformer outputs #28877

paranjapeved15 · 2024-04-23T22:47:25Z

Describe the workflow you want to enable

I am trying to do the following:

import pandas
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.base import TransformerMixin, BaseEstimator

# Input Data
df = pandas.DataFrame([["car",0.1,0.0],["car",0.2,0.0],["suv",0.0,0.2]],columns=['vehicleType','features_car','features_suv'])

# Custom Transformer
class GetScore(BaseEstimator, TransformerMixin):  # type: ignore
    """Apply binarize transform for matching values to filter_value."""

    def __init__(self):
        """Initialize transformer with expected columns."""
        pass

    def dot_product(self, x) -> float:
        """Return 1.0 if input == filter_value, else 0."""
        return x[0]*x[2] + x[1] * x[3]


    def fit(self, X, y=None):  # type: ignore
        """Fit the transformer."""
        """Transform the given data."""
        if type(X) == pandas.DataFrame:
            x = X.apply(lambda x: self.dot_product(x), axis=1)
            return x.values.reshape((-1, 1))

    def transform(self, X: pandas.DataFrame):
        """Transform the given data."""
        if type(X) == pandas.DataFrame:
            x = X.apply(lambda x: self.dot_product(x), axis=1)
            return x.values.reshape((-1, 1))
        # elif type(X) == numpy.ndarray:
        #     vector_func = numpy.vectorize(self.dot_product)
        #     x = vector_func(X)
        #     return x.reshape((-1, 1))

    def get_feature_names_out(self) -> None:
        """Return feature names. Required for onnx conversion."""
        pass

onehot = ColumnTransformer(
        transformers=[
            ("onehot",OneHotEncoder(categories=[["car", "suv"]], sparse_output=False), ['vehicleType']),
            ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)

get_score = ColumnTransformer(
    transformers=[
        ("getScore", GetScore(),[0,1,2,3])
    ],
remainder='passthrough'
)

pipeline = make_pipeline([("onehot", onehot),
                          ("get_score", get_score)])

preprocesses_df = pipeline.fit(df)

print(preprocesses_df)

I basically wanna get the onehot encoded columns from onehot and then pass them into GetScore to calculate dot product with the ['features_car','features_suv'] input features from df.
The above code is throwing an error:

TypeError: Last step of Pipeline should implement fit or be the string 'passthrough'. '

Is there any easier way to do what I am trying to do?

Describe your proposed solution

Some way to use the intermediate features from previous transformer into next one in a single ColumnTransformer.

paranjapeved15 added Needs Triage Issue requires triage New Feature labels Apr 23, 2024

ogrisel added Question and removed Needs Triage Issue requires triage labels Apr 30, 2024

scikit-learn locked and limited conversation to collaborators Apr 30, 2024

ogrisel converted this issue into discussion #28917 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Feature request to use intermediate column transformer outputs #28877

Feature request to use intermediate column transformer outputs #28877

paranjapeved15 commented Apr 23, 2024 •

edited by ogrisel

This issue was moved to a discussion.

This issue was moved to a discussion.

Feature request to use intermediate column transformer outputs #28877

Feature request to use intermediate column transformer outputs #28877

Comments

paranjapeved15 commented Apr 23, 2024 • edited by ogrisel

Describe the workflow you want to enable

Describe your proposed solution

This issue was moved to a discussion.

paranjapeved15 commented Apr 23, 2024 •

edited by ogrisel