Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of OneHotEncoder fed into CCA throws ValueError: The output of the '<pipeline_name>' transformer should be 2D (scipy matrix, array, or pandas DataFrame). #16600

Open
izabella197 opened this issue Feb 29, 2020 · 3 comments

Comments

@izabella197
Copy link

Describe the bug

Output of OneHotEncoder fed into CCA throws ValueError: The output of the '<pipeline_name>' transformer should be 2D (scipy matrix, array, or pandas DataFrame)

Notes:
Please note that this is not an issue if the above is done not in a pipeline ie. I call fit_transform on the OneHotEncoder and then use the output with CCA.fit_transform. The issue occurs if the entire Column Transformer pipeline is used.

Steps/Code to Reproduce

  1. Create a pipeline transformer, where input is one hot encoded with OneHotEncoder, its output is fed to CCA.
  2. Run fit_transform on the data using the pipeline.
  3. Should return a numpy array with number of features specified in CCA
Sample code to reproduce the problem
categorical = [<some categorical column names from pandas dataframe>]
numerical = [<some numerical column names from pandas dataframe>]

ohe_pipe =  ColumnTransformer(
    [
    ("one_hot_encode", Pipeline([("one_hot_encode", OneHotEncoder(handle_unknown="ignore", sparse=False)),  ("cca", CCA(n_components=300))]), categorical),
("scale", StandardScaler(), numerical),
    ],
    remainder="drop")

X_test = ohe_pipe.fit_transform(X_test, y_test)

Expected Results

Should produce a numpy array such that if X_test.shape is called the output should be

>>>(<num examples>, 300)

Actual Results

D:\Users\<USERNAME>\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
    536 
    537         self._update_fitted_transformers(transformers)
--> 538         self._validate_output(Xs)
    539 
    540         return self._hstack(list(Xs))

D:\Users\<USERNAME>\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in _validate_output(self, result)
    400                 raise ValueError(
    401                     "The output of the '{0}' transformer should be 2D (scipy "
--> 402                     "matrix, array, or pandas DataFrame).".format(name))
    403 
    404     def _validate_features(self, n_features, feature_names):

ValueError: The output of the 'one_hot_encode' transformer should be 2D (scipy matrix, array, or pandas DataFrame).

Versions

System:
    python: 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]
executable: D:\Users\<USERNAME>\Anaconda3\python.exe
   machine: Windows-10-10.0.17763-SP0

Python dependencies:
       pip: 20.0.2
setuptools: 45.2.0.post20200210
   sklearn: 0.22
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 0.25.1
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

(The reason that these versions are used and not newest is due to a memmapping problem experienced in windows)

@jnothman
Copy link
Member

jnothman commented Feb 29, 2020 via email

@jnothman
Copy link
Member

jnothman commented Feb 29, 2020 via email

@izabella197
Copy link
Author

class MakeNumpy(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Since X is a tuple 
        return np.array(X[0])
onehotencode_ccapca = ColumnTransformer(
    [
    ("one_hot_encode", Pipeline([("one_hot_encode", OneHotEncoder(categories=categories, handle_unknown="ignore", sparse=False, dtype=np.int8)), ("cca", CCA(n_components=300)), ("MakeNumpy", MakeNumpy())]), categorical),
 ("scale", StandardScaler(), numerical),
    ],
    remainder="drop")

So this appears to work but what is returned from CCA is a tuple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants