Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature names - NamedArray #14315

Open
wants to merge 19 commits into
base: master
from
Open

Conversation

@adrinjalali
Copy link
Member

adrinjalali commented Jul 12, 2019

This is based on #13307 by @amueller, related SLEP: scikit-learn/enhancement_proposals#18

This PR explores the idea of passing around the column names with the X object. It is the (b) option in scikit-learn/enhancement_proposals#18 (comment)

The idea is to pass the column names with the object, and for that we can either use an xarray.DataArray, or a sklearn.NamedArray. This PR uses xarray, exploring the idea, and the focus is to see how it would look like if we do it.

There are certain aspect we need to consider, deciding which one of xarray or an internal NamedArray to choose:

  • xarray depends on pandas! I haven't checked/asked if it'd be easy/doable to remove that hard dependency, but for now it's there.
  • On the other hand, NamedArray requires careful discussion on whether we really wanna have such an object, and what it should include as features.

The code behaves as such ATM:

input:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.compose import make_column_transformer
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
import pandas as pd

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
pipe = make_pipeline(StandardScaler(), PCA(), SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output:

array(['pca0', 'pca1'], dtype='<U4')

input:

pipe = make_pipeline(make_column_transformer((make_pipeline(StandardScaler(),
                                                            PCA()), X.columns),
                                             (StandardScaler(), X.columns[:2])),
                     SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output

array(['pipeline__pca0', 'standardscaler__sepal length (cm)'],
      dtype='<U33')

input:

pipe = make_pipeline(make_column_transformer((PCA(), X.columns),
                                             (StandardScaler(), X.columns[:2])),
                     SelectKBest(k=2), LogisticRegression())
pipe.fit(X, iris.target)
pipe[-1].feature_names_in_

output

array(['pipeline__pca0', 'standardscaler__sepal length (cm)'],
      dtype='<U33')

EDIT: an update/summary is written under #14315 (comment)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 13, 2019

Awesome, thanks for working on this! What does the repr of the output of transform look like, say for the PCA or standard scaler?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 13, 2019

Looks to me like pandas is pretty deeply baked into xarray.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 13, 2019

wanna add xarray to the CI and see what happens? ;)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 13, 2019

What are the requirements for a DataArray as input for column names to work correctly in your proposal? call the [1] dimension columns?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

jorisvandenbossche commented Jul 14, 2019

Looking now at the PR, I see that you do the wrapping of the output into an xarray DataArray only at the level of the Pipeline (in order to pass it to the next step).
That's of course much less invasive than what I was first thinking the idea was (that the actual fit method of the estimator would do this wrapping. But that would of course also be quite a (breaking) change in behaviour, if fit methods would start returning DataArray instead of numpy array). But so, my description of this idea in scikit-learn/enhancement_proposals#18 (comment) is thus not fully correct.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 14, 2019

I was actually also thinking of returning them in the estimators, not the pipeline.
If we do it in the pipeline, I think we'll have tons of inconsistent behavior. I haven't looked at the PR yet but I'm not sure I like this option.

in terms of backward-incompatibility: this is something we'll have to swallow at some point if we want to return something with more metadata, and I think it makes sense.
If we implement code paths for pandas to not convert dataframes, that will also break the API. We could have a global option or an experimental activation thing or something like that.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Jul 15, 2019

What are the requirements for a DataArray as input for column names to work correctly in your proposal? call the [1] dimension columns?

I've called them features, and check for it in _feature_names(.) in validation.py.

I didn't try to fix all the estimators. These two need to be done:

  • check the input data at the beginning of fit
  • potentially wrap the data in a DataArray at the end of transform or keep the data as a DataArray during transform

The first one I'd like to do once we have #13603 in, cause then it'd be really easy.

The second one, we need to discuss how we'd like the API to behave.

To start, my proposal is to allow the user to ask for the data with the metadata, or just a numpy array. By default, it's a numpy array, and the pipeline and other meta estimators can activate the metadata output if they need it.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Jul 15, 2019

Awesome, thanks for working on this! What does the repr of the output of transform look like, say for the PCA or standard scaler?

Drafting the API I mentioned, here's how it'd look like:

t = PCA(n_components=3).fit(X)
t._return_dataarray_ = True
t.transform(X)

<xarray.DataArray (samples: 150, features: 3)>
array([[-2.684126,  0.319397, -0.027915],
       [-2.714142, -0.177001, -0.210464],
       [-2.888991, -0.144949,  0.0179  ],
       ...,
       [ 1.764346,  0.078859,  0.130482],
       [ 1.900942,  0.116628,  0.723252],
       [ 1.390189, -0.282661,  0.36291 ]])
Coordinates:
  * samples   (samples) int64 0 1 2 3 4 5 6 7 ... 143 144 145 146 147 148 149
  * features  (features) <U4 'pca0' 'pca1' 'pca2'

It would return a usual numpy array if we don't set the flag, and the pipeline would set the flag (WIP).

EDIT: I'm thinking of having that flag as an __init__ parameter to all estimators, but it doesn't have to be there at the beginning.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 16, 2019

I would rather have a global flag that allows returning the metadata whenever we can. Which is probably whenever the input was a pandas dataframe or DataArray and in some featurizers, and I'd mark that as an experimental feature.
Why would people want their meta-data lost if they provided it in the input?

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Jul 17, 2019

I did try subclassing an ndarray, it is indeed trickier and dirtier than I though.

So for now, I've just added a simple container (NamedArray), and since it doesn't care what the data is, it supports sparse as well:

import pandas as pd
X = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
    [('categories', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='drop')
column_trans.fit(X)
column_trans.feature_names_out_

['categories__city_London',
 'categories__city_Paris',
 'categories__city_Sallisaw',
 'title_bow__bow',
 'title_bow__feast',
 'title_bow__grapes',
 'title_bow__his',
 'title_bow__how',
 'title_bow__last',
 'title_bow__learned',
 'title_bow__moveable',
 'title_bow__of',
 'title_bow__the',
 'title_bow__trick',
 'title_bow__watson',
 'title_bow__wrath']

t = CountVectorizer()
t._return_dataarray_ = True
t.fit(X).transform(X)

<4x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>
feature names: ['city', 'expert_rating', 'title', 'user_rating']
@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 17, 2019

I did try subclassing an ndarray, it is indeed trickier and dirtier than I though.

I agree with not subclassing ndarray. We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html

With a custom NamedArray, we can have "namedarray in -> namedarray out" for every estimator. make_named_array(pd_df) is used to wrap dataframe and pass into our estimators. make_named_array(np_array, features=[...]) is used to wrap numpy arrays. We would still be backward compatibility, since only NamedArrays will have its metadata preserved.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 17, 2019

@adrinjalali what were the issues with subclassing?

@thomasjpfan surely you didn't mean named_named_array?

Also, I just realized following this discussion:
scikit-learn/enhancement_proposals#18 (comment)

we can actually make it "just work" for many cases and don't need to implement get_feature_names in those cases!!!

stupid github not supporting gifs

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 17, 2019

@thomasjpfan I think I would first start with just having columns

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 17, 2019

With only columns each transformer would have to do some name mangling with the original column names. In my API proposal, _orig_columns and _columns_to_orig are used together to form columns. (I updated them to be private)

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Jul 17, 2019

@thomasjpfan

With only columns each transformer would have to do some name mangling with the original column names. In my API proposal, _orig_columns and _columns_to_orig are used together to form columns. (I updated them to be private)

somehow I can't find your proposal

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 17, 2019

@adrinjalali I removed it. It was not a good idea and the discussion is better left for scikit-learn/enhancement_proposals#17.

@shoyer

This comment has been minimized.

Copy link
Contributor

shoyer commented Jul 17, 2019

  • xarray depends on pandas! I haven't checked/asked if it'd be easy/doable to remove that hard dependency, but for now it's there.

This is likely doable with some effort. You would lose label-based indexing/alignment, support for datetime dtypes and groupby operations. With more effort, most of these could be replaced by less efficient pure NumPy/Python implementations. Fundamentally, xarray uses pandas for two things: (1) low-level operations that rely upon hash-tables and (2) some fast low-level loops that aren't implemented by NumPy (mostly for non-numeric dtypes).

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Jul 22, 2019

Almost all the tests now pass, and almost all the remaining failing ones are related to me removing get_feature_names(). We can have the function with the old behavior and deprecate it of course in the final proposal.

This is also not super clean, but I think it's at a good state to be considered as a proposal for us to see how it could look like.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 22, 2019

I'll try to check on it today or tomorrow.

Copy link
Member

jnothman left a comment

should we enforce that names are unique?

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Jul 23, 2019

should we enforce that names are unique?

Sounds reasonable to me, but that needs a much better way of feature name generation in some of the transformers. For instance, having two separate PCAs on two different sets of columns, then unioning them, would result in duplicate names right now.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 23, 2019

@adrinjalali I would have FeatureUnion and ColumnTransformer do name mangling. I think these are the only places where we need to do it, but that should solve most issues.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 23, 2019

Sorry I find this a bit hard to follow as there are changes in so many files (now I know how y'all feel about my PRs lol).

So this is not using the xarray DataArray idea now but doing named array?

cv_coefs = np.concatenate([cv_pipeline.named_steps["classifier"].coef_
for cv_pipeline in cv_results["estimator"]])
fig, ax = plt.subplots()
ax.barh(pipeline.named_steps["classifier"].feature_names_in_,

This comment has been minimized.

Copy link
@amueller

amueller Jul 23, 2019

Member

This is a bit confusing since this comes from the fit way above and not from the cross_validate.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Jul 24, 2019

Author Member

I haven't really fixed the documentation part yet

It only attaches the features names to the data if `return_dataarray`
is set to `True`."""
if self._return_dataarray_:

This comment has been minimized.

Copy link
@amueller

amueller Jul 23, 2019

Member

I don't understand this mechanism. How does this work currently? this is only set in pipeline? Why?
To not show the dataarray (or rather NamedArray) to users?
And doesn't that need to be set in FeatureUnion and ColumnTransformer as well?
And how does it propagate through meta-estimators?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 23, 2019

@adrinjalali adrinjalali changed the title feature names - xarray feature names - NamedArray Jul 24, 2019
@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Jul 24, 2019

Here's my effort to summarize what has happened here.

I started by passing the column names around using an xarray.DataArray, to see how it would look like in terms of the change required in the code etc. Then for the following reasons:
(1) xarray has a hard dependency on pandas,
(2) it doesn't support sparse matrices, and when it does, it'll be pydata.sparse which is COO only for now,

it seemed sensible to have a lightweight implementation of a NamedArray. For now, the implementation accepts the data and the feature names, and as long as the data is a numpy array it works w/o an issue in operations with other numpy objects (it implements __array__). However, things get a bit tricky once the data is a sparse matrix, in which case the user should get the data of the NamedArray and work with that (for instance, scipy.issparse() does an isinstance() check which clearly fails on a NamedArray, unless we hook into that).

On the estimator API side, right now:

  • the estimators return the usual numpy or a scipy sparse matrix by default
  • they return a NamedArray if a flag is set by the user, which is done in pipeline, but has to be done in other places such as ColumnTransformer and metaestimators.
  • the feature_names_in_ is filled with the feature names of the data if a pandas.DataFrame or a NamedArray is passed.
  • the feature_names_out_ is populated regardless of passing feature names. It is x0, x1, ... by default, and derived from the actual column names if present.

Now, these are the questions I have:

  • What is left to be able to answer the question: is this what we'd like to develop further (i.e. some sort of an internal NamedArray + having the transformers handling the feature names).

If the answer is yes to the above question:

  • should it be a global [experimental] flag telling the estimators to return a NamedArray if a pandas DF or a NamedArray is given, or should it be a per-estimator flag, as it is now.

  • NamedArray should probably better be in its own PR, shouldn't it?

  • How should NamedArray behave if the data is sparse (or not a numpy array in general)?

  • Should we enforce unique column names?
    ...

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 24, 2019

Maybe we should go back to trying to write a slep?
You made many decision in this PR lol.

Why do you introduce a flag? Why would we not always return a NamedArray?

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Oct 25, 2019

I'm gonna try to be coherent in this response, but you're touching too many things in your post @thomasjpfan :p.

Categories for categorical features - (Internally, we may need to have a homogeneous array for floats, and int array for categories where the category metadata can be used to get the category name)

This has not been in the scope of this design. I'm happy for it to become that, but one reason we've been able to achieve a consensus on the design is that it's really lightweight. Adding such API would substantially increase the complexity of the implementation, and I worry about the consensus seeking around it. Personally, I wouldn't mind. It may become a separate package at some point.

Sample weights - (Internally, would be stored in a separate 1-D array)
Group indicies

And other sample props, and not necessarily 1-D. But again, this has not been a part of the discussion here. I would personally hope we could use it for that purpose, but @jnothman has raised some concerns related to routing the props and which object needing which prop and so on.

print(X_named["embarked"])

More like print(X_named.select("embarked")). Most definitely we don't want to override __getitem__.

print(X_named.categories)

As I said, we haven't thought of this as a part of this object yet, so I don't know.

# everything as numpy array (no metadata)
print(np.asarray(X_named))

# if a user wants to use `X_trans` as a numpy array either:
np.asarray(X_trans)
X_trans.to_numpy()

We have done our best to not need that. The idea is that the user should be able to use the instance in a workflow as they had before, which means they shouldn't need to do np.asarray on it. The current implementation supports this.

I agree with the rest of your statements.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Oct 25, 2019

Oh, and the answer to your question, for now it's not even in the scope of the design.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 25, 2019

Oh, and the answer to your question, for now it's not even in the scope of the design.

We can have a lightweight version of this design now, but I want to see what the end goal of this design will be. Specifically, I do not want us to make a design decision that will constraint our goals in the future.

If the goal is to "add feature names to single numpy array" and we plan to do nothing else, I agree with the design of this PR.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Oct 25, 2019

Specifically, we considered using pandas dataframes to define categories, which can be used by our tree based algorithms, but this was turned down because of the copying issue. This motivated the creation of the NameArray. If NamedArray does not resolve this issue, then we would still not have a good resolution for categorical features.

(We can not use XArray, because when converting a pandas dataframe to a XArray, it loses the "categorical dtype")

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Oct 28, 2019

Other than the fact that xarray depends on pandas, it also aligns rows and columns when you apply an operation between two xarray instances, which is not something we'd like to include.

I'm happy to hear your thoughts on an alternative design. What do you have in mind?

@hermidalc

This comment has been minimized.

Copy link
Contributor

hermidalc commented Oct 28, 2019

Sample weights - (Internally, would be stored in a separate 1-D array)
Group indices

And other sample props, and not necessarily 1-D. But again, this has not been a part of the discussion here. I would personally hope we could use it for that purpose, but @jnothman has raised some concerns related to routing the props and which object needing which prop and so on.

I agree with @thomasjpfan... that if we could make NamedArray that holds the X numpy data plus 2D sample metadata/properties and 2D feature metadata/properties as pandas DataFrames that would be so ideal. Then when you subset this NamedArray it automatically subsets the embedded structures appropriately. This is how things are typically done in R with complex data objects, there are underlying dataframe slots though the object acts like an array and you can simply e.g. X[1:6,] and it also subsets the metadata dataframes appropriately.

I think this would make the property routing issue totally moot? Since we wouldn't have such metadata being passed via fit or transform method params, it would be in the NamedArray.

@hermidalc

This comment has been minimized.

Copy link
Contributor

hermidalc commented Oct 28, 2019

Also too I don't immediately see what the problem is with xarray. The values slot is a raw numpy array and fast, only attrs metadata slot is pandas. I think it is worth evaluating the major advantages xarray gives vs probably minimal performance hit because only attrs is pandas and these are typically small compared to values and only subsetting would occur on them (no computation normally). I think better than rolling your own and supporting so much less. xarray is very much like I described is supported and used in R (which in my field the ExpressionSet and SummarizedExperiment data objects are the foundation to many libraries and they are conceptually like xarray)

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Oct 28, 2019

I think this would make the property routing issue totally moot? Since we wouldn't have such metadata being passed via fit or transform method params, it would be in the NamedArray.

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

Xarray depends on pandas, that itself means we cannot depend on it (we don't want to depend on pandas). Also, xarray uses indices substantially (which is why it depends on pandas), and we don't want them. We did explore using xarray before coming to the conclusion that we need our own array. It was not suitable.

@hermidalc

This comment has been minimized.

Copy link
Contributor

hermidalc commented Oct 28, 2019

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

Could you point me to his latest suggestion? My apologies, just regarding the issues of sample and feature properties there are so many different open issues and SLEPs.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Oct 28, 2019

@hermidalc

This comment has been minimized.

Copy link
Contributor

hermidalc commented Oct 28, 2019

And apologies too if I appeared to question your choices, totally not my intention and just happy that these issues are being discussed.

I just don't want the scikit-learn core team to go through all the effort of developing new functionalities that then don't solve the basic needs of e.g. life sciences researchers who use ML. For us we need 2D sample metadata and 2D feature metadata dataframes that are in some way transparently attached or associated with X numpy array such that everything gets passed around and subset the right way. As long as you are keeping these needs in your plan then I'm happy.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Oct 28, 2019

That's a fair point, and good to know.

@hermidalc

This comment has been minimized.

Copy link
Contributor

hermidalc commented Oct 28, 2019

no it doesn't. Still have the issue that the user doesn't want all functions/estimators receiving a certain metadata to actually use it, and that there may be a meta-data with the same name (let say sample weight) which is different for different functions. @jnothman 's latest suggestion on the SLEP may solve this issue though.

I think this more matters for compatibility with the scikit-learn API itself? Like groups and sample_weight. From the user side when we write our own custom extensions if I do not need to specify new named parameters in fit and transform to hold specific 2D sample or feature metadata and they are just attached to X then this a great solution and no need to worry about routing from the user side. In my custom classes I only access the metadata columns I need and it doesn't have any potential namespace clashes with existing scikit-learn named parameters being passed to fit or transform or predict for example.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 12, 2019

If we want to use the new data structure to solve the categorical/continuous issue that @thomasjpfan points to, we're basically reimplementing the chunks that are currently used in pandas, right? The main reason we're not going with pandas is the fact that they want to move away from this model because it's too hard to maintain, from what I understand.
So before we consider implementing this, we might talk to @jorisvandenbossche a bit more first ;)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 15, 2019

@adrinjalali can you please elaborate on what you wanted to write in the SLEP re categorical data? @thomasjpfan had the same question above, I think.

My assumption was that we would have homogeneous data, so categorical values would indeed be stored as floats.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 15, 2019

Can someone give an example where the output of an estimator would contain both continuous and categorical columns? I guess that would be involving a ColumnTransformer, but right now we have nothing that produces categorical columns (that are not integers), right?

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Nov 15, 2019

ColumnTransformer can have remainder=passthrough which can passthrough the categorical features. For example:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

X = pd.DataFrame(
     {'city': ['London', 'London', 'Paris', 'Sallisaw'],
      'title': ["His Last Bow", "How Watson Learned the Trick",
                "A Moveable Feast", "The Grapes of Wrath"],
      'expert_rating': [5, 3, 4, 5],
      'user_rating': [4, 5, 4, 3]})


column_trans = ColumnTransformer(
     [('title_bow', CountVectorizer(), 'title')],
     remainder='passthrough')

column_trans.fit_transform(X)

As an aside, I think using homogenous data works for most use cases. Even for strings, we can include the categories in the metadata and encode it as floats.

Edit: Fix example to actually pass through the "city" category.

@NicolasHug

This comment has been minimized.

Copy link
Member

NicolasHug commented Nov 15, 2019

@adrinjalali can you please elaborate on what you wanted to write in the SLEP re categorical data?

Regardless of the content of the SELP I really think we need a summary of the current state on this matter. With all the discussions happening on the sleps, issues and PRs, it's very hard to understand what's going on for people who haven't been following it since the beginning.
(I'd like to help but can't because of this)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 15, 2019

@NicolasHug Isn't the summary the new slep?
@thomasjpfan that's the only one I could come up with as well.
We could have ColumnTransformer optionally return a dataframe in that case....

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Nov 16, 2019

I'll write up a new SLEP with the summary and all. I just wanted to talk to Thomas before doing that, and we're doing that in the coming week.

Regarding an estimator with categorical and numerical data as output, some resamplers would be another example.

And regarding the question of if we want to have homogeneous data or not, the answer is yes as far as I know, or at least it has been the case. I haven't confirmed it since I need to know more about what we want to include in this new object.

@adrinjalali adrinjalali mentioned this pull request Dec 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked issues

Successfully merging this pull request may close these issues.

None yet

8 participants
You can’t perform that action at this time.