Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coding categorical response variables for use with scikit-learn #77

Open
pkch opened this issue Nov 11, 2015 · 10 comments
Open

coding categorical response variables for use with scikit-learn #77

pkch opened this issue Nov 11, 2015 · 10 comments

Comments

@pkch
Copy link

pkch commented Nov 11, 2015

scikit-learn expects the response variable to be a 1d array. For example,

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X, y) # here y is expected to be 1d array, strings or numerical labels

However, if y is an array of strings, patsy will convert it to dummy variables, which scikit-learn will not accept as a valid response y.

Would it be useful perhaps to be able to tell patsy that a given (string-type) variable should remain a string and/or converted to numeric labels?

@datnamer
Copy link

+1

@njsmith
Copy link
Member

njsmith commented Nov 12, 2015

Yeah, I think statsmodels has similar trouble with their categorical models. Only question is what the interface to this should look like. What do you want to be able to request from patsy, and what kind of output should patsy support? Array of strings? Array of ints corresponding to the categories? (Annoyingly we still don't have a standard way to represent categorical data in numpy, so I guess we'll have to do something ad hoc...) When you process a formula, do you know ahead of time whether you want the response variable to be categorical, or do there exist models that want to do one thing if y is categorical and a different-but-equally-valid thing if y is numerical?

R's way of handling this is that whatever is on the left-hand side of formulas gets treated as R code rather than formula code, so in x + y ~ z + w, the first + does addition and the second + concatenates columns, which is rather confusing. And they have the luxury of having a single standard way to represent categorical data, so it's reasonable for the formula system to just get out of the way and let the end-user and the underlying model talk to each other directly. Not very helpful for us, unfortunately.

CC @josef-pkt @amueller

@josef-pkt
Copy link

My guess is that for statsmodels it would be helpful to have a keyword in dmatrices to turn off categorical "treatment" on the left hand side of ~. We would need a way to adjust the treatment of categorical without directly manipulating the formula string.
This is currently model specific and we are still missing some models. For ordered Logit it would be nice to keep pandas ordered Categorical but the ordered/ordinal model is still just a basic prototype without formulas yet.

We would still need patsy to extract the arrays from the data, DataFrame or dictionary, because we don't have string parsing.

For GLM-Binomial I used left hand + for standard formula concatenation. success + fail ~ ... for Binomial counts. My guess is that lefthand formulas will also be useful for multivariate models directly

@amueller
Copy link

Just fyi, for classification, scikit-learn accepts anything that is not a float. It'll do a np.unique on it.
If you give models 2d data, they assume it is multi-label or multi-output data.

@pkch
Copy link
Author

pkch commented Jan 24, 2016

@amueller correct me if I'm wrong, but when sklearn does np.unique on the 1D non-float data, it becomes impossible to use the trained learner to predict on the new data (since the conversion map is irretrievably lost). In other words, if I later call sklearn's predict_proba function, I will have no way of knowing which probability refers to which class (it's just a 2D array of numbers, with no labels).

@amueller
Copy link

@pkch wrong, because unique actually returns the unique values, which are stored in the classes_ attribute.

@pkch
Copy link
Author

pkch commented Jan 24, 2016

@amueller Ah thanks, the classes_ attribute wasn't mentioned in LogisticRegression, not sure if it's worth submitting a PR for such a small issue though.

I guess not only is this information preserved in classes_, but it can also be deduced by rerunning LabelBinarizer.transform() on the set of original labels. This is important because without this guarantee that LabelBinarizer is fully deterministic, it would be impossible to write custom scoring functions that require probabilities. For example, the built-in log_loss metric starts by calling LabelBinarizer.transform() without actually being able to see the classes_ attribute and without the access to the LabelBinarizer instance from the estimator.

@amueller
Copy link

PR very welcome, I'm surprised it's not there. Small doc fixes are very valuable.
We want to make using predictors and scoring metrics as easy to use as possible, but it does have the draw-back you mention. Increasingly more metrics have a labels parameter, which allows you to pass the estimator.classes_.
It seems not to be present in log_loss at the moment, but it would be a welcome addition.

@pkch
Copy link
Author

pkch commented Jan 24, 2016

@amueller ah I didn't think it's that terrible to depend on the stability of LabelBinarizer; but I guess it's not ideal. Did you mean as a required argument, or as an optional with the current behavior allowed as default?

If labels is added to log_loss, it will make sense to add it to the API for the user-defined function score_func accepted by make_scorer (requires a modest code change in make_scorer).

Also, what about the default scorer of those estimators that use log_loss? Where is the code that needs to be changed to make them use the new labels argument? If it's not done, then GridSearchCV and cross_val_score (which use the estimator's scorer by default) will still use the old behavior.

@amueller
Copy link

Optional.
Well it's not only about stability, its that different subsets of the data
(as happens in cross validation) can have different label sets.
On Jan 24, 2016 15:22, "pkch" notifications@github.com wrote:

@amueller https://github.com/amueller ah I didn't think it's that
terrible to depend on the stability of LabelBinarizer; but I guess it's
not ideal. Did you mean as a required argument, or as an optional with the
current behavior allowed as default?

If labels is added to log_loss, it will make sense to add it to the API
for the user-defined function score_func accepted by make_scorer
(requires a modest code change in make_scorer).

Also, what about the default scorer of those estimators that use log_loss?
Where is the code that needs to be changed to make them use the new labels
argument? If it's not done, then GridSearchCV and cross_val_score (which
use the estimator's scorer by default) will still use the old behavior.


Reply to this email directly or view it on GitHub
#77 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants