Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels #5874

Merged
merged 12 commits into from Nov 9, 2016

Conversation

Projects
None yet
6 participants
@raghavrv
Copy link
Member

raghavrv commented Nov 18, 2015

Fixes #5868

In master -

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

>>> X = np.arange(12)[:, np.newaxis]
>>> y = ['a',] * 4 + ['b',] * 4 + ['c',] * 4

>>> LogisticRegressionCV(solver='lbfgs', multi_class='multinomial').fit(X, y).predict(X)
ValueError: could not convert string to float: 'a'

In this branch -

>>> LogisticRegressionCV(solver='lbfgs', multi_class='multinomial').fit(X, y).predict(X)
array(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c'], dtype='<U1')

@agramfort @GaelVaroquaux

@raghavrv raghavrv force-pushed the raghavrv:string_multinomial branch from 7a54392 to 41777f3 Nov 18, 2015

@@ -1525,9 +1525,6 @@ def fit(self, X, y, sample_weight=None):
cv = check_cv(self.cv, y, classifier=True)
folds = list(cv.split(X, y))

self._enc = LabelEncoder()
self._enc.fit(y)

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 18, 2015

Member

Surely self._enc must be used somewhere?

This comment has been minimized.

Copy link
@raghavrv

raghavrv Nov 18, 2015

Author Member

It was added here - 67585f6, but not used anywhere... I am unable to understand why though...

@raghavrv raghavrv changed the title FIX LabelEncoder to correctly handle string labels [WIP] FIX LabelEncoder to correctly handle string labels Nov 21, 2015

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 4, 2015

Instead of this can we encode y once in the fit time of LogisticRegressionCV and pass the encoded y to every split.

If you did this you would need to reconstruct the class_weights if they are in a dict format, such that they use the encoded class labels. Other than that I don't foresee any problem.

Also you can remove the check_classification_targets to before the type of y is checked.

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 4, 2015

Would also be a good time to clean up some parts of the code, especially the encoding logic, dtype checks etc

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 12, 2016

hm this is still a bug, right?

@amueller amueller added the Bug label Sep 14, 2016

@amueller amueller added this to the 0.19 milestone Sep 14, 2016

@@ -898,7 +897,8 @@ def _log_reg_scoring_path(X, y, train, test, pos_class=None, Cs=10,
y_test[~mask] = -1.

# To deal with object dtypes, we need to convert into an array of floats.
y_test = check_array(y_test, dtype=np.float64, ensure_2d=False)
y_test = check_array(LabelEncoder().fit_transform(y_test),

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 17, 2016

Member

Surely y_test needs to use the same encoder as was used for training.

@raghavrv raghavrv force-pushed the raghavrv:string_multinomial branch 2 times, most recently from 3e85d50 to c61373d Sep 27, 2016

@raghavrv

This comment has been minimized.

Copy link
Member Author

raghavrv commented Sep 27, 2016

@MechCoder @jnothman @amueller Could you take a look at this now?

@raghavrv raghavrv force-pushed the raghavrv:string_multinomial branch 2 times, most recently from 1a7889b to 9097471 Sep 27, 2016

@raghavrv raghavrv changed the title [WIP] FIX LabelEncoder to correctly handle string labels [MRG] FIX LogisticRegression(CV) to correctly handle string labels Sep 27, 2016

@raghavrv raghavrv changed the title [MRG] FIX LogisticRegression(CV) to correctly handle string labels [MRG] FIX LogisticRegressionCV to correctly handle string labels Sep 27, 2016

@raghavrv raghavrv force-pushed the raghavrv:string_multinomial branch 2 times, most recently from c3d207e to 0b7129a Sep 27, 2016

@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented Sep 28, 2016

What about just adding label encoding after _check_solver_option in _log_reg_scoring_path ?

y = LabelEncoder().fit_transform(y)
y = np.array(y) - 1
# Test for string labels
lr = LogisticRegression(solver='lbfgs', multi_class='multinomial')
lr_cv = LogisticRegression(solver='lbfgs', multi_class='multinomial')

This comment has been minimized.

Copy link
@TomDLT

TomDLT Sep 28, 2016

Member

you probably wanted to use LogisticRegressionCV here

check_consistent_length(X, y)
if y.dtype.kind in ('S', 'U'):
# Encode for string labels
self._label_encoder = LabelEncoder().fit(y)

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 6, 2016

Member

Why can't you do encoding in all cases and assume that LabelEncoder handles efficiency issues?

This comment has been minimized.

Copy link
@raghavrv

raghavrv Oct 12, 2016

Author Member

thanks for the comment I did so in the recent commit...

for new_key, old_key in zip(new_keys, old_keys):
self.class_weight[new_key] = self.class_weight[old_key]
else:
self.classes_enc_ = self.classes_

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 6, 2016

Member

Does this work when classes are [1,2,3], not [0,1,2]?

This comment has been minimized.

Copy link
@raghavrv

raghavrv Oct 13, 2016

Author Member

I'm encoding for all cases now like you suggested. And there is a test for this usecase.

@raghavrv raghavrv force-pushed the raghavrv:string_multinomial branch from 0b7129a to 14699b7 Oct 12, 2016

@raghavrv

This comment has been minimized.

Copy link
Member Author

raghavrv commented Oct 12, 2016

Arghh. This is tougher to solve than I thought. After encoding the string labels at the fit of LogisticRegressionCV, the logistic_regression_path helper again encodes the data and this causes a test failure I think... I'll have a deeper look at it soon... The scores and coef dict require the unencoded class label as keys... ah...

@raghavrv raghavrv force-pushed the raghavrv:string_multinomial branch from 14699b7 to 934493c Oct 13, 2016

@raghavrv

This comment has been minimized.

Copy link
Member Author

raghavrv commented Oct 13, 2016

Now it should pass all the tests... Gentle ping @TomDLT @MechCoder for reviews too...

@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented Oct 21, 2016

Your variables names are quite confusing:
enc_labels, iter_labels, enc_lbl
cls_labels, iter_classes, cls_lbl

what about:
encoded_labels, iter_encoded_labels, encoded_label
classes_labels, iter_classes_labels, classes_label
or
enc_labels, iter_enc_labels, enc_label
cls_labels, iter_cls_labels, cls_label

or even remove iter_labels and iter_classes and use enc_labels and cls_labels directly

@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented Oct 21, 2016

Otherwise LGTM

@raghavrv raghavrv force-pushed the raghavrv:string_multinomial branch from b0de8f3 to 4961d97 Nov 9, 2016

@raghavrv raghavrv changed the title [MRG + 1] FIX LogisticRegressionCV to correctly handle string labels [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels Nov 9, 2016

@raghavrv

This comment has been minimized.

Copy link
Member Author

raghavrv commented Nov 9, 2016

Merging once CIs pass...

@raghavrv raghavrv merged commit 31ee1a8 into scikit-learn:master Nov 9, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@raghavrv

This comment has been minimized.

Copy link
Member Author

raghavrv commented Nov 9, 2016

Thanks @jnothman @amueller @TomDLT @MechCoder for the reviews!

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 9, 2016

thanks for the pr :)

@raghavrv raghavrv deleted the raghavrv:string_multinomial branch Nov 9, 2016

amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (s…
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Nov 20, 2016

We seem to be getting test failures from this on PRs, such as at https://ci.appveyor.com/project/sklearn-ci/scikit-learn/build/1.0.10296/job/qol1vrsk30ycsopl

@raghavrv

This comment has been minimized.

Copy link
Member Author

raghavrv commented Nov 20, 2016

Which PR is this? I saw a similar error but because of an incorrect updation of master. (Which got resolved after fixing that...)

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Nov 20, 2016

#7838 The commit history
looks clean.

On 21 November 2016 at 09:33, Raghav RV notifications@github.com wrote:

Which PR is this? I saw a similar error but because of an incorrect
updation of master. (Which got resolved after fixing that...)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5874 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz68dXJJvBChav9vNhG-Dby-LREw8Oks5rAMrAgaJpZM4Gkt-Q
.

@raghavrv

This comment has been minimized.

Copy link
Member Author

raghavrv commented Nov 20, 2016

Okay. Thanks for the notice. I'll look into it tomorrow...

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (s…
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (s…
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017

Merge tag '0.18.1' into releases
* tag '0.18.1': (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017

Merge branch 'releases' into dfsg
* releases: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

 Conflicts:  removed
	sklearn/externals/joblib/__init__.py
	sklearn/externals/joblib/_parallel_backends.py
	sklearn/externals/joblib/testing.py

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017

Merge branch 'dfsg' into debian
* dfsg: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (s…
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (s…
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (s…
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry

MLopez-Ibanez pushed a commit to MLopez-Ibanez/scikit-learn that referenced this pull request Feb 9, 2019

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (s…
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.