Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warm start bug when fitting a LogisticRegression model on binary outcomes with `multi_class='multinomial'`. #10836

Closed
rwolst opened this issue Mar 19, 2018 · 3 comments
Labels

Comments

@rwolst
Copy link
Contributor

@rwolst rwolst commented Mar 19, 2018

Description

Bug when fitting a LogisticRegression model on binary outcomes with multi_class='multinomial' when using warm start. Note that it is similar to the issue here #9889 i.e. only using a 1D coef object on binary outcomes even when using multi_class='multinomial' as opposed to a 2D coef object.

Steps/Code to Reproduce

from sklearn.linear_model import LogisticRegression
import sklearn.metrics
import numpy as np

# Set up a logistic regression object
lr = LogisticRegression(C=1000000, multi_class='multinomial',
                    solver='sag', tol=0.0001, warm_start=True,
                    verbose=0)

# Set independent variable values
Z = np.array([
[ 0.        ,  0.        ],
[ 1.33448632,  0.        ],
[ 1.48790105, -0.33289528],
[-0.47953866, -0.61499779],
[ 1.55548163,  1.14414766],
[-0.31476657, -1.29024053],
[-1.40220786, -0.26316645],
[ 2.227822  , -0.75403668],
[-0.78170885, -1.66963585],
[ 2.24057471, -0.74555021],
[-1.74809665,  2.25340192],
[-1.74958841,  2.2566389 ],
[ 2.25984734, -1.75106702],
[ 0.50598996, -0.77338402],
[ 1.21968303,  0.57530831],
[ 1.65370219, -0.36647173],
[ 0.66569897,  1.77740068],
[-0.37088553, -0.92379819],
[-1.17757946, -0.25393047],
[-1.624227  ,  0.71525192]])

# Set dependant variable values
Y = np.array([1, 0, 0, 1, 0, 0, 0, 0, 
          0, 0, 1, 1, 1, 0, 0, 1, 
          0, 0, 1, 1], dtype=np.int32)

# First fit model normally
lr.fit(Z, Y)

p = lr.predict_proba(Z)
print(sklearn.metrics.log_loss(Y, p)) # ...

print(lr.intercept_)
print(lr.coef_)

# Now fit model after a warm start
lr.fit(Z, Y)

p = lr.predict_proba(Z)
print(sklearn.metrics.log_loss(Y, p)) # ...

print(lr.intercept_)
print(lr.coef_)

Expected Results

The predictions should be the same as the model converged the first time it was run.

Actual Results

The predictions are different. In fact the more times you re-run the fit the worse it gets. This is actually the only reason I was able to catch the bug. It is caused by the line here https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L678.

 w0[:, :coef.shape[1]] = coef

As coef is (1, n_features), but w0 is (2, n_features), this causes the coef value to be broadcast into the w0. This some sort of singularity issue when training resulting in worse performance. Note that had it not done exactly this i.e. w0 was simply initialised by some random values, this bug would be very hard to catch because of course each time the model would converge just not as fast as one would hope when warm starting.

Further Information

The fix I believe is very easy, just need to swap the previous line to

 if n_classes == 1:
     w0[0, :coef.shape[1]] = -coef  # Be careful to get these the right way around
     w0[1, :coef.shape[1]] = coef
 else:
     w0[:, :coef.shape[1]] = coef

Versions

Linux-4.13.0-37-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
NumPy 1.14.2
SciPy 1.0.0
Scikit-Learn 0.20.dev0 (built from latest master)

@jnothman
Copy link
Member

@jnothman jnothman commented Mar 19, 2018

Thanks for the report. At a glance, that looks very plausible.. Test and patch welcome

@rwolst
Copy link
Contributor Author

@rwolst rwolst commented Mar 20, 2018

I'm happy to do this although would be interested in opinions on the test. I could do either

  1. Test what causes the bug above i.e. the model doesn't converge when warm starting.
  2. Test that the initial w0 used in logistic_regression_path is the same as the previous w0 after the function has been run i.e. that warm starting is happening as expected.

The pros of (1) are that its quick and easy however as mentioned previously it doesn't really get to the essence of what is causing the bug. The only reason it is failing is because the w0 is getting initialised so that the rows are exactly identical. If this were not the case but the rows also weren't warm started correctly (i.e. just randomly initialised), the model would still converge (just slower than one would hope if a good warm start had been used) and the test would unfortunately pass.

The pros of (2) are that it would correctly test that the warm starting occurred but the cons would be I don't know how I would do it as the w0 is not available outside of the logistic_regression_path function.

@lesteve
Copy link
Member

@lesteve lesteve commented Mar 20, 2018

Go for the simplest test first, open a PR and see where that leads you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants
You can’t perform that action at this time.