Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Results For Logistic Regressions across multiple computers #24615

Closed
NickBrecht opened this issue Oct 10, 2022 · 10 comments
Closed
Labels

Comments

@NickBrecht
Copy link

NickBrecht commented Oct 10, 2022

Describe the bug

Hey all -

I'm working on teaching some students some logistic regressions and noticed different computers can produce slightly different intercepts/coefs. At first I thought it was maybe environment differences, but I have been able to reproduce the variances when accounting for the various packages/interpreters.

I can create a new environment on computer 1 (windows 10, intel 8th gen CPU), computer 2 (windows 10, intel 11th gen CPU), and a coworker's M1 MacBook - all three produce different results. I thought maybe minute differences in Numpy's Openblas or MKL could be the culprit but that yielded the same varying results. I've tried not splitting the data, different random states, different C values, different solvers...

For the purposes of reproducibility, I ran this is exact code & uploaded the data to Google sheets:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

bank_df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSGfjG4mq1_4HS4iwRN7EZK6YHzPDi8HpB_giY7kiqbDZRsRNjbfhuQ2J6xkHGk1YVYN9H0TxOf2tgw/pub?gid=1909291157&single=true&output=csv')
bank_df.drop(columns=['ID', 'ZIP_Code'], inplace=True)

bank_df['Education'] = bank_df['Education'].astype('category')
new_categories = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
bank_df.Education.cat.rename_categories(new_categories, inplace=True)
bank_df = pd.get_dummies(bank_df, prefix_sep='_', drop_first=True)

y = bank_df['Personal_Loan']
X = bank_df.drop(columns=['Personal_Loan'])

# partition data
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)

# fit a logistic regression (set penalty=l2 and C=1e42 to avoid regularization)
logit_reg = LogisticRegression(penalty="l2", C=1e42, solver='liblinear', random_state=0)
logit_reg.fit(train_X, train_y)

All versions of all other packages are held constant. I received this results across the three computers:

values Computer 1 with Sklearn 1.1.2 Computer 2 with Sklearn 1.1.2 M1 Macbook with Sklearn 1.1.1
Intercept -12.4919578542609 -12.6123809362746 -12.4934360611768
Age coef -0.037738 -0.032577 -0.037685
Experience coef 0.039255 0.034171 0.039202
Income coef 0.058843 0.058837 0.058844
Family coef 0.612243 0.613192 0.612251

I have tried many other things that all produce slightly different results that are not captured in the above table. Multicollinearity is present within the data, but I would still expect consistent results.

I understand the results are effectively the same and this produces little real world impacts. However, when a student is comparing results and they see 12.4 vs 12.6 which is a decently large difference. I think some of these differences are too large to just be contributed to floating-point computations -- especially for size of the data. I assume I'm missing something, but I'm reaching the end of my rope on troubleshooting. Thoughts / expected behavior?

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

bank_df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSGfjG4mq1_4HS4iwRN7EZK6YHzPDi8HpB_giY7kiqbDZRsRNjbfhuQ2J6xkHGk1YVYN9H0TxOf2tgw/pub?gid=1909291157&single=true&output=csv')
bank_df.drop(columns=['ID', 'ZIP_Code'], inplace=True)

bank_df['Education'] = bank_df['Education'].astype('category')
new_categories = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
bank_df.Education.cat.rename_categories(new_categories, inplace=True)
bank_df = pd.get_dummies(bank_df, prefix_sep='_', drop_first=True)

y = bank_df['Personal_Loan']
X = bank_df.drop(columns=['Personal_Loan'])

# partition data
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)

# fit a logistic regression (set penalty=l2 and C=1e42 to avoid regularization)
logit_reg = LogisticRegression(penalty="l2", C=1e42, solver='liblinear')
logit_reg.fit(train_X, train_y)

Expected Results

values Computer 1 with Sklearn 1.1.2 Computer 2 with Sklearn 1.1.2 M1 Macbook with Sklearn 1.1.1
Intercept -12.4919578542609 -12.6123809362746 -12.4934360611768
Age coef -0.037738 -0.032577 -0.037685
Experience coef 0.039255 0.034171 0.039202
Income coef 0.058843 0.058837 0.058844
Family coef 0.612243 0.613192 0.612251

Actual Results

n/a

Versions

System:
    python: 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\Nick\.conda\envs\sktest\python.exe
   machine: Windows-10-10.0.19044-SP0

Python dependencies:
      sklearn: 1.1.2
          pip: 22.2.2
   setuptools: 63.4.1
        numpy: 1.23.1
        scipy: 1.9.1
       Cython: None
       pandas: 1.4.4
   matplotlib: None
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: C:\Users\Nick\.conda\envs\sktest\Lib\site-packages\sklearn\.libs\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 12

       filepath: C:\Users\Nick\.conda\envs\sktest\Library\bin\mkl_rt.1.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 6
threading_layer: intel
@NickBrecht NickBrecht added Bug Needs Triage Issue requires triage labels Oct 10, 2022
@glemaitre
Copy link
Member

Could you set the random_state since liblinear as some randomness in the algorithm?

@NickBrecht
Copy link
Author

@glemaitre Of course - I should've included that in the code example.

Setting random_state = 42 and removing the test_train_split to just pass the entire data, I get these results:

values Computer 1 Computer 2
Intercept -11.494120788531475 -12.301872604999168
Age -0.064421 -0.035942
Experience 0.072901 0.045021
Income 0.059754 0.060148
Family 0.611814 0.617678

@lorentzenchr
Copy link
Member

Could you try with a smaller tol, 1e-6 or smaller?

@glemaitre glemaitre added Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Oct 18, 2022
@glemaitre
Copy link
Member

I decreased the tol to 1e-8 and I can see the non-deterministic results between a Linux machine and a MacOS machine.

The number of iterations is also not the same, and when it is the same, the results are different.

When the regularization increases, this behaviour will disappear.

Note that this behaviour is also present with "lbfgs" solver.

@lorentzenchr
Copy link
Member

What happens with C=np.inf instead of C=1e42?

@glemaitre
Copy link
Member

What happens with C=np.inf instead of C=1e42?

It doesn't converge :)

With LBFGS, since you have the verbose, you can see small numerical differences (in term of objective and gradient) at each iteration that leads the coefficients to diverge after a while.

@MujassimJamal
Copy link

MujassimJamal commented Dec 19, 2022

@NickBrecht, As liblinear approximate weights to minimize the cost function, getting different results might be possible due to small mathematical round off errors at the time of convergence. It also depends on

Different computational environments,

Number of iterations,

And the learning rate.

In the case of Tensorflow, We may directly use logits in cost function to avoid this type of round off errors.

@NickBrecht
Copy link
Author

@MujassimJamal thanks for the input. I still am incredibly suspicious that differences in approximated weights & rounding could yield the nontrivial differences in the numbers above. If we saw a difference in the hundred thousandths then sure, but a .1-.2+ difference seems suspect. Do you feel otherwise?

Iterations/learning rate were account for in many, many tests (and @glemaitre's test seemingly). I attempted to account for computational environments as stated in my original post. I'm aware that differences in the actual CPU architecture could yield some infinitely small differences, but same conda, python, blas, mkl, etc.

@lorentzenchr
Copy link
Member

I can‘t test for myself right now. Let‘s exclude collinearity and perfect separability:

  1. @NickBrecht mentioned collinearity. What‘s the smallest eigenvalue of X? While the weights/coefficients are different on different computers, do the predictions (predict_proba on the train set) also differ?
  2. It seems unlikely and I don‘t know how to systematically test this, but are there feature combinations (subspaces if X) such that y is always the same class label (either always 0 or always 1)?

@lorentzenchr
Copy link
Member

Maybe the problem/target is separable, see also #18264.
I‘ll close as no activity and no minimal reproducer (I won’t download arbitrary cloud files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants