Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] statsmodel.formula.api -- order of terms impacts fit_regularized results #7838

Open
snacamavtyn opened this issue Nov 3, 2021 · 4 comments

Comments

@snacamavtyn
Copy link

snacamavtyn commented Nov 3, 2021

Describe the bug

Example code shows that the params are different based strictly on the order in which the polynomial terms in the model are declared in the formula. Problem impacts only fit_regularized() and not fit()

Code Sample, a copy-pastable example if possible

import statsmodels.formula.api as smf
import statsmodels.api as sm
import pandas as pd
import random
#print(sm.show_versions())
data = []
for x1 in range(1, 100) :
    for x2 in range(1, 100) :
        x1 = random.random()
        x2 = random.random()
        data.append((x1, x2, 3*x1*x1 + 2*x1*x2 - 1*x1))
df = pd.DataFrame(data, columns=['X1', 'X2', 'Y'])
res0 = smf.ols(formula = 'Y ~ I(X1*X2) + I(X1**2) + X1 - 1 - X2', data=df).fit_regularized()
print(f"\tfit_regularized#1\n{res0.params}")
res1 = smf.ols(formula = 'Y ~ X1 + I(X1*X2) + I(X1**2) - 1 - X2', data=df).fit()
print(f"\tfit_regularized#2\n{res1.params}")
res2 = smf.ols(formula = 'Y ~ I(X1*X2) + I(X1**2) + X1 - 1 - X2', data=df).fit()
print(f"\tfit#1\n{res2.params}")
res3 = smf.ols(formula = 'Y ~ X1 + I(X1*X2) + I(X1**2) - 1 - X2', data=df).fit()
print(f"\tfit#2\n{res3.params}")

Output of the script

	fit_regularized#1
I(X1 * X2)    1.999803
I(X1 ** 2)    2.963199
X1           -0.972459
dtype: float64
	fit_regularized#2
X1           -1.0
I(X1 * X2)    2.0
I(X1 ** 2)    3.0
dtype: float64
	fit#1
I(X1 * X2)    2.0
I(X1 ** 2)    3.0
X1           -1.0
dtype: float64
	fit#2
X1           -1.0
I(X1 * X2)    2.0
I(X1 ** 2)    3.0
dtype: float64

Output of import statsmodels.api as sm; sm.show_versions()

INSTALLED VERSIONS
------------------
Python: 3.9.7.final.0
OS: Darwin 20.6.0 Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64 x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

statsmodels
===========

Installed: 0.13.0 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/statsmodels)

Required Dependencies
=====================

cython: 0.29.13 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/Cython)
numpy: 1.20.1 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/numpy)
scipy: 1.6.1 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scipy)
pandas: 1.2.3 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas)
dateutil: 2.8.1 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/dateutil)
patsy: 0.5.2 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/patsy)

Optional Dependencies
=====================

matplotlib: 3.3.4 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/matplotlib)
    backend: MacOSX 
cvxopt: Not installed
joblib: 1.0.1 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/joblib)

Developer Tools
================

IPython: 7.21.0 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython)
    jinja2: 3.0.1 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/jinja2)
sphinx: Not installed
    pygments: 2.8.0 (/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pygments)
pytest: Not installed
virtualenv: Not installed

Problem is reproducible under Linux as well


INSTALLED VERSIONS
------------------
Python: 3.6.8.final.0
OS: Linux 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: None

Statsmodels
===========

Installed: 0.10.1 (/org/tools/pypi/3.6.8/1/el-7-x86_64/lib/python3.6/site-packages/statsmodels)

Required Dependencies
=====================

cython: 0.29.13 (/org/tools/pypi/3.6.8/1/el-7-x86_64/lib/python3.6/site-packages/Cython)
numpy: 1.17.3 (/org/tools/pypi/3.6.8/1/el-7-x86_64/lib/python3.6/site-packages/numpy)
scipy: 1.3.1 (/org/tools/pypi/3.6.8/1/el-7-x86_64/lib/python3.6/site-packages/scipy)
pandas: 0.25.2 (/org/tools/pypi/3.6.8/1/el-7-x86_64/lib/python3.6/site-packages/pandas)
dateutil: 2.8.0 (/org/tools/pypi/3.6.8/1/el-7-x86_64/lib/python3.6/site-packages/dateutil)
patsy: 0.5.1 (/org/tools/pypi/3.6.8/1/el-7-x86_64/lib/python3.6/site-packages/patsy)
@snacamavtyn snacamavtyn changed the title statsmodel.formula.api -- order of terms impacts fit_regularized results [BUG] statsmodel.formula.api -- order of terms impacts fit_regularized results Nov 3, 2021
@bashtage
Copy link
Member

bashtage commented Nov 5, 2021

Your fit_regularized#2 is just calling fit. Can you update with a call to fit_regularized()`?

@bashtage
Copy link
Member

bashtage commented Nov 5, 2021

FWIW I also see the same when I call fit_regularized on different specifications.

@josef-pkt
Copy link
Member

we need convergence information

e.g. increasing maxiter

res0 = smf.ols(formula = 'Y ~ I(X1*X2) + I(X1**2) + X1 - 1 - X2', data=df).fit_regularized(maxiter=200)
print(f"\tfit_regularized#1\n{res0.params}")
res1 = smf.ols(formula = 'Y ~ X1 + I(X1*X2) + I(X1**2) - 1 - X2', data=df).fit_regularized(maxiter=200)
print(f"\tfit_regularized#2\n{res1.params}")
res2 = smf.ols(formula = 'Y ~ I(X1*X2) + I(X1**2) + X1 - 1 - X2', data=df).fit()
print(f"\tfit#1\n{res2.params}")
res3 = smf.ols(formula = 'Y ~ X1 + I(X1*X2) + I(X1**2) - 1 - X2', data=df).fit()
print(f"\tfit#2\n{res3.params}")

	fit_regularized#1
I(X1 * X2)    2.000000
I(X1 ** 2)    2.999998
X1           -0.999999
dtype: float64
	fit_regularized#2
X1           -0.999993
I(X1 * X2)    2.000000
I(X1 ** 2)    2.999991
dtype: float64
	fit#1
I(X1 * X2)    2.0
I(X1 ** 2)    3.0
X1           -1.0
dtype: float64
	fit#2
X1           -1.0
I(X1 * X2)    2.0
I(X1 ** 2)    3.0
dtype: float64

@snacamavtyn
Copy link
Author

Ah! Fixed the typo in the test case

import statsmodels.formula.api as smf
import statsmodels.api as sm
import pandas as pd
import random
data = []
for x1 in range(1, 200) :
    for x2 in range(1, 200) :
        x1 = random.random()
        x2 = random.random()
        data.append((x1, x2, 3*x1*x1 + 2*x1*x2 - 1*x1))
df = pd.DataFrame(data, columns=['X1', 'X2', 'Y'])
res0 = smf.ols(formula = 'Y ~ I(X1*X2) + I(X1**2) + X1 - 1 - X2', data=df).fit_regularized()
print(f"\tfit_regularized#1\n{res0.params}")
res1 = smf.ols(formula = 'Y ~ X1 + I(X1*X2) + I(X1**2) - 1 - X2', data=df).fit_regularized()
print(f"\tfit_regularized#2\n{res1.params}")
res2 = smf.ols(formula = 'Y ~ I(X1*X2) + I(X1**2) + X1 - 1 - X2', data=df).fit()
print(f"\tfit#1\n{res2.params}")
res3 = smf.ols(formula = 'Y ~ X1 + I(X1*X2) + I(X1**2) - 1 - X2', data=df).fit()
print(f"\tfit#2\n{res3.params}")

Output

	fit_regularized#1
I(X1 * X2)    2.000152
I(X1 ** 2)    2.964253
X1           -0.973314
dtype: float64
	fit_regularized#2
X1           -0.853654
I(X1 * X2)    2.000769
I(X1 ** 2)    2.816278
dtype: float64
	fit#1
I(X1 * X2)    2.0
I(X1 ** 2)    3.0
X1           -1.0
dtype: float64
	fit#2
X1           -1.0
I(X1 * X2)    2.0
I(X1 ** 2)    3.0
dtype: float64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants