Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: pinv threshold, linear regression for almost singular #2628

Open
josef-pkt opened this issue Sep 29, 2015 · 0 comments
Open

REF: pinv threshold, linear regression for almost singular #2628

josef-pkt opened this issue Sep 29, 2015 · 0 comments

Comments

@josef-pkt
Copy link
Member

based on a stackoverflow question

because of numerical precision problems the pinv singular treatment doesn't kick in early enough for essentially singular design matrices.

The below script return huge non-singular numbers for about a third of the runs.
Note: we are printing the warning in summary()

for example

>>> np.linalg.pinv(mod.exog).dot(mod.endog)
array([ 2.5      ,  3.59375  ,  4.296875 ,  2.9140625,  3.8046875])
>>> np.linalg.pinv(mod.exog, rcond=1.5e-15).dot(mod.endog)
array([ 1.6,  2.6,  3.6,  3.4,  4.4])

qr produces similar ill determined results

possible solution:

  • change threshold for pinv
  • provide more methods to handle/detect near singular (see other issues, including penalized and automatic dropping of collinear variables)
# -*- coding: utf-8 -*-
"""
Created on Wed Sep 16 00:44:55 2015

Author: http://stackoverflow.com/questions/32599159/robustness-issue-of-statsmodel-linear-regression-ols-python


"""
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols

seed = np.random.randint(100000)
#seed = 62802  #df_model=4, bad numbers
#seed = 46269  #df_model=4, bad numbers38680
print(seed)
np.random.seed(seed)


nbData = 600
rand1 = np.random.uniform(size=nbData)
rand2 = np.random.uniform(size=nbData)
a = 1 * (rand1 <= (1.0/3.0))
b = 1 * (((1.0/3.0)< rand1) & (rand1< (4/5.0)))
c = 1-b-a
d = 1 * (rand2 <= (3.0/5.0))
e = 1-d
weigths = [1,2,3,1,2]
y = a+2*b+3*c+4*d+5*e
df = pd.DataFrame({'y':y, 'a':a, 'b':b, 'c':c, 'd':d, 'e':e})

mod = ols(formula='y ~ a + b + c + d + e - 1', data=df)
res = mod.fit()
print(res.summary())

# try changing pinv threshold
print(np.linalg.pinv(mod.exog).dot(mod.endog))
print(np.linalg.pinv(mod.exog, rcond=1.5e-15).dot(mod.endog))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant