Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iolib not found #66

Closed
agramfort opened this issue Sep 1, 2011 · 15 comments
Closed

iolib not found #66

agramfort opened this issue Sep 1, 2011 · 15 comments
Assignees
Labels

Comments

@agramfort
Copy link

/Users/alex/local/lib/python2.7/site-packages/scikits/statsmodels/genmod/generalized_linear_model.py in summary(self, yname, xname, title, returns)
659 """
660 import time as Time
--> 661 from iolib import SimpleTable
662 from stattools import jarque_bera, omni_normtest, durbin_watson
663

ImportError: No module named iolib

it looks like a relative import problem.

@josef-pkt
Copy link
Member

The short answer, the import path needs to be adjusted, and I think Skipper did it in the pandas-integration path.

The longer answer: Summary() for other models is where Vincent was working at the end in his branch. I didn't have time to look at it, (still on launchpad and needs manual merge) and I think there are no tests for any summary methods in the test suite.

I still need to check what the status of summary for GLM, and RLM is.

@josef-pkt
Copy link
Member

from scikits.statsmodels.iolib import SimpleTable
from scikits.statsmodels.stats.stattools import jarque_bera, omni_normtest, durbin_watson

but the version in 0.3 looks unfinished, and two extra '=='

@agramfort
Copy link
Author

thanks for the feedback. I was looking for a method to get p-values on regression coefs out of a logistic regression. If you have a simple solution for urgent needs that would be great.

@josef-pkt
Copy link
Member

results.pvalues ? binom_results.tvalues binom_results.pvalues using examples/example_glm.py

the parameter summary table works for GLM after fixing the path

should also be available using logit in discrete

@ghost ghost assigned josef-pkt Sep 5, 2011
@josef-pkt
Copy link
Member

I'm working here, but it will still take some time
https://github.com/josef-pkt/statsmodels/commits/summary-refactoring

@josef-pkt
Copy link
Member

while comparing summary methods with R, I saw that pvalues in R are based on normal distribution, the pvalues in statsmodels glm are based on t distribution, tvalues are identical.

example from R help file

values in R

>>> stats.norm.sf(np.abs(res.tvalues))*2
array([  5.42677102e-71,   1.00000000e+00,   1.00000000e+00,
         2.46471164e-02,   1.28486515e-01])

values in statsmodels

>>> stats.t.sf(np.abs(res.tvalues), res.df_resid)*2
array([  5.83392507e-05,   1.00000000e+00,   1.00000000e+00,
         8.79477411e-02,   2.03120034e-01])

very small sample, 9 observations, 5 regressors including constant, so the difference between t and norm is pretty large

>>> res.df_resid
4

@agramfort
Copy link
Author

maybe a naive question but it seems I get crazy pvalues with :

import numpy as np
import scikits.statsmodels.api as sm
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
X = X[y != 2]
y = y[y != 2]
X = sm.add_constant(X)
results = sm.GLM(y, X, family=sm.families.Binomial()).fit()
print results.pvalues

In [42]: run test_logreg_pvalues.py
[ 0.9997972 0.99971177 0.99951518 0.99963353 0.99996525]

what am I missing? thx

@josef-pkt
Copy link
Member

I don't seem to have scikits.learn available right now on my computer.
Looking at the csv files, the effect of the features looks big. If you check the means between groups 0 and 1, they all look pretty different. So I would expect the pvalues to be large, but 0.999 looks very large, larger than I would expect.

just one guess, statsmodels binomial has a problem if there is perfect prediction. (I will have to look up the details in this case.) Are there any observations that are misclassified? Or what is the fraction of misclassified observations?

I don't have any other idea, until I look at the data.

@agramfort
Copy link
Author

you can access the data here:

http://mldata.org/repository/data/viewslug/iris/

@josef-pkt
Copy link
Member

>>> np.max(np.abs(results.fittedvalues - y))
3.3864987480924924e-09

looks like perfect fit,

I also tried discrete.Logit, but the numbers there don't seem to make sense.

I just had a very fast look, so it's still possible that something else is going on.

In the complete separation case the likelihood function has some problems, not finite or the wrong curvature.
We discussed this, but we haven't done anything about it. If I remember correctly except for warning the user (and stopping the maximization) there is not much we can do.

@agramfort
Copy link
Author

thanks for taking a look. A warning would definitely be helpful. Out of curiosity how does R behave in this degenerated case? They might have a trick.

@josef-pkt
Copy link
Member

I never looked at this case in R.
Bruce provided links to the SAS documentation, and they stop the maximization after detecting the problem without waiting to go off to infinity or hit the maxiter for the optimization, and warn the user.

@josef-pkt
Copy link
Member

pvalues look more reasonable with misclassifying some observations

pvalues with y[:5] = 1
[ 0.80703843  0.39039945  0.18388126  0.91594413  0.86159542]
pvalues with y[:10] = 1
[ 0.32384945  0.71572876  0.07024508  0.95266349  0.47546027]

@agramfort
Copy link
Author

better indeed. It would be great to reproduce SAS behavior.

@josef-pkt
Copy link
Member

summary refactoring has been merged, see #76 and related tickets

for perfect separation, see old ticket #39
I added a warning text to the summary method of Logit and Probit

full resolution with warning or exception needs more refactoring, see perfect prediction branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants