-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glm weights rebased #2835
Glm weights rebased #2835
Conversation
new test failure in (edit: I moved the details that I had added here to #2834) The freq_weights related unit tests pass. |
As green as it gets on TravisCI. @thequackdaddy The main next task is to clean up the unit tests: don't use internet, and try to speed them up (these seems to add minutes to the test runs) |
Sounds good. Should I add the R insurance dataset to our datasets? It seems straightforward enough to do that per the documentation. Then I can use the same data and just get rid of the I like that dataset for this only because its quite clear |
I'm not sure about the insurance dataset. From the description it's from an insurance company and used in a article in a conference book, that doesn't have a copy on the internet. We can check if we can reuse another dataset, or we could also include it in the test directory given that it is a small file. |
the existing exposure unit tests use cpunish data with fake exposure numbers. |
Ahhhh I see. I can re-do the tests on my PR with cpunish and fake weights instead. Once I do that, can you incorporate into your project? My git speak may be wrong, but then you could merge my latest and greatest Does that plan work? Happy to help, and somewhat curious about the best practice for a workflow at this point. You clearly are more knowledgeable about this new-fangled git do-hickey. |
The standard workflow in this case is that you checkout my branch, make your changes, push and open a PR against my branch. Then I'll merge it in my branch. You can create a branch and checkout either by adding my fork as a remote cpunish with fake offset is used in test_glm.py TestGlmPoissonOffset. If we need it in many test classes it might be worth it to load it only once into the module namespace. The unrepeated dataset is very small. (I have a day trip to the US tomorrow and will be mostly offline.) Related aside: I was searching and browsing on the internet to see whether there are public domain claims datasets around, but it wasn't successful. (Our trend is to use whatever data we can find for examples, and rely now more on rdatasets for documentation examples.) |
Josef, Ok I think this should work. I've edited the tests to use cpunish. Let me know what you find. I did a PR against your branch on your account. |
I merged the test changes, and TraviCI is testing them here. in general: If you have time, then you could work on adding |
The first test run is green in only 9 minutes, so this is fast. |
I will work on the docstrings later. I presume I should make a new PR against your PR once it is ready, no? I think the docstrings only need to go into the GLM code base. I didn't realize until recently that some other classes inherit GLM and thus will have freq_weights available. Any advice on finding them or should I ignore that wrinkle? I'm less familiar with documentation standards (sphinx I think?) than with git. |
We use numpy doc standard for the doc string format But most of it can be done by following the example of the current docstrings. I made an inline comment about where to add the new parameter in your original PR. Yes, again a PR against my branch until this is merged, then it's possible to branch from master again. AFAIR, GLM doesn't have subclasses. At least not yet. It's all in one class. |
I was experimenting with this and realized that this breaks when you have missing values. I need to investigate this more. Reference #805 |
@thequackdaddy I think we have now the generic setup for missing handling of extra arrays. |
FYI: You cannot call |
@josef-pkt Hey I wanted to check in on this. Anything you need me to do to get this merged in? Also I had those 2 PR that I made into your branch. https://github.com/josef-pkt/statsmodels/pulls I'd like to get this merged before I start working on Tweedie. Thanks again for all your help with this. |
@thequackdaddy I wasn't sure you were finished with your changes. I will look at it, and hopefully merge today. |
This is about ready to be merged, after another rebase needs a warning about unverified robust standard errors todo (in followup)
|
rebased, squashed the docstring commits, and edited the last commit to improve unit tests. |
this didn't trigger the travis test run. The previous two merge commits before interactive rebase where green on all machines. network graph looks clean for this |
I will check some things again. |
def __init__(self, endog, exog, family=None, offset=None, exposure=None, | ||
missing='none', **kwargs): | ||
def __init__(self, endog, exog, family=None, freq_weights=None, | ||
offset=None, exposure=None, missing='none', **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
freq_weights after offset and exposure
checking resid still seems to be an open issue, unclear defintion
|
I was unsure what to do with these. Should I apply the freq weights by this as well? I'm not as familiar with anscombe as with deviance/Pearson. Just multiply the freq_weight by this result?
|
@thequackdaddy thanks for the explanations. general FYI: Some of my comments when I'm reading through related code might not be related to changes in a PR. There are just things that I discover whenever I read through code. I'm also partially reading in an editor where I don't see the change sets as on github. |
@josef-pkt I understand these are general comments. Just providing what I know and some of my (possibly flawed) logic. Let me know which of these issues (if any you want me to tackle). I can make PR's against your PR as we've done in the past. |
return self.deviance - self.df_resid*np.log(self.nobs) | ||
return (self.deviance - | ||
(self._freq_weights.sum() - self.df_model - 1) * | ||
np.log(self._freq_weights.sum())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate calculation of sum. I think we add wnobs
for this
return wls_model.fit().fittedvalues | ||
|
||
@cache_readonly | ||
def deviance(self): | ||
return self.family.deviance(self._endog, self.mu) | ||
return self.family.deviance(self._endog, self.mu, self._freq_weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general use keywords for keyword arguments.
this PR breaks backwards compatibility if keyword arguments are used as positional arguments.
I didn't see much else, but only quickly skimmed resid are a bit of a mess: for scale and analysis of deviance we need weighted, for outlier or residual diagnostics we need unweighted, I guess. However, we cannot add an argument to make this optional because residuals are cached attributes. I'm not sure whether the families should have freq_weights or scale first in the argument list. It's mostly used internally and with keyword usage the order doesn't matter. |
merged, Thanks @thequackdaddy I'm just running some examples in Stata, and all I spot checked agreed at quite high precision in cpunish poisson example. I will add a follow-up issue and add most likely some tests against Stata. |
I thank you more. |
#2849 I opened a follow-up issue |
this is squashed and rebased version of #2805
plus