Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glm weights rebased #2835

Merged
merged 11 commits into from
Mar 9, 2016
Merged

Glm weights rebased #2835

merged 11 commits into from
Mar 9, 2016

Conversation

josef-pkt
Copy link
Member

this is squashed and rebased version of #2805

plus

  • fixes GLM fit_gradient bug start
  • add freq_weights correction in sandwich code (experimental, likely we stick with it)
  • unit tests for GLM fit_gradient and with cov_type=HC0 for GLM with freq_weights

@josef-pkt
Copy link
Member Author

new test failure in test_gradient_irls looks like precision issue in the convergence between irls and newton, or test tol is too small for convergence criterion tol.
assert inside loop doesn't tell which testcase fails

(edit: I moved the details that I had added here to #2834)

The freq_weights related unit tests pass.

@josef-pkt
Copy link
Member Author

As green as it gets on TravisCI.

@thequackdaddy
this includes now extra tests and fixes (fit_gradient, cov_type='HC', and max_start_irls with a detour to convergence problems)
it didn't affect the freq_weights in GLM itself.

The main next task is to clean up the unit tests: don't use internet, and try to speed them up (these seems to add minutes to the test runs)

@thequackdaddy
Copy link
Member

Sounds good. Should I add the R insurance dataset to our datasets? It seems straightforward enough to do that per the documentation. Then I can use the same data and just get rid of the get_rdataset function. I presume the Insurance dataset is public domain as R distributes it. I can add that dataset as a new PR and then we can change the unit tests to use that instead.

I like that dataset for this only because its quite clear Holders represents frequency weights. I guess we could randomly assign freq weights and use another dataset in the alternative.

@josef-pkt
Copy link
Member Author

I'm not sure about the insurance dataset. From the description it's from an insurance company and used in a article in a conference book, that doesn't have a copy on the internet.
On the other hand it looks like quite a classic dataset.
https://vincentarelbundock.github.io/Rdatasets/doc/MASS/Insurance.html

We can check if we can reuse another dataset, or we could also include it in the test directory given that it is a small file.
To make the replicated dataset smaller, and speed up the tests, we could divide holders by, for example, 10 with a minimum of 1 or 2. (In my experiments I divided by 2)

@josef-pkt
Copy link
Member Author

the existing exposure unit tests use cpunish data with fake exposure numbers.

@thequackdaddy
Copy link
Member

Ahhhh I see. I can re-do the tests on my PR with cpunish and fake weights instead. Once I do that, can you incorporate into your project? My git speak may be wrong, but then you could merge my latest and greatest test_glm.py with your already rebased PR?

Does that plan work? Happy to help, and somewhat curious about the best practice for a workflow at this point. You clearly are more knowledgeable about this new-fangled git do-hickey.

@josef-pkt
Copy link
Member Author

The standard workflow in this case is that you checkout my branch, make your changes, push and open a PR against my branch. Then I'll merge it in my branch.

You can create a branch and checkout either by adding my fork as a remote josef-pkt/statsmodels or checkout the PR branch from the main repo. I don't remember how to do either and have to search the internet each time I do this or set this up.
(github adds the PR branches to the main statsmodels repo. In my setting I automatically download all PR branches, which makes it very convenient to check out PRs.)

cpunish with fake offset is used in test_glm.py TestGlmPoissonOffset. If we need it in many test classes it might be worth it to load it only once into the module namespace. The unrepeated dataset is very small.

(I have a day trip to the US tomorrow and will be mostly offline.)

Related aside: I was searching and browsing on the internet to see whether there are public domain claims datasets around, but it wasn't successful.
One book has online data http://instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/BookWebDec2010/data.html which is AFAICS also in the R insuranceData package, I don't have the book.
There is a Springer book with blog examples in R http://www.r-bloggers.com/r-code-for-chapter-1-of-non-life-insurance-pricing-with-glm/ but I don't find the book website for the data.

(Our trend is to use whatever data we can find for examples, and rely now more on rdatasets for documentation examples.)

@thequackdaddy
Copy link
Member

Josef,

Ok I think this should work. I've edited the tests to use cpunish. Let me know what you find. I did a PR against your branch on your account.

@josef-pkt
Copy link
Member Author

I merged the test changes, and TraviCI is testing them here.

in general:
After a merge you can delete your branch, and work on a new checkout for the next feature. Continuing to work on a merged branch will mix old, already merged commits with new commits.

If you have time, then you could work on adding freq_weights to the docstrings.

@josef-pkt
Copy link
Member Author

The first test run is green in only 9 minutes, so this is fast.

@thequackdaddy
Copy link
Member

I will work on the docstrings later. I presume I should make a new PR against your PR once it is ready, no?

I think the docstrings only need to go into the GLM code base. I didn't realize until recently that some other classes inherit GLM and thus will have freq_weights available. Any advice on finding them or should I ignore that wrinkle? I'm less familiar with documentation standards (sphinx I think?) than with git.

@josef-pkt
Copy link
Member Author

We use numpy doc standard for the doc string format
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

But most of it can be done by following the example of the current docstrings. I made an inline comment about where to add the new parameter in your original PR.

Yes, again a PR against my branch until this is merged, then it's possible to branch from master again.

AFAIR, GLM doesn't have subclasses. At least not yet. It's all in one class.
I might have to watch out before I merge other PRs that do extend or subclass GLM, or we need to warn that freq_weights are not supported for that.

@thequackdaddy
Copy link
Member

I was experimenting with this and realized that this breaks when you have missing values. I need to investigate this more.

Reference #805

@josef-pkt
Copy link
Member Author

@thequackdaddy I think we have now the generic setup for missing handling of extra arrays.
Try to include the freq_weights in the super __init__ call similar to offset and exposure.

@josef-pkt
Copy link
Member Author

FYI: You cannot call dmatrices yourself if there are nans because patsy removes nan rows automatically, but patsy is not able to remove the corresponding rows from the auxiliary/keyword arrays. So, in that case a user has to handle all the nans. (or just use pandas dropna.
AFAIR: we avoid missing value handling by patsy in the formula interface, and handle it inside statsmodels for all arrays.
Note default for missing in formula interface is to drop rows, in the array/DataFrame interface it is currently not checked for nans.

@thequackdaddy
Copy link
Member

@josef-pkt Hey I wanted to check in on this. Anything you need me to do to get this merged in?

Also I had those 2 PR that I made into your branch.

https://github.com/josef-pkt/statsmodels/pulls

I'd like to get this merged before I start working on Tweedie.

Thanks again for all your help with this.

@josef-pkt
Copy link
Member Author

@thequackdaddy I wasn't sure you were finished with your changes. I will look at it, and hopefully merge today.

@josef-pkt
Copy link
Member Author

This is about ready to be merged, after another rebase

needs a warning about unverified robust standard errors

todo (in followup)

  • optimize for weights is None
  • check non-HC robust cov_params and adjust small sample correction
  • check reusability for analytic weights as in Stata and R. unit tests against R or Stata for aweights/var_weights

@josef-pkt
Copy link
Member Author

rebased, squashed the docstring commits, and edited the last commit to improve unit tests.

@josef-pkt
Copy link
Member Author

this didn't trigger the travis test run. The previous two merge commits before interactive rebase where green on all machines. network graph looks clean for this

@josef-pkt
Copy link
Member Author

I will check some things again.
for example, I just saw that freq_weights is before offset and exposure in the __init__ arguments, but I think it should be after for consistency and some additional backwards compatibility.

def __init__(self, endog, exog, family=None, offset=None, exposure=None,
missing='none', **kwargs):
def __init__(self, endog, exog, family=None, freq_weights=None,
offset=None, exposure=None, missing='none', **kwargs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freq_weights after offset and exposure

@josef-pkt
Copy link
Member Author

checking resid still seems to be an open issue, unclear defintion

     def resid_anscombe(self):
 +        # TODO: Data weights?

@thequackdaddy
Copy link
Member

I was unsure what to do with these. Should I apply the freq weights by this as well? I'm not as familiar with anscombe as with deviance/Pearson.

Just multiply the freq_weight by this result?

On Mar 8, 2016, at 6:42 PM, Josef Perktold notifications@github.com wrote:

checking resid still seems to be an open issue, unclear defintion

 def resid_anscombe(self):
  •    # TODO: Data weights?
    

    Reply to this email directly or view it on GitHub.

@josef-pkt
Copy link
Member Author

@thequackdaddy thanks for the explanations.

general FYI: Some of my comments when I'm reading through related code might not be related to changes in a PR. There are just things that I discover whenever I read through code. I'm also partially reading in an editor where I don't see the change sets as on github.
Some of the code has never been carefully reviewed in details, and some code might change it's meaning or relevance with ongoing development. Even core code needs improvement and adjustments over time, even if it has full test coverage for the original use cases.

@thequackdaddy
Copy link
Member

@josef-pkt I understand these are general comments. Just providing what I know and some of my (possibly flawed) logic.

Let me know which of these issues (if any you want me to tackle). I can make PR's against your PR as we've done in the past.

return self.deviance - self.df_resid*np.log(self.nobs)
return (self.deviance -
(self._freq_weights.sum() - self.df_model - 1) *
np.log(self._freq_weights.sum()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate calculation of sum. I think we add wnobs for this

@thequackdaddy thequackdaddy mentioned this pull request Mar 9, 2016
return wls_model.fit().fittedvalues

@cache_readonly
def deviance(self):
return self.family.deviance(self._endog, self.mu)
return self.family.deviance(self._endog, self.mu, self._freq_weights)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general use keywords for keyword arguments.

this PR breaks backwards compatibility if keyword arguments are used as positional arguments.

@josef-pkt
Copy link
Member Author

I didn't see much else, but only quickly skimmed families

resid are a bit of a mess: for scale and analysis of deviance we need weighted, for outlier or residual diagnostics we need unweighted, I guess. However, we cannot add an argument to make this optional because residuals are cached attributes.

I'm not sure whether the families should have freq_weights or scale first in the argument list. It's mostly used internally and with keyword usage the order doesn't matter.

josef-pkt added a commit that referenced this pull request Mar 9, 2016
@josef-pkt josef-pkt merged commit 6903819 into statsmodels:master Mar 9, 2016
@josef-pkt
Copy link
Member Author

merged, Thanks @thequackdaddy

I'm just running some examples in Stata, and all I spot checked agreed at quite high precision in cpunish poisson example. I will add a follow-up issue and add most likely some tests against Stata.

@thequackdaddy
Copy link
Member

I thank you more.

@josef-pkt
Copy link
Member Author

#2849 I opened a follow-up issue
main point: I think we are going to change the definition of the cached resid_xxx to be unweighted, observation specific. I had added a warning to the docstring that that's not settled yet.

@josef-pkt josef-pkt deleted the glm_weights branch April 26, 2016 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants