Glm weights rebased #2835

josef-pkt · 2016-02-20T05:50:06Z

this is squashed and rebased version of #2805

plus

fixes GLM fit_gradient bug start
add freq_weights correction in sandwich code (experimental, likely we stick with it)
unit tests for GLM fit_gradient and with cov_type=HC0 for GLM with freq_weights

josef-pkt · 2016-02-20T14:11:41Z

new test failure in test_gradient_irls looks like precision issue in the convergence between irls and newton, or test tol is too small for convergence criterion tol.
assert inside loop doesn't tell which testcase fails

(edit: I moved the details that I had added here to #2834)

The freq_weights related unit tests pass.

josef-pkt · 2016-02-20T19:47:31Z

As green as it gets on TravisCI.

@thequackdaddy
this includes now extra tests and fixes (fit_gradient, cov_type='HC', and max_start_irls with a detour to convergence problems)
it didn't affect the freq_weights in GLM itself.

The main next task is to clean up the unit tests: don't use internet, and try to speed them up (these seems to add minutes to the test runs)

thequackdaddy · 2016-02-20T19:52:57Z

Sounds good. Should I add the R insurance dataset to our datasets? It seems straightforward enough to do that per the documentation. Then I can use the same data and just get rid of the get_rdataset function. I presume the Insurance dataset is public domain as R distributes it. I can add that dataset as a new PR and then we can change the unit tests to use that instead.

I like that dataset for this only because its quite clear Holders represents frequency weights. I guess we could randomly assign freq weights and use another dataset in the alternative.

josef-pkt · 2016-02-20T21:09:41Z

I'm not sure about the insurance dataset. From the description it's from an insurance company and used in a article in a conference book, that doesn't have a copy on the internet.
On the other hand it looks like quite a classic dataset.
https://vincentarelbundock.github.io/Rdatasets/doc/MASS/Insurance.html

We can check if we can reuse another dataset, or we could also include it in the test directory given that it is a small file.
To make the replicated dataset smaller, and speed up the tests, we could divide holders by, for example, 10 with a minimum of 1 or 2. (In my experiments I divided by 2)

josef-pkt · 2016-02-20T21:14:40Z

the existing exposure unit tests use cpunish data with fake exposure numbers.

thequackdaddy · 2016-02-21T01:00:28Z

Ahhhh I see. I can re-do the tests on my PR with cpunish and fake weights instead. Once I do that, can you incorporate into your project? My git speak may be wrong, but then you could merge my latest and greatest test_glm.py with your already rebased PR?

Does that plan work? Happy to help, and somewhat curious about the best practice for a workflow at this point. You clearly are more knowledgeable about this new-fangled git do-hickey.

josef-pkt · 2016-02-21T01:55:27Z

The standard workflow in this case is that you checkout my branch, make your changes, push and open a PR against my branch. Then I'll merge it in my branch.

You can create a branch and checkout either by adding my fork as a remote josef-pkt/statsmodels or checkout the PR branch from the main repo. I don't remember how to do either and have to search the internet each time I do this or set this up.
(github adds the PR branches to the main statsmodels repo. In my setting I automatically download all PR branches, which makes it very convenient to check out PRs.)

cpunish with fake offset is used in test_glm.py TestGlmPoissonOffset. If we need it in many test classes it might be worth it to load it only once into the module namespace. The unrepeated dataset is very small.

(I have a day trip to the US tomorrow and will be mostly offline.)

Related aside: I was searching and browsing on the internet to see whether there are public domain claims datasets around, but it wasn't successful.
One book has online data http://instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/BookWebDec2010/data.html which is AFAICS also in the R insuranceData package, I don't have the book.
There is a Springer book with blog examples in R http://www.r-bloggers.com/r-code-for-chapter-1-of-non-life-insurance-pricing-with-glm/ but I don't find the book website for the data.

(Our trend is to use whatever data we can find for examples, and rely now more on rdatasets for documentation examples.)

thequackdaddy · 2016-02-21T05:27:49Z

Josef,

Ok I think this should work. I've edited the tests to use cpunish. Let me know what you find. I did a PR against your branch on your account.

josef-pkt · 2016-02-21T05:45:30Z

I merged the test changes, and TraviCI is testing them here.

in general:
After a merge you can delete your branch, and work on a new checkout for the next feature. Continuing to work on a merged branch will mix old, already merged commits with new commits.

If you have time, then you could work on adding freq_weights to the docstrings.

josef-pkt · 2016-02-21T05:49:09Z

The first test run is green in only 9 minutes, so this is fast.

thequackdaddy · 2016-02-21T06:00:26Z

I will work on the docstrings later. I presume I should make a new PR against your PR once it is ready, no?

I think the docstrings only need to go into the GLM code base. I didn't realize until recently that some other classes inherit GLM and thus will have freq_weights available. Any advice on finding them or should I ignore that wrinkle? I'm less familiar with documentation standards (sphinx I think?) than with git.

josef-pkt · 2016-02-21T06:08:14Z

We use numpy doc standard for the doc string format
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

But most of it can be done by following the example of the current docstrings. I made an inline comment about where to add the new parameter in your original PR.

Yes, again a PR against my branch until this is merged, then it's possible to branch from master again.

AFAIR, GLM doesn't have subclasses. At least not yet. It's all in one class.
I might have to watch out before I merge other PRs that do extend or subclass GLM, or we need to warn that freq_weights are not supported for that.

thequackdaddy · 2016-02-24T19:33:13Z

I was experimenting with this and realized that this breaks when you have missing values. I need to investigate this more.

Reference #805

josef-pkt · 2016-02-24T19:39:16Z

@thequackdaddy I think we have now the generic setup for missing handling of extra arrays.
Try to include the freq_weights in the super __init__ call similar to offset and exposure.

josef-pkt · 2016-02-24T20:13:06Z

FYI: You cannot call dmatrices yourself if there are nans because patsy removes nan rows automatically, but patsy is not able to remove the corresponding rows from the auxiliary/keyword arrays. So, in that case a user has to handle all the nans. (or just use pandas dropna.
AFAIR: we avoid missing value handling by patsy in the formula interface, and handle it inside statsmodels for all arrays.
Note default for missing in formula interface is to drop rows, in the array/DataFrame interface it is currently not checked for nans.

thequackdaddy · 2016-03-08T05:22:35Z

@josef-pkt Hey I wanted to check in on this. Anything you need me to do to get this merged in?

Also I had those 2 PR that I made into your branch.

https://github.com/josef-pkt/statsmodels/pulls

I'd like to get this merged before I start working on Tweedie.

Thanks again for all your help with this.

josef-pkt · 2016-03-08T12:36:14Z

@thequackdaddy I wasn't sure you were finished with your changes. I will look at it, and hopefully merge today.

josef-pkt · 2016-03-08T20:07:36Z

This is about ready to be merged, after another rebase

needs a warning about unverified robust standard errors

todo (in followup)

optimize for weights is None
check non-HC robust cov_params and adjust small sample correction
check reusability for analytic weights as in Stata and R. unit tests against R or Stata for aweights/var_weights

…tests

josef-pkt · 2016-03-08T23:25:27Z

rebased, squashed the docstring commits, and edited the last commit to improve unit tests.

josef-pkt · 2016-03-08T23:38:00Z

this didn't trigger the travis test run. The previous two merge commits before interactive rebase where green on all machines. network graph looks clean for this

josef-pkt · 2016-03-08T23:43:07Z

I will check some things again.
for example, I just saw that freq_weights is before offset and exposure in the __init__ arguments, but I think it should be after for consistency and some additional backwards compatibility.

josef-pkt · 2016-03-09T00:25:29Z

statsmodels/genmod/generalized_linear_model.py

-    def __init__(self, endog, exog, family=None, offset=None, exposure=None,
-                 missing='none', **kwargs):
+    def __init__(self, endog, exog, family=None, freq_weights=None,
+                 offset=None, exposure=None, missing='none', **kwargs):


freq_weights after offset and exposure

josef-pkt · 2016-03-09T00:42:17Z

checking resid still seems to be an open issue, unclear defintion

     def resid_anscombe(self):
 +        # TODO: Data weights?

thequackdaddy · 2016-03-09T00:49:03Z

I was unsure what to do with these. Should I apply the freq weights by this as well? I'm not as familiar with anscombe as with deviance/Pearson.

Just multiply the freq_weight by this result?

On Mar 8, 2016, at 6:42 PM, Josef Perktold notifications@github.com wrote:

checking resid still seems to be an open issue, unclear defintion
 def resid_anscombe(self):
   # TODO: Data weights?
—
Reply to this email directly or view it on GitHub.

josef-pkt · 2016-03-09T01:54:40Z

@thequackdaddy thanks for the explanations.

general FYI: Some of my comments when I'm reading through related code might not be related to changes in a PR. There are just things that I discover whenever I read through code. I'm also partially reading in an editor where I don't see the change sets as on github.
Some of the code has never been carefully reviewed in details, and some code might change it's meaning or relevance with ongoing development. Even core code needs improvement and adjustments over time, even if it has full test coverage for the original use cases.

thequackdaddy · 2016-03-09T02:05:00Z

@josef-pkt I understand these are general comments. Just providing what I know and some of my (possibly flawed) logic.

Let me know which of these issues (if any you want me to tackle). I can make PR's against your PR as we've done in the past.

josef-pkt · 2016-03-09T02:54:45Z

statsmodels/genmod/generalized_linear_model.py

-        return self.deviance - self.df_resid*np.log(self.nobs)
+        return (self.deviance -
+                (self._freq_weights.sum() - self.df_model - 1) * 
+                np.log(self._freq_weights.sum()))


duplicate calculation of sum. I think we add wnobs for this

josef-pkt · 2016-03-09T03:13:15Z

statsmodels/genmod/generalized_linear_model.py

            return wls_model.fit().fittedvalues

    @cache_readonly
    def deviance(self):
-        return self.family.deviance(self._endog, self.mu)
+        return self.family.deviance(self._endog, self.mu, self._freq_weights)


in general use keywords for keyword arguments.

this PR breaks backwards compatibility if keyword arguments are used as positional arguments.

josef-pkt · 2016-03-09T03:22:01Z

I didn't see much else, but only quickly skimmed families

resid are a bit of a mess: for scale and analysis of deviance we need weighted, for outlier or residual diagnostics we need unweighted, I guess. However, we cannot add an argument to make this optional because residuals are cached attributes.

I'm not sure whether the families should have freq_weights or scale first in the argument list. It's mostly used internally and with keyword usage the order doesn't matter.

ENH: Glm weights rebased

josef-pkt · 2016-03-09T20:43:50Z

merged, Thanks @thequackdaddy

I'm just running some examples in Stata, and all I spot checked agreed at quite high precision in cpunish poisson example. I will add a follow-up issue and add most likely some tests against Stata.

thequackdaddy · 2016-03-09T20:52:04Z

I thank you more.

josef-pkt · 2016-03-10T13:24:34Z

#2849 I opened a follow-up issue
main point: I think we are going to change the definition of the cached resid_xxx to be unweighted, observation specific. I had added a warning to the docstring that that's not settled yet.

josef-pkt mentioned this pull request Feb 20, 2016

ENH: Relative tolerance for GLM convergence #2803

Merged

josef-pkt mentioned this pull request Feb 25, 2016

weights: repeat versus mean/variance in LEF family #2840

Open

josef-pkt added type-bug type-enh comp-base comp-genmod topic-covtype topic-weights labels Mar 8, 2016

josef-pkt added this to the 0.8 milestone Mar 8, 2016

thequackdaddy and others added 3 commits March 8, 2016 16:45

GLM add frequency weights

5fbb444

BUG: GLM fit_gradient use max_start_irls closes statsmodels#2834

8a0a1a1

ENH/TST: sandwich_covariance correct for frequency weights, GLM unit …

3d01ea3

…tests

josef-pkt reviewed Mar 9, 2016
View reviewed changes

thequackdaddy mentioned this pull request Mar 9, 2016

ENH: GLM Weights #2805

Closed

josef-pkt reviewed Mar 9, 2016
View reviewed changes

josef-pkt mentioned this pull request Mar 9, 2016

SUMM: weights - which kinds #2848

Open

josef-pkt added 2 commits March 9, 2016 14:14

REF/DOC: change order in GLM arguments, add to docstring

6b8d845

REF: add wnobs for freq_weights.sum()

f508e3b

josef-pkt added a commit that referenced this pull request Mar 9, 2016

Merge pull request #2835 from josef-pkt/glm_weights

6903819

ENH: Glm weights rebased

josef-pkt merged commit 6903819 into statsmodels:master Mar 9, 2016

josef-pkt mentioned this pull request Mar 9, 2016

followup GLM weights, Stata and Poisson #2849

Open

josef-pkt mentioned this pull request Mar 19, 2016

ENH: doubly robust GEE #2861

Open

josef-pkt mentioned this pull request Apr 2, 2016

ENH: all weights in GLM #2879

Merged

josef-pkt deleted the glm_weights branch April 26, 2016 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glm weights rebased #2835

Glm weights rebased #2835

josef-pkt commented Feb 20, 2016

josef-pkt commented Feb 20, 2016

josef-pkt commented Feb 20, 2016

thequackdaddy commented Feb 20, 2016

josef-pkt commented Feb 20, 2016

josef-pkt commented Feb 20, 2016

thequackdaddy commented Feb 21, 2016

josef-pkt commented Feb 21, 2016

thequackdaddy commented Feb 21, 2016

josef-pkt commented Feb 21, 2016

josef-pkt commented Feb 21, 2016

thequackdaddy commented Feb 21, 2016

josef-pkt commented Feb 21, 2016

thequackdaddy commented Feb 24, 2016

josef-pkt commented Feb 24, 2016

josef-pkt commented Feb 24, 2016

thequackdaddy commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt Mar 9, 2016

josef-pkt commented Mar 9, 2016

thequackdaddy commented Mar 9, 2016

josef-pkt commented Mar 9, 2016

thequackdaddy commented Mar 9, 2016

josef-pkt Mar 9, 2016

josef-pkt Mar 9, 2016

josef-pkt commented Mar 9, 2016

josef-pkt commented Mar 9, 2016

thequackdaddy commented Mar 9, 2016

josef-pkt commented Mar 10, 2016

Glm weights rebased #2835

Glm weights rebased #2835

Conversation

josef-pkt commented Feb 20, 2016

josef-pkt commented Feb 20, 2016

josef-pkt commented Feb 20, 2016

thequackdaddy commented Feb 20, 2016

josef-pkt commented Feb 20, 2016

josef-pkt commented Feb 20, 2016

thequackdaddy commented Feb 21, 2016

josef-pkt commented Feb 21, 2016

thequackdaddy commented Feb 21, 2016

josef-pkt commented Feb 21, 2016

josef-pkt commented Feb 21, 2016

thequackdaddy commented Feb 21, 2016

josef-pkt commented Feb 21, 2016

thequackdaddy commented Feb 24, 2016

josef-pkt commented Feb 24, 2016

josef-pkt commented Feb 24, 2016

thequackdaddy commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt Mar 9, 2016

Choose a reason for hiding this comment

josef-pkt commented Mar 9, 2016

thequackdaddy commented Mar 9, 2016

josef-pkt commented Mar 9, 2016

thequackdaddy commented Mar 9, 2016

josef-pkt Mar 9, 2016

Choose a reason for hiding this comment

josef-pkt Mar 9, 2016

Choose a reason for hiding this comment

josef-pkt commented Mar 9, 2016

josef-pkt commented Mar 9, 2016

thequackdaddy commented Mar 9, 2016

josef-pkt commented Mar 10, 2016