Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed duplicate svd in RegressionModel #1092

Merged
merged 2 commits into from Oct 24, 2013

Conversation

Projects
None yet
4 participants
@guyrt
Copy link
Contributor

commented Sep 24, 2013

Closes #1081

Removes several duplicate calls to svd or eigenvalue routines. This should cause speedups.

Also removed a related TODO in the RegressionResults

@coveralls

This comment has been minimized.

Copy link

commented Sep 24, 2013

Coverage Status

Coverage remained the same when pulling 12a4694 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Sep 24, 2013

Thanks, that's a good catch

This should also go into 0.5.1

@coveralls

This comment has been minimized.

Copy link

commented Sep 26, 2013

Coverage Status

Coverage remained the same when pulling 602295e on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Sep 26, 2013

made comments in commit instead of PR (by accident)

Looks good from what I have seen so far.
I don't understand the adjustments to the df in the unit tests.

@guyrt

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2013

I've addressed all the comments from @josef-pkt in the previous commit.

The major update is that df_xxx are now properties. If for some reason you want to update them between init and fit, the behavior hasn't changed.

I wasn't 100% satisfied with the way i was calculating rank in the QR case: I would rather still use a rank helper function for consistency. What I've done is pass a smaller matrix of equal rank to the rank() function so we still get a time savings.

[EDIT: this paragraph is incorrect. Please ignore. ] Last thing to check: I'm currently using exog in QR and wexog in pinv. That's the way it was when I started this PR, but it isn't clear to me why we would want to be solving with different matrices in those two cases. I think we meant wexog in the QR code path but I want a +1 before I change that.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Sep 26, 2013

just last point exog is a local alias for self.wexog, when I checked this (confusing naming)

@guyrt

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2013

yeah I noticed that. The last commit fixes it.

@coveralls

This comment has been minimized.

Copy link

commented Sep 26, 2013

Coverage Status

Coverage remained the same when pulling daae4a8 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@guyrt

This comment has been minimized.

Copy link
Contributor Author

commented Sep 28, 2013

I've rebased this to one commit for clarity.

@coveralls

This comment has been minimized.

Copy link

commented Sep 28, 2013

Coverage Status

Coverage remained the same when pulling 9e78d52 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@guyrt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 13, 2013

@josef-pkt Unless you see anything you want to change, this branch is ready.

if self._df_model is None:
self._df_model = float(self.rank - self.k_constant)
if self._df_resid is None:
self.df_resid = self.nobs - self.rank

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Oct 22, 2013

Member

Ok, I think I'm starting to see how this works.
If fit is called after __init__ then this sets df_model and df_resid and the extra calculations in the properties are never done.
sound ok

def _get_df_resid(self):
if self._df_resid is None:
self.rank = rank(self.exog)
self._df_resid = self.nobs - self.rank

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Oct 22, 2013

Member

this could use df_model, so we don't need to calculate rank twice if self._df_resid is not yet set.

This could be a case for the resettable cache decorator, but I don't think it works nicely with the check in fit() (would need to check _cache). I don't have any experience with the resettable cache.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Oct 22, 2013

I liked it better when it had the original separate commits.
I doubt we want to backport this to 0.5.1, there are now changes that might change the behavior too much for a bugfix release.

moving milestone to 0.6

@@ -345,6 +345,29 @@ def isestimable(C, D):
return True


def extendedpinv(X, rcond=1e-15):

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Oct 22, 2013

Member

maybe a different name with pinv in front so it's closer to a simple replacement in calculations, pinvext pinvs ... ?

This comment has been minimized.

Copy link
@guyrt

guyrt Oct 23, 2013

Author Contributor

pinv_extended

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Oct 22, 2013

Looks good and should give us a nice speedup. Ready to merge except for a two suggested changes.

We will be able to use the new pinv function also in some other places where we can use the additional information about singular values.

@guyrt Thank you for this.
Sorry about the delay. I would prefer if we don't squash commits if they are separate logical units. It took me a bit of time to figure out what the new properties were doing.

@guyrt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2013

Sorry about the commit squashing. It's the way they do things over in pandas.

I've made the two changes. Rank computation in the two df computations now check whether rank is already defined.

I'm not familiar with statsmodels' caching system. Isn't there a ticket about improving it out there somewhere? It's probably worth taking a look at some point.

@coveralls

This comment has been minimized.

Copy link

commented Oct 23, 2013

Coverage Status

Coverage remained the same when pulling cdc6830 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Oct 23, 2013

about the cache decorator:
It has been working very well for many years and we never made a change to the original version by Pierre GM.
The only open issue that I could find is #886. There are some problem how cache-decorated properties are handled by sphinx. It's not obvious in the generated docs that these use attribute access and are not methods. There were also some problems with overriding them in subclasses, but nothing serious or clear enough for an issue (most likely just the way python behaves).
The cache itself is a dictionary attached as _cache to the instance.

Our results classes are full of @cache_readonly decorated methods to calculate and store attributes on demand (lazily), for example the set of methods in the RegressionResults: https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L892

But we use it so far only for simple cases when an attribute is stored in a method. In your case, the caching is across methods, and your _df_model check is more explicit and easier to understand.

@jseabold

This comment has been minimized.

Copy link
Member

commented Oct 23, 2013

Should we use the getter and setter decorators for properties since we have 2.6 as minimum Python? I prefer them, though it's not a huge deal.

Removed one more rank issue
Changed name of extendedpinv to pinv_extended

Updated properties using decorators
@guyrt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2013

@jseabold Agreed. Fixed.

@coveralls

This comment has been minimized.

Copy link

commented Oct 23, 2013

Coverage Status

Coverage increased (+0.008%) to 74.286% when pulling 162e0e3 on guyrt:streamline-linearalgebra into 84e7607 on statsmodels:master.

@jseabold

This comment has been minimized.

Copy link
Member

commented Oct 23, 2013

Looks good to me. Merge?

@guyrt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2013

+1 on my end

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Oct 23, 2013

yes, good to merge

jseabold added a commit that referenced this pull request Oct 24, 2013

Merge pull request #1092 from guyrt/streamline-linearalgebra
ENH: Avoid duplicate svd in RegressionModel

@jseabold jseabold merged commit 1bc6576 into statsmodels:master Oct 24, 2013

1 check passed

default The Travis CI build passed
Details
@jseabold

This comment has been minimized.

Copy link
Member

commented Oct 24, 2013

For the future, we need to remember to request an update to the changes file in PRs like this to make release time easier on ourselves. @guyrt if you find the time/desire, could you send a PR to update the docs/source/release/version0.6.rst with a blurb about this change? You can look in version0.5.rst to get an idea of what we're after.

I'm imagining this can go under the Other important new features header.

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014

Merge pull request statsmodels#1092 from guyrt/streamline-linearalgebra
ENH: Avoid duplicate svd in RegressionModel

@josef-pkt josef-pkt referenced this pull request Jun 4, 2017

Closed

linear_model code paths #3740

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.