Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed duplicate svd in RegressionModel #1092

Merged
merged 2 commits into from
Oct 24, 2013

Conversation

guyrt
Copy link
Contributor

@guyrt guyrt commented Sep 24, 2013

Closes #1081

Removes several duplicate calls to svd or eigenvalue routines. This should cause speedups.

Also removed a related TODO in the RegressionResults

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 12a4694 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt
Copy link
Member

Thanks, that's a good catch

This should also go into 0.5.1

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 602295e on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt
Copy link
Member

made comments in commit instead of PR (by accident)

Looks good from what I have seen so far.
I don't understand the adjustments to the df in the unit tests.

@guyrt
Copy link
Contributor Author

guyrt commented Sep 26, 2013

I've addressed all the comments from @josef-pkt in the previous commit.

The major update is that df_xxx are now properties. If for some reason you want to update them between init and fit, the behavior hasn't changed.

I wasn't 100% satisfied with the way i was calculating rank in the QR case: I would rather still use a rank helper function for consistency. What I've done is pass a smaller matrix of equal rank to the rank() function so we still get a time savings.

[EDIT: this paragraph is incorrect. Please ignore. ] Last thing to check: I'm currently using exog in QR and wexog in pinv. That's the way it was when I started this PR, but it isn't clear to me why we would want to be solving with different matrices in those two cases. I think we meant wexog in the QR code path but I want a +1 before I change that.

@josef-pkt
Copy link
Member

just last point exog is a local alias for self.wexog, when I checked this (confusing naming)

@guyrt
Copy link
Contributor Author

guyrt commented Sep 26, 2013

yeah I noticed that. The last commit fixes it.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling daae4a8 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@guyrt
Copy link
Contributor Author

guyrt commented Sep 28, 2013

I've rebased this to one commit for clarity.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 9e78d52 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@guyrt
Copy link
Contributor Author

guyrt commented Oct 13, 2013

@josef-pkt Unless you see anything you want to change, this branch is ready.

if self._df_model is None:
self._df_model = float(self.rank - self.k_constant)
if self._df_resid is None:
self.df_resid = self.nobs - self.rank
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I'm starting to see how this works.
If fit is called after __init__ then this sets df_model and df_resid and the extra calculations in the properties are never done.
sound ok

@josef-pkt
Copy link
Member

I liked it better when it had the original separate commits.
I doubt we want to backport this to 0.5.1, there are now changes that might change the behavior too much for a bugfix release.

moving milestone to 0.6

@@ -345,6 +345,29 @@ def isestimable(C, D):
return True


def extendedpinv(X, rcond=1e-15):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a different name with pinv in front so it's closer to a simple replacement in calculations, pinvext pinvs ... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinv_extended

@josef-pkt
Copy link
Member

Looks good and should give us a nice speedup. Ready to merge except for a two suggested changes.

We will be able to use the new pinv function also in some other places where we can use the additional information about singular values.

@guyrt Thank you for this.
Sorry about the delay. I would prefer if we don't squash commits if they are separate logical units. It took me a bit of time to figure out what the new properties were doing.

@guyrt
Copy link
Contributor Author

guyrt commented Oct 23, 2013

Sorry about the commit squashing. It's the way they do things over in pandas.

I've made the two changes. Rank computation in the two df computations now check whether rank is already defined.

I'm not familiar with statsmodels' caching system. Isn't there a ticket about improving it out there somewhere? It's probably worth taking a look at some point.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling cdc6830 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt
Copy link
Member

about the cache decorator:
It has been working very well for many years and we never made a change to the original version by Pierre GM.
The only open issue that I could find is #886. There are some problem how cache-decorated properties are handled by sphinx. It's not obvious in the generated docs that these use attribute access and are not methods. There were also some problems with overriding them in subclasses, but nothing serious or clear enough for an issue (most likely just the way python behaves).
The cache itself is a dictionary attached as _cache to the instance.

Our results classes are full of @cache_readonly decorated methods to calculate and store attributes on demand (lazily), for example the set of methods in the RegressionResults: https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L892

But we use it so far only for simple cases when an attribute is stored in a method. In your case, the caching is across methods, and your _df_model check is more explicit and easier to understand.

@jseabold
Copy link
Member

Should we use the getter and setter decorators for properties since we have 2.6 as minimum Python? I prefer them, though it's not a huge deal.

Changed name of extendedpinv to pinv_extended

Updated properties using decorators
@guyrt
Copy link
Contributor Author

guyrt commented Oct 23, 2013

@jseabold Agreed. Fixed.

@coveralls
Copy link

coveralls commented Oct 23, 2013

Coverage Status

Coverage increased (+0.008%) to 74.286% when pulling 162e0e3 on guyrt:streamline-linearalgebra into 84e7607 on statsmodels:master.

@jseabold
Copy link
Member

Looks good to me. Merge?

@guyrt
Copy link
Contributor Author

guyrt commented Oct 23, 2013

+1 on my end

@josef-pkt
Copy link
Member

yes, good to merge

jseabold added a commit that referenced this pull request Oct 24, 2013
ENH: Avoid duplicate svd in RegressionModel
@jseabold jseabold merged commit 1bc6576 into statsmodels:master Oct 24, 2013
@jseabold
Copy link
Member

For the future, we need to remember to request an update to the changes file in PRs like this to make release time easier on ourselves. @guyrt if you find the time/desire, could you send a PR to update the docs/source/release/version0.6.rst with a blurb about this change? You can look in version0.5.rst to get an idea of what we're after.

I'm imagining this can go under the Other important new features header.

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014
ENH: Avoid duplicate svd in RegressionModel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

streamline linear algebra for linear model
4 participants