Fixed duplicate svd in RegressionModel #1092

Merged
merged 2 commits into from Oct 24, 2013

Projects

None yet

4 participants

@guyrt
Contributor
guyrt commented Sep 24, 2013

Closes #1081

Removes several duplicate calls to svd or eigenvalue routines. This should cause speedups.

Also removed a related TODO in the RegressionResults

@coveralls

Coverage Status

Coverage remained the same when pulling 12a4694 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt
Member

Thanks, that's a good catch

This should also go into 0.5.1

@coveralls

Coverage Status

Coverage remained the same when pulling 602295e on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt
Member

made comments in commit instead of PR (by accident)

Looks good from what I have seen so far.
I don't understand the adjustments to the df in the unit tests.

@guyrt
Contributor
guyrt commented Sep 26, 2013

I've addressed all the comments from @josef-pkt in the previous commit.

The major update is that df_xxx are now properties. If for some reason you want to update them between init and fit, the behavior hasn't changed.

I wasn't 100% satisfied with the way i was calculating rank in the QR case: I would rather still use a rank helper function for consistency. What I've done is pass a smaller matrix of equal rank to the rank() function so we still get a time savings.

[EDIT: this paragraph is incorrect. Please ignore. ] Last thing to check: I'm currently using exog in QR and wexog in pinv. That's the way it was when I started this PR, but it isn't clear to me why we would want to be solving with different matrices in those two cases. I think we meant wexog in the QR code path but I want a +1 before I change that.

@josef-pkt
Member

just last point exog is a local alias for self.wexog, when I checked this (confusing naming)

@guyrt
Contributor
guyrt commented Sep 26, 2013

yeah I noticed that. The last commit fixes it.

@coveralls

Coverage Status

Coverage remained the same when pulling daae4a8 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@guyrt
Contributor
guyrt commented Sep 28, 2013

I've rebased this to one commit for clarity.

@coveralls

Coverage Status

Coverage remained the same when pulling 9e78d52 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@guyrt
Contributor
guyrt commented Oct 13, 2013

@josef-pkt Unless you see anything you want to change, this branch is ready.

@josef-pkt josef-pkt commented on the diff Oct 22, 2013
statsmodels/regression/linear_model.py
beta = np.linalg.solve(R, effects)
- # no upper triangular solve routine in numpy/scipy?
+ if self._df_model is None:
+ self._df_model = float(self.rank - self.k_constant)
+ if self._df_resid is None:
+ self.df_resid = self.nobs - self.rank
@josef-pkt
josef-pkt Oct 22, 2013 Member

Ok, I think I'm starting to see how this works.
If fit is called after __init__ then this sets df_model and df_resid and the extra calculations in the properties are never done.
sound ok

@josef-pkt josef-pkt commented on an outdated diff Oct 22, 2013
statsmodels/regression/linear_model.py
+
+ def _get_df_model(self):
+ if self._df_model is None:
+ self.rank = rank(self.exog)
+ self._df_model = float(self.rank - self.k_constant)
+ return self._df_model
+
+ def _set_df_model(self, value):
+ self._df_model = value
+
+ df_model = property(_get_df_model, _set_df_model)
+
+ def _get_df_resid(self):
+ if self._df_resid is None:
+ self.rank = rank(self.exog)
+ self._df_resid = self.nobs - self.rank
@josef-pkt
josef-pkt Oct 22, 2013 Member

this could use df_model, so we don't need to calculate rank twice if self._df_resid is not yet set.

This could be a case for the resettable cache decorator, but I don't think it works nicely with the check in fit() (would need to check _cache). I don't have any experience with the resettable cache.

@josef-pkt
Member

I liked it better when it had the original separate commits.
I doubt we want to backport this to 0.5.1, there are now changes that might change the behavior too much for a bugfix release.

moving milestone to 0.6

@josef-pkt josef-pkt and 1 other commented on an outdated diff Oct 22, 2013
statsmodels/tools/tools.py
@@ -345,6 +345,29 @@ def isestimable(C, D):
return True
+def extendedpinv(X, rcond=1e-15):
@josef-pkt
josef-pkt Oct 22, 2013 Member

maybe a different name with pinv in front so it's closer to a simple replacement in calculations, pinvext pinvs ... ?

@guyrt
guyrt Oct 23, 2013 Contributor

pinv_extended

@josef-pkt
Member

Looks good and should give us a nice speedup. Ready to merge except for a two suggested changes.

We will be able to use the new pinv function also in some other places where we can use the additional information about singular values.

@guyrt Thank you for this.
Sorry about the delay. I would prefer if we don't squash commits if they are separate logical units. It took me a bit of time to figure out what the new properties were doing.

@guyrt
Contributor
guyrt commented Oct 23, 2013

Sorry about the commit squashing. It's the way they do things over in pandas.

I've made the two changes. Rank computation in the two df computations now check whether rank is already defined.

I'm not familiar with statsmodels' caching system. Isn't there a ticket about improving it out there somewhere? It's probably worth taking a look at some point.

@coveralls

Coverage Status

Coverage remained the same when pulling cdc6830 on guyrt:streamline-linearalgebra into 16365ed on statsmodels:master.

@josef-pkt
Member

about the cache decorator:
It has been working very well for many years and we never made a change to the original version by Pierre GM.
The only open issue that I could find is #886. There are some problem how cache-decorated properties are handled by sphinx. It's not obvious in the generated docs that these use attribute access and are not methods. There were also some problems with overriding them in subclasses, but nothing serious or clear enough for an issue (most likely just the way python behaves).
The cache itself is a dictionary attached as _cache to the instance.

Our results classes are full of @cache_readonly decorated methods to calculate and store attributes on demand (lazily), for example the set of methods in the RegressionResults: https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L892

But we use it so far only for simple cases when an attribute is stored in a method. In your case, the caching is across methods, and your _df_model check is more explicit and easier to understand.

@jseabold
Member

Should we use the getter and setter decorators for properties since we have 2.6 as minimum Python? I prefer them, though it's not a huge deal.

@guyrt guyrt Removed one more rank issue
Changed name of extendedpinv to pinv_extended

Updated properties using decorators
162e0e3
@guyrt
Contributor
guyrt commented Oct 23, 2013

@jseabold Agreed. Fixed.

@coveralls

Coverage Status

Coverage remained the same when pulling 162e0e3 on guyrt:streamline-linearalgebra into 84e7607 on statsmodels:master.

@jseabold
Member

Looks good to me. Merge?

@guyrt
Contributor
guyrt commented Oct 23, 2013

+1 on my end

@josef-pkt
Member

yes, good to merge

@jseabold jseabold merged commit 1bc6576 into statsmodels:master Oct 24, 2013

1 check passed

default The Travis CI build passed
Details
@jseabold
Member

For the future, we need to remember to request an update to the changes file in PRs like this to make release time easier on ourselves. @guyrt if you find the time/desire, could you send a PR to update the docs/source/release/version0.6.rst with a blurb about this change? You can look in version0.5.rst to get an idea of what we're after.

I'm imagining this can go under the Other important new features header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment