Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

WIP: ENH:LTS, MTL least trimmed squares and ETS and MM-E #452

Open
wants to merge 21 commits into
from

Conversation

Projects

pulls in comp-robust

others in josef-pulls

2 participants
Owner

josef-pkt commented Sep 6, 2012

extensions to robust estimation

least trimmed squares
efficient least trimmed squares
mm-estimation
maximum trimmed likelihood

basic models work and look good in MonteCarlo and some examples from papers

no unit tests yet
incomplete integration with RLM
doesn't work with dummy variables yet: if subsets exog is singular, then it doesn't draw extra starting cases, and it breaks (raises exception) in the Poisson case.
no large sample split for FastLTS yet

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ self.all_dict = {} #store tried outliers
+ self.temp = Holder()
+ self.temp.n_est_calls = 0
+ self.temp.n_refine_steps = []
+
+ self.target_attr = 'ssr'
+ self.fit_options = {}
+
+ def _refine_step(self, iin, k_accept):
+ endog, exog = self.endog, self.exog
+ nobs = self.nobs
+
+ res_trimmed = self.est_model(endog[iin], exog[iin]).fit()
+ self.temp.n_est_calls += 1
+ #print np.nonzero(~iin)[0] + 1, res_t_ols.params, res_t_ols.ssr
+ r = endog - res_trimmed.predict(exog)
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

Why not r = self.est_model.resid?

@josef-pkt

josef-pkt Sep 7, 2012

Owner

because I'm working here with res_trimmed : res_trimmed.resid (from self.est_model fit) doesn't contain all observations, only the inliers.

So. I need to calculate the errors for the full dataset given by endog exog, not the one from the "trimmed" dataset endog[iin], exog[iin]

(aside: in the initial estimate with the random starting values iin only contains k_vars + 1 observations, not really all inliers.)

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ def _refine_step(self, iin, k_accept):
+ endog, exog = self.endog, self.exog
+ nobs = self.nobs
+
+ res_trimmed = self.est_model(endog[iin], exog[iin]).fit()
+ self.temp.n_est_calls += 1
+ #print np.nonzero(~iin)[0] + 1, res_t_ols.params, res_t_ols.ssr
+ r = endog - res_trimmed.predict(exog)
+ #ii2 = np.argsort(np.argsort(np.abs(r))) < k_accept
+ #partial sort would be enough: need only the smallest k_outlier
+ #values
+ #TODO: another version: use resid_se and outlier test to determin
+ #k_outliers
+ idx3 = np.argsort(np.abs(r))[k_accept:] #idx of outliers
+
+ ii2 = np.ones(nobs, bool)
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

In my implementation of Fast-MCD, I called "iix" the "support". Although "support" already means something in some other contexts (e.g. SVM), I found that name to be quite good (and maybe more explicit than "iix").

@josef-pkt

josef-pkt Sep 7, 2012

Owner

I agree, naming here is not good yet, especially for the argument. In general I like to keep the distinction _idx for integer and _mask for boolean in the name.

I changed at some point whether iin are inliers or outliers. (So far I only used support in the case of distributions.)

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ self.temp.n_est_calls += 1
+ #print np.nonzero(~iin)[0] + 1, res_t_ols.params, res_t_ols.ssr
+ r = endog - res_trimmed.predict(exog)
+ #ii2 = np.argsort(np.argsort(np.abs(r))) < k_accept
+ #partial sort would be enough: need only the smallest k_outlier
+ #values
+ #TODO: another version: use resid_se and outlier test to determin
+ #k_outliers
+ idx3 = np.argsort(np.abs(r))[k_accept:] #idx of outliers
+
+ ii2 = np.ones(nobs, bool)
+ ii2[idx3] = False
+ ssr_new = np.dot(r*r, ii2)
+ return res_trimmed, ii2, ssr_new
+
+ def refine(self, iin, k_accept, max_nrefine=2):
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

Why not calling this method c_step? I think that only users that read Rousseeuw and Van Driessen's paper would browse that code and they would probably look for the c_step method as a starting point to understand it. At least, this is what I did ;)

@josef-pkt

josef-pkt Sep 7, 2012

Owner

Because in a few months I won't remember what c in ``c_step` stands for. I could call it concentrate.

(I just liked refine better because I didn't figure out what concentration means).
And the code needs docstrings.

@josef-pkt

josef-pkt Sep 7, 2012

Owner

just to add: I'm reading the code more often without the paper at hand, so, in general, I prefer terminology that I can understand having only a rough idea about the algorithm to the terminology and (short-hand) naming convention of a paper or text book.

What are all the psi's and phi's in RLM? I always have to check.

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+
+ The search itself, for fixed tuning parameter, does not increase much
+ with nobs and k_trimmed. This does not include the increase in time for
+ an estimation as nobs increases.
+
+ n_keep : If n_keep is too low, for example 10 as in Rousseuw and Van Driessen,
+ then the random search does not find the best solution for one of the
+ test datasets (aircraft).
+
+ k_start : some papers recommend k_start=k_vars for an exactly identified
+ estimation problem, but that didn't work so well in my preliminary
+ tests for OLS. (exactly identified cases might not work correctly
+ for all models.)
+
+ Currently, there is no check for a singular design matrix, exog. This
+ could be the case if there are categorical variables.
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

... or if n_obs is too large before n_vars.

@josef-pkt

josef-pkt Sep 7, 2012

Owner

I don't understand this.

@VirgileFritsch

VirgileFritsch Sep 7, 2012

Contributor

If the number of observations becomes close to the number of features, we may get a singular covariance matrix np.dot(X.T, X).

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ iterator = subsample(nobs, k_start, max_nrep=max_nstarts)
+
+ for ii in iterator:
+ iin = ii.copy() #TODO: do I still need a copy
+ res = self.refine(iin, k_accept, max_nrefine=max_nrefine_st1)
+ res_trimmed, ii2, ssr_new, converged = res
+ #if res_trimmed.ssr < ssr_keep[n_keep-1]:
+ #best_stage1.append((res_trimmed.ssr, ii2))
+ if ssr_new < ssr_keep[n_keep-1]:
+ best_stage1.append((ssr_new, ii2))
+ #update minkeep, shouldn't grow longer than n_keep
+ #we don't drop extra indices in best_stage1
+ ssr_keep.append(ssr_new)
+ ssr_keep.sort() #inplace python sort
+ del ssr_keep[n_keep:] #remove extra
+
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

I would rather do:

iin_keep = np.zeros((n_keep, n_obs))  # best_stage1 --> iin_keep
ssr_keep = np.inf * np.ones(n_keep)
...
for ii in iterator:
    ...
   # replace value that we will pop out (the biggest amongst the previous n_keep kept ones)
   ssr_keep[-1] = ssr_new
   iin_keep[-1] = ii2
   # sort both ssr_keep and iin_keep
   sort_aux = np.argsort(ssr_keep)
   ssr_keep = ssr_keep[sort_aux}
   iin_keep = iin_keep[sort_aux}

and then, use all value in (ssr_keep X iin_keep) for stage 2.

@josef-pkt

josef-pkt Sep 7, 2012

Owner

looks better

more numpy, but it's not obvious to me that numpy will be faster than python lists in this case

@VirgileFritsch

VirgileFritsch Sep 7, 2012

Contributor

I do not know if this would be faster, but it would save a lot of appends/deletions and make the code clearer (to me at least ;).
Also you would end up with iin_keep with only the solutions to be pooled at stage 2, which avoids testing ssr > ssr_keep[n_keep-1].

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ #we don't drop extra indices in best_stage1
+ ssr_keep.append(ssr_new)
+ ssr_keep.sort() #inplace python sort
+ del ssr_keep[n_keep:] #remove extra
+
+ #stage 2 : refine best_stage1 to convergence
+ ssr_best = np.inf
+ for (ssr, start_mask) in best_stage1:
+ if ssr > ssr_keep[n_keep-1]: continue
+ res = self.refine(start_mask, k_accept, max_nrefine=max_nrefine_st2)
+ res_trimmed, ii2, ssr_new, converged = res
+ if not converged:
+ #warning ?
+ print "refine step did not converge, max_nrefine limit reached"
+
+ ssr_current = getattr(res_trimmed, self.target_attr) #res_trimmed.ssr
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

Is self.target_attr useful at this point of the code development? Should we keep that for a further version?

@josef-pkt

josef-pkt Sep 7, 2012

Owner

I added it to be able to use it for maximum trimmed likelihood as a subclass : self.target_attr = 'nllf'
(I didn't have it in my first version, and I didn't adjust the names to reflect that the target could be different from ssr)

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ res_trimmed, ii2, ssr_new, converged = res
+ if not converged:
+ #warning ?
+ print "refine step did not converge, max_nrefine limit reached"
+
+ ssr_current = getattr(res_trimmed, self.target_attr) #res_trimmed.ssr
+ if ssr_current < ssr_best:
+ ssr_best = ssr_current
+ res_best = (res_trimmed, ii2)
+
+ self.temp.best_stage1 = best_stage1
+ self.temp.ssr_keep = ssr_keep
+
+ return res_best
+
+ def fit_exact(self, k_trimmed=None):
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

TODO: documentation
( + maybe move fit before fit_exact and fit_random so that a new developer can understand the purpose of these three methods just by reading the code straightforwardly).

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+
+ def __init__(self, endog, exog, est_model=OLS):
+ self.endog = endog
+ self.exog = exog
+ self.est_model = est_model
+ self.nobs, self.k_vars = self.exog.shape
+ #TODO: all_dict might not be useful anymore with new algorithm
+ self.all_dict = {} #store tried outliers
+ self.temp = Holder()
+ self.temp.n_est_calls = 0
+ self.temp.n_refine_steps = []
+
+ self.target_attr = 'ssr'
+ self.fit_options = {}
+
+ def _refine_step(self, iin, k_accept):
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

Here comes the biggest part of my review:
If I am not mistaken, the random starts in stage 1 should be done in limited subsamples. That is, you must split your datasets into subsets and THEN (only then), you perform c_step with random initializations. In that case, iin should be a "mask of a mask" / a selection of n2 observations in a predefined subset of n1.
This raise a programming challenge because according to the (sub)set you are running the c_step on, the size of ii2 will vary and it becomes tricky to pool it at stage2 (my above "mask of a mask" suggestion).
I am not sure I am clear enough? Would you confirm it please?

@josef-pkt

josef-pkt Sep 7, 2012

Owner

for the large sample case, when we split the data into subsamples, I thought we can just run the entire algorithm on different subsets, that is split the data and call LTS for each subsample, and then merge and refine/concentrate again. I haven't tried to implement it yet.
I don't expect that we gain much from working on all subsamples at the same time. "mask of a mask" not inside LTS, but the method that is doing the split has to be able to merge the indices of the subsample LTS's correctly.

@VirgileFritsch

VirgileFritsch Sep 7, 2012

Contributor

We agree. I started to have a look at the "large sample" implementation today and I think one limitations of the actual code is that the data on which LTS is applied are only referenced as self.endog/exog, which makes difficult to reuse the already existing "fit_*" methods.
I propose to continue my investigations if you are ok with me doing so.

@josef-pkt

josef-pkt Sep 7, 2012

Owner

Yes, I think new extension still need adjustments to the code.
One possibility is to create a new LTS instance for each subsample, so each has it's own endog and exog. (I don't expect much overhead doing so.) But in that case I don't think we already have the correct stopping criterion to get a 1st stage set of candidates out of it.

I appreciate your investigation and comments. And, if you want to work on extensions like the large sample case, then it would be faster to get this finished up.

@VirgileFritsch

VirgileFritsch Sep 12, 2012

Contributor

I pushed a first try here:

VirgileFritsch/statsmodels@1a1a14c

As said in the commit description, the example I provided does not really show that there is an interest in having a large sample version yet. I will keep on trying because 1. my code may not be optimized for speed and 2. maybe I was just not able to construct the "good" case for "LTS large sample" to beat "simple LTS".

@josef-pkt

josef-pkt Sep 12, 2012

Owner

Thanks Virgile,
I looked briefly through it and don't see anything to "complain". It might take a few days until I have time to look at it more carefully, and to play with the tuning parameters for the large sample case.

@VirgileFritsch

VirgileFritsch Sep 12, 2012

Contributor

No problem, take your time! :)

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ maximum number of concentration or refinement steps in second stage.
+ see Notes.
+
+ Returns
+ -------
+ res_trimmed : results instance
+ The best, lowest ssr, instance of the estimation results class
+ ii2 : ndarray
+ index of inliers, observations used in the best regression.
+
+ Notes
+ -----
+ The heuristic for the random search follows approximately the algorithm
+ of Rousseuw and van Driessen. In the first stage, a large number of
+ random starts are evaluated. The best of those are used in a second
+ stage until convergence.
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

See my comment at line 158.
I do think the actual way random starts are used does not correspond to Rousseeuw and Van Driessen's two-steps algorithm proposal (Fast-LTS).
Although a lot of random starts can be done without sticking to their algorithm, there is no proof (AFAIK) that the number of iterations should be 2 at stage 1 and then "up-to-convergence" at stage 2, and I guess not using subsets at stage 1 reduce the probability to find a good candidate that generalizes well for stage 2.

@josef-pkt

josef-pkt Sep 6, 2012

Owner

Just to this one (for the rest I need less distraction by kids)

The way I interpreted R and VD, I implemented the "small" sample version. They split the sample only for the case when nobs is large, run the process for 5 to 6 subsamples. I haven't implemented that part yet.

the number of iterations (c-steps) in the 1st stage are only guidelines, AFAIKS. In my examples, I often used only 1 refinement step. (There might be a small argument on the definition, because I check after the estimate, and I'm not sure if 1 c-step doesn't actually mean almost two in my case. I didn't check very carefully how many steps or half-steps are in my iteration.)
Trying out different versions, it didn't seem to me that many refinement/c-steps are necessary in the first stage. Having more random starts looked often more important. But I keep more for the second stage than what R, VD recommend.

It looked to me a bit like guess work and extrapolating from examples, for what a good combination of the 3 "tuning" parameters are, number of random starts, number of c-steps in first stage, and how many of the first stage results to use in the second stage until convergence.

@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

My bad, I did not realize that you were in the "small sample" case. So the implementation is fine to me.
(note that the "large sample" version, with data splitting, is what makes the Fast in the algorithm name. I guess we should therefore not expect the algorithm to be really faster than it is actually.)
I still do not understand why a larger n_keep would yield better results when going from stage 1 to stage 2... seems strange.

Anyway, I will think about it tomorrow, time to sleep now (good luck with kids ;)

@josef-pkt

josef-pkt Sep 7, 2012

Owner

The main reason for "fast" in my interpretation is to stop early in the first stage and not go to convergence with candidates that look too bad. The splitting in the large sample case still spreads the random sampling better over the dataset, IIRC.

IIRC, some of the example datasets had a small "basin of attraction" and many "reasonably good" estimates, so keeping only a few 1st stage results for the second stage might still not include the optimal set, just sets that are close to it. That was my interpretation for increasing n_keep. Also increasing n_keep and reducing the number of c_steps in the 1st stage looked more efficient to me, in terms of amount of calculation to get the right solution with high probability.
I have to get back to remember, but at some point I thought that LTS in R still was more "lucky" in finding the right solution than my implementation. But I didn't see how to get more lucky.
Much of my "heuristic" was just from running several examples not a systematic search.

@VirgileFritsch

VirgileFritsch Sep 7, 2012

Contributor

"But when n grows the computation time increases, due to the LS regression on h-subsets and the n residuals that need to be calculated each time. TO avoid doing all the computations in the entire data set, we will consider a special structure." (R & VD)
So the splitting part is mainly there to speed-up things, right?

They seem to stick to a splitting scheme with runs in sets respectively of size 300, 1500 and n_obs. I wonder if this is not a bit of over-fitting...

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ ave_refine_st2 is the average number of refinement steps needed until
+ convergence and depends on the problem. In the test cases, it is very
+ low for n_keep that are not large relative to max_nstarts (30%).
+
+ TODO: renaming
+ k_start -> nobs_start ?
+ max_nstarts n_repl_starts ?
+ max is not correct, it's always exactly max_nstarts
+
+ The search itself, for fixed tuning parameter, does not increase much
+ with nobs and k_trimmed. This does not include the increase in time for
+ an estimation as nobs increases.
+
+ n_keep : If n_keep is too low, for example 10 as in Rousseuw and Van Driessen,
+ then the random search does not find the best solution for one of the
+ test datasets (aircraft).
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

Again, I think this is related to the way the first stage is performed: your aircraft example clearly shows that the best solutions at stage 1 do not generalizes well at stage 2. A larger n_keep should not improve the results, otherwise this would mean that the best solution on the whole dataset has a very small link with the best solution at stage 1 and we should therefore question the use of such a two-steps algorithm.
What do you think?

@josef-pkt

josef-pkt Sep 7, 2012

Owner

I looked at the Hawkins paper again, for there pairwise swapping of observations algorithm they have 16% going to the best and 34% going to the third best solution. Not as bad as I remember, but my impression was the set of converged solutions is larger with the RVD LTS algorithm.

some back of the envelope calculations
"So 18 random starting subsets will give a 95% chance of having one lead to the global optimum" quote for aircraft data. 7 convergence sets cover almost 90 % of the starting values, that's about an 18 % chance of converging to the best. To have a 99 % probability of converging to the best solution, we would need roughly 23 starting values.
So if there is a good chance of getting many reasonably good estimates in the first stage, then it might not include the best case for the second stage if n_keep is too small. Also using additional c-steps in the first stage might not help a lot in this case, it just weeds out the awful cases but not the pretty-good-but-not-best cases.

side remark: besides the Hawkins and other traditional datasets, I would have liked to get a collection of models for generating data and test cases that have been used in different papers. I doubt the same settings will be very good for all test cases.

@VirgileFritsch

VirgileFritsch Sep 7, 2012

Contributor

Fine. Thanks fir the explanation!

Regarding data generation, I also doubt that a module covering the entire range of possible polluted data could be made. It is yet especially important for LTS since the complex structure of the algorithm makes it sensitive to specific data settings.
I coded a data generation module for the MCD test cases but I doubt it would be useful for LTS.

@VirgileFritsch VirgileFritsch commented on the diff Sep 6, 2012

statsmodels/robust/least_trimmed_squares.py
+ self.est_model = est_model
+ self.nobs, self.k_vars = self.exog.shape
+ #TODO: all_dict might not be useful anymore with new algorithm
+ self.all_dict = {} #store tried outliers
+ self.temp = Holder()
+ self.temp.n_est_calls = 0
+ self.temp.n_refine_steps = []
+
+ self.target_attr = 'ssr'
+ self.fit_options = {}
+
+ def _refine_step(self, iin, k_accept):
+ endog, exog = self.endog, self.exog
+ nobs = self.nobs
+
+ res_trimmed = self.est_model(endog[iin], exog[iin]).fit()
@VirgileFritsch

VirgileFritsch Sep 6, 2012

Contributor

I did not find the name res_trimmed appropriate (because I do not understand what is "trimmed" regarding that instruction).

@josef-pkt

josef-pkt Sep 7, 2012

Owner

result of estimation with trimmed observations, run OLS on the sample of "inlier" candidates (iin), dropping/trimming outliers.

I think main contrast was to WLS with weights=0 for outliers as we get with RLM. In some parts it wasn't clear to me whether it's better to work with the reduced dataset (OLS) or with the full dataset with weights=0 (WLS).

Contributor

VirgileFritsch commented Sep 6, 2012

I had a look at the LTS object and the associated algorithm. I did not look at the examples or the advanced uses of LTS yet (i.e. efficient lts, inclusion with MM-estimators, MTL, ...).
Impressive amount of work! Welldone!

@josef-pkt josef-pkt commented on the diff Sep 7, 2012

statsmodels/robust/least_trimmed_squares.py
+ '''
+
+ endog, exog = self.endog, self.exog
+ #nobs = self.nobs
+ #all_dict = self.all_dict
+
+ for ib in range(max_nrefine):
+ #print tuple(np.nonzero(iin)[0])
+ res_trimmed, ii2, ssr_new = self._refine_step(iin, k_accept)
+ #print ib, tuple(np.nonzero(ii2)[0])
+ if (ii2 == iin).all():
+ converged = True
+ break
+ else:
+ iin = ii2.copy()
+ #for debugging
@josef-pkt

josef-pkt Sep 7, 2012

Owner

make optional

Owner

josef-pkt commented Sep 7, 2012

Thanks Virgile, and thanks for the review

@josef-pkt josef-pkt removed the PR label Mar 5, 2016

@josef-pkt josef-pkt added to pulls in comp-robust Feb 26, 2017

@josef-pkt josef-pkt added to others in josef-pulls Feb 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment