Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Design of Experiment with Low Discrepancy Sequences #4103

Closed
tupui opened this issue Nov 4, 2017 · 24 comments
Closed

ENH: Design of Experiment with Low Discrepancy Sequences #4103

tupui opened this issue Nov 4, 2017 · 24 comments
Milestone

Comments

@tupui
Copy link
Contributor

tupui commented Nov 4, 2017

Follow up on a discussion about implementing some design of experiment.

There is an opened PR which proposed to add Halton sequence. And here is my code proposition: Gist. As opposed to the PR, I propose to separate have a function to generate some prime numbers and to have a distinct function that will create the basic of Halton (the Van der Corput sequence). This way, these functions can be re-used.

In low dimensional cases (<~10), this method is good. Otherwise, we might want to go with Sobol' sequences and optimized LHS (this is the state of the art).

LHS is nice but we cannot easily increase the sample size (still an open field) as opposed to low discrepancy sequences where you simply have to take the next points of the sequences. An other option is to augment the design by optimization of metric such as the discrepancy (here is another gist for its computation).

@josef-pkt
Copy link
Member

a suggestion: put the gist functions in a new module statsmodels.tools.sequences

Halton looks fine going through van der Corput, but I have no idea if there are numerical advantages to either version.

a few proposed changes

in van_der_corput use a different loop variable in the, while i > 0. Even if it works this way not modifying outer loop variable is an easier to understand design.

I think we can hardcode the first few prime numbers as in the old halton. They won't change and it is not necessary to recompute them each time. Also maybe outsource the find n prime numbers loop from the halton function to a helper function would be useful, i.e. separate prime numbers code from sequence code.

If I read van_der_corput correctly, then we can compute the sequence for any i separately. Which would mean that we don't have to compute and throw away numbers if we want to skip numbers in the halton sequence, as proposed by Kocis and Whiten. (I didn't go through the details of their paper)

In terms of organizing the code and it's location:
I know internal use cases for halton in at least discrete and genmod, and most likely for all models for numerical or Monte Carlo integration, so tools would be an appropriate location. However, I don't know about the other sequences and methods that you describe. If those are mainly for the DOE use cases, then they could also go into a new doe or experimental subfolder.

@tupui
Copy link
Contributor Author

tupui commented Nov 4, 2017

@josef-pkt I have created a PR and tried to address your comments.

  1. I modified the van_der_corput loop variable with quotient,
  2. No computation for the primes up to 1000.

I guess we could compute the van_der_corput sequence for any i but I do not quite see the use case. But I can make it work if you have a user story for this. This is so cheap that I do not see how this is even an issue. Or maybe it is to improve the DoE to lower the discrepancy.

Regarding Sobol' sequences or LHS, their use cases are the same as Halton. If there is no specific use for DoE at the moment, I guess tools is as you said a good location.

@josef-pkt
Copy link
Member

for computing only for selected i:
The simplest method to break high correlation for some pairs of prime numbers in Kocis and Whiten was to skip halton numbers. But I worried that it would be expensive if we have to compute the intermediate skipped numbers. I didn't try to understand their other methods because they sounded to me a lot more complex.
It won't be relevant for low dimensional halton sequences

Kocis, L. and Whiten, W.J., 1997. Computational investigations of low-discrepancy sequences. ACM Transactions on Mathematical Software (TOMS), 23(2), pp.266-294.

@josef-pkt
Copy link
Member

aside: playing with the original halton version
http://jpktd.blogspot.ca/2013/06/quasi-random-numbers-with-halton.html

@tupui
Copy link
Contributor Author

tupui commented Nov 6, 2017

@josef-pkt I got a clean LHS version, were should I put this? tools.sequences does not fit as this is not a sequence. We can rename this by tools.doe or as you suggested create a doe folder.

@josef-pkt
Copy link
Member

Start the new statsmodels.doe folder (I hope the acronym is common and clear enough)
the __init__.py should be empty except for the pytester
We add an api.py as import path for public functions.

@tupui
Copy link
Contributor Author

tupui commented Nov 6, 2017

@josef-pkt I will do the PR, If you prefer for the little fix on the bounds, I can include this not to have a slim PR ( #4106 )

@josef-pkt
Copy link
Member

It's better to merge the fix separately right away.
The new doe might not go as fast with reviewing and merging as halton did, i.e. most likely doe will be for 0.10.

@tupui
Copy link
Contributor Author

tupui commented Nov 7, 2017

@josef-pkt I have done the corresponding PR.

@josef-pkt
Copy link
Member

On reference for using latin hypercube sampling for mixed discrete choice models, similar to halton sequences
(I didn't look at it, has KennethTrain as one of the authors)

Hess, Stephane, Kenneth E. Train, and John W. Polak. 2006. “On the Use of a Modified Latin Hypercube Sampling (MLHS) Method in the Estimation of a Mixed Logit Model for Vehicle Choice.” Transportation Research Part B: Methodological 40 (2):147–63. https://doi.org/10.1016/j.trb.2004.10.005.

@tupui
Copy link
Contributor Author

tupui commented Dec 23, 2017

@josef-pkt Thanks for the reference, I did not know this one. Do you see anything more I can do? Some doc? Some other functionalities?

@josef-pkt
Copy link
Member

I wanted to reply in your PR

Essentially this is you area for now. I browsed a bit in the literature but have barely some superficial ideas about this. It would be good if you could work on whatever you think is useful in this area. I have many different topics that I need to keep track of or need to catch up with, so I cannot keep up, but I will come back to it every once in a while.

For my context:
I had thought of LHS as a method to create a design matrix for actual experiments, but it is also just another method for numerical integration
https://www.linkedin.com/pulse/20140708131747-483951-the-pros-and-cons-of-latin-hypercube-sampling
http://www.lumina.com/blog/latin-hypercube-vs.-monte-carlo-sampling
In discrete choice models like mixed multinomial logit or probit models the likelihood function needs to be numerically integrated, using gaussian quadrature, halton sequences or monte carlo, or, based on this reference modified LHS.
So for this LHS would be similar to Halton sequences, as a numerical integration tool and not literally for DOE.

@josef-pkt
Copy link
Member

Stratified LHS sound interesting based on the abstracts, found while doing a semi-random search for related articles (without knowing much about it, number of citations is pretty good)

Minasny, Budiman, and Alex B. McBratney. 2006. “A Conditioned Latin Hypercube Method for Sampling in the Presence of Ancillary Information.” Computers & Geosciences 32 (9):1378–88. https://doi.org/10.1016/j.cageo.2005.12.009.

Shields, Michael D., and Jiaxin Zhang. 2016. “The Generalization of Latin Hypercube Sampling.” Reliability Engineering & System Safety 148 (Supplement C):96–108. https://doi.org/10.1016/j.ress.2015.12.002.

@josef-pkt
Copy link
Member

One comment on Halton versus LHS for simulated MLE

in footnote 12
"""
In high dimensions, when Halton draws tend to be highly correlated over dimensions, Bhat (2003) has investigated the use of scrambled Halton draws, and Hess et al. (2006) propose modified Latin hypercube sampling procedures. The dimension of integration in our model is not sufficiently high to require these procedures
"""
Train, Kenneth E., and Clifford Winston. 2007. “Vehicle Choice Behavior and the Declining Market Share of U.s. Automakers*.” International Economic Review 48 (4):1469–96. https://doi.org/10.1111/j.1468-2354.2007.00471.x.

@tupui
Copy link
Contributor Author

tupui commented Dec 24, 2017

Stratified LHS is indeed promising and I definitely have planned to implement this when given some time. As for the number of dimensions, a scrambled version of Halton/Sobol could be also interesting to add. With these and the already coded methods, it would be quite versatile IMO.

Regarding the current PR, I will try to propose more documentation as well as examples.

@josef-pkt
Copy link
Member

One more possible enhancement idea:
for adaptive designs/simulation it would be good to be able to add points.
I saw that R package lhs has an augmentLHS function to update an existing hypercube with more points https://cran.r-project.org/web/packages/lhs/lhs.pdf
Again, I have no idea what the properties of the augmentation process are, but for some application it is useful to be able do a small sample MonteCarlo or Halton sequence first and then continue if the small sample doesn't look precise enough.

@josef-pkt
Copy link
Member

@tupui A related but different question? (just an idea right now
Have you seen anything about deviating from uniformly distributed design to put more points in "interesting" local neighborhoods. An example would be if there is larger local variation or variance in some part of the parameter or design space.

An application where this is useful: A standard example for spline fitting is a motorcycle dataset. It can be fit very well with splines that have knots at every few (5th or 10th) sorted observation. However, eventually I realized that these splines work very well because there are more observation in the part that has high variation, so we get more knots in the large variation part. In general this would not work so well because with knots uniform in the design space we don't get more knots in the high variation part and would either oversmooth the high variation part or undersmooth the low variation part, without switching to adaptive splines and bandwidth choice.

What could work is if we could use a nonlinear distortion function that depends on the local curvature or variance, then either brings points closer together or farther apart. That would be a similar nonlinear transformation as using the ppf (inverse cdf) when transforming uniform sample points to correspond to a non-uniform distribution.
One possibility for the hypercube is to make the grid/bin width irregular which would "distort" entire columns or rows instead of a local region.

@josef-pkt
Copy link
Member

Here is an example, but not variation based, It looks like just the analogue of the ppf transformation based on a brief look
Dette, Holger, and Andrey Pepelyshev. 2010. “Generalized Latin Hypercube Design for Computer Experiments.” Technometrics 52 (4):421–29.

@tupui
Copy link
Contributor Author

tupui commented Dec 25, 2017

@josef-pkt Indeed, being able to continue a DoE can be handy. For Halton this is already a feature as we added the start_index. Other than continuing sequences, a possibility is to add the new points by optimizing the discrepancy. I can add this easily. A complementary approach would be to limit the hypercube to add the new point to a particular zone within the bounds. Only issue is that this can be computationally expansive as the dimension increases. If we just want to increase the design without having external information I am afraid that there won’t be much options. As opposed for ex: If using Bayesian optimization, expected improvement based on the surrogate’s variance can be used.

Apart from importance sampling (so ppf to do the inverse transformation), there is also the possibility to use quadrature based sampling. But this constrains a lot the sampling. Appart from that, I am not aware of anything else. From what I know, we use ppf all the time as it is super convenient.

@josef-pkt
Copy link
Member

parking a reference (found while shutting down)
I didn't read it but this sounds like a alternative or complement to FANOVA/Sobol Indices second moment measures
(uses kolmogorov-smirnov distance on conditional cdf if I guess correctly based on a very brief look)

Pianosi, Francesca, and Thorsten Wagener. 2015. “A Simple and Efficient Method for Global Sensitivity Analysis Based on Cumulative Distribution Functions.” Environmental Modelling & Software 67 (Supplement C):1–11. https://doi.org/10.1016/j.envsoft.2015.01.004.

@josef-pkt
Copy link
Member

another way of creating quasi random numbers
http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/
link by Robert Kern in scipy-dev discussion on Poisson Disk Sampling

@tupui
Copy link
Contributor Author

tupui commented Oct 19, 2018

This is nice. As detailed in the post, it works quite well in 2D. But if you go into higher dimensions, it's really bad. In the comments, someone else noticed that. I have done myself some 2D sub-projections of it:

4d

So, I am not sure about adding this one. Both optimized LHS and scrambled Sobol' sequences are way more efficient.

@tupui
Copy link
Contributor Author

tupui commented Oct 19, 2018

By the way regarding your comment on the sensitivity indices based on CDF, I have coded these indices in my tool Batman (https://gitlab.com/cerfacs/batman). If your interested in, I can make a PR.

@tupui
Copy link
Contributor Author

tupui commented Jul 25, 2021

Closing as all this is in SciPy now and further development on this should happen here except if we want to have something really particular.

As for poisson disk sampling there is an open PR too on SciPy.

@tupui tupui closed this as completed Jul 25, 2021
@bashtage bashtage added this to the 0.13 milestone Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants