ENH: Design of Experiment with Low Discrepancy Sequences #4103

tupui · 2017-11-04T18:53:54Z

Follow up on a discussion about implementing some design of experiment.

There is an opened PR which proposed to add Halton sequence. And here is my code proposition: Gist. As opposed to the PR, I propose to separate have a function to generate some prime numbers and to have a distinct function that will create the basic of Halton (the Van der Corput sequence). This way, these functions can be re-used.

In low dimensional cases (<~10), this method is good. Otherwise, we might want to go with Sobol' sequences and optimized LHS (this is the state of the art).

LHS is nice but we cannot easily increase the sample size (still an open field) as opposed to low discrepancy sequences where you simply have to take the next points of the sequences. An other option is to augment the design by optimization of metric such as the discrepancy (here is another gist for its computation).

josef-pkt · 2017-11-04T20:04:41Z

a suggestion: put the gist functions in a new module statsmodels.tools.sequences

Halton looks fine going through van der Corput, but I have no idea if there are numerical advantages to either version.

a few proposed changes

in van_der_corput use a different loop variable in the, while i > 0. Even if it works this way not modifying outer loop variable is an easier to understand design.

I think we can hardcode the first few prime numbers as in the old halton. They won't change and it is not necessary to recompute them each time. Also maybe outsource the find n prime numbers loop from the halton function to a helper function would be useful, i.e. separate prime numbers code from sequence code.

If I read van_der_corput correctly, then we can compute the sequence for any i separately. Which would mean that we don't have to compute and throw away numbers if we want to skip numbers in the halton sequence, as proposed by Kocis and Whiten. (I didn't go through the details of their paper)

In terms of organizing the code and it's location:
I know internal use cases for halton in at least discrete and genmod, and most likely for all models for numerical or Monte Carlo integration, so tools would be an appropriate location. However, I don't know about the other sequences and methods that you describe. If those are mainly for the DOE use cases, then they could also go into a new doe or experimental subfolder.

tupui · 2017-11-04T23:43:00Z

@josef-pkt I have created a PR and tried to address your comments.

I modified the van_der_corput loop variable with quotient,
No computation for the primes up to 1000.

I guess we could compute the van_der_corput sequence for any i but I do not quite see the use case. But I can make it work if you have a user story for this. This is so cheap that I do not see how this is even an issue. Or maybe it is to improve the DoE to lower the discrepancy.

Regarding Sobol' sequences or LHS, their use cases are the same as Halton. If there is no specific use for DoE at the moment, I guess tools is as you said a good location.

josef-pkt · 2017-11-05T00:33:44Z

for computing only for selected i:
The simplest method to break high correlation for some pairs of prime numbers in Kocis and Whiten was to skip halton numbers. But I worried that it would be expensive if we have to compute the intermediate skipped numbers. I didn't try to understand their other methods because they sounded to me a lot more complex.
It won't be relevant for low dimensional halton sequences

Kocis, L. and Whiten, W.J., 1997. Computational investigations of low-discrepancy sequences. ACM Transactions on Mathematical Software (TOMS), 23(2), pp.266-294.

josef-pkt · 2017-11-05T17:08:32Z

aside: playing with the original halton version
http://jpktd.blogspot.ca/2013/06/quasi-random-numbers-with-halton.html

tupui · 2017-11-06T08:42:36Z

@josef-pkt I got a clean LHS version, were should I put this? tools.sequences does not fit as this is not a sequence. We can rename this by tools.doe or as you suggested create a doe folder.

josef-pkt · 2017-11-06T13:16:38Z

Start the new statsmodels.doe folder (I hope the acronym is common and clear enough)
the __init__.py should be empty except for the pytester
We add an api.py as import path for public functions.

tupui · 2017-11-06T20:35:02Z

@josef-pkt I will do the PR, If you prefer for the little fix on the bounds, I can include this not to have a slim PR ( #4106 )

josef-pkt · 2017-11-07T03:18:41Z

It's better to merge the fix separately right away.
The new doe might not go as fast with reviewing and merging as halton did, i.e. most likely doe will be for 0.10.

tupui · 2017-11-07T11:15:02Z

@josef-pkt I have done the corresponding PR.

josef-pkt · 2017-12-23T20:53:06Z

On reference for using latin hypercube sampling for mixed discrete choice models, similar to halton sequences
(I didn't look at it, has KennethTrain as one of the authors)

Hess, Stephane, Kenneth E. Train, and John W. Polak. 2006. “On the Use of a Modified Latin Hypercube Sampling (MLHS) Method in the Estimation of a Mixed Logit Model for Vehicle Choice.” Transportation Research Part B: Methodological 40 (2):147–63. https://doi.org/10.1016/j.trb.2004.10.005.

tupui · 2017-12-23T20:56:34Z

@josef-pkt Thanks for the reference, I did not know this one. Do you see anything more I can do? Some doc? Some other functionalities?

josef-pkt · 2017-12-23T21:08:30Z

I wanted to reply in your PR

Essentially this is you area for now. I browsed a bit in the literature but have barely some superficial ideas about this. It would be good if you could work on whatever you think is useful in this area. I have many different topics that I need to keep track of or need to catch up with, so I cannot keep up, but I will come back to it every once in a while.

For my context:
I had thought of LHS as a method to create a design matrix for actual experiments, but it is also just another method for numerical integration
https://www.linkedin.com/pulse/20140708131747-483951-the-pros-and-cons-of-latin-hypercube-sampling
http://www.lumina.com/blog/latin-hypercube-vs.-monte-carlo-sampling
In discrete choice models like mixed multinomial logit or probit models the likelihood function needs to be numerically integrated, using gaussian quadrature, halton sequences or monte carlo, or, based on this reference modified LHS.
So for this LHS would be similar to Halton sequences, as a numerical integration tool and not literally for DOE.

josef-pkt · 2017-12-23T21:23:01Z

Stratified LHS sound interesting based on the abstracts, found while doing a semi-random search for related articles (without knowing much about it, number of citations is pretty good)

Minasny, Budiman, and Alex B. McBratney. 2006. “A Conditioned Latin Hypercube Method for Sampling in the Presence of Ancillary Information.” Computers & Geosciences 32 (9):1378–88. https://doi.org/10.1016/j.cageo.2005.12.009.

Shields, Michael D., and Jiaxin Zhang. 2016. “The Generalization of Latin Hypercube Sampling.” Reliability Engineering & System Safety 148 (Supplement C):96–108. https://doi.org/10.1016/j.ress.2015.12.002.

josef-pkt · 2017-12-23T21:33:57Z

One comment on Halton versus LHS for simulated MLE

in footnote 12
"""
In high dimensions, when Halton draws tend to be highly correlated over dimensions, Bhat (2003) has investigated the use of scrambled Halton draws, and Hess et al. (2006) propose modified Latin hypercube sampling procedures. The dimension of integration in our model is not sufficiently high to require these procedures
"""
Train, Kenneth E., and Clifford Winston. 2007. “Vehicle Choice Behavior and the Declining Market Share of U.s. Automakers*.” International Economic Review 48 (4):1469–96. https://doi.org/10.1111/j.1468-2354.2007.00471.x.

tupui · 2017-12-24T17:43:03Z

Stratified LHS is indeed promising and I definitely have planned to implement this when given some time. As for the number of dimensions, a scrambled version of Halton/Sobol could be also interesting to add. With these and the already coded methods, it would be quite versatile IMO.

Regarding the current PR, I will try to propose more documentation as well as examples.

josef-pkt · 2017-12-24T18:05:52Z

One more possible enhancement idea:
for adaptive designs/simulation it would be good to be able to add points.
I saw that R package lhs has an augmentLHS function to update an existing hypercube with more points https://cran.r-project.org/web/packages/lhs/lhs.pdf
Again, I have no idea what the properties of the augmentation process are, but for some application it is useful to be able do a small sample MonteCarlo or Halton sequence first and then continue if the small sample doesn't look precise enough.

josef-pkt · 2017-12-25T02:25:49Z

@tupui A related but different question? (just an idea right now
Have you seen anything about deviating from uniformly distributed design to put more points in "interesting" local neighborhoods. An example would be if there is larger local variation or variance in some part of the parameter or design space.

An application where this is useful: A standard example for spline fitting is a motorcycle dataset. It can be fit very well with splines that have knots at every few (5th or 10th) sorted observation. However, eventually I realized that these splines work very well because there are more observation in the part that has high variation, so we get more knots in the large variation part. In general this would not work so well because with knots uniform in the design space we don't get more knots in the high variation part and would either oversmooth the high variation part or undersmooth the low variation part, without switching to adaptive splines and bandwidth choice.

What could work is if we could use a nonlinear distortion function that depends on the local curvature or variance, then either brings points closer together or farther apart. That would be a similar nonlinear transformation as using the ppf (inverse cdf) when transforming uniform sample points to correspond to a non-uniform distribution.
One possibility for the hypercube is to make the grid/bin width irregular which would "distort" entire columns or rows instead of a local region.

josef-pkt · 2017-12-25T02:48:54Z

Here is an example, but not variation based, It looks like just the analogue of the ppf transformation based on a brief look
Dette, Holger, and Andrey Pepelyshev. 2010. “Generalized Latin Hypercube Design for Computer Experiments.” Technometrics 52 (4):421–29.

tupui · 2017-12-25T20:34:25Z

@josef-pkt Indeed, being able to continue a DoE can be handy. For Halton this is already a feature as we added the start_index. Other than continuing sequences, a possibility is to add the new points by optimizing the discrepancy. I can add this easily. A complementary approach would be to limit the hypercube to add the new point to a particular zone within the bounds. Only issue is that this can be computationally expansive as the dimension increases. If we just want to increase the design without having external information I am afraid that there won’t be much options. As opposed for ex: If using Bayesian optimization, expected improvement based on the surrogate’s variance can be used.

Apart from importance sampling (so ppf to do the inverse transformation), there is also the possibility to use quadrature based sampling. But this constrains a lot the sampling. Appart from that, I am not aware of anything else. From what I know, we use ppf all the time as it is super convenient.

josef-pkt · 2017-12-30T04:44:15Z

parking a reference (found while shutting down)
I didn't read it but this sounds like a alternative or complement to FANOVA/Sobol Indices second moment measures
(uses kolmogorov-smirnov distance on conditional cdf if I guess correctly based on a very brief look)

Pianosi, Francesca, and Thorsten Wagener. 2015. “A Simple and Efficient Method for Global Sensitivity Analysis Based on Cumulative Distribution Functions.” Environmental Modelling & Software 67 (Supplement C):1–11. https://doi.org/10.1016/j.envsoft.2015.01.004.

josef-pkt · 2018-10-18T13:17:03Z

another way of creating quasi random numbers
http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/
link by Robert Kern in scipy-dev discussion on Poisson Disk Sampling

tupui · 2018-10-19T06:28:30Z

This is nice. As detailed in the post, it works quite well in 2D. But if you go into higher dimensions, it's really bad. In the comments, someone else noticed that. I have done myself some 2D sub-projections of it:

So, I am not sure about adding this one. Both optimized LHS and scrambled Sobol' sequences are way more efficient.

tupui · 2018-10-19T06:33:25Z

By the way regarding your comment on the sensitivity indices based on CDF, I have coded these indices in my tool Batman (https://gitlab.com/cerfacs/batman). If your interested in, I can make a PR.

tupui · 2021-07-25T13:16:06Z

Closing as all this is in SciPy now and further development on this should happen here except if we want to have something really particular.

As for poisson disk sampling there is an open PR too on SciPy.

tupui mentioned this issue Nov 4, 2017

ENH: add sequences module with Halton #4104

Merged

tupui mentioned this issue Nov 7, 2017

ENH: doe module and lhs designs #4113

Closed

josef-pkt mentioned this issue Nov 1, 2018

ENH: Add shgo for global optimization of NLPs. scipy/scipy#8098

Merged

tupui closed this as completed Jul 25, 2021

bashtage added this to the 0.13 milestone Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Design of Experiment with Low Discrepancy Sequences #4103

ENH: Design of Experiment with Low Discrepancy Sequences #4103

tupui commented Nov 4, 2017

josef-pkt commented Nov 4, 2017

tupui commented Nov 4, 2017

josef-pkt commented Nov 5, 2017

josef-pkt commented Nov 5, 2017

tupui commented Nov 6, 2017

josef-pkt commented Nov 6, 2017

tupui commented Nov 6, 2017

josef-pkt commented Nov 7, 2017

tupui commented Nov 7, 2017

josef-pkt commented Dec 23, 2017

tupui commented Dec 23, 2017

josef-pkt commented Dec 23, 2017

josef-pkt commented Dec 23, 2017

josef-pkt commented Dec 23, 2017

tupui commented Dec 24, 2017

josef-pkt commented Dec 24, 2017

josef-pkt commented Dec 25, 2017

josef-pkt commented Dec 25, 2017

tupui commented Dec 25, 2017 •

edited

josef-pkt commented Dec 30, 2017

josef-pkt commented Oct 18, 2018

tupui commented Oct 19, 2018

tupui commented Oct 19, 2018

tupui commented Jul 25, 2021

ENH: Design of Experiment with Low Discrepancy Sequences #4103

ENH: Design of Experiment with Low Discrepancy Sequences #4103

Comments

tupui commented Nov 4, 2017

josef-pkt commented Nov 4, 2017

tupui commented Nov 4, 2017

josef-pkt commented Nov 5, 2017

josef-pkt commented Nov 5, 2017

tupui commented Nov 6, 2017

josef-pkt commented Nov 6, 2017

tupui commented Nov 6, 2017

josef-pkt commented Nov 7, 2017

tupui commented Nov 7, 2017

josef-pkt commented Dec 23, 2017

tupui commented Dec 23, 2017

josef-pkt commented Dec 23, 2017

josef-pkt commented Dec 23, 2017

josef-pkt commented Dec 23, 2017

tupui commented Dec 24, 2017

josef-pkt commented Dec 24, 2017

josef-pkt commented Dec 25, 2017

josef-pkt commented Dec 25, 2017

tupui commented Dec 25, 2017 • edited

josef-pkt commented Dec 30, 2017

josef-pkt commented Oct 18, 2018

tupui commented Oct 19, 2018

tupui commented Oct 19, 2018

tupui commented Jul 25, 2021

tupui commented Dec 25, 2017 •

edited