Skip to content

Random Thoughts on Probabilistic Modeling

Vince Buffalo edited this page Feb 9, 2016 · 6 revisions

Things I've learned (often the hard way), from modeling probabilistic things on paper or in R/Python.

  1. Beware of implicit conditioning in your theoretical work.

  2. If something doesn't match up in later results, remember there's a non-zero probability something in earlier results is fundamentally wrong and is the cause of present difficulties.

  3. Beware of ways in which your numerical simulations make explicit use of theory. This is a problem. Anytime this is necessary for testing code (or maybe even expression the problem), maybe note this with some sort of grep-able string like PROBD. Theory could match up with simulation results superficially because of this dependence.

  4. Beware of vectorization (R) or broadcasting (numpy/python) of variables. Recycling can hurt you, especially when coding up double summations. R's Vectorize() is handy.

  5. Don't forget to always go back and reason through the original (true) process each time something doesn't work.

  6. Don't forget about correlation in real processes. For example, assigning a label done (very) incorrectly:

     # beware, bad code below!
     red_balls = [ball for ball in balls if rand.choice(('red', 'blue)) == 'red']
     blue_balls = [ball for ball in balls if rand.choice(('red', 'blue)) == 'blue']
    

    What's wrong? The the real process, we're labelling all balls as red or blue. These list comprehensions can lose balls, since there's no guarantee that total = blue + red (in other words, in the real labeling process total number of red balls is perfectly correlated with the number of blue balls)! Use assert() statements liberally to make sure cases aren't lost in code. The correct approach is of course:

     labels = rand.choice(('blue', 'red'), size=ntotal, replace=True)
    
  7. A basic unit test of all PDF/PMFs is that they sum to one across their domains.

  8. Beware of high variance tests. For example, if I suspect an algorithm is biased, a test of whether the average converges to the expected value is better than repeating 1000 draws and counting how many times zero appears over 1. The latter event only requires one more success over the mean — essentially these 1000 draws just average out to be a single coin flip—a high variance process.

Clone this wiki locally