# zaxtax/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers forked from CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

### Subversion checkout URL

You can clone with HTTPS or Subversion.

We’re showing branches in this repository, but you can also compare across forks.

...
• 7 commits
• 3 files changed
• 3 contributors
Commits on Jan 13, 2014
 mdiephuis fixed two minor grammar errors and a python typo b33f38d mdiephuis Added two missing words 8e4a64c
Commits on Jan 14, 2014
 CamDavidsonPilon Merge pull request #168 from mdiephuis/master fixed two minor grammar errors and a python typo edc6f8e mdiephuis Typos and one edit for style. 6525dff CamDavidsonPilon Merge pull request #170 from mdiephuis/master Typos and one edit for style. 55f8df4
Commits on Jan 24, 2014
 zaxtax Adding sentence on Fisher information 22933e3 zaxtax Merge branch 'master' of git://github.com/CamDavidsonPilon/Probabilis… …tic-Programming-and-Bayesian-Methods-for-Hackers 0dcbc01
4 Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb
 @@ -112,7 +112,7 @@ "source": [ "### Decision, decisions...\n", "\n", - "The choice, either *objective* or *subjective* mostly depend on the problem being solved, but there are a few cases where one is preferred over the other. In instances of scientific research, the choice of an objective prior is obvious. This eliminates any biases in the results, and two researchers who might have differing prior opinions would feel an objective prior is fair. Consider a more extreme situation:\n", + "The choice, either *objective* or *subjective* mostly depends on the problem being solved, but there are a few cases where one is preferred over the other. In instances of scientific research, the choice of an objective prior is obvious. This eliminates any biases in the results, and two researchers who might have differing prior opinions would feel an objective prior is fair. Consider a more extreme situation:\n", "\n", "> A tobacco company publishes a report with a Bayesian methodology that retreated 60 years of medical research on tobacco use. Would you believe the results? Unlikely. The researchers probably chose a subjective prior that too strongly biased results in their favor.\n", "\n", @@ -145,7 +145,7 @@ "source": [ "### Empirical Bayes\n", "\n", - "While not a true Bayesian method, *empirical Bayes* is a trick that combines frequentist and Bayesian inference. As mentioned previously, for (almost) every inference problem there is a Bayesian method and a frequentist method. The significant difference between the two is that Bayesian methods have a prior distribution, with hyperparameters $\\alpha$, while empirical methods do not have any notion of a prior. Empirical Bayes combines the two methods by using frequentist methods to select $\\alpha$, and then proceeding with Bayesian methods on the original problem. \n", + "While not a true Bayesian method, *empirical Bayes* is a trick that combines frequentist and Bayesian inference. As mentioned previously, for (almost) every inference problem there is a Bayesian method and a frequentist method. The significant difference between the two is that Bayesian methods have a prior distribution, with hyperparameters $\\alpha$, while empirical methods do not have any notion of a prior. Empirical Bayes combines the two methods by using frequentist methods to select $\\alpha$, and then proceeds with Bayesian methods on the original problem. \n", "\n", "A very simple example follows: suppose we wish to estimate the parameter $\\mu$ of a Normal distribution, with $\\sigma = 5$. Since $\\mu$ could range over the whole real line, we can use a Normal distribution as a prior for $\\mu$. How to select the prior's hyperparameters, denoted ($\\mu_p, \\sigma_p^2$)? The $\\sigma_p^2$ parameter can be chosen to reflect the uncertainty we have. For $\\mu_p$, we have two options:\n", "\n", @@ -265,7 +265,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "One thing to notice is that the symmetry of these matrices. The Wishart distribution can be a little troubling to deal with, but we will use it in an example later." + "One thing to notice is the symmetry of these matrices. The Wishart distribution can be a little troubling to deal with, but we will use it in an example later." ] }, { @@ -536,7 +536,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that we don't real care how accurate we become about inference of the hidden probabilities — for this problem we are more interested in choosing the best bandit (or more accurately, becoming *more confident* in choosing the best bandit). For this reason, the distribution of the red bandit is very wide (representing ignorance about what that hidden probability might be) but we are reasonably confident that it is not the best, so the algorithm chooses to ignore it.\n", + "Note that we don't really care how accurate we become about the inference of the hidden probabilities — for this problem we are more interested in choosing the best bandit (or more accurately, becoming *more confident* in choosing the best bandit). For this reason, the distribution of the red bandit is very wide (representing ignorance about what that hidden probability might be) but we are reasonably confident that it is not the best, so the algorithm chooses to ignore it.\n", "\n", "From the above, we can see that after 1000 pulls, the majority of the \"blue\" function leads the pack, hence we will almost always choose this arm. This is good, as this arm is indeed the best.\n", "\n", @@ -865,7 +865,7 @@ "\n", "- If interested in the *minimum* probability (eg: where prizes are a bad thing), simply choose $B = \\text{argmin} \\; X_b$ and proceed.\n", "\n", - "- Adding learning rates: Suppose the underlying environment may change over time. Technically the standard Bayesian Bandit algorithm would self-update itself (awesome) by noting that what it thought was the best is starting to fail more often, we can motivate the algorithm to learn changing environments quicker. We simply need to add a *rate* term upon updating:\n", + "- Adding learning rates: Suppose the underlying environment may change over time. Technically the standard Bayesian Bandit algorithm would self-update itself (awesome) by noting that what it thought was the best is starting to fail more often. We can motivate the algorithm to learn changing environments quicker by simply adding a *rate* term upon updating:\n", "\n", " self.wins[ choice ] = rate*self.wins[ choice ] + result\n", " self.trials[ choice ] = rate*self.trials[ choice ] + 1\n", @@ -874,7 +874,7 @@ "\n", "- Hierarchical algorithms: We can setup a Bayesian Bandit algorithm on top of smaller bandit algorithms. Suppose we have $N$ Bayesian Bandit models, each varying in some behavior (for example different rate parameters, representing varying sensitivity to changing environments). On top of these $N$ models is another Bayesian Bandit learner that will select a sub-Bayesian Bandit. This chosen Bayesian Bandit will then make an internal choice as to which machine to pull. The super-Bayesian Bandit updates itself depending on whether the sub-Bayesian Bandit was correct or not. \n", "\n", - "- Extending the rewards, denoted $y_a$ for bandit $a$, to random variables from a distribution $f_{y_a}(y)$ is straightforward. More generally, this problem can be rephrased as \"Find the bandit with the largest expected value\", as playing the bandit with the largest expected value is optimal. In the case above, $f_{y_a}$ was Bernoulli with probability $p_a$, hence the expected value for an bandit is equal to $p_a$, which is why it looks like we are aiming to maximize the probability of winning. If $f$ is not Bernoulli, and it is non-negative, which can be accomplished apriori by shifting the distribution (we assume we know $f$), then the algorithm behaves as before:\n", + "- Extending the rewards, denoted $y_a$ for bandit $a$, to random variables from a distribution $f_{y_a}(y)$ is straightforward. More generally, this problem can be rephrased as \"Find the bandit with the largest expected value\", as playing the bandit with the largest expected value is optimal. In the case above, $f_{y_a}$ was Bernoulli with probability $p_a$, hence the expected value for a bandit is equal to $p_a$, which is why it looks like we are aiming to maximize the probability of winning. If $f$ is not Bernoulli, and it is non-negative, which can be accomplished apriori by shifting the distribution (we assume we know $f$), then the algorithm behaves as before:\n", "\n", " For each round, \n", " \n", @@ -887,7 +887,7 @@ "\n", "- There has been some interest in extending the Bayesian Bandit algorithm to commenting systems. Recall in Chapter 4, we developed a ranking algorithm based on the Bayesian lower-bound of the proportion of upvotes to total votes. One problem with this approach is that it will bias the top rankings towards older comments, since older comments naturally have more votes (and hence the lower-bound is tighter to the true proportion). This creates a positive feedback cycle where older comments gain more votes, hence are displayed more often, hence gain more votes, etc. This pushes any new, potentially better comments, towards the bottom. J. Neufeld proposes a system to remedy this that uses a Bayesian Bandit solution.\n", "\n", - "His proposal is to consider each comment as a Bandit, with a the number of pulls equal to the number of votes cast, and number of rewards as the number of upvotes, hence creating a $\\text{Beta}(1+U,1+D)$ posterior. As visitors visit the page, samples are drawn from each bandit/comment, but instead of displaying the comment with the $\\max$ sample, the comments are ranked according the the ranking of their respective samples. From J. Neufeld's blog [7]:\n", + "His proposal is to consider each comment as a Bandit, with the number of pulls equal to the number of votes cast, and number of rewards as the number of upvotes, hence creating a $\\text{Beta}(1+U,1+D)$ posterior. As visitors visit the page, samples are drawn from each bandit/comment, but instead of displaying the comment with the $\\max$ sample, the comments are ranked according to the ranking of their respective samples. From J. Neufeld's blog [7]:\n", "\n", " > [The] resulting ranking algorithm is quite straightforward, each new time the comments page is loaded, the score for each comment is sampled from a $\\text{Beta}(1+U,1+D)$, comments are then ranked by this score in descending order... This randomization has a unique benefit in that even untouched comments $(U=1,D=0)$ have some chance of being seen even in threads with 5000+ comments (something that is not happening now), but, at the same time, the user is not likely to be inundated with rating these new comments. " ] @@ -998,7 +998,7 @@ "\n", "The *expected daily return* of a stock is denoted $\\mu = E[ r_t ]$. Obviously, stocks with high expected returns are desirable. Unfortunately, stock returns are so filled with noise that it is very hard to estimate this parameter. Furthermore, the parameter might change over time (consider the rises and falls of AAPL stock), hence it is unwise to use a large historical dataset. \n", "\n", - "Historically, the expected return has been estimated by using the sample mean. This is a bad idea. As mentioned, the sample mean of a small dataset size has enormous potential to be very wrong (again, see Chapter 4 for full details). Thus Bayesian inference is the correct procedure here, since we are able to see our uncertainty along with probable values.\n", + "Historically, the expected return has been estimated by using the sample mean. This is a bad idea. As mentioned, the sample mean of a small sized dataset has enormous potential to be very wrong (again, see Chapter 4 for full details). Thus Bayesian inference is the correct procedure here, since we are able to see our uncertainty along with probable values.\n", "\n", "For this exercise, we will be examining the daily returns of the AAPL, GOOG, MSFT and AMZN. Before we pull in the data, suppose we ask our a stock fund manager (an expert in finance, but see [10] ), \n", "\n", @@ -1491,7 +1491,9 @@ "Jeffreys Priors are defined as:\n", "\n", "$$p_J(\\theta) \\propto \\mathbf{I}(\\theta)^\\frac{1}{2}$$\n", - "$$\\mathbf{I}(\\theta) = - \\mathbb{E}\\Big[\\frac{d^2 \\text{ log } p(X|\\theta)}{}\\Big]$$" + "$$\\mathbf{I}(\\theta) = - \\mathbb{E}\\bigg[\\frac{d^2 \\text{ log } p(X|\\theta)}{d\\theta^2}\\bigg]$$\n", + "\n", + "$\\mathbf{I}$ being the *Fisher information*" ] }, {