fixing typos and adding EM to the applications

s-baumann · Jan 19, 2019 · ec8da23 · ec8da23
1 parent 0385719
commit ec8da23
Show file tree

Hide file tree

Showing 2 changed files with 133 additions and 11 deletions.
diff --git a/docs/src/3_UsingAdvice.md b/docs/src/3_UsingAdvice.md
@@ -78,13 +78,17 @@ fp = fixed_point(simple_vector_function, fp; Algorithm = Anderson)
 
 ## 3.3 Graceful error handling
 
-Hopefully **FixedPointAcceleration** is well tested enough that most kind of errors will be rare. **FixedPointAcceleration** also offers an option (ReplaceInvalids) to ensure that no acceleration algorithm generates guess vectors that contain NAs or Infs. This option can be set to ReplaceVector  which will replace an extrapolated vector containing missings, NANs or Infs by the vector output in the previous iterate. If it is set to ReplaceElement then it will replace the individual elements that are missings, NANs or Infs by the corresponding elements in the output of the previous iterate.
+Hopefully **FixedPointAcceleration** is well tested enough that most kind of errors will be rare. **FixedPointAcceleration** also offers an option (ReplaceInvalids) to ensure that no acceleration algorithm generates guess vectors that contain NaNs, Missings or Infs. This option can be set to ReplaceVector  which will replace an extrapolated vector containingNaNs, Missings or Infs by the vector output in the previous iterate. If it is set to ReplaceElement then it will replace the individual elements that are missings, NANs or Infs by the corresponding elements in the output of the previous iterate.
 
 Errors are likely however in cases where inputs functions have a restricted domain. For example this may include functions that require the input vector to have a particular shape (ie concavity) or functions where the input vector must be strictly positive. For a simple example consider the vectorised function we introduced in section 3.1. Now rather than
 
-$x^\prime[1] = \frac{\sqrt{\vert x[1] + x[2] \vert}}{2}$ we have $x^\prime[1] = \frac{\sqrt{ x[1] + x[2] }}{2}$
+$x^\prime[1] = \frac{\sqrt{\vert x[1] + x[2] \vert}}{2}$
 
-where the output x has a prime and the inputs has no prime symbol. $x^\prime[1]$ here is no longer real valued if the sum of the previous iterate is negative. This is what occurs in the 5th iterate of the Anderson method applied to this problem.
+we have
+
+$x^\prime[1] = \frac{\sqrt{ x[1] + x[2] }}{2}$
+
+where the output $$x$$ has a prime and the inputs has no prime symbol. $x^\prime[1]$ here is no longer real valued if the sum of the previous iterate is negative. This is what occurs in the 5th iterate of the Anderson method applied to this problem.
 
 The FixedPoint function handles these situations gracefully by saving all previous results as well as the proposed new vector that lead to the error. In the event of such an error the FailedEvaluation_ member of the returned FixedPointResults struct will describe the issue.
 
@@ -123,7 +127,7 @@ More generally similar problems exist for the other acceleration methods. When t
 then then the fixed point method receives information about what direction to go in but no information about how far to go.
 This is a complication that is common to all of these acceleration methods in this package.
 In these cases it may be possible to change the function to make it converge by varying increments while retaining the
-same set of fixed points. This is shown in the perceptron example in section 4.3. In other cases where it is not possible
+same set of fixed points. This is shown in the perceptron example in section 4.2. In other cases where it is not possible
 to modify the function, it is advisable to use the simple method.
 
 [^5]: When these complications arise the ReplaceInvalids method can be used to revert to a simple iterate or to change individual elements to the corresponding values in a simple iterate. This is as described in section 3.3.
diff --git a/docs/src/4_Applications.md b/docs/src/4_Applications.md
@@ -94,11 +94,11 @@ This training algorithm can be rewritten as fixed point problem. We can write a
 data updating these weights and then returns the updated weight vector. If the perceptron classifies every observation correctly then
 the weights will not update and we are at a fixed point.[^7]
 
-The standard Perceptron updating algorithm does not work well with FixedPointAcceleration methods because of convergence by a fixed increment. This occurs because multiple iterates can result in the same observations being misclassified and hence the same change in the weights. As a result we modify the training algorithm to give training increments that change depending on distance from the fixedpoint. This can be done by updating the weights by an amount proportional to a concave function of the norm of $wx+b$.
+Most acceleration algorithms perform poorly in accelerating the convergence of this perceptron training algorithm. This is due to the perceptron often converging by a ﬁxed increment. This occurs because multiple iterates can result in the same observations being misclassiﬁed and hence the same changeintheweights. Asaresultwewillusethesimplemethodwhichisguaranteedtobeconvergent for this problem (Novikoff, 1963).
 
 First we generate a dataset:
 ```
-# Generating linearly seperable data
+# Generating linearly separable data
 using Distributions
 using FixedPointAcceleration
 using Random
@@ -116,9 +116,27 @@ plot!(data2.x1, data2.x2,seriestype=:scatter)
 
 Now we write a function that will take a set of weights, update them and return the updated weights.
 ```
-# A function to train weights. This is not the normal perceptron function but changes weights by an amount proportional to
-# the distance fromthe seperation line. This ensures that the weights move more, the greater the seperation line is oway from
-# correctly seperating a datapoint.
+function IteratePerceptronWeights(w, LearningRate = 1)
+    for i in 1:length(data[:y])
+        target = data[i,:y]
+        score = w[1] + (w[2]*data[i,:x1]) + (w[3]*data[i,:x2])
+        ypred = 2*((score > 0)-0.5)
+        if abs(target-ypred) > 1e-10
+            update = LearningRate * 0.5*(target-ypred)
+            w[1] = w[1] + update
+            w[2] = w[2] + update*data[i,:x1]
+            w[3] = w[3] + update*data[i,:x2]
+        end
+    end
+    return(w)
+end
+InitialGuess = [1.0, -2.0, 0.5]
+FP = fixed_point(IteratePerceptronWeights, InitialGuess; Algorithm = Simple, PrintReports = true)
+```
+
+Only the simple method is convergent here and it is relatively slow taking 1121 iterations. We can still get a beneﬁt from accelerators however if we can modify the training algorithm to give training increments that change depending on distance from the ﬁxed point. This can be done by updating the weights by an amount proportional to a concave function of the norm of $$w_0 + \sum_{i=1}^N w_i x_{i,j}$$. Note that the instances in which the weights are not updated stay the same and hence the modiﬁed training function will result in the same set of ﬁxed points as the basic function. This is done in the next piece of code where the MPE method is used. It can be seen that there is a substantial increase in speed with only 54 iterations required by the MPE method.
+
+```
 function IteratePerceptronWeights(w, LearningRate = 1)
     for i in 1:length(data[:y])
         target = data[i,:y]
@@ -148,7 +166,107 @@ plot!(x1,x2_on_sep_line, label ="SeperationLine")
 
 [^7]: Note that for perceptrons there are always uncountably many such fixed points where the perceptron correctly classifies the entire training set and will not further update. On the other hand it is possible that the data is not linearly separable in which case there may be no fixed point and the weights will continue to update forever.
 
-## 4.3 A consumption smoothing problem
+## 4.3 Expectation Maximisation
+
+Consider we have a set of data which has come from two different multivariate normal distributions. There is a probability $$\tau$$ that a datapoint comes from the first multivariate distribution.
+```
+# Generating data from two two-dimensional gaussian processes
+using Distributions
+using FixedPointAcceleration
+using Random
+using DataFrames
+true_tau = 0.6
+nobs_1 = 400
+nobs_2 = convert(Int, round(nobs_1 * ((1-true_tau)/true_tau)))
+Random.seed!(1234)
+mu_1 = [0.0,8.0]
+cov_1 = [2.0,0.5,2.0]
+covar_1 = Symmetric([cov_1[1] cov_1[2]; cov_1[2] cov_1[3]])
+md_1 = MultivariateNormal(mu_1,covar_1)
+mu_2 = [-4.0,10.0]
+cov_2 = [2.0,-0.75,12.0]
+covar_2 = Symmetric([cov_2[1] cov_2[2]; cov_2[2] cov_2[3]])
+md_2 = MultivariateNormal(mu_2,covar_2)
+
+rands_from_1 = transpose(rand(md_1, nobs_1))
+rands_from_2 = transpose(rand(md_2, nobs_2))
+data1 = DataFrame([rands_from_1[:,1], rands_from_1[:,2]], [:x1, :x2])
+data2 = DataFrame([rands_from_2[:,1], rands_from_2[:,2]], [:x1, :x2])
+dd  = vcat(data1,data2)
+# Plotting it:
+plot(data1.x1, data1.x2,seriestype=:scatter)
+plot!(data2.x1, data2.x2,seriestype=:scatter)
+```
+Now we want to estimate the parameter $$\tau$$, the means (represented above by mu_1 and mu_2) and the covariance matrices (represented above by cov_1, cov_2) using only the realised datapoints in the DataFrame called dd. We will refer to these parameters as $$\theta$$.
+
+If we knew from which distribution each datapoint came, the above task would be considerably easier. We could separate the data and for each use [standard techniques](https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices) to find the expected mean and covariance matrix. We do not know from which distribution each datapoint came from however. We could use a guess for $$\theta$$ to estimate the probabilities of each datapoint coming from each distribution however (and call this vector of estimates by $$Z$$). Then we could choose maximum likelihood estimates of $$\theta$$ using our estimates of $$Z$$ in the likelihood expression. This is the EM approach. Note that it lends itself well to fixed point acceleration - We can write a function that given $$\theta$$ creates estimated probabilities of source distribution for each datapoint ($$Z$$) and uses these in a maximum likelihood expression to improve the estimate of $$\theta$$.
+
+In this multivariate gaussian case there are simple expressions to choose parameters to maximise the likelihood. These are recounted on the [wikipedia article on expectation maximisation](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) and are used below:
+```
+function z_estimate_given_theta(x::Array{Float64,1}, md_1::MultivariateNormal, md_2::MultivariateNormal, tau::Float64)
+    pdf_1 = pdf(md_1, x)
+    pdf_2 = pdf(md_2, x)
+    return tau*pdf_1 / (tau*pdf_1 + (1-tau)*pdf_2)
+end
+
+function update_tau(Z::Array{Float64,1})
+    return mean(Z)
+end
+
+function update_mu(dd::DataFrame, Z::Array{Float64,1})
+    X = convert(Array{Float64,2}, dd[[:x1, :x2]])
+    sum_Z = sum(Z)
+    updated_mu = (transpose(Z) * X) ./sum_Z
+    return vec(updated_mu)
+end
+
+function update_cov(dd::DataFrame, updated_mu::Array{Float64,1}, Z::Array{Float64,1})
+    X_minus_mu = convert(Array{Float64,2}, dd[[:x1, :x2]]) .- transpose(updated_mu)
+    sum_Z = sum(Z)
+    updated_cov = (transpose(Z .* X_minus_mu) * X_minus_mu) ./sum_Z
+    return [updated_cov[1,1], updated_cov[1,2], updated_cov[2,2]]
+end
+
+function update_theta(theta::Array{Float64,1}, dd::DataFrame)
+    # We will use the convention that theta's 11 entries are (mu_1, cov_1, mu_2, cov_2, tau). First unpacking theta:
+    mu_1    = theta[[1,2]]
+    cov_1   = theta[[3,4,5]]
+    covar_1 = Symmetric([cov_1[1] cov_1[2]; cov_1[2] cov_1[3]])
+    md_1 = MultivariateNormal(mu_1,covar_1)
+    mu_2    = theta[[6,7]]
+    cov_2   = theta[[8,9,10]]
+    covar_2 = Symmetric([cov_2[1] cov_2[2]; cov_2[2] cov_2[3]])
+    md_2 = MultivariateNormal(mu_2,covar_2)
+    tau     = theta[11]
+    # Getting Z
+    Z = Array{Float64,1}(undef,size(dd)[1])
+    for i in 1:size(dd)[1]
+        Z[i] = z_estimate_given_theta([dd[i,:x1], dd[i,:x2]], md_1, md_2, tau)
+    end
+
+    # Updating Tau
+    updated_tau = update_tau(Z)
+    # Updating mu1
+    updated_mu_1 = update_mu(dd,Z)
+    updated_mu_2 = update_mu(dd, 1 .- Z)
+    # Updating Cov
+    updated_cov_1 = update_cov(dd, updated_mu_1, Z)
+    updated_cov_2 = update_cov(dd, updated_mu_2, 1 .- Z)
+    # Returning theta
+    updated_theta = vcat(updated_mu_1, updated_cov_1, updated_mu_2, updated_cov_2, updated_tau)
+    return updated_theta
+end
+```
+Now we can come up with a choice for an initial guess based on eyeballing the plotted data. We can then put it into the fixed_point function to get ML estimates of these distributional parameters as well as $$\tau$$:
+
+```
+InitialGuess = [0.5, 7.5, 2.0, 0.0, 2.0, -5.0, 7.5, 2.0, 0.0, 10.0, 0.5]
+fp_anderson = fixed_point(x -> update_theta(x,dd), InitialGuess; Algorithm = Anderson, PrintReports = true)
+fp_simple   = fixed_point(x -> update_theta(x,dd), InitialGuess; Algorithm = Simple, PrintReports = true)
+```
+We can see that the Anderson method only takes 15 iterations while the simple method takes 80. By checking the generated fixedpoint against the data generation process it can also be verified that the fixedpoint we find provides quite good estimates.
+
+## 4.4 A consumption smoothing problem
 
 Consider an infinitely lived consumer that has a budget of $B_t$ at time $t$ and a periodic income of $1$. She has a periodic utility function given by $\epsilon_t x_t^\delta$, where $x_t$ is spending in period $t$ and $\epsilon_t$ is the shock in period $t$ drawn from some stationary nonnegative shock process with pdf $f(\epsilon)$ defined on the interval $[y,z]$. The problem for the consumer in period $t$ is to maximise their value function:
 
@@ -211,4 +329,4 @@ fp_anderson = fixed_point(OneIterateBudgetValues, InitialGuess; Algorithm = Ande
 fp_simple   = fixed_point(OneIterateBudgetValues, InitialGuess; Algorithm = Simple, PrintReports = true)
 ```
 
-This takes 22 iterates with the anderson algorithm which is drastically better than the 459 iterates it takes with the simple method.
+This takes 22 iterates with the Anderson algorithm which is drastically better than the 459 iterates it takes with the simple method.