analysis/ridge_em.Rmd

---
title: "ridge_em"
author: "Matthew Stephens"
date: "2020-05-29"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
---

## Introduction

Here I am going to experiment with EM algorithm for estimating parameters 
of ridge regression in different parameterizations.

Initial derivations of EM updates are [here](EM_Ridge.pdf).
I initially implemented 1,2, and 5 in that document. 

A futher derivation for another parameterization is [here](EM_Ridge_02.pdf).


## Simple parameterization

$$y \sim N(Xb,s^2)$$
$$b \sim N(0,s_b^2I)$$

```{r}
ridge_em1 = function(y,X, s2,sb2, niter=10){
  XtX = t(X) %*% X
  Xty = t(X) %*% y
  yty = t(y) %*% y
  n = length(y)
  p = ncol(X)
  loglik = rep(0,niter)
  for(i in 1:niter){
    V = chol2inv(chol(XtX+ diag(s2/sb2,p))) 
    
    SigmaY = sb2 *(X %*% t(X)) + diag(s2,n)
    loglik[i] = mvtnorm::dmvnorm(as.vector(y),sigma = SigmaY,log=TRUE)
    
    Sigma1 = s2*V  # posterior variance of b
    mu1 = as.vector(V %*% Xty) # posterior mean of b
    
    s2 = as.vector((yty + sum(diag(XtX %*% (mu1 %*% t(mu1) + Sigma1)))- 2*sum(Xty*mu1))/n)
    sb2 = mean(mu1^2+diag(Sigma1))
   
  }
  return(list(s2=s2,sb2=sb2,loglik=loglik,postmean=mu1))
}
```

## Scaled parameterization

In this parameterization I take the $s_b$ out of the prior and put it 
$$y \sim N(s_b Xb,s^2)$$
$$b \sim N(0,I)$$.

```{r}

ridge_em2 = function(y,X, s2,sb2, niter=10){
  XtX = t(X) %*% X
  Xty = t(X) %*% y
  yty = t(y) %*% y
  n = length(y)
  p = ncol(X)
  loglik = rep(0,niter)
  for(i in 1:niter){
    V = chol2inv(chol(XtX+ diag(s2/sb2,p))) 
    
    SigmaY = sb2 *(X %*% t(X)) + diag(s2,n)
    loglik[i] = mvtnorm::dmvnorm(as.vector(y),sigma = SigmaY,log=TRUE)
    
    Sigma1 = (s2/sb2)*V  # posterior variance of b
    mu1 = (sqrt(sb2)/s2)*as.vector(Sigma1 %*% Xty) # posterior mean of b
    
    sb2 = (sum(mu1*Xty)/sum(diag(XtX %*% (mu1 %*% t(mu1) + Sigma1))))^2
    s2 = as.vector((yty + sb2*sum(diag(XtX %*% (mu1 %*% t(mu1) + Sigma1)))- 2*sqrt(sb2)*sum(Xty*mu1))/n)
  }
  return(list(s2=s2,sb2=sb2,loglik=loglik,postmean=mu1*sqrt(sb2)))
}
```


## A hybrid/redundant parameterization

Motivated by initial observations that 1 and 2 can converge
well in different settings I implemented a hybrid of the two:

$$y \sim N(s_b Xb,s^2)$$
$$b \sim N(0,\lambda^2).$$ 
Note that there is a redundancy/non-identifiability here as the likelihood depends
only on $s_b^2 \lambda^2$. The hope is to get the
best of both worlds...


```{r}
ridge_em3 = function(y,X, s2, sb2, l2, niter=10){
  XtX = t(X) %*% X
  Xty = t(X) %*% y
  yty = t(y) %*% y
  n = length(y)
  p = ncol(X)
  loglik = rep(0,niter)
  for(i in 1:niter){
    V = chol2inv(chol(XtX+ diag(s2/(sb2*l2),p))) 
    
    SigmaY = l2*sb2 *(X %*% t(X)) + diag(s2,n)
    loglik[i] = mvtnorm::dmvnorm(as.vector(y),sigma = SigmaY,log=TRUE)
    
    Sigma1 = (s2/sb2)*V  # posterior variance of b
    mu1 = (1/sqrt(sb2))*as.vector(V %*% Xty) # posterior mean of b
    
   
    sb2 = (sum(mu1*Xty)/sum(diag(XtX %*% (mu1 %*% t(mu1) + Sigma1))))^2
    s2 = as.vector((yty + sb2*sum(diag(XtX %*% (mu1 %*% t(mu1) + Sigma1)))- 2*sqrt(sb2)*sum(Xty*mu1))/n)
     
    l2 = mean(mu1^2+diag(Sigma1))
   
  }
  return(list(s2=s2,sb2=sb2,l2=l2,loglik=loglik,postmean=mu1*sqrt(sb2)))
}
```

## Avoiding large 2nd moment computation

The previous parameterizations require the full second moment of $b$, 
which is a $p$ times $p$ matrix. This can be expensive to compute if $p$ is big. 
The following parameterization avoids this.

$$y \sim N(sXb, s^2 I)$$

$$b \sim N(0,s_b^2I)$$

(Note that for simplicity I still do compute the $p \times p$ matrix, as for now it
is the easiest way to implement the ridge regression).

```{r}
dot = function(x,y){sum(x*y)}

ridge_em4 = function(y, X, s2, sb2,  niter=10){
  XtX = t(X) %*% X
  Xty = t(X) %*% y
  yty = t(y) %*% y
  n = length(y)
  p = ncol(X)
  loglik = rep(0,niter)
  for(i in 1:niter){
    
    SigmaY = s2*sb2 *(X %*% t(X)) + diag(s2,n)
    loglik[i] = mvtnorm::dmvnorm(as.vector(y),sigma = SigmaY,log=TRUE)
    
    Sigma1 = chol2inv(chol(XtX + diag(1/sb2,p)))  # posterior variance of b
    mu1 = (1/sqrt(s2))*as.vector(Sigma1 %*% Xty) # posterior mean of b
    
    sb2 = mean(mu1^2+diag(Sigma1))
    yhat = X %*% mu1
    
    s2 = drop((0.5/n)* (sqrt(dot(y,yhat)^2 + 4*n*yty) - dot(y,yhat)))^2
   
  }
  return(list(s2=s2,sb2=sb2,loglik=loglik,postmean=mu1*sqrt(s2)))
}
```


## Another redundant parameterization

Here I consider
$$y \sim N(s_b Xb, s^2 I)$$
where
$$b \sim N(0,s^2 \lambda^2I).$$

This is like the redundant parameterization above,
except that the prior on $b$ is scaled by the
residual variance ($s^2$).
This is motivated by the result in the Blasso paper that this makes
the posterior on $s^2,b$ convex.


```{r}
ridge_em5 = function(y,X, s2, sb2, l2, niter=10){
  XtX = t(X) %*% X
  Xty = t(X) %*% y
  yty = t(y) %*% y
  n = length(y)
  p = ncol(X)
  loglik = rep(0,niter)
  for(i in 1:niter){
    
    SigmaY = l2* s2* sb2 *(X %*% t(X)) + diag(s2,n)
    loglik[i] = mvtnorm::dmvnorm(as.vector(y),sigma = SigmaY,log=TRUE)
    
    #V = chol2inv(chol(XtX+ diag(s2/(sb2*l2),p))) 
    
    Sigma1 = chol2inv(chol((sb2/s2) * XtX + diag(1/(s2*l2),p) ))  # posterior variance of b
    mu1 = (sqrt(sb2)/s2)*as.vector(Sigma1 %*% Xty) # posterior mean of b
    
   
    sb2 = (sum(mu1*Xty)/sum(diag(XtX %*% (mu1 %*% t(mu1) + Sigma1))))^2  #same as em3
    
    s2 = as.vector((sum((mu1^2+diag(Sigma1))/l2)+ yty + sb2*sum(diag(XtX %*% (mu1 %*% t(mu1) + Sigma1)))- 2*sqrt(sb2)*sum(Xty*mu1))/(n+p))  
    # as in em3 but adds sum(mu1^2/l2) to numerator and p to demoninator
     
    l2 = mean(mu1^2+diag(Sigma1))/s2 #as in em3 but divided by s2
   
  }
  return(list(s2=s2,sb2=sb2,l2=l2,loglik=loglik,postmean=mu1*sqrt(sb2)))
}
```

## Simple Simulations

This is a simple simulation with independent design matrix.

### High signal:

This simulation has high signal:
```{r}
set.seed(100)
sd = 1
n = 100
p = n
X = matrix(rnorm(n*p),ncol=n)
btrue = rnorm(n)
y = X %*% btrue + sd*rnorm(n)

plot(X %*% btrue, y)
```

```{r}
y.em1 = ridge_em1(y,X,1,1,100)
y.em2 = ridge_em2(y,X,1,1,100)
y.em3 = ridge_em3(y,X,1,1,1,100)
y.em4 = ridge_em4(y,X,1,1,100)
y.em5 = ridge_em5(y,X,1,1,1,100)


plot_loglik = function(res){
  maxloglik = max(res[[1]]$loglik)
  minloglik = min(res[[1]]$loglik)
  maxlen =length(res[[1]]$loglik)
  for(i in 2:length(res)){
    maxloglik = max(c(maxloglik,res[[i]]$loglik))
    minloglik = min(c(minloglik,res[[i]]$loglik))
    maxlen= max(maxlen, length(res[[i]]$loglik))
  }
  
  
  plot(res[[1]]$loglik,type="n",ylim=c(minloglik,maxloglik),xlim=c(0,maxlen),ylab="log-likelihood",
       xlab="iteration")
  for(i in 1:length(res)){
    lines(res[[i]]$loglik,col=i,lwd=2)
  }

}
res = list(y.em1,y.em2,y.em3,y.em4,y.em5)
plot_loglik(res)
```

## No signal

This simulation has no signal (b=0):
```{r}
btrue = rep(0,n)
y = X %*% btrue + sd*rnorm(n)

y.em1 = ridge_em1(y,X,1,1,100)
y.em2 = ridge_em2(y,X,1,1,100)
y.em3 = ridge_em3(y,X,1,1,1,100)
y.em4 = ridge_em4(y,X,1,1,100)
y.em5 = ridge_em5(y,X,1,1,100)

plot_loglik(list(y.em1,y.em2,y.em3,y.em4,y.em5))
```

## Trendfiltering Simulations

This is more challenging example (in that the design matrix is correlated)

### High Signal

```{r}
set.seed(100)
sd = 1
n = 100
p = n
X = matrix(0,nrow=n,ncol=n)
for(i in 1:n){
  X[i:n,i] = 1:(n-i+1)
}
btrue = rep(0,n)
btrue[40] = 8
btrue[41] = -8
y = X %*% btrue + sd*rnorm(n)

plot(y)
lines(X %*% btrue)

y.em1 = ridge_em1(y,X,1,1,100)
lines(X %*% y.em1$postmean,col=1,lwd=2)

y.em2 = ridge_em2(y,X,1,1,100)
lines(X %*% y.em2$postmean,col=2,lwd=2)

y.em3 = ridge_em3(y,X,1,1,1,100)
lines(X %*% y.em3$postmean,col=3,lwd=2)

y.em4 = ridge_em4(y,X,1,1,100)
lines(X %*% y.em4$postmean,col=4,lwd=2)

y.em5 = ridge_em5(y,X,1,1,1,100)
lines(X %*% y.em4$postmean,col=5,lwd=2)


```


Look at the likelihoods:
```{r}
plot_loglik(list(y.em1,y.em2,y.em3,y.em4,y.em5))
```

Run the second one longer and check it:
```{r}
y.em2 = ridge_em2(y,X,1,1,1000)
plot_loglik(list(y.em1,y.em2,y.em3,y.em4))

y.em1$sb2
y.em2$sb2
y.em3$sb2 * y.em3$l2
y.em4$sb2 * y.em4$s2
y.em5$sb2 * y.em5$l2 * y.em5$s2

y.em1$s2
y.em2$s2
y.em3$s2
y.em4$s2
y.em5$s2
```


## Different initializations

Try starting $s$ in wrong place
```{r}
y.em1 = ridge_em1(y,X,10,1,100)
y.em2 = ridge_em2(y,X,10,1,100)
y.em3 = ridge_em3(y,X,10,1,1,100)
y.em4 = ridge_em4(y,X,10,1,100)
y.em5 = ridge_em5(y,X,10,1,1,100)
plot_loglik(list(y.em1,y.em2,y.em3,y.em4,y.em5))
```


Try starting $s2$ in wrong place
```{r}
y.em1 = ridge_em1(y,X,1,10,100)
y.em2 = ridge_em2(y,X,1,10,100)
y.em3 = ridge_em3(y,X,1,10,10,100)
y.em4 = ridge_em4(y,X,1,10,100)
y.em5 = ridge_em5(y,X,1,10,10,100)
plot_loglik(list(y.em1,y.em2,y.em3,y.em4,y.em5))
```

Try starting both in wrong place. Interestingly in this example em4 seems to converge
to a local optimum?
```{r}
y.em1 = ridge_em1(y,X,.1,10,100)
y.em2 = ridge_em2(y,X,.1,10,100)
y.em3 = ridge_em3(y,X,.1,10,10,100)
y.em4 = ridge_em4(y,X,.1,10,100)
y.em5 = ridge_em5(y,X,.1,10,10,100)
plot_loglik(list(y.em1,y.em2,y.em3,y.em4,y.em5))
y.em4$s2
y.em1$s2
```

### No signal case

Try no signal case -- the convergence issues are reversed!
```{r}
sd = 1
n = 100
p = n
X = matrix(0,nrow=n,ncol=n)
for(i in 1:n){
  X[i:n,i] = 1:(n-i+1)
}
btrue = rep(0,n)

y = X %*% btrue + sd*rnorm(n)

plot(y)
lines(X %*% btrue)

y.em1 = ridge_em1(y,X,1,1,100)
lines(X %*% y.em1$postmean,col=1,lwd=2)

y.em2 = ridge_em2(y,X,1,1,100)
lines(X %*% y.em2$postmean,col=2,lwd=2)

y.em3 = ridge_em3(y,X,1,1,1,100)
lines(X %*% y.em3$postmean,col=3,lwd=2)

y.em4 = ridge_em4(y,X,1,1,100)
lines(X %*% y.em4$postmean,col=4,lwd=2)

y.em5 = ridge_em5(y,X,1,1,1,100)
lines(X %*% y.em5$postmean,col=5,lwd=2)

```

The EM2 and EM3 and EM5 converge faster here:
```{r}
plot_loglik(list(y.em1,y.em2,y.em3,y.em4,y.em5))
```


Try starting the expanded algorithm from very large lambda... it still seems to work.
```{r}
y.em3b = ridge_em3(y,X,1,1,100,100)
y.em5b = ridge_em5(y,X,1,1,100,100)

plot_loglik(list(y.em1,y.em2,y.em3b,y.em4,y.em5b))
```


### Possible next steps

It might be interesting to combine the expanded idea with algorithm em4.

It might also be interesting to add another redundant parameter multiplying the residual variance in the second redundant parameterization, so that some of the residual variance
is tied to the prior variance and some is not.