# Group Lasso
Reference: Yuan, Ming, and Yi Lin. "Model selection and estimation in regression with grouped variables." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.1 (2006): 49-67.

The objective function is $$min_{\beta} \frac12 \| y-\Sigma_{j=1}^JX_j\beta_j \|_2^2 + \lambda \| \beta_j \|_{K_j},$$
where $\| \eta \|_{K} = (\eta^T K \eta)^{1/2}$ for a vector $\eta \in R^d$ and a SPD matrix $K \in R^{d\times d}$,
$\beta_j$ is the coefficient vector of size $p_j$ and $p_j$ is the number of entries in $j$-th group.

Let $K_j=p_jI_{p_j}$ then $\| \beta_j \|_{K_j} = \Sigma_{j=1}^J \sqrt{p_j} \| \beta_j \|_2$,
**assume each $X_j$ is orthonormal as $X_j^T X_j = I_{p_j}$** for $j=1,...,J$,
according to KKT condition the solution of $\beta$  is
$$ -X_j^T (y-\Sigma_{j=1}^J X_j \beta_j) + \lambda \sqrt{p_j} \frac{\beta_j}{\|\beta_j\|_2} = 0, \text{ if } \beta_j \neq 0$$
otherwise, $\beta_j = 0$, zero lies in the subgradient of L2 norm of $\beta_j$ as $\| \frac{X_j^T \left(y - X\beta\right)} {\lambda \sqrt{p_j}} \|_2 \leq 1$, then
$$ \| X_j^T \left(y - X\beta_{-j}\right)\|_2 \leq \lambda \sqrt{p_j}, \text{ if } \beta_j = 0 $$

Denote $\beta_{-j}=( \beta_1^T, ..., \beta_{j-1}^T, \mathbf{0}, \beta_{j+1}^T, \beta_{J}^T)^T,
r_j = y - X\beta_{-j},
s_j = X_j^T r_j$,
recall $X_j^T X_j = I_{p_j}$,
when $\beta_j \neq 0$,
$$-s_j + X_j^T X_j\beta_j + \lambda \sqrt{p_j} \frac{\beta_j}{\|\beta_j\|_2} = 0$$
$$s_j = (1 + \frac{ \lambda \sqrt{p_j} } { \|\beta_j\|_2 }) \beta_j$$
$s_j$ and $\beta_j$ are parallel, thus $\frac{\beta_j}{\|\beta_j\|_2} = \frac{s_j}{\|s_j\|_2}$ and 
$\beta_j = (1 - \frac{ \lambda \sqrt{p_j} } { \|s_j\|_2 }) s_j$.

Finally, combine solution of $\beta_j$ in two cases,
$$\beta_j = (1 - \frac{ \lambda \sqrt{p_j} } { \|s_j\|_2 })_+ s_j$$

The algorithm is updated iteratively for $j=1,...,J$.

Simulation setting:
$n=100, p=20$, $X$ is centered and scaled to variance 1 in each column, and X in each group holds orthonormality

In [1]:
n = 100
p = 20
set.seed(0)
# beta = rnorm(p, mean = 0, sd = 1)
# print(mean(beta))
# print(sd(beta))

X = matrix( rnorm(n*p, mean = 0, sd = 1), n, p )
# X = X - colMeans(X) # centering
# Xsd <- sqrt(colMeans(X^2))
# X = t( t(X)/Xsd ) /sqrt(p)
X = scale(X, center=T, scale=T)

## Note that if orthonormalize the whole matrix X, the update only happens in the first iteration, r_j would be updated since beta changes
## but s_j=X_j^T(y-X\beta_{-j}) remains the same since X_1^T X = [I, 0 , 0, 0] as an example then \beta_j remains unchanged as well
# X.svd = svd(X) 
# X = X.svd$u
# # t(X) %*% X

In [2]:
group_number = 4
card_per_group = p / group_number
beta_true = c(0.15,-0.33,0.25,-0.25,0.05, 0,0,0,0.5,0.2, -0.25, 0.12,-0.125,0,0, 0,0,0,0,0)
cat('beta_true:', beta_true, '\n')
group_id = rep(1:group_number, each=card_per_group)
cat('group_id:', group_id, '\n')
J = group_number
group_list = list()
for(j in 1:J){
  group_list[[j]] = c(1:p)[group_id==j]
}
group_list

beta_true: 0.15 -0.33 0.25 -0.25 0.05 0 0 0 0.5 0.2 -0.25 0.12 -0.125 0 0 0 0 0 0 0 
group_id: 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 


In [3]:
for(j in 1:J){
  X_j = X[, group_list[[j]] ]
  X_j.svd = svd(X_j)
  X[, group_list[[j]] ] = X_j.svd$u
}
# round(t(X) %*% X, 2)

In [4]:
e = rnorm(n, 0, 0.1)
y = X %*% beta_true + e

In [5]:
# hyperparameters
lambda = 0.1
max_iter = 100
threshold = 1e-5

beta_hat = rep(10, p)

iter = 0

while(iter < max_iter){
  beta_old = beta_hat
  for(j in 1:J){ # for each group
    r_j = y - X[, -group_list[[j]] ] %*% beta_hat[ -group_list[[j]] ] # r_j = y - \Sigma_{k \neq j} X \beta_{-j}
#     cat(iter, j, r_j, '\n')
#     cat('L2 norm of r_j', sqrt(sum(r_j^2)), '\n')
    X_j = X[, group_list[[j]] ]
#     print(round(t(X_j) %*% X,2  ))
    
    s_j = t(X_j) %*% r_j # s_j = X_j^T ( y - \Sigma_{k \neq j} X \beta_{-j} )
#     print(s_j)
    p_j = card_per_group
    s_2_norm = sqrt( sum(s_j^2) )
#     cat('L2 norm of s_j',s_2_norm, '\n')

    if(s_2_norm <= lambda * sqrt(p_j)){
      beta_j = 0
    }
    else{
      beta_j = (1 - lambda*sqrt(p_j)/s_2_norm) * s_j
    }
    beta_hat[ group_list[[j]] ] = beta_j
#     print(beta_j)
#     cat('L2 norm of beta_j', sqrt(sum(beta_j^2)), '\n')
    
  }
  iter = iter + 1
  change = max( abs( beta_hat - beta_old ) )
  cat('iter', iter, change, '\n')
  if(change < threshold){
    break
  }
}
round(beta_hat,3)

iter 1 16.69633 
iter 2 6.479991 
iter 3 2.08435 
iter 4 0.8267275 
iter 5 0.2383172 
iter 6 0.04213397 
iter 7 0.0140898 
iter 8 0.003104337 
iter 9 0.0006537933 
iter 10 0.0001368071 
iter 11 2.868036e-05 
iter 12 6.026674e-06 


# Sparse group lasso
Reference: Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. "A note on the group lasso and a sparse group lasso." arXiv preprint arXiv:1001.0736 (2010).

Intuition: sparse group lasso yield sparsity at both the group and individual feature levels.

The objective function is
$$min_{\beta} \frac12 \| y-\Sigma_{l=1}^L X^{(l)}\beta^{(l)} \|_2^2 + \lambda_1 \Sigma_{l=1}^L \|\beta^{(l)}\|_2 + \lambda_2 \|\beta^{(l)}\|_1,$$

the subgradient equations are
$$ -X^{(l)T} (y-\Sigma_{l=1}^L X^{(l)}\beta^{(l)}) + \lambda_1 s_l + \lambda_2 t_l = 0. $$
This is separable for groups so that block coordinate descent can be used, each time update one group only, with other group coefficients fixed.


For each group $l$ whose element $j=1,...,p_j$,
$X^{(l)}=(Z_1,...,Z_{p_j}), \beta^{(l)} = \theta = (\theta_1,...,\theta_{p_j})^T$, let $r = y - \Sigma_{i\neq l} X^{(i)}\beta^{(i)}$,

the subgradient equations are
$$ -Z_j^T(r - \Sigma_k Z_k\theta_k) + \lambda_1 s_j + \lambda_2 t_j = 0 $$

Let $a=X^{(l)T} r$,

$$\theta=0 \iff a_j=\lambda_1 s_j + \lambda_2 t_j \text{ with } \|s\|\leq1, t_j\in[-1,1]$$

$$\iff min_t J(t) = \Sigma_{j=1}^{p_j}s_j^2 = \frac{1}{\lambda_1^2} \Sigma_{j=1}^{p_j} (a_j-\lambda_2 t_j)^2, \text{ s.t. }t_j\in[-1,1],$$
whose solution is $\hat t_j=\frac{a_j}{\lambda_2}$ if $|\frac{a_j}{\lambda_2}|\leq1$, otherwise $\hat t_j=sign(\frac{a_j}{\lambda_2})$.

If $\hat t_j$ is a feasible solution with $J(\hat t_j)\leq1$, then $\theta=0$.

If $J(\hat t_j)>1$, optimize each nonzero entries in $\theta$ though
$$ min_{\theta_j} \frac12 \| r - \Sigma_{j=1}^{p_j} Z_j \theta_j \|_2^2 + \lambda_1 \| \theta \|_2 + \lambda_2 \Sigma_{j=1}^{p_j} |\theta_j|$$
by one-dimensional search with **optimize** function in the R package

Algorithm: 
1. outer loop: block coordinate decent over group $l=1,...,L$,
2. inner loop: coordinate decent over (nonzero) entries $\theta_j$ for $j=1,...,p_j$

Simulation setting:
$n=100, p=20$, $X$ is centered and scaled to variance $\frac1p$ in each column

In [6]:
n = 100
p = 20
set.seed(0)

X = matrix( rnorm(n*p, mean = 0, sd = 1), n, p )
X = X - colMeans(X) # centering
Xsd <- sqrt(colMeans(X^2))
X = t( t(X)/Xsd ) /sqrt(p)

e = rnorm(n, 0, 0.1)
y = X %*% beta_true + e

In [7]:
group_number = 4
card_per_group = p / group_number
beta_true = c(0.15,-0.33,0.25,-0.25,0.05, 0,0,0,0.5,0.2, -0.25, 0.12,-0.125,0,0, 0,0,0,0,0)
# cat('beta:', beta, '\n')
group_id = rep(1:group_number, each=card_per_group)
# cat('group_id:', group_id, '\n')

In [8]:
L = group_number
group_list = list()
for(l in 1:L){
  group_list[[l]] = c(1:p)[group_id==l]
}
# group_list

In [9]:
# hyperparameters
lambda.1 = 0.1
lambda.2 = 0.2
max_iter = 100
threshold = 1e-5

beta_hat = rep(0, p)

iter = 0

while(iter < max_iter){
  beta_old = beta_hat
  for(l in 1:L){ # for each group
    r_l = y - X[, -group_list[[l]] ] %*% beta_hat[ -group_list[[l]] ] # r_l = y - \Sigma_{k \neq j} X \beta_{-j}
#     cat(iter, l, r_l, '\n')
    X_l = X[, group_list[[l]] ]
    a = t(X_l) %*% r_l

#     a_l2 = a / lambda.2
# #     print(a_l2)
#     t = a_l2
#     t[t>1] = 1
#     t[t<(-1)] = -1
# #     print(t)
#     a_l2t_2 = (a - lambda.2*t)^2 # (a_j - \lambda_2 t_j)^2
# #     print(a_l2t_2)
#     J = sum(a_l2t_2) / (lambda.1**2)
#     print(J)
#     if(J <= 1){
#       theta = 0
#     }

    p_l = card_per_group
    theta = rep(0, p_l)
    for(j in 1:p_l){
      if(abs(a[j]) <= lambda.2){
        theta[j] = 0
      }
      else{
        Z_j = X_l[, -j] # Z_{-j}
        theta_j = theta[-j] # \theta_{-j}
        f <- function(x) { 1/2* sum(( r_l - Z_j%*%theta_j - X_l[,j]*x )^2) +
                          lambda.1 * sqrt(sum(theta_j**2) + x**2) +
                          lambda.2 * (sum(abs(theta_j)) + abs(x))  }
        theta[j] = optimize(f, c(-10,10))$minimum
      }
    }
    print(theta)
    beta_hat[ group_list[[l]] ] = theta
  
  }



  iter = iter + 1
  change = max( abs( beta_hat - beta_old ) )
  cat('iter', iter, change, '\n')
  if(change < threshold){
    break
  }
}
round(beta_hat,3)

[1]  2.719209e-02 -3.893512e-01  3.046806e-01 -1.735826e-01  1.346569e-06
[1] -0.01562881 -0.12591371  0.00000000  0.45096282  0.08911321
[1] -0.18672487  0.06802142 -0.08516696  0.04588338  0.00000000
[1]  0.00000e+00  0.00000e+00  0.00000e+00 -8.22812e-06  0.00000e+00
iter 1 0.4509628 
[1]  0.03061995 -0.35211492  0.25070759 -0.18414058  0.00000000
[1] -5.228342e-05 -7.127099e-02  0.000000e+00  4.413935e-01  6.923042e-02
[1] -0.19879879  0.07278474 -0.09648731  0.05922917  0.00000000
[1]  0.000000e+00  0.000000e+00  0.000000e+00 -3.698008e-05 -7.812010e-06
iter 2 0.05464272 
[1]  0.02954488 -0.34953557  0.24797525 -0.18503279  0.00000000
[1] -6.886815e-06 -6.584137e-02  0.000000e+00  4.393237e-01  6.845442e-02
[1] -0.19987972  0.07360516 -0.09757684  0.06039204  0.00000000
[1]  0.000000e+00  0.000000e+00  0.000000e+00  2.452883e-07 -3.605729e-05
iter 3 0.005429615 
[1]  0.02874492 -0.34914443  0.24763086 -0.18495681  0.00000000
[1] -1.001736e-05 -6.522575e-02  0.000000e+00  4.390715e

In [10]:
# library(SGL)
# SGL(data = list(x=X, y=y), index = group_id)
# SGL(data = list(x=X, y=y), index = group_id, lambdas = (lambda.1+lambda.2), alpha=lambda.2/(lambda.1+lambda.2))

Conclusion:
1. Group lasso put assumption on groupwise sparsity and successfully reveal all zeroes in the last group, but for nonzero groups like group 2 and group 3 in the simulation setting, elements are shrinkaged to be closer, are all nonzero in these groups.
2. Sparse group lasso handles both groupwise sparsity and the sparsity within each group, which performs better especially for the case when both zero and nonzero elements exist within group, just like group 2 and group 3.