-
Notifications
You must be signed in to change notification settings - Fork 2
/
tree_pca_02.Rmd
330 lines (258 loc) · 9.35 KB
/
tree_pca_02.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
---
title: "tree_pca_02"
author: "Matthew Stephens"
date: "2020-07-29"
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
---
## Introduction
Following up on a [previous analysis](tree_pca.html) to look at how
PCA behaves on some tree-structure covariance matrices.
## Tree-structure covariance - 4 tips
I look at the simple 4-tip tree where the root first splits into 2 branches,
and then each branch splits into two tips.
This function produces the covariance matrix of this tree, with
branch lengths given by `b`.
The first two elements of `b` give
the left and right branches of the first split, etc
(There are 6 branches in all.)
```{r}
Sigma4_fun = function(b){
Sigma4 = matrix(0,nrow=4, ncol=4) # shared branch
Sigma4[1:2,1:2] = Sigma4[1:2,1:2] + b[1] # left
Sigma4[3:4,3:4] = Sigma4[3:4,3:4] + b[2] # right split
for(i in 1:4)
Sigma4[i,i] = Sigma4[i,i] + b[2+i]
Sigma4
}
```
### Even trees
Here we assume all the branch lengths are the same. We see that
the PCA (eigenvectors) reproduce the tree.
```{r}
Sigma4_even = Sigma4_fun(rep(1,6))
Sigma4_even
eigen(Sigma4_even)
```
However, note that the equal eigenvalues (3,3) and (1,1)
mean that there is ambiguity about these eigenvectors. Any
linear cobination of these last two eigenvectors will also be
an eigenvector. For example:
```{r}
v=c(-1,1,-1,1)
Sigma4_even %*% v
```
This means that if the truth is a tree like this,
the representation learned could be quite prior dependent.
It makes it more attractive to make
assumptions like sparsity, or even a hierarchical structure
on the loadings. (One long-term question is whether a sparsity
assumption is enough, or whether we really need to put in
a hierarchical assumption to recover trees.)
### The effects of centering
We can also center the covariance matrix before eigen-decomposition,
which is analogous to including a constant factor. This way each
eigenvector represents a "split" of the samples in the tree (rather
than the top two branches being put in two different eigen vectors).
```{r}
Sigma4_even_centered = Sigma4_even-mean(Sigma4_even)
eigen(Sigma4_even_centered)
```
### Balanced trees
By a (locally-)balanced tree I mean one where every split is balanced (although
not all splits are the same). We see similar results in this case to the even tree
(but this is not true for the locally-balanced 8 tip tree!)
```{r}
Sigma4_balanced = Sigma4_fun(c(1,1,2,2,3,3))
Sigma4_balanced
eigen(Sigma4_balanced)
eigen(Sigma4_balanced-mean(Sigma4_balanced))
```
### Un-balanced trees
With arbitrary branch lengths we don't see the nice exact results.
For example, the first eigen-vector, the loadings are not equal for all
the samples that split the same way.
However the general patterns are somewhat similar.
```{r}
Sigma4_unbalanced = Sigma4_fun(c(1,2,3,4,5,6))
Sigma4_unbalanced
eigen(Sigma4_unbalanced)
eigen(Sigma4_unbalanced-mean(Sigma4_unbalanced))
```
### Un-balanced trees: correlation matrix
Part of the issue is that with arbitrary branch lengths the different
nodes do not have the same variance. Taking the correlation matrix
instead of the covariance matrix seemms to make the results for irregular trees
closer to the "regular" trees.
```{r}
eigen(cov2cor(Sigma4_unbalanced)-mean(cov2cor(Sigma4_unbalanced)))
```
## Tree-structure covariance - 8 tips
Here I create an 8-tip tree by joining two 4-tip trees with a top splitting branch
whose branch lengths are given by `b_top`.
```{r}
Sigma8_fun = function(b_top,b_left,b_right){
Sigma8 = matrix(0,nrow=8,ncol=8)
Sigma8[1:4,1:4] = b_top[1] + Sigma4_fun(b_left) # left
Sigma8[5:8,5:8] = b_top[2] + Sigma4_fun(b_right) # right split
Sigma8
}
```
### Balanced trees
Note that for the 8-tip case, the (locally-)balanced trees don't give such clean
results as the 4-tip case. In particular, the first eigenvector varies
among samples that split the same way. However if you take the strategy of splitting
each eigenvector at its largest gap, then I think every eigenvector splits the tree according to
some branch...
```{r}
Sigma8_balanced = Sigma8_fun(c(1,1),c(2,2,3,3,4,4),c(5,5,6,6,7,7))
Sigma8_balanced
eigen(Sigma8_balanced)
eigen(Sigma8_balanced-mean(Sigma8_balanced))
```
Using the correlation matrix instead helps make the results cleaner and more "tree-like":
```{r}
eigen(cov2cor(Sigma8_balanced)-mean(cov2cor(Sigma8_balanced)))
```
### Add an outgroup
Here I add a 9th "outgroup" to an even 8-tip tree to see what happens.
I try eigen-decomposition of correlation matrix
both with and without centering. In both cases
the first eigen-vector has no weight on the 9th population. This might not
get in the way of partition-based approaches because in both cases the first
eigenvector partitions the samples
consistent with the unrooted tree. However, it may complicate
the issue of taking account of admixed populations; how to distinguish
admixed vs an outgroup?
```{r}
Sigma8_even = Sigma8_fun(c(1,1),rep(1,6),rep(1,6))
Sigma9 = rbind(cbind(1+Sigma8_even, rep(0,8)),rep(0,9))
Sigma9[9,9] = 1
Sigma9
Sigma9_cor = cov2cor(Sigma9)
Sigma9_cor
```
```{r}
eigen(Sigma9_cor)
eigen(Sigma9_cor-mean(Sigma9_cor))
```
## Try admixture
Here I add an admixed population that I intended to represent $X=qX_1 + (1-q)X_2$,
from which one obtains $cov(X,X_k) = q cov(X_1,X_k) + (1-q) cov(X_2,X_k)$.
```{r}
q = 0.5
Sigma8_even = Sigma8_fun(c(1,1),rep(1,6),rep(1,6))
Sigma9 = rbind(Sigma8_even, q*Sigma8_even[1,]+(1-q)*Sigma8_even[2,])
Sigma9 = cbind(Sigma9, c(q*Sigma8_even[1,]+(1-q)*Sigma8_even[2,], 0))
Sigma9[9,9] = q^2*Sigma8_even[1,1]+(1-q)^2*Sigma8_even[1,1] + 2*q*(1-q)*Sigma8_even[1,2]
Sigma9
cov2cor(Sigma9)
eigen(Sigma9)
eigen(cov2cor(Sigma9)-mean(cov2cor(Sigma9)))
```
Both the correlation and covariance decompose in an intutive way, with the admixed
population being loaded as approximately the average of populations 1 and 2 (exactly in the case
of the covariance).
Also try a different q:
```{r}
q = 0.25
Sigma9 = rbind(Sigma8_even, q*Sigma8_even[1,]+(1-q)*Sigma8_even[2,])
Sigma9 = cbind(Sigma9, c(q*Sigma8_even[1,]+(1-q)*Sigma8_even[2,], 0))
Sigma9[9,9] = q^2*Sigma8_even[1,1]+(1-q)^2*Sigma8_even[1,1] + 2*q*(1-q)*Sigma8_even[1,2]
eigen(Sigma9)
```
## Graph Laplacian
The graph laplacian for a graph with edge weights (distances) $w$ is $L=D-W$ where $D$ is a diagonal
matrix with $d_{ii} = \sum_j w_{ij}$. Cutting a graph into two can be acheived by taking the top eigenvector
of the Graph Laplacian, as [here](https://www-liebertpub-com.proxy.uchicago.edu/doi/10.1089/cmb.2009.0028)
for example.
Here we have functions to compute distance matrices for the 4-tip and 8-tip trees as above.
```{r}
D4_fun = function(b){
D4 = matrix(0,nrow=4, ncol=4)
D4[1:2,3:4] = b[1] + b[2] # top split
D4[1,-1] = D4[1,-1] + b[3]
D4[2,-2] = D4[2,-2] + b[4]
D4[3,-3] = D4[3,-3] + b[5]
D4[4,-4] = D4[4,-4] + b[6]
D4 + t(D4)
}
D8_fun = function(b_top, b_left, b_right){
D8 = matrix(0,nrow=8, ncol=8)
D8[1:4,-(1:4)] = b_top[1]
D8[5:8,-(5:8)] = b_top[2] # top split
D8[1:2,-(1:2)] = D8[1:2,-(1:2)] + b_left[1]
D8[3:4,-(3:4)] = D8[3:4,-(3:4)] + b_left[2]
D8[1,-1] = D8[1,-1] + b_left[3]
D8[2,-2] = D8[2,-2] + b_left[4]
D8[3,-3] = D8[3,-3] + b_left[5]
D8[4,-4] = D8[4,-4] + b_left[6]
D8[5:6,-(5:6)] = D8[5:6,-(5:6)] + b_right[1]
D8[7:8,-(7:8)] = D8[7:8,-(7:8)] + b_right[2]
D8[5,-5] = D8[5,-5] + b_right[3]
D8[6,-6] = D8[6,-6] + b_right[4]
D8[7,-7] = D8[7,-7] + b_right[5]
D8[8,-8] = D8[8,-8] + b_right[6]
D8 + t(D8)
}
```
Interestingly, in the case with even branches not even the top eigenvector is unique -- there are three equal eigenvalues. (However, all the top eigenvectors would produce a "valid" split
of the tree if you applied the strategy of splitting them at the largest gap. The second and third eigenvectors
would split into 2 vs 6 rather than 4 vs 4. Would all linear combinations of eigenvectors produce a valid split? Not sure...)
```{r}
D8_even = D8_fun(c(1,1),rep(1,6),rep(1,6))
D8_even
L_even = diag(rowSums(D8_even)) - D8_even
L_even
eigen(L_even)
```
### Balanced tree
```{r}
D8_balanced = D8_fun(c(1,1),c(2,2,3,3,4,4),c(5,5,6,6,7,7))
D8_balanced
eigen(diag(rowSums(D8_balanced))-D8_balanced)
```
### Outgroup
Add an outgroup to the even tree. The top eigenvector splits 5 vs 4.
```{r}
D9 = rbind(cbind(D8_even, rep(5,8)),rep(5,9))
D9[9,9] = 0
L9 = diag(rowSums(D9)) - D9
eigen(L9)
```
Notice here that, unlike the correlation case, the length of the outgroup will matter. If we make it long
enough then the first eigenvector will split 8 vs 1.
```{r}
D9 = rbind(cbind(D8_even, rep(50,8)),rep(50,9))
D9[9,9] = 0
L9 = diag(rowSums(D9)) - D9
eigen(L9)
```
## Relationship between distance and covariance
This section is a bit of a diversion and may be ignored...
I just wanted to remind myself about the relationship between the pairwise
distances and covariances.
Basically if $D$ is a distance matrix, and $L$ is the centering
matrix (so premultiplying by $L$ subtracts the) then
$W=-LDL'$ is symmetric positive semi-definite, and closely related to the covariance matrix. See Petkova et al for more discussion....
```{r}
D4_fun = function(b){
D4 = matrix(0,nrow=4, ncol=4)
D4[1:2,3:4] = b[1] + b[2] # top split
D4[1,-1] = D4[1,-1] + b[3]
D4[2,-2] = D4[2,-2] + b[4]
D4[3,-3] = D4[3,-3] + b[5]
D4[4,-4] = D4[4,-4] + b[6]
D4 + t(D4)
}
```
```{r}
D4_even = D4_fun(rep(1,6))
D4_even
L = diag(4) - matrix(1/4,nrow=4,ncol=4)
W = -L %*% D4_even %*% t(L)
W
0.5*W - Sigma4_even
eigen(W)
```