-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
107 lines (56 loc) · 6.29 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: "Home"
site: workflowr::wflow_site
output:
workflowr::wflow_html:
toc: false
---
## Featured
[Note.](flashier_features.html) New features included in the `flashier` implementation of EBMF.
[Investigation 21.](flashier_bench.html) Benchmarking `flashier`.
[Investigation 27.](lowrank.html) Using low-rank approximations to the data as input to `flashier`.
## In Progress
Investigations 8 and 9 implement parallel backfitting updates.
* [Investigation 8.](parallel.html) Parallelizing the backfitting algorithm shows promise.
* [Investigation 9.](parallel2.html) An additional trick is needed to parallelize the backfitting updates performed in this [MASH v FLASH GTEx analysis](https://willwerscheid.github.io/MASHvFLASH/MASHvFLASHgtex3.html).
[Investigation 10.](squarem.html) SQUAREM does poorly on FLASH backfits. DAAREM (a more recent algorithm by one of the authors of SQUAREM) does better, but offers smaller performance gains than parallelization.
[Investigation 11.](random.html) The order in which factor/loading pairs are updated (during backfitting) makes some difference, but not much.
[Investigation 12.](arbitraryV.html) To fit a FLASH model with an arbitrary error covariance matrix, I follow up on a [suggestion](https://github.com/stephenslab/flashr/issues/17) by Matthew Stephens.
Investigations 14 and 16-17 illustrate three approaches to factorizing the GTEx donation matrix. The first is more naive, and is primarily intended as an illustration of how to do nonnegative matrix factorization using FLASH. The second and third are more sophisticated approaches that model the entries as count or binary data.
* [Investigation 14.](nonnegative.html) An example of how to use nonnegative ASH priors to obtain a nonnegative matrix factorization.
* [Investigation 16.](count_data.html) Instead of directly fitting FLASH, I fit count data via a Gaussian approximation to the Poisson log likelihood...
* [Investigation 17.](binary_data.html) ... then I fit binary data via an approximation to the binomial log likelihood.
Note 3 and Investigation 18 explore stochastic approaches to fitting FLASH objects to very large datasets.
* [Note 3.](large_p.html) An idea for how to fit FLASH models when $n$ is manageable and $p$ is very large.
* [Investigation 18.](minibatch.html) I implement the idea described in Note 3 and I test it out on data from the GTEx project.
Investigations 19a-b and 20 try FLASH out on large single-cell RNA datasets.
* [Investigation 19a.](trachea.html) An analysis of the smaller "droplet" dataset from [Montoro et al.](https://www.nature.com/articles/s41586-018-0393-7)
* [Investigation 19b.](trachea2.html) I redo my analysis of the "droplet" dataset, but this time I follow the authors' preprocessing steps. Results are, I think, of much lower quality.
* [Investigation 20.](pulseseq.html) An analysis of the larger "pulse-seq" dataset from Montoro et al.
[Investigation 22.](brain.html) A `flashier` analysis of the GTEx brain subtensor.
Investigations 24 and 25 explore approaches to count data.
* [Investigation 24.](count_shrinkage.html) I propose a new approach to factorizing count data that uses adaptive shrinkage to estimate the rate matrix.
* [Investigation 25.](count_preproc_r1.html) I compare three different data transformations, three approaches to handling the heteroskedacity of the `log1p` transformation, and two approaches to dealing with row- and column-specific scaling.
* [Investigation 26.](trachea3.html) I compare FLASH fits of the "droplet" dataset in Montoro et al. using three different data transformations.
## Still Relevant
Notes 1 and 2 and Investigation 4 describe a way to compute the FLASH objective directly (rather than using the indirect method implemented in `flashr`).
* [Note 1.](obj_notes.html) Notes on computing the FLASH objective function. I derive an explicit expression for the KL divergence between prior and posterior.
* [Note 2.](flash_em.html) An alternate algorithm for optimizing the FLASH objective, using the explicit expression derived in the previous note.
* [Investigation 4.](alt_alg.html) The alternate algorithm agrees with FLASH with respect to both the objective and fit obtained.
## Archived
The bug causing the problem described in Investigations 1-3 was fixed in version 0.1-13 of package `ebnm`.
* [Investigation 1.](objective.html) The FLASH objective function can behave very erratically.
* [Investigation 2.](objective2.html) This problem only occurs when using `ebnm_pn`, not `ebnm_ash`.
* [Investigation 3.](objective3.html) The objective can continue to get worse as loadings are repeatedly updated. Nonetheless, convergence takes place (from above!).
Investigations 6 and 7 deal with warmstarts, which were implemented in version 0.5-14 of `flashr`.
* [Investigation 6.](warmstart.html) Poor `optim` results can produce large decreases in the objective function. We should use warmstarts when `ebnm_fn = ebnm_pn`.
* [Investigation 7.](warmstart2.html) The advantages of warmstarts are not nearly as compelling when `ebnm_fn = ebnm_ash`.
Since the newest versions of `flashr` use a home-grown initialization function, Investigations 5a-b and 13 are no longer relevant.
* [Investigation 5a.](init_fn.html) An argument for changing the default `init_fn` to `udv_si_svd` when there is missing data and `udv_svd` otherwise. Based on an analysis of GTEx data.
* [Investigation 5b.](init_fn2.html) More evidence supporting the recommendations in Investigation 5a.
* [Investigation 13.](init_fn3.html) A counterargument. Results in Investigations 5a-b probably depend on the fact that $n$ is small ($n = 44$). For large $n$, setting `init_fn` to `udv_si` is best.
The changes tested here were implemented in version 0.6-2 of `flashr`.
* [Investigation 15.](scalar_tau.html) Tests an implementation of changes to the way `tau` is stored, as discussed [here](https://github.com/stephenslab/flashr/issues/83).
And the changes tested here were implemented in version 2.2-29 of `ashr`.
* [Investigation 23.](truncnorm.html) I benchmark the rewritten `my_etruncnorm` and `my_vtruncnorm` functions in package `ashr` against their counterparts in package `truncnorm`.
[Note 4.](matrix_ops.html) An early set of notes that identified key ways to reduce the memory footprint of `flashr`. The good ideas were implemented in `flashier`. Not all of the ideas were good.