-
Notifications
You must be signed in to change notification settings - Fork 2
/
simu.Rmd
191 lines (170 loc) · 6.56 KB
/
simu.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: "Simulation"
author: "Zheng Li"
date: "2022-02-06"
output:
workflowr::wflow_html:
toc: true
editor_options:
chunk_output_type: console
---
[simulation]: https://github.com/zhengli09/BASS-Analysis/tree/master/code/
[params]: https://github.com/zhengli09/BASS-Analysis/tree/master/data/params.txt
[visualization]: https://github.com/zhengli09/BASS-Analysis/tree/master/code/viz.R
# Introduction
This page provides code to reproduce all the simulation results for BASS.
Hopefully, this can also serve as a resource for building up your simulation
studies and evaluate your statistical methods.
# Simulation study
## Generate simulated data
The simulation study relies on a few R packages and python softwares
(e.g. SpaGCN and FICT), which need to be installed beforehand. The specific
packages and all necessary functions to conduct the simulation study can be
found at [simulation][].
```{r}
source("code/simu_utils.R")
# data for inferring parameters for splatter
load("data/starmap_mpfc.RData") # starmap_cnts, starmap_info
cnts <- starmap_cnts[["20180417_BZ5_control"]]
info <- starmap_info[["20180417_BZ5_control"]]
xy <- as.matrix(info[, c("x", "y")])
starmap <- list(cnts = cnts, info = info)
# we perform 50 simulation replicates for each setting
# The random seeds used for each replicate are generated as below
NREP <- 50
set.seed(0)
seeds <- sample(20201230, NREP)
```
We generate the simulated data under the baseline simulation setting. Refer to
the [params][] file for a complete list of simulation settings.
```{r}
# baseline settings for all four scenarios for the
# single tissue section analysis (L = 1)
scenarios <- 1:4 # simulation scenarios
C <- 4 # number of cell types
R <- 4 # number of spatial domains
J <- 200 # number of genes
de_prop <- 0.2 # proportion of DE genes for each cell type
de_facLoc <- 1.1 # DE gene strength
L <- 1 # number of tissue sections
# illustrate the data at the first replicate
rep <- 1
sim_dats <- lapply(scenarios, function(scenario){
simu(starmap = starmap, scenario = scenario, C = C, J = J, L = L,
batch_facLoc = 0, de_prop = de_prop, de_facLoc = de_facLoc,
de_facScale = 0.4, sim_seed = seeds[rep], debug = FALSE)
})
```
```{r}
# The simulated data include a list of L gene expression count matrices,
# a list of L true cell type label vectors for evaluation purpose, and
# the random seed used for generating the data
scenario <- 3
sim_dat <- sim_dats[[scenario]]
# gene expression count matrix of the first tissue section under scenario 3
sim_dat[[1]][[1]][1:5, 1:5]
```
## Visualize of the simulated data
We visualize the cell type distributions under scenarios I-IV. You can refer to
[visualization][] for some useful plotting functions or you can write your own
code for plotting.
```{r}
library(cowplot)
source("code/viz.R")
p <- lapply(scenarios, function(scenario){
plotClusters(xy, labels = sim_dats[[scenario]][[2]][[1]],
title = paste("Scenario", c("I", "II", "III", "IV")[scenario]))
})
legend <- get_legend(p[[1]] +
theme(legend.position = "bottom") +
guides(color = guide_legend(title = "Cell type")))
p <- plot_grid(plotlist = p, ncol = 2)
plot_grid(p, legend, ncol = 1, rel_heights = c(1, 0.1))
```
## Running all methods
We have wrapped up code for running different methods into different functions.
Those functions take in simulated data and output adjusted random index (ARI)
measuring the estimate cell type or spatial domain labels with the underlying
truth. Functions for running different methods can be found at [simulation][].
We take the simulated data from scenario 3 as an example to illustrate running different methods.
```{r}
scenario <- 3
```
### Run BASS
```{r}
BASS_out <- run_BASS(sim_dats[[scenario]], xy, "ACCUR_EST", beta = 1, C, R,
init_method = "kmeans", cov_struc = "EEE")
BASS_out
```
### Run HMRF
Please note that running HMRF requires you to specify a path for python and will
generate intermediate and final results into a specified folder. Please modify
those directories accordingly in the `run_HMRF` function if you want to use the
function for your own analysis.
```{r message=FALSE, warning=FALSE}
# Run HMRF
# Note that case and rep parameters below are used to generate a unique
# temporary directories for running HMRF.
HMRF_out <- run_HMRF(sim_dat, xy, ztrue = info$z, R, case = 3,
rep, usePCs = F, dosearchSEgenes = T)
# ARI for spatial domain detection across different
# spatial interaction parameter betas.
HMRF_out
```
### Run BayesSpace
Note that BayesSpace cannot identify any neighbors because it was developed for
analyzing ordered lattice structure of spots from ST and 10x Visium platforms.
```{r}
# Run BayesSpace
BayesSpace_out <- run_BayesSpace(sim_dat, xy, ztrue = info$z, R)
# ARI for spatial domain detection
BayesSpace_out
```
### Run SpaGCN
Because SpaGCN was originally developed with Python, we wrapped up the code into
a function and have it imported into R such that we can conveniently analyze the
simulated data in R. The python function can be found at [simulation][].
```{r}
# Run SpaGCN
source_python("code/run_SpaGCN.py")
SpaGCN_out <- run_SpaGCN(sim_dat, xy, info$z, R)
# ARI for spatial domain detection
SpaGCN_out
```
### Run Seurat
Because Seurat uses a resolution parameter to indirectly determine the number of
clusters, we run Seurat on a range of resolution parameters and identify the
first value that resulted in the desired number of cell types.
```{r}
# Run Seurat
resolutions <- seq(0.3, 1.5, by = 0.1)
seu_out <- seu_cluster(sim_dat, resolutions)
# ARI for cell type clustering and the number of identified clusters
seu_out
```
### Run SC3
Note that when the number of cells exceeds 5,000, SC3 randomly selects 5,000
cells for clustering, trains a support vector machine with the cluster labels
obtained for the 5,000 cells, and then predict the cluster labels of the
remaining cells.
```{r}
# Run SC3
# Note SC3 algorithm is slow
sc3_out <- sc3_cluster(sim_dat, C)
# ARI for cell type clustering
sc3_out
```
### Run FICT
Please install the software according to the [FICT github](https://github.com/haotianteng/FICT) and modify the paths to the FICT
software and a temporary folder accordingly in the `fict_cluster` function.
```{r}
# Note that case and rep parameters below are used to generate a unique
# temporary directories for running FICT.
fict_out <- fict_cluster(sim_dat, xy, C, case = 3, rep)
# ARI for cell type clustering
fict_out
```
## Multi-sample simulation
# Find all the simulation results
Put the link to all the simulation results here later and plot the two main
figures on simulation analysis.