/
liger_object.Rmd
326 lines (236 loc) · 12.6 KB
/
liger_object.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
---
title: "Interact with A Liger Object"
author: "Yichen Wang"
date: "2023-11-06"
output:
html_document:
toc: 3
toc_float:
collapsed: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hold")
```
## Structure
Starting from rliger 2.0.0, we introduced a newly designed structure for the main data container object class. The figure below brings a overall idea about the design.
![liger_object_structure](img/liger_class_structure.png)\
On the left hand side of the figure, we briefly illustrate how the old structure was like. It has slots for data at each processing stage, and each slot is a named list object containing matrices.
In the new version, on the right side, we first introduce another new class, [ligerDataset](../reference/ligerDataset-class.html), to serve as a container for all matrices belonging to the same specific dataset. In this way, we have easy and safe control on data matching thanks to a number of object oriented accessor methods and validity checks. As for the [liger](../reference/liger-class.html) class, there are some main differences:
- Feature (gene expression) matrices are now wrapped in ligerDataset objects, and all put in the `datasets(obj)` slot. Click here for how to [access the raw counts, normalized counts, scaled data](#access-feature-matrices).
- Cell metadata including dataset belonging, study design conditions, quality control (QC) metrics are now stored in slot `cellMeta(obj)`. Meanwhile, we moved cluster labeling result also to this slot, in order to have multiple cluster variables existing at the same time. Additionally, by introducing `S4Vectors::DataFrame` class for the metadata for flexibility and tidy display. Click here for how to [access cell metadata](#access-cell-metadata).
- Dimensionality reductions are now expanded to a list of low-dimensional representation (e.g. UMAP, t-SNE) in slot `dimReds(obj)`. Click here for how to [access dimensionality reductions](#access-dimensionality-reductions).
- Variable features identified are now stored in `varFeatures(obj)`, with no structural change.
- A new slot `@uns` is added for storing miscellaneous unstructured information, including default setting for cluster labeling and dimension reduction, for faster visualization calls.
- We also added the feature to record the commands applied to the liger object and allow retrieving the records with `commands(obj)`. See more about liger [command recording and retrieving](#check-the-records-of-run-commands).
We demonstrate examples below with example dataset "[pbmc](../reference/pbmc.html)", which is a minimal subset from the study conducted in Hyun Min Kang and et. al., Nature Biotechnology, 2018. The data is a ready to use (new) liger object with only raw counts data. We quickly process it here so that we can show how to retrieve all kinds of data in sections below.
```{R, message=FALSE, warning=FALSE, results="hide"}
library(rliger)
data("pbmc")
pbmc <- pbmc %>%
normalize() %>%
selectGenes() %>%
scaleNotCenter() %>%
runIntegration() %>%
quantileNorm() %>%
runCluster() %>%
runUMAP()
```
## Access a dataset
As introduced above, the dataset-specific information is contained in a [ligerDataset](../reference/ligerDataset-class.html) object.
- To get the names of all datasets
```{R}
names(pbmc)
```
- To get the number of all datasets
```{R}
length(pbmc)
```
- To list out number of cells in each dataset
```{R}
lengths(pbmc)
```
- To access the [ligerDataset](../reference/ligerDataset-class.html) object for a specific dataset
```{R}
ctrlLD <- dataset(pbmc, dataset = "ctrl")
# Alternatively, using numeric index
ctrlLD <- dataset(pbmc, 1)
```
In any other *rliger* functions where the argument `useDatasets` is exposed, users can always use the exact character name(s) or the numeric index to specify the datasets to be involved in the analysis. Moreoever, a logical vector of index is also allowed and could ease the usage in some cases.
```{R, eval = FALSE}
# Not run, just for example, assuming we've got the clustering for such an object
names(ligerObj)
## [1] female-1 female-2 male-3 male-4 female-5 ......
femaleIdx <- startsWith(names(ligerObj), "fe")
runMarkerDEG(ligerObj, conditionBy = "dataset", splitBy = "leiden_cluster",
useDatasets = femaleIdx)
```
In the example above, the `runMarkerDEG()` funcion is parametered for detecting dataset specific markers within each cluster, and only within the female samples. For example, cells from condition "female-1 and cluster 1" will be tested against cells belonging to condition "cluster 1 and all other female datasets". Can be use
- To access multiple datasets, returned in a list
```{R}
ldList <- datasets(pbmc)
```
## Access feature matrices
We have three main generics for accessing feature matrices, namingly `rawData()`, `normData()` and `scaleData()`. For scaled unshared features, used for UINMF, we also have `scaleUnsharedData()`. Additionally, we provide `rawPeak()` and `normPeak()` for accessing the peak counts in a ATACseq dataset. The logistics of the accessor to all these feature matrices are the same, so we only present the case for raw counts.
- To get a list of the raw counts from all datasets:
```{R}
rawList <- rawData(pbmc)
class(rawList)
length(rawList)
# Alternatively
rawList <- getMatrix(pbmc, "rawData", returnList = TRUE)
```
- To get the raw counts from a specific dataset:
```{R}
stimRaw <- rawData(pbmc, "stim")
class(stimRaw)
dim(stimRaw)
# Alternatively, get the `ligerDataset` object of the dataset first
# and then fetch from there
stim <- dataset(pbmc, "stim")
stimRaw <- rawData(stim)
# Alternatively
stimRaw <- getMatrix(pbmc, "rawData", dataset = "stim")
```
- To replace the raw counts with a new matrix object:
```{R}
ctrlRaw <- rawData(pbmc, "ctrl")
# Assume we do some operation on it here
rawData(pbmc, "ctrl") <- ctrlRaw
```
In the new version, strict validity checks have been put upon modification in object content. Replacement with unmatching feature names or barcodes will be rejected. In the case where there is a need to replace the dataset with a different set of barcodes or features, we suggest recreating a new [ligerDataset](../reference/ligerDataset-class.html) object with the new raw counts (or other feature matrix), and then replace the whole dataset with it.
```{R}
ctrlRaw <- rawData(pbmc, "ctrl")
ctrlRawSubset <- ctrlRaw[1:200, 1:200]
## Not Run, will raise error
# rawData(pbmc, "ctrl") <- ctrlRawSubset
```
```{R, eval=FALSE}
ctrlNew <- createLigerDataset(rawData = ctrlRawSubset)
dataset(pbmc, "ctrl") <- ctrlNew
dim(pbmc)
## [1] NA 500
```
## Access cell metadata
As previously descibed at the top of this page, cell metadata including dataset origin, study metadata, QC metrics and cluster labeling are all stored in `cellMeta(obj)`.
- To have a look at the full metadata table:
```{R}
cellMeta(pbmc)
```
- To retrieve one of the variables:
```{R}
nUMI <- cellMeta(pbmc, "nUMI")
class(nUMI)
length(nUMI)
# Alternatively
nUMI <- pbmc$nUMI
nUMI <- pbmc[["nUMI"]]
```
- To retrieve multiple variables:
```{R}
cellMeta(pbmc, c("nUMI", "nGene"))
```
- To get a variable for only a subset of cells (e.g. from a specific dataset), use the argument `useDatasets`, or alternatively `cellIdx` that subscribes cells explicitly:
```{R}
nUMI <- cellMeta(pbmc, "nUMI", useDatasets = "ctrl")
length(nUMI)
```
- Add or replace a variable. If the new variable being added has a matching size (`length()` or `nrow()`) with the number of all cells (`ncol(pbmc)`):
```{R}
foo <- seq_len(ncol(pbmc))
pbmc$foo <- foo
# Alternatively
cellMeta(pbmc, "foo") <- foo
```
If the new variable is only for a subset of cells (e.g. the original clustering derived from an individual dataset).
```{R}
ctrlBar <- seq_len(ncol(dataset(pbmc, "ctrl")))
cellMeta(pbmc, "ctrl_bar", useDataset = "ctrl") <- ctrlBar
```
## Access dimensionality reductions
- To get a list of all dimensionality reductions:
```{R}
allDimReds <- dimReds(pbmc)
class(allDimReds)
length(allDimReds)
```
- To get a specific dimensionality reduction:
```{R}
umap <- dimRed(pbmc, "UMAP")
class(umap)
dim(umap)
```
- To get a specific dimensionality reduction for a specific dataset, again, use the argument `useDatasets`.
```{R}
ctrlUMAP <- dimRed(pbmc, "UMAP", useDatasets = "ctrl")
dim(ctrlUMAP)
```
- Setting an existing dimensionality reduction as the default for visualization
```{R}
defaultDimRed(pbmc) <- "UMAP"
```
Every time when `runUMAP()` or `runTSNE()` is called, the new result will be set as default. When default dimensionality reduction is set, any plotting function that shall work with it will use it by default without the need to specify it explicitly. And the dimensionality reduction accessor function also returns the default one if no specific one is requested.
```{R}
umap <- dimRed(pbmc)
```
- Add a new matrix into the object
```{R}
dimRed(pbmc, "newUMAP") <- umap
```
Cell identifiers on `rownames(value)` will be checked for matching is present. The check is aware of that a dataset name prefix is added to the object cell IDs.
- Adding a dimensionality reduction matrix for only one certain dataset
```{R}
ctrlUMAP <- dimRed(pbmc, "UMAP", useDatasets = "ctrl")
dimRed(pbmc, "ctrlUMAP", useDatasets = "ctrl") <- ctrlUMAP
```
## Access factorization result
We suggest using `getMatrix()` for all matrices involved in the factorization, including:
- $H$, $V$ matrices produced for each dataset, involved in all iNMF variant algorithms
- $W$ matrix shared for all datasets, involved in all iNMF variant algorithms and NMF dimension reduction
- $H.norm$, the aligned factor loading matrix, produced downstream of iNMF integration by `quantileNorm()`
- $A$ and $B$ matrices produced as intermediate information for each dataset during online iNMF interations
- $U$ matrices produced for each dataset, involved in UINMF
```{R}
HList <- getMatrix(pbmc, "H")
lapply(HList, dim)
```
## Subsetting the data
- A [liger](../reference/liger-class.html) object can be subset by both cells and genes.
For cell level subsetting, any indexing method among barcode names, numeric or logical index can do the job. Cells are indexed by `rownames(cellMeta(object))`, which is a concatenation of the barcodes from each dataset, and datasets are ordered as `names(object)` shows.
```{R}
pbmcSmall <- pbmc[, 1:100]
pbmcCluster1 <- pbmc[, pbmc$leiden_cluster == 1]
```
For gene level subsetting, we only allow using gene names, because it is assumed that different datasets can have different set of genes. And only genes shared by all datasets can be used.
```{R}
pbmcVarOnly <- pbmc[varFeatures(pbmc),]
ctrlUnsharedGenes <- c("P2RY1", "GFI1B", "HDGFRP2", "TUBGCP6", "CELA1")
# Not run, will raise error
# pbmc[ctrl.unshared.genes,]
```
- A [ligerDataset object](../reference/liger-class.html) can be subset by both cells and genes.
Cell level subsetting works in the exactly same way as a [liger object](../reference/liger-class).
```{R}
ctrlLD <- dataset(pbmc, "ctrl")
ctrlLDSmall <- ctrlLD[, 1:100]
```
Gene level subsetting on a [ligerDataset object](../reference/liger-class.html) can achieved with any type of index.
```{R}
ctrlLDsmall <- ctrlLD[1:100, ]
ctrlLDsmall <- ctrlLD[1:100, 1:100]
```
Note that, `scaleData(ctrlLD)` and `scaleUnsharedData(ctrlLD)` comes with only variable genes identified upstream. Subsetting genes on a [ligerDataset object](../reference/liger-class.html) is based on its raw input data. Therefore, we only take the user specification available in scaled data into the subset of scaled data.
## Check the records of run commands
We implemented a analysis tracking feature in order to keep a record of what functions are called and what parameters are used.
- To show a list of function names applied to the [liger object](../reference/liger-class.html) in time order
```{R}
commands(pbmc)
```
A unique suffix is added to each function name to keep track of calls of the same function with different parameters.
- Detailed a function call information can be retrieved with partial matching.
```{R}
commands(pbmc, "runINMF")
```
A function can be applied to an object several times with parameter tweaks. For example, different `lambda` for iNMF integration. If `runINMF()` is called several times, calling `commands(pbmc, "runINMF")` returns a list of records of all such calls, as all record names starting with `"runINMF"` are matched. So listing names first and using the unique record name will be required for getting the information of one specific call among all of such. For another example, given that `runCluster()` and `runUMAP()` are also in the record, the following result would be returned if we do matching with only `"run"`
```{R}
commands(pbmc, "run")
```