vignettes/articles/liger_object.Rmd

---
title: "Interact with A Liger Object"
author: "Yichen Wang"
date: "2023-11-06"
output:
  html_document: 
    toc: 3
    toc_float:
      collapsed: false
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hold")
```

## Structure

Starting from rliger 2.0.0, we introduced a newly designed structure for the main data container object class. The figure below brings a overall idea about the design.

![liger_object_structure](img/liger_class_structure.png)\

On the left hand side of the figure, we briefly illustrate how the old structure was like. It has slots for data at each processing stage, and each slot is a named list object containing matrices. 

In the new version, on the right side, we first introduce another new class, [ligerDataset](../reference/ligerDataset-class.html), to serve as a container for all matrices belonging to the same specific dataset. In this way, we have easy and safe control on data matching thanks to a number of object oriented accessor methods and validity checks. As for the [liger](../reference/liger-class.html) class, there are some main differences:

- Feature (gene expression) matrices are now wrapped in ligerDataset objects, and all put in the `datasets(obj)` slot. Click here for how to [access the raw counts, normalized counts, scaled data](#access-feature-matrices).
- Cell metadata including dataset belonging, study design conditions, quality control (QC) metrics are now stored in slot `cellMeta(obj)`. Meanwhile, we moved cluster labeling result also to this slot, in order to have multiple cluster variables existing at the same time. Additionally, by introducing `S4Vectors::DataFrame` class for the metadata for flexibility and tidy display. Click here for how to [access cell metadata](#access-cell-metadata).
- Dimensionality reductions are now expanded to a list of low-dimensional representation (e.g. UMAP, t-SNE) in slot `dimReds(obj)`. Click here for how to [access dimensionality reductions](#access-dimensionality-reductions).
- Variable features identified are now stored in `varFeatures(obj)`, with no structural change.
- A new slot `@uns` is added for storing miscellaneous unstructured information, including default setting for cluster labeling and dimension reduction, for faster visualization calls.
- We also added the feature to record the commands applied to the liger object and allow retrieving the records with `commands(obj)`. See more about liger [command recording and retrieving](#check-the-records-of-run-commands).

We demonstrate examples below with example dataset "[pbmc](../reference/pbmc.html)", which is a minimal subset from the study conducted in Hyun Min Kang and et. al., Nature Biotechnology, 2018. The data is a ready to use (new) liger object with only raw counts data. We quickly process it here so that we can show how to retrieve all kinds of data in sections below.

```{R, message=FALSE, warning=FALSE, results="hide"}
library(rliger)
data("pbmc")
pbmc <- pbmc %>%
    normalize() %>%
    selectGenes() %>%
    scaleNotCenter() %>%
    runIntegration() %>%
    quantileNorm() %>%
    runCluster() %>%
    runUMAP()
```

## Access a dataset

As introduced above, the dataset-specific information is contained in a [ligerDataset](../reference/ligerDataset-class.html) object. 

- To get the names of all datasets

```{R}
names(pbmc)
```

- To get the number of all datasets

```{R}
length(pbmc)
```

- To list out number of cells in each dataset

```{R}
lengths(pbmc)
```

- To access the [ligerDataset](../reference/ligerDataset-class.html) object for a specific dataset

```{R}
ctrlLD <- dataset(pbmc, dataset = "ctrl")
# Alternatively, using numeric index
ctrlLD <- dataset(pbmc, 1)
```

In any other *rliger* functions where the argument `useDatasets` is exposed, users can always use the exact character name(s) or the numeric index to specify the datasets to be involved in the analysis. Moreoever, a logical vector of index is also allowed and could ease the usage in some cases.

```{R, eval = FALSE}
# Not run, just for example, assuming we've got the clustering for such an object
names(ligerObj)
## [1] female-1  female-2  male-3  male-4  female-5 ......
femaleIdx <- startsWith(names(ligerObj), "fe")
runMarkerDEG(ligerObj, conditionBy = "dataset", splitBy = "leiden_cluster", 
             useDatasets = femaleIdx)
```

In the example above, the `runMarkerDEG()` funcion is parametered for detecting dataset specific markers within each cluster, and only within the female samples. For example, cells from condition "female-1 and cluster 1" will be tested against cells belonging to condition "cluster 1 and all other female datasets". Can be use

- To access multiple datasets, returned in a list

```{R}
ldList <- datasets(pbmc)
```

## Access feature matrices

We have three main generics for accessing feature matrices, namingly `rawData()`, `normData()` and `scaleData()`. For scaled unshared features, used for UINMF, we also have `scaleUnsharedData()`. Additionally, we provide `rawPeak()` and `normPeak()` for accessing the peak counts in a ATACseq dataset. The logistics of the accessor to all these feature matrices are the same, so we only present the case for raw counts.

- To get a list of the raw counts from all datasets:

```{R}
rawList <- rawData(pbmc)
class(rawList)
length(rawList)
# Alternatively
rawList <- getMatrix(pbmc, "rawData", returnList = TRUE)
```

- To get the raw counts from a specific dataset:

```{R}
stimRaw <- rawData(pbmc, "stim")
class(stimRaw)
dim(stimRaw)
# Alternatively, get the `ligerDataset` object of the dataset first
# and then fetch from there
stim <- dataset(pbmc, "stim")
stimRaw <- rawData(stim)
# Alternatively
stimRaw <- getMatrix(pbmc, "rawData", dataset = "stim")
```

- To replace the raw counts with a new matrix object:

```{R}
ctrlRaw <- rawData(pbmc, "ctrl")
# Assume we do some operation on it here
rawData(pbmc, "ctrl") <- ctrlRaw
```

In the new version, strict validity checks have been put upon modification in object content. Replacement with unmatching feature names or barcodes will be rejected. In the case where there is a need to replace the dataset with a different set of barcodes or features, we suggest recreating a new [ligerDataset](../reference/ligerDataset-class.html) object with the new raw counts (or other feature matrix), and then replace the whole dataset with it.

```{R}
ctrlRaw <- rawData(pbmc, "ctrl")
ctrlRawSubset <- ctrlRaw[1:200, 1:200]
## Not Run, will raise error
# rawData(pbmc, "ctrl") <- ctrlRawSubset
```

```{R, eval=FALSE}
ctrlNew <- createLigerDataset(rawData = ctrlRawSubset)
dataset(pbmc, "ctrl") <- ctrlNew
dim(pbmc)
## [1] NA 500
```

## Access cell metadata

As previously descibed at the top of this page, cell metadata including dataset origin, study metadata, QC metrics and cluster labeling are all stored in `cellMeta(obj)`.

- To have a look at the full metadata table:

```{R}
cellMeta(pbmc)
```

- To retrieve one of the variables:

```{R}
nUMI <- cellMeta(pbmc, "nUMI")
class(nUMI)
length(nUMI)
# Alternatively
nUMI <- pbmc$nUMI
nUMI <- pbmc[["nUMI"]]
```

- To retrieve multiple variables:

```{R}
cellMeta(pbmc, c("nUMI", "nGene"))
```

- To get a variable for only a subset of cells (e.g. from a specific dataset), use the argument `useDatasets`, or alternatively `cellIdx` that subscribes cells explicitly:

```{R}
nUMI <- cellMeta(pbmc, "nUMI", useDatasets = "ctrl")
length(nUMI)
```

- Add or replace a variable. If the new variable being added has a matching size (`length()` or `nrow()`) with the number of all cells (`ncol(pbmc)`):

```{R}
foo <- seq_len(ncol(pbmc))
pbmc$foo <- foo
# Alternatively
cellMeta(pbmc, "foo") <- foo
```

If the new variable is only for a subset of cells (e.g. the original clustering derived from an individual dataset).

```{R}
ctrlBar <- seq_len(ncol(dataset(pbmc, "ctrl")))
cellMeta(pbmc, "ctrl_bar", useDataset = "ctrl") <- ctrlBar
```

## Access dimensionality reductions

- To get a list of all dimensionality reductions:

```{R}
allDimReds <- dimReds(pbmc)
class(allDimReds)
length(allDimReds)
```

- To get a specific dimensionality reduction:

```{R}
umap <- dimRed(pbmc, "UMAP")
class(umap)
dim(umap)
```

- To get a specific dimensionality reduction for a specific dataset, again, use the argument `useDatasets`.

```{R}
ctrlUMAP <- dimRed(pbmc, "UMAP", useDatasets = "ctrl")
dim(ctrlUMAP)
```

- Setting an existing dimensionality reduction as the default for visualization

```{R}
defaultDimRed(pbmc) <- "UMAP"
```

Every time when `runUMAP()` or `runTSNE()` is called, the new result will be set as default. When default dimensionality reduction is set, any plotting function that shall work with it will use it by default without the need to specify it explicitly. And the dimensionality reduction accessor function also returns the default one if no specific one is requested.

```{R}
umap <- dimRed(pbmc)
```

- Add a new matrix into the object

```{R}
dimRed(pbmc, "newUMAP") <- umap
```

Cell identifiers on `rownames(value)` will be checked for matching is present. The check is aware of that a dataset name prefix is added to the object cell IDs.

- Adding a dimensionality reduction matrix for only one certain dataset

```{R}
ctrlUMAP <- dimRed(pbmc, "UMAP", useDatasets = "ctrl")
dimRed(pbmc, "ctrlUMAP", useDatasets = "ctrl") <- ctrlUMAP
```

## Access factorization result

We suggest using `getMatrix()` for all matrices involved in the factorization, including:

- $H$, $V$ matrices produced for each dataset, involved in all iNMF variant algorithms
- $W$ matrix shared for all datasets, involved in all iNMF variant algorithms and NMF dimension reduction
- $H.norm$, the aligned factor loading matrix, produced downstream of iNMF integration by `quantileNorm()`
- $A$ and $B$ matrices produced as intermediate information for each dataset during online iNMF interations 
- $U$ matrices produced for each dataset, involved in UINMF

```{R}
HList <- getMatrix(pbmc, "H")
lapply(HList, dim)
```

## Subsetting the data

- A [liger](../reference/liger-class.html) object can be subset by both cells and genes. 

For cell level subsetting, any indexing method among barcode names, numeric or logical index can do the job. Cells are indexed by `rownames(cellMeta(object))`, which is a concatenation of the barcodes from each dataset, and datasets are ordered as `names(object)` shows.

```{R}
pbmcSmall <- pbmc[, 1:100]
pbmcCluster1 <- pbmc[, pbmc$leiden_cluster == 1]
```

For gene level subsetting, we only allow using gene names, because it is assumed that different datasets can have different set of genes. And only genes shared by all datasets can be used.

```{R}
pbmcVarOnly <- pbmc[varFeatures(pbmc),]
ctrlUnsharedGenes <- c("P2RY1", "GFI1B", "HDGFRP2", "TUBGCP6", "CELA1")
# Not run, will raise error
# pbmc[ctrl.unshared.genes,]
```

- A [ligerDataset object](../reference/liger-class.html) can be subset by both cells and genes. 

Cell level subsetting works in the exactly same way as a [liger object](../reference/liger-class).

```{R}
ctrlLD <- dataset(pbmc, "ctrl")
ctrlLDSmall <- ctrlLD[, 1:100]
```

Gene level subsetting on a [ligerDataset object](../reference/liger-class.html) can achieved with any type of index.

```{R}
ctrlLDsmall <- ctrlLD[1:100, ]
ctrlLDsmall <- ctrlLD[1:100, 1:100]
```

Note that, `scaleData(ctrlLD)` and `scaleUnsharedData(ctrlLD)` comes with only variable genes identified upstream. Subsetting genes on a [ligerDataset object](../reference/liger-class.html) is based on its raw input data. Therefore, we only take the user specification available in scaled data into the subset of scaled data. 

## Check the records of run commands

We implemented a analysis tracking feature in order to keep a record of what functions are called and what parameters are used.

- To show a list of function names applied to the [liger object](../reference/liger-class.html) in time order

```{R}
commands(pbmc)
```

A unique suffix is added to each function name to keep track of calls of the same function with different parameters. 

- Detailed a function call information can be retrieved with partial matching.

```{R}
commands(pbmc, "runINMF")
```

A function can be applied to an object several times with parameter tweaks. For example, different `lambda` for iNMF integration. If `runINMF()` is called several times, calling `commands(pbmc, "runINMF")` returns a list of records of all such calls, as all record names starting with `"runINMF"` are matched. So listing names first and using the unique record name will be required for getting the information of one specific call among all of such. For another example, given that `runCluster()` and `runUMAP()` are also in the record, the following result would be returned if we do matching with only `"run"`

```{R}
commands(pbmc, "run")
```