/
register-data.Rmd
163 lines (128 loc) · 6.86 KB
/
register-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: "Registering data"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Registering data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This article will show users how to register data using different types of input, as illustrated below.
```{r reg-data, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics("figures/registration_opt_diagram.png")
```
## Using data frame input
### Loading sample data
`greatR` provides an example of data frame containing two different species *A. thaliana* and *B. rapa* with two and three different replicates, respectively. This data frame can be read as follows:
```{r load-greatR, message=FALSE}
# Load the package
library(greatR)
library(data.table)
```
```{r load-brapa-data, message=FALSE, warning=FALSE}
# Load a data frame from the sample data
b_rapa_data <- system.file("extdata/brapa_arabidopsis_data.csv", package = "greatR") |>
data.table::fread()
```
Note that the data has all of five columns required by the package:
```{r brapa-data-kable-clean, eval=FALSE}
b_rapa_data |>
knitr::kable()
```
```{r brapa-data-kable, echo=FALSE}
b_rapa_data[, .SD[1:6], by = accession][, .(gene_id, accession, timepoint, expression_value, replicate)] |>
knitr::kable()
```
### Registering the data
To align gene expression time-course between *Arabidopsis* Col-0 and *B. rapa* Ro18, we can use the function `register()`. When using the default `use_optimisation = TRUE`, `greatR` will find the best stretch and shift parameters through optimisation. For more details on the other function arguments, go to `register()`.
```{r register-data-raw, message=FALSE, warning=FALSE, eval=FALSE}
registration_results <- register(
b_rapa_data,
reference = "Ro18",
query = "Col0",
scaling_method = "z-score"
)
#> ── Validating input data ────────────────────────────────────────────────────────
#> ℹ Will process 10 genes.
#> ℹ Using estimated standard deviation, as no `exp_sd` was provided.
#> ℹ Using `scaling_method` = "z-score".
#>
#> ── Starting registration with optimisation ──────────────────────────────────────
#> ℹ Using L-BFGS-B optimisation method.
#> ℹ Using computed stretches and shifts search space limits.
#> ℹ Using `overlapping_percent` = 50% as a registration criterion.
#> ✔ Optimising registration parameters for genes (10/10) [2s]
```
When dealing with thousands of genes, users may speed up the registration process by using the argument `num_cores` to run the registration using parallel processing.
```{r register-data-raw-detect-cores, message=FALSE, warning=FALSE, eval=FALSE}
parallel::detectCores()
#> 8
```
```{r register-data-raw-parallel, message=FALSE, warning=FALSE, eval=FALSE}
registration_results <- register(
b_rapa_data,
reference = "Ro18",
query = "Col0",
scaling_method = "z-score",
num_cores = 8
)
```
### Registration results
The function `register()` returns a list with S3 class `res_greatR` containing three different objects:
- `data` is a data frame containing the expression data and an additional `timepoint_reg` column which is a result of registered time points by applying the registration parameters to the query data.
- `model_comparison` is a data frame containing (a) the optimal stretch and shift for each `gene_id` and (b) the difference between Bayesian Information Criterion for the separate model and for the combined model (`BIC_diff`) after applying optimal registration parameters for each gene. If the value of `BIC_diff < 0`, then expression dynamics between reference and query data can be registered (`registered = TRUE`). (Default S3 print).
- `fun_args` is a list of arguments used when calling the function (`reference`, `query`, `scaling_method`, ...).
To check whether a gene is registered or not, we can get the summary results by accessing the `model_comparison` table from the registration result.
```{r register-data, message=FALSE, warning=FALSE, include=FALSE}
# Load a data frame from the sample data
registration_results <- system.file("extdata/brapa_arabidopsis_registration.rds", package = "greatR") |>
readRDS()
```
```{r get-model-summary-data, warning=FALSE}
registration_results$model_comparison |>
knitr::kable()
```
From the sample data above, we can see that for nine out of ten genes, `registered = TRUE`, meaning that reference and query data between those nine genes can be aligned or registered. These data frame outputs can further be summarised and visualised; see the documentation on the [processing registration results](https://ruthkr.github.io/greatR/articles/process-results.html) article.
## Using other inputs
### Loading sample data
As noted in the [input data requirements](https://ruthkr.github.io/greatR/articles/data-requirement.html) article, `register()` also accepts a list of data frames or a list of reference and query vectors as `input`:
```{r load-input-vector, message=FALSE, warning=FALSE}
# Define expression value vectors
ref_expressions <- c(1.9, 3.1, 7.8, 31.6, 33.7, 31.5, 131.4, 107.5, 116.7, 112.5, 109.7, 57.4, 50.9)
query_expressions <- c(14, 12.1, 15.9, 47, 30.9, 50.5, 80.1, 67.4, 72.9, 61.7)
list_vector <- list(
reference = ref_expressions,
query = query_expressions
)
```
### Registering the data
As previously shown, to register the input list, users can run the function `register()`:
```{r register-data-list-vectors, message=FALSE, warning=FALSE}
registration_results_vectors <- register(
list_vector,
reference = "Ref",
query = "Query",
scaling_method = "z-score"
)
#> ── Validating input data ───────────────────────────────────────────────────────
#> ℹ Will process 1 gene.
#> ℹ Using estimated standard deviation, as no `exp_sd` was provided.
#> ℹ Using `scaling_method` = "z-score".
#>
#> ── Starting registration with optimisation ─────────────────────────────────────
#> ℹ Using L-BFGS-B optimisation method.
#> ℹ Using computed stretches and shifts search space limits.
#> ℹ Using `overlapping_percent` = 50% as a registration criterion.
#> ✔ Optimising registration parameters for genes (1/1) [170ms]
```
### Registration results
The registration result object will have the same structure as when using a data frame as an input. Since no ID is explicitly defined in the input vector list, a unique `gene_id` will be automatically generated for the reference and query pair.
```{r get-model-summary-data-vectors, warning=FALSE}
registration_results_vectors$model_comparison |>
knitr::kable()
```