-
Notifications
You must be signed in to change notification settings - Fork 0
/
regionalpcs.Rmd
346 lines (257 loc) · 11.7 KB
/
regionalpcs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
---
title: "regionalpcs"
author: "Tiffany Eulalio"
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{regionalpcs}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
echo = TRUE
)
```
# regionalpcs
Navigating the complexity of DNA methylation data poses substantial challenges
due to its intricate biological patterns. The `regionalpcs` package is
conceived to address the substantial need for enhanced summarization and
interpretation at a regional level. Traditional methodologies, while
foundational, may not fully encapsulate the biological nuances of methylation
patterns, thereby potentially yielding interpretations that may be suboptimal
or veer towards inaccuracies. This package introduces and utilizes regional
principal components (rPCs), designed to adeptly capture biologically relevant
signals embedded within DNA methylation data. Through the implementation of
rPCs, researchers can gain new insights into complex interactions and
effects in methylation data that might otherwise be missed.
# Installation
The `regionalpcs` package can be easily installed from Bioconductor, providing
you with the latest stable version suitable for general use. Alternatively,
for those interested in exploring or utilizing the latest features and
developments, the GitHub version can be installed directly.
### Bioconductor Installation
Install `regionalpcs` from Bioconductor using the command below. Ensure
that your R version is compatible with the package version available on
Bioconductor for smooth installation and functionality.
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("regionalpcs")
```
### Development Version Installation
To access the development version of `regionalpcs` directly from GitHub,
which might include new features or enhancements not yet available in the
Bioconductor version, use the following commands. Note that the development
version might be less stable than the officially released version.
```{r eval=FALSE}
# install devtools package if not already installed
if (!requireNamespace("devtools", quietly=TRUE))
install.packages("devtools")
# download and install the regionalpcs package
devtools::install_github("tyeulalio/regionalpcs")
```
# `regionalpcs` R Package Tutorial
## Loading Required Packages
This tutorial depends on several Bioconductor packages. These packages should
be loaded at the beginning of the analysis.
```{r load-packages, message=FALSE, warning=FALSE}
library(regionalpcs)
library(GenomicRanges)
library(tidyr)
library(tibble)
library(dplyr)
library(minfiData)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
```
Let’s proceed to load the packages, briefly understanding their roles in
this tutorial.
- `regionalpcs`: Primary package for summarizing and interpreting DNA
methylation data at a regional level.
- `GenomicRanges`: Facilitates representation and manipulation of genomic
intervals and variables defined along a genome.
- `tidyr`, `tibble`, `dplyr`: Assist in data tidying, representation,
and manipulation.
- `minfiData`: Provides example Illumina 450k data, aiding in the
demonstration of `regionalpcs` functionalities.
- `TxDb.Hsapiens.UCSC.hg19.knownGene`: Accommodates transcriptome data,
useful for annotating results.
Once the packages are loaded, we can start using the functions provided by each
package.
## Load the Dataset
### Loading Minfi Sample Dataset
In this tutorial, we’ll utilize a sample dataset, `MsetEx.sub`, which is a
subset derived from Illumina's Human Methylation 450k dataset, specifically
preprocessed to contain 600 CpGs across 6 samples. The dataset is stored in a
`MethylSet` object, which is commonly used to represent methylation data.
The methylation beta values, denoting the proportion of methylated cytosines
at a particular CpG site, will be extracted from this dataset for our
subsequent analyses.
```{r}
# Load the MethylSet data
data(MsetEx.sub)
# Display the first few rows of the dataset for a preliminary view
head(MsetEx.sub)
# Extract methylation M-values from the MethylSet
# M-values are logit-transformed beta-values and are often used in differential
# methylation analysis for improved statistical performance.
mvals <- getM(MsetEx.sub)
# Display the extracted methylation beta values
head(mvals)
```
Note that `MsetEx.sub` provides a manageable slice of data that we can
utilize to illustrate the capabilities of `regionalpcs` without requiring
extensive computational resources.
Now that we have our dataset loaded and methylation values extracted,
let’s proceed with demonstrating the core functionalities of `regionalpcs`.
## Obtaining Methylation Array Probe Positions
### Load Illumina 450k Array Probe Positions
In this step, we’ll obtain the genomic coordinates of the CpG sites in our
methylation dataset using the 450k array probe annotations using the `minfi`
package.
```{r genomic-positions}
# Map the methylation data to genomic coordinates using the mapToGenome
# function. This creates a GenomicMethylSet (gset) which includes genomic
# position information for each probe.
gset <- mapToGenome(MsetEx.sub)
# Display the GenomicMethylSet object to observe the structure and initial
# entries.
gset
# Convert the genomic position data into a GRanges object, enabling genomic
# range operations in subsequent analyses.
# The GRanges object (cpg_gr) provides a versatile structure for handling
# genomic coordinates in R/Bioconductor.
cpg_gr <- granges(gset)
# Display the GRanges object for a preliminary view of the genomic coordinates.
cpg_gr
```
## Summarizing Gene Region Types
### Introduction
Gene regions, which include functional segments such as promoters, gene bodies,
and intergenic regions, play pivotal roles in gene expression and regulation.
Summarizing methylation patterns across these regions can provide insights
into potential gene regulatory mechanisms and associations with phenotypes
or disease states. Herein, we will delve into how to succinctly summarize
methylation data at these crucial genomic segments using the `regionalpcs`
package.
### Load Gene Region Annotations
First, let's load the gene region annotations. Make sure to align the genomic
builds of your annotations and methylation data.
```{r}
# Obtain promoter regions
# The TxDb object 'txdb' facilitates the retrieval of transcript-based
# genomic annotations.
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# Extracting promoter regions with a defined upstream and downstream window.
# This GRanges object 'promoters_gr' will be utilized to map and summarize
# methylation data in promoter regions.
promoters_gr <- suppressMessages(promoters(genes(txdb),
upstream=1000,
downstream=0))
# Display the GRanges object containing the genomic coordinates of promoter
# regions.
promoters_gr
```
### Create a Region Map
Creating a region map, which systematically assigns CpGs to specific gene
regions, stands as a crucial precursor to gene-region summarization using the
`regionalpcs` package. This mapping elucidates the physical positioning of
CpGs within particular gene regions, facilitating our upcoming endeavors to
comprehend how methylation varies across distinct genomic segments. We'll use
the `create_region_map` function from the `regionalpcs` package. This function
takes two genomic ranges objects, `cpg_gr` contains CpG positions and `genes_gr`
contains gene region positions. Make sure both positions are aligned to the
same genome build (e.g. GrCH37, CrCH38).
```{r}
# get the region map using the regionalpcs function
region_map <- regionalpcs::create_region_map(cpg_gr=cpg_gr,
genes_gr=promoters_gr)
# Display the initial few rows of the region map.
head(region_map)
```
**Note:** The second column of `region_map` must contain values matching the
rownames of your methylation dataframe.
### Summarizing Gene Regions with Regional Principal Components
In this final section, we'll summarize gene regions using Principal
Components (PCs) to capture the maximum variation. We'll utilize the
`compute_regional_pcs` function from the `regionalpcs` package for this.
#### Compute Regional PCs
Let's calculate the regional PCs using our gene regions for
demonstration purposes.
```{r compute-regional-pcs}
# Display head of region map
head(region_map)
dim(region_map)
# Compute regional PCs
res <- compute_regional_pcs(meth=mvals, region_map=region_map, pc_method="gd")
```
#### Inspecting the Output
The function returns a list containing multiple elements. Let's first look at
what these elements are.
```{r inspect-output}
# Inspect the output list elements
names(res)
```
#### Extracting and Viewing Regional PCs
The first element (`res$regional_pcs`) is a data frame containing the
calculated regional PCs.
```{r extract-regional-pcs}
# Extract regional PCs
regional_pcs <- res$regional_pcs
head(regional_pcs)[1:4]
```
#### Understanding the Results
The output is a data frame with regional PCs for each region as rows and our
samples as columns. This is our new representation of methylation values, now
on a gene regional PC scale. We can feed these into downstream analyses as is.
The number of regional PCs representing each gene region was determined by the
Gavish-Donoho method. This method allows us to identify PCs that capture actual
signal in our data and not the noise that is inherent in any dataset. To explore
alternative methods, we can change the `pc_method` parameter.
```{r count-pcs-regions}
# Count the number of unique gene regions and PCs
regions <- data.frame(gene_pc = rownames(regional_pcs)) |>
separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(regions)
# number of genes that have been summarized
length(unique(regions$gene))
# how many of each PC did we get
table(regions$pc)
```
We have summarized each of our genes using just one PC. The number of PCs
depends on three main factors: the number of samples, the number of CpGs in
the gene region, and the noise in the methylation data.
By default, the `compute_regional_pcs` function uses the Gavish-Donoho method.
However, we can also use the Marcenko-Pasteur method by setting the `pc_method`
parameter:
```{r marcenko-pasteur-method}
# Using Marcenko-Pasteur method
mp_res <- compute_regional_pcs(mvals, region_map, pc_method = "mp")
# select the regional pcs
mp_regional_pcs <- mp_res$regional_pcs
# separate the genes from the pc numbers
mp_regions <- data.frame(gene_pc = rownames(mp_regional_pcs)) |>
separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(mp_regions)
# number of genes that have been summarized
length(unique(mp_regions$gene))
# how many of each PC did we get
table(mp_regions$pc)
```
The Marcenko-Pasteur and the Gavish-Donoho methods are both based on Random
Matrix Theory, and they aim to identify the number of significant PCs that
capture the true signal in the data and not just the noise. However, these
methods differ in how they select the number of significant PCs. The
Marcenko-Pasteur method typically selects more PCs to represent a gene region
compared to the Gavish-Donoho method. This may be due to the different ways in
which the two methods estimate the noise level in the data.
Ultimately, the choice between the two methods depends on the specific needs
and goals of the analysis. The Gavish-Donoho method tends to provide more
conservative results, while the Marcenko-Pasteur method may capture more of
the underlying signal in the data. Researchers should carefully consider
their objectives and the characteristics of their data when selecting a method.
# Session Information
```{r sessionInfo}
sessionInfo()
```