/
index.Rmd
178 lines (134 loc) · 7.51 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
slug: "rsnps-update"
title: "rsnps 0.5.0: New ncbi_snp_query() Features"
package_version: 0.5.0
author:
- Sina Rüeger
- Julia Gustavsen
date: 2022-03-29
tags:
- packages
- R
- rsnps
- dbsnp
- open-data
- tech notes
description: "`ncbi_snp_query()` now returns all reported variant allele frequencies in dbSNP."
twitterImg: blog/2022/03/29/rsnps-update/plot-maf-1.png
twitterAlt: "Allele frequencies in dbSNP for various studies/populations."
output:
html_document:
keep_md: true
---
```{r setup, include=FALSE}
# Options to have images saved in the post folder
# And to disable symbols before output
knitr::opts_chunk$set(fig.path = "", comment = "")
# knitr hook to make images output use Hugo options
knitr::knit_hooks$set(
plot = function(x, options) {
hugoopts <- options$hugoopts
paste0(
"{{<figure src=",
'"', x, '" ',
if (!is.null(hugoopts)) {
glue::glue_collapse(
glue::glue('{names(hugoopts)}="{hugoopts}"'),
sep = " "
)
},
">}}\n"
)
}
)
# knitr hook to use Hugo highlighting options
knitr::knit_hooks$set(
source = function(x, options) {
hlopts <- options$hlopts
paste0(
"```r ",
if (!is.null(hlopts)) {
paste0("{",
glue::glue_collapse(
glue::glue('{names(hlopts)}={hlopts}'),
sep = ","
), "}"
)
},
"\n", glue::glue_collapse(x, sep = "\n"), "\n```\n"
)
}
)
```
## TL;DR
`rsnps` is a package that enables the retrieval of single nucleotide polymorphism (SNP) data from the [NCBI's](https://www.ncbi.nlm.nih.gov/) [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) database and [openSNP](https://opensnp.org/) by providing wrappers for the APIs. Single nucleotide polymorphisms represent differences at one specific position in a detected biological sequence compared to the reference.
`ncbi_snp_query()` now returns all reported variant allele frequencies in dbSNP in column `maf_population` in form of a `tibble`. Previously (version <= 0.4.0), it reported only the allele frequency from [gnomAD](https://gnomad.broadinstitute.org/) in column `maf` as a `data.frame`.
## Changes
The [NEWS](https://github.com/ropensci/rsnps/blob/master/NEWS.md) file will tell you about the changes of rsnps 0.4.0 to 0.5.0, but there is one change that we think will be particularly relevant to users which we will highlight in this blog post.
`ncbi_snp_query()` is the function that pulls data from [NCBI's](https://www.ncbi.nlm.nih.gov/) [dbSNP](https://www.ncbi.nlm.nih.gov/snp/), a database of single-nucleotide polymorphisms (SNP). This database lets a user query for a SNP of interest and returns a plethora of information, among them genomic position, associated gene, clinical significance, and - relevant for this blogpost - the allele frequency. The allele frequency varies typically (on average) between different populations, sometimes just a little (e.g. [rs562556](https://www.ncbi.nlm.nih.gov/snp/rs562556#frequency_tab)), sometimes a lot (e.g. [rs11677783](https://www.ncbi.nlm.nih.gov/snp/rs11677783#frequency_tab)). This is why [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) collects allele frequency estimates from different studies and populations.
Until version 0.4.0 `ncbi_snp_query()` reported the allele frequency estimated from [gnomAD](https://gnomad.broadinstitute.org/). For example, for SNP `rs420358` the `ncbi_snp_query()` output used to look like this:
```r
ncbi_snp_query("rs420358")
#> query chromosome bp class rsid gene
#> 2 rs420358 1 40341239 snv rs420358
#> alleles ancestral_allele variation_allele seqname
#> 2 A,C,G,T A C,G,T NC_000001.11
#> assembly ref_seq minor maf
#> 2 GRCh38.p12 A C 0.8614
```
We have now changed two things:
1. `ncbi_snp_query()` output now includes a new `maf_population` column which contains all reported allele frequencies, not only from one study or population (for backwards compatibility, the `maf` column stays the same).
2. To do that, `ncbi_snp_query()` returns now a [tibble](https://tibble.tidyverse.org/) (and not a data frame).
## Examples
Let's have a look at the output of the two SNPs mentioned above. You can see that a) `maf_population` is a new list-column (`<list>`), and b) `dat` is a tibble.
```{r setupexample, include=FALSE}
library(ggplot2)
theme_set(theme_bw())
library(patchwork)
library(dplyr)
library(tidyr)
library(forcats)
```
```{r query}
library(rsnps)
(dat <- ncbi_snp_query(c("rs11677783", "rs562556")))
```
### One row per study ("long data")
Here is a visual to show how much the allele frequencies vary per SNP.
First, we **lengthen** the data frame, so that each SNP and population/study are on a **separate line**.
```{r maf}
(dat_maf <- dat %>%
select(query, maf_population) %>%
unnest(cols = c(maf_population)))
```
Then we display it with the allele frequency on the x-axis and the study along the y-axis.
```{r plot-maf, hugoopts=list(alt="Two figures displaying the allele frequency for two genetic variants (rs11677783, rs562556). Each figure is a dot plot with study along the vertical, y-axis and MAF along the horizontal, x-axis. The studies are arranged in ascending order of MAF", caption="Allele frequencies in dbSNP for rs11677783, rs562556 and various studies/populations.", width=800), fig.width = 10, fig.asp = 0.6}
p1 <- ggplot(data = dat_maf %>% filter(query == "rs11677783") %>% mutate(study = forcats::fct_reorder(study, MAF ))) +
geom_vline(xintercept = c(0, 0.5, 1), linetype = 3, color ="gray") +
geom_point(aes(MAF, study)) +
labs(title = "Allele frequency", subtitle = "rs11677783")
p2 <- ggplot(data = dat_maf %>% filter(query == "rs562556") %>% mutate(study = forcats::fct_reorder(study, MAF ))) +
geom_vline(xintercept = c(0, 0.5, 1), linetype = 3, color ="gray") +
geom_point(aes(MAF, study)) +
labs(title = "Allele frequency", subtitle = "rs562556")
p1 + p2
```
### One study as column
We can decide to turn the tibble into a data frame again and pick a **specific study**. For example, here we use the `map()` function from the [purrr](https://purrr.tidyverse.org/) package to pull out the study which matches "KOREAN" from the `maf_population` list. Note that our new column, `maf_korean` is a `dbl`, and not a `list`:
```{r mafkorean}
(dat_korean <- dat %>%
mutate(maf_korean = purrr::map(maf_population, ~..1$MAF[..1$study=="KOREAN"])) %>%
unnest(cols = c(maf_korean)))
```
### One column per study ("wide data")
Lastly, we can decide to pivot the study allele frequencies, so that **each study** has its **own column**:
```{r mafcolumn}
dat_maf <- dat %>%
select(query, maf_population) %>%
unnest(cols = c(maf_population)) %>%
select(query, study, MAF) %>%
pivot_wider(values_from = "MAF", names_from = "study", values_fn = min, names_prefix = "maf_") ## if duplicate, picking the minimum
(dat_wide <- left_join(dat, dat_maf, by = "query"))
```
## Conclusion
We have highlighted important changes in `ncbi_snp_query()` whereby this function now returns a tibble, the column `maf_population` now returns all allele frequencies available in the NCBI database, and finally `maf_population` is now a list column instead of a regular column. These improvements which provide additional information and functionality to this package are available in the most recent version of rnsps available on [CRAN](https://cran.r-project.org/web/packages/rsnps/index.html) and on [GitHub](https://github.com/ropensci/rsnps).