/
2018-12-05-1000genomes-map.Rmd
241 lines (180 loc) · 7.33 KB
/
2018-12-05-1000genomes-map.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: Create a map of the 1000 Genomes project reference populations
description: Using the leaflet R-package
date: '2018-12-05'
tags:
- data visualisation
- R
- maps
slug: 1kgmap
draft: false
url_source: https://github.com/sinarueeger/webpage/blob/master/blogdown/content/post/2018-12-05-1000genomes-map.Rmd
url_code: https://github.com/sinarueeger/webpage/blob/master/blogdown/content/post/2018-12-05-1000genomes-map.R
header:
caption: ''
image: ''
output:
blogdown::html_page:
toc: true
---
```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE}
knitr::purl(input = "content/post/2018-12-05-1000genomes-map.Rmd",
output = "content/post/2018-12-05-1000genomes-map.R")
```
This post provides the R-Code to map the 26 populations of the [1000 Genomes project](http://www.internationalgenome.org/).
## Goal
Create a map similar to the one^[This is a png and cannot be altered.] on the front page of http://www.internationalgenome.org/ in a reproducible manner.
![_Version on internationalgenome.org_](/post/2018-12-05-1000genomes-map/1000genomes-map.png)
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
cache = TRUE
)
```
## Get started
Packages needed:
```{r}
## accessed via ::
# library(mapview)
# library(readxl)
# library(readr)
# library(purrr)
# library(tidyr)
# library(forcats)
library(leaflet)
library(dplyr)
library(ggmap) ## for geocode, devtools::install_github("dkahle/ggmap")
```
`ggmap` requires a google map api key^[Seen here: https://stackoverflow.com/questions/36175529/getting-over-query-limit-after-one-request-with-geocode.]:
1. get one here: https://developers.google.com/maps/documentation/geocoding/get-api-key
1. then run `register_google(key = "my_api_key")`
## Data
- The population counts and labels are from ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/ ([download xlsx file](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.xlsx)).
- The super population labels are from [here](http://www.internationalgenome.org/faq/which-populations-are-part-your-study/) (pasted into a [csv](/post/2018-12-05-1000genomes-map/sample_info_superpop.csv), then the location was inferred).
Download the population counts and labels first:
```{r}
url <-
"ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_sample_info.xlsx"
url.bitly <- "http://bit.ly/2MQTr02"
download.file(url, "20130606_sample_info.xlsx", mode = "wb")
```
Import file into R:
```{r}
df <-
readxl::read_excel("20130606_sample_info.xlsx", sheet = "Sample Info")
# >> Sample Info
```
Some data wrangling:
```{r}
## count number of individuals by population
## rename population > POP
n.pop <-
df %>%
group_by(Population) %>%
summarise(n = n()) %>%
rename(POP = Population)
## import super population names and details to the location of populations
## copied from here:
url.spop <-
"http://www.internationalgenome.org/faq/which-populations-are-part-your-study/"
## added location manually (!) - found this the only option to prevent overlapping locations.
## Also, description involves a mix of location and origin.
## rename superpopulation > SPOP
n.spop <-
readr::read_csv("../../static/post/2018-12-05-1000genomes-map/sample_info_superpop.csv") %>%
rename(POP = `Population Code`, SPOP = `Super Population Code`)
## join the two information
n.1kg <- left_join(n.pop, n.spop, by = c("POP" = "POP"))
```
## Add geographical coordinates
Parts of the code below is from a map created by [Daniela Vazquez](https://github.com/d4tagirl) for R-Ladies: https://github.com/rladies/Map-RLadies-Growing.
This is the part where we annotate the dataframe `n.1kg` with where the individuals live (not their ancestry).
Repeat this until there are no `warnings()` about `QUERY LIMITS` (the `while` loop takes care of this).
We will use the `ggmap` package, which accesses the google maps api.
A workaround is to set `source = "dsk"` (works for a limited number of queries)^[See https://stackoverflow.com/questions/36175529/getting-over-query-limit-after-one-request-with-geocode.].
```{r}
n.1kg <- n.1kg %>%
mutate(purrr::map(.$location, geocode, source = "dsk")) %>%
tidyr::unnest()
## running into the inevitable QUERY LIMITS problems, lets use the approach from https://github.com/rladies/Map-RLadies-Growing
n.1kg.withloc <- n.1kg %>%
filter(!is.na(lon))
while(nrow(n.1kg.withloc) != nrow(n.1kg))
{
# repeat this until there are no warnings() about QUERY LIMITS
temp <- n.1kg %>%
select(-lon, -lat) %>%
anti_join(n.1kg.withloc %>% select(-lon, -lat)) %>%
mutate(longlat = purrr::map(.$location, geocode, source = "dsk")) %>%
tidyr::unnest() %>%
filter(!is.na(lon))
n.1kg.withloc <- n.1kg.withloc %>%
bind_rows(temp) %>%
distinct()
}
n.1kg <- n.1kg.withloc
## glue POP and `Population Description` together
n.1kg <-
n.1kg %>% mutate(pop.desc = paste0(POP, " : ", `Population Description`, " (", SPOP, ")"))
## given that only a number of geolocation are possible with the google API, this
## should probably stored out
## readr::write_csv(n.1kg, path = "1kg_sample_info_location.csv")
```
## Create leaflet
Map locations a world map with leaflet
```{r}
## if you have stroed the data in the previous chunk:
## readr::read_csv("1kg_sample_info_location.csv")
```
Define shiny icons:
```{r}
icons <- awesomeIcons(
icon = 'user', #people',
iconColor = 'black',
library = 'fa', #ion
markerColor = as.character(forcats::fct_recode(as.factor(n.1kg$SPOP),
red = "EUR",
blue = "AFR",
green = "AMR",
gray = "EAS",
orange = "SAS"))
## ok, thats not too pretty, but turns out, hex colors won't work
)
## we need to create a vector that maps cols to SPOP from the markerColor argument above
cols <- c("#E50102", "#00A9DD", "#57BA1F", "#575757", "#FD8E00")
SPOP <- c("EUR", "AFR", "AMR", "EAS", "SAS")
## separate icon that will display the information
## ------------------------------------------------
icon.info <- awesomeIcons(
icon = 'info', #people',
iconColor = 'white',
library = 'fa', #ion
markerColor = "white"
)
```
Create map:
```{r}
m <- leaflet(data = n.1kg) %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addAwesomeMarkers(lat=~lat, lng=~lon, label = ~htmltools::htmlEscape(pop.desc), icon = icons) %>%
addAwesomeMarkers(lat=-45, lng=-107, popup = glue::glue("Source: https://github.com/sinarueeger/map-1000genomes"), icon = icon.info) %>% ## this bit has potential to be displayed as a href.
#glue::glue("Source: {url.bitly} + {url.spop} (manual tidying)"), icon = icon.info) %>%
addLegend("bottomright",
colors =cols,
labels= SPOP,
opacity = 1)
m # Print the map
```
## Save the map
```{r, eval = FALSE}
## save to png
## ------------
mapview::mapshot(m, file = "map-1000genomes-populations.png")
## save to hmtl
## -------------
htmlwidgets::saveWidget(m, file="map-1000genomes-populations.html")
```
## Reason for deviation from the original
I mapped the populations according to the current location but coloured them according to ancestry.