/
api.Rmd
363 lines (289 loc) · 11.7 KB
/
api.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
title: "Accessing the RESTful API"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{restapi}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(magrittr)
library(ruODK)
```
# Scope
This vignette provides a walk-through of the "getting data out"
functions of the RESTful API endpoints which list and view details.
`ruODK` users would mix and match parts of the demonstrated workflows to build
their own data pipelines, e.g.:
* to build a quick analysis from all data, freshly downloaded from a smaller
project, or
* to build an interactive ETL pipeline to selectively download only new submissions
for further processing and upload into downstream data warehouses.
## Three ways to happiness
ODK Central offers no less than three different ways to access data:
* viewing ODK Central data in MS PowerBI, MS Excel, Tableau, or `ruODK`
through the OData service endpoints, or
* downloading all submissions including attachments as one (possibly gigantic)
zip archive either through the "Export Submissions" button in the ODK Central
form submissions page or through `ruODK`,
* viewing ODK Central data through `ruODK`'s RESTful API functions.
While the `vignette("odata", package="ruODK")`
(online [here](https://dbca-wa.github.io/ruODK/articles/odata.html)) illustrates
the first option, this vignette demonstrates the remaining two.
Not implemented (yet) are the "managing ODK Central" functions which create,
update, and delete projects, forms, users, roles, and permissions.
We haven't yet found a strong use case to automate those functions -
ODK Central (driven by humans) does those jobs beautifully on an expected scale.
# Setup ruODK
See [`vignette("Setup", package = "ruODK")`](https://dbca-wa.github.io/ruODK/articles/setup.html)
for detailed options to configure `ruODK`.
# Projects
List projects. We see the project ID, a name, the number of forms and app users,
dates of last form submissions plus project management timestamps (created,
updated).
The important bit here is the project ID.
```{r project_list}
p <- ruODK::project_list()
p %>% knitr::kable(.)
```
Inspect a project using its ID. We receive a tibble with exactly one row,
all the columns of `ruODK::project_list` plus a column `verbs`, which contains all
available API actions for a project.
```{r project_details}
pd <- ruODK::project_detail(1)
pd %>% dplyr::select(-"verbs") %>% knitr::kable(.)
pd$verbs[[1]] %>% unlist(.)
```
Nothing apart from the verbs is new compared to the data returned by
`ruODK::project_list`.
To learn more about the functionality behind the verbs, refer to the interactive
[ODK Central API documentation](https://odkcentral.docs.apiary.io/#reference/project-management).
To retrieve data from ODK Central, the functions shown below will suffice.
# Forms
## List forms for a project
To download form submissions, we need to know project ID and form ID.
There are several ways of retrieving the form ID:
* Browsing forms in the ODK Central's project overviews,
* Stealing the form ID from the OData service endpoint URL as shown on
ODK Central's form submission page,
* Listing form metadata for a given project ID with `ruODK::form_list()`.
```{r form_list}
forms <- ruODK::form_list(1)
forms %>% knitr::kable(.)
```
Further to the metadata shown here, a column `xml` contains the entire XForms
definition (originally XML) as nested list.
If the original XML is needed rather than the R equivalent (nested list),
we can use `ruODK::form_xml` with parameter `parse=FALSE`:
```{r form_xml}
form_xml <- ruODK::form_xml(1, forms$fid[[1]], parse=FALSE)
form_xml
```
## Show details of one form
The details of a form are exactly the same as the output of `ruODK::form_list()`.
```{r form_detail}
form <- ruODK::form_detail(1, forms$fid[[1]])
form %>% knitr::kable(.)
```
## Inspect form schema
The nested list of lists returned by `form_schema` is the cleanest representation
of the Xform definition available in R.
See the [ODK Central API docs](https://odkcentral.docs.apiary.io/#reference/forms-and-submissions/'-individual-form/retrieving-form-schema-json) and the examples of `??ruODK::form_schema()` for more detail.
```{r form_schema}
fs <- form_schema(1, forms$fid[[1]])
# listviewer::jsonedit(fs)
```
Namely, the form schema tells us about form field data types (dates, geopoints,
binary attachments) and repeating groups. We need this knowledge to parse the
submission data into tidy tibbles later.
# Submissions
We are getting closer to the actual data! This section shows two of the options
for data access: dump all submissions, or extract a subset.
## Get all submissions for one form
Smaller datasets lend themselves to be exported in one go.
ODK Central offers one giant zip file containing all submissions, any
repeating groups, and any attachments both on the form submission page, and as
API endpoint which is provided as `ruODK::submission_export()`.
The default behaviour of `ruODK::submission_export()` is to write the zip file
to the project root (`here::here()`), and to overwrite existing previous downloads.
See `?ruODK::submission_export()` for alternative download and retention options.
In the following chuck, we illustrate common tasks:
* Download the zip file.
* Unpack the zip file.
* Join repeating form group data `data_taxon` to main data `data_quadrat` to
annotate `data_taxon` with data from `data_quadrat`.
* Sanitise the column names.
* Prepend all attachment filenames (e.g. `data_quadrat$location_quadrat_photo`,
`data_taxon$photo_in_situ`) with `media/`.
```{r submission_export}
t <- fs::dir_create("api")
# Predict filenames (with knowledge of form)
fid <- Sys.getenv("ODKC_TEST_FID")
fid_csv <- fs::path(t, glue::glue("{fid}.csv"))
fid_csv_tae <- fs::path(t, glue::glue("{fid}-taxon_encounter.csv"))
# Download the zip file
se <- ruODK::submission_export(
Sys.getenv("ODKC_TEST_PID"),
Sys.getenv("ODKC_TEST_FID"),
local_dir = t,
overwrite = FALSE,
verbose = TRUE,
url = Sys.getenv("ODKC_TEST_URL"),
un = Sys.getenv("ODKC_TEST_UN"),
pw = Sys.getenv("ODKC_TEST_PW")
)
# Unpack the zip file
f <- unzip(se, exdir = t)
fs::dir_ls(t)
# Read the CSV files
data_quadrat_raw <- fid_csv %>% readr::read_csv()
data_taxon_raw <- fid_csv_tae %>% readr::read_csv()
# Sanitize column names
add_dettol <- . %>% stringr::str_replace_all(., "-", "_")
names(data_quadrat_raw) <- names(data_quadrat_raw) %>% add_dettol
names(data_taxon_raw) <- names(data_taxon_raw) %>% add_dettol
# Prepend attachments with media/ to turn into relative file paths
data_quadrat <- data_quadrat_raw %>%
dplyr::mutate(
location_quadrat_photo = glue::glue("media/{location_quadrat_photo}"),
habitat_morphological_type_photo = glue::glue("media/{habitat_morphological_type_photo}"),
perimeter_mudmap_photo = glue::glue("media/{perimeter_mudmap_photo}")
)
DT::datatable(data_quadrat)
# Further: create map with popups, see vignette "odata"
data_taxon <- data_taxon_raw %>%
dplyr::mutate(
photo_in_situ = glue::glue("media/{photo_in_situ}")
) %>%
dplyr::left_join(
data_quadrat,
by = c("PARENT_KEY" = "meta_instanceID")
)
DT::datatable(data_taxon)
# Further: create map with popups, see vignette "odata"
```
In future versions of `ruODK`, we'll introspect the form schema to determine
the names of repeating groups, the form field names of attachments (data type
"binary" in form schema), and thus could automate the process shown above.
## List submissions for one form
Not always is it appropriate to download all submissions and all attachments
at once.
If forms feed into downstream data warehouses, the typical ETL workflow is to
* List all submissions from ODK Central
* Select the subset of new submissions to download, e.g.
* Submissions younger than the oldest submission date in the data warehouse.
* Submissions whose `instance_id` is not already present in the data warehouse.
* Download only the selected submissions.
* Download attachments of only the selected submissions.
```{r submission_list}
sl <- ruODK::submission_list(
Sys.getenv("ODKC_TEST_PID"),
Sys.getenv("ODKC_TEST_FID"),
url = Sys.getenv("ODKC_TEST_URL"),
un = Sys.getenv("ODKC_TEST_UN"),
pw = Sys.getenv("ODKC_TEST_PW")
)
sl %>% knitr::kable(.)
```
The list of submissions critically contains each submission's unique ID in
`instance_id`. If the submissions shall be downloaded and uploaded into another
data warehouse, the `instance_id` can be used to determine whether a record
already exists in the downstream warehouse or not.
This workflow is preferable where the majority of submissions is already
imported into another downstream data warehouse, and we only want to add new
submissions, as in submissions which are not already imported into the data
warehouse.
Furthermore, the `instance_id`s can now be used to retrieve the actual
submissions.
## Get submission data
In order to import each submission, we need to retrieve the data by
`instance_id`.
```{r submission_detail}
# One submission
sub <- ruODK::submission_detail(
Sys.getenv("ODKC_TEST_PID"),
Sys.getenv("ODKC_TEST_FID"),
sl$instance_id[[1]],
url = Sys.getenv("ODKC_TEST_URL"),
un = Sys.getenv("ODKC_TEST_UN"),
pw = Sys.getenv("ODKC_TEST_PW")
)
# Multiple submissions
submissions <- tibble::tibble(
pid = Sys.getenv("ODKC_TEST_PID"),
fid = Sys.getenv("ODKC_TEST_FID"),
iid = sl$instance_id, # this is a vector of multiple instance_ids
url = Sys.getenv("ODKC_TEST_URL"),
un = Sys.getenv("ODKC_TEST_UN"),
pw = Sys.getenv("ODKC_TEST_PW")
) %>%
purrr::pmap(ruODK::submission_detail)
```
## Parse submissions
The data in `sub` is one row of the bulk downloaded submissions in `data_quadrat`.
The data in `submissions` represents all (or let's pretend, the selected)
submissions in `data_quadrat`.
The field `xml` contains the actual submission data including repeating groups.
The structure is different to the output of `ruODK::odata_submission_get`,
therefore `ruODK::odata_submission_parse` does not work for those, as
here we might have repeating groups included in a submission.
```{r}
head(submissions)
```
# Attachments
## List all attachments for one form
```{r attachment_list}
al <- attachment_list(
Sys.getenv("ODKC_TEST_PID"),
Sys.getenv("ODKC_TEST_FID"),
sl$instance_id[[1]],
url = Sys.getenv("ODKC_TEST_URL"),
un = Sys.getenv("ODKC_TEST_UN"),
pw = Sys.getenv("ODKC_TEST_PW")
)
al
attachments <- tibble::tibble(
pid = Sys.getenv("ODKC_TEST_PID"),
fid = Sys.getenv("ODKC_TEST_FID"),
iid = sl$instance_id, # this is a vector of multiple instance_ids
url = Sys.getenv("ODKC_TEST_URL"),
un = Sys.getenv("ODKC_TEST_UN"),
pw = Sys.getenv("ODKC_TEST_PW")
) %>%
purrr::pmap(ruODK::attachment_list)
attachments
```
## Get attachments
While we're working on parsing the submissions into something resembling
`data_quadrat`, we'll demonstrate the use of `ruODK::attachment_get` using
`data_quadrat_raw`, which still contains the plain filenames without prefixed
local folders.
```{r attachment_get}
data_quadrat_att <- data_quadrat_raw %>%
dplyr::mutate(
location_quadrat_photo = ruODK::attachment_get(
Sys.getenv("ODKC_TEST_PID"),
Sys.getenv("ODKC_TEST_FID"),
meta_instanceID,
location_quadrat_photo,
local_dir = t,
url = Sys.getenv("ODKC_TEST_URL"),
un = Sys.getenv("ODKC_TEST_UN"),
pw = Sys.getenv("ODKC_TEST_PW")
)
# same for all other attachments
# requires knowledge or introspection of form schema
)
# We have file paths...
data_quadrat_att$location_quadrat_photo
# ...and the corresponding files
fs::dir_ls(t, recurse = TRUE)
```
From here, we can use the separately created data tibbles and locally downloaded
attachments to build visualisations as shown in the vignette "odata".