/
workflow.Rmd
420 lines (363 loc) · 17.2 KB
/
workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
---
title: "Example deposits workflow"
author:
- "Mark Padgham"
date: "`r Sys.Date()`"
vignette: >
%\VignetteIndexEntry{3. Example deposits workflow}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set (
collapse = TRUE,
warning = TRUE,
message = TRUE,
width = 120,
comment = "#>",
fig.retina = 2,
fig.path = "README-"
)
```
This vignette provides an example of a complete deposits workflow, from
initiation of a deposits client through to data publication.
## Initial Metadata
As described in [the metadata
vignette](https://docs.ropensci.org/deposits/articles/metadata.html), deposits
start with metadata describing general aspects of the data being deposited,
such as a title, description, identification of creators, and any other aspects
specified in [the deposits metadata JSON
schema](https://github.com/ropenscilabs/deposits/blob/main/inst/extdata/dc/schema.json).
This workflow will use the same "beaver" datasets as [the metadata
vignette](https://docs.ropensci.org/deposits/articles/metadata.html), from [R's
"datasets"
package](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/beavers.html).
That vignette demonstrated how to use the error messages triggered by
incorrectly specified metadata to work towards the following, schema-compliant
specification:
```{r}
metadata <- list (
creator = list (list (name = "P.S. Reynolds")),
created = "1994-01-01T00:00:00",
title = "Time-series analyses of beaver body temperatures.",
description = "Original source of 'beaver' data, in Chapter 11 of Lange, N., Ryan, L., Billard, L., Brillinger, D., Conquest, L. and Greenhouse, J. eds (1994) Case Studies in Biometry.",
publisher = "John Wiley and Sons",
isPartOf = list (list (
identifier = "ark:/13960/t0mt2n370",
relation = "isPartOf"
))
)
```
The "isPartOf" element is a key part of "deposits" metadata, enabling
individual data sources to be systematically related to other resources, in
this case to the book first describing these data which has an associated "ark"
identifier. These fields are intended to help linking data depositions with
other project outcomes, such as publications, other data sets, or general
project descriptions. Fields for these purposes include "hasPart",
"hasVersion", "isPartOf", "isReferencedBy", "isReplacedBy", "isRequiredBy", and
"isVersionOf", with details of all fields given in the [deposits JSON
schema](https://github.com/ropenscilabs/deposits/blob/main/inst/extdata/dc/schema.json).
Those metadata can then be used to initiate a deposits client with [the `new()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-depositsClient-new),
demonstrated here with `sandbox = TRUE` to use the Zenodo "sandbox" environment.
```{r client-init, eval = FALSE}
cli <- depositsClient$new (
service = "zenodo",
sandbox = TRUE,
metadata = metadata
)
print (cli)
#> <deposits client>
#> deposits service : zenodo
#> sandbox: TRUE
#> url_base : https://sandbox.zenodo.org/api/
#> Current deposits : <none>
#>
#> hostdata : <none>
#> metadata : 6 terms (see 'metadata' element for details)
```
The metadata can be edited and extended as desired. The metadata recorded in
the deposits client can be updated after each edit with [the
`deposit_fill_metadata()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-fill-metadata-):
```{r fill-meta, eval = FALSE}
cli$deposit_fill_metadata (metadata)
```
While it is always possible to edit deposits metadata directly by passing
values to [the `deposit_fill_metadata()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-fill-metadata-),
the recommended procedure is to generate a "frictionless" metadata file, as
described in [the vignette of the same
name](https://docs.ropensci.org/deposits/articles/frictionless.html), and to
edit the metadata directly in that file. This procedure is demonstrated in the
following section. A frictionless metadata file can only be initially generated
in response to an actual data resource, and thus the next section begins by
generating some example data.
## Preparing data sources
The ["beaver"
data](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/beavers.html)
actually comprises two datasets, "beaver1" and "beaver2", each of which is a
time series of body temperature measurements from an individual beaver. For
data sources to be uploaded by deposits, they must first exist on a local
computer, meaning in this case that copies of these beaver datasets must first
be written to local files.
The deposits package presumes that a single deposits lives within a dedicated
local directory which includes all associated files. Let's start by making a
temporary directory and storing the "beaver" data there:
```{r make-tempdir}
beaver_dir <- file.path (tempdir (), "beaver")
if (!dir.exists (beaver_dir)) {
dir.create (beaver_dir)
}
bv1 <- file.path (beaver_dir, "beaver1.csv")
write.csv (datasets::beaver1, bv1, row.names = FALSE)
bv2 <- file.path (beaver_dir, "beaver2.csv")
write.csv (datasets::beaver1, bv2, row.names = FALSE)
```
We can then connect the deposits client with that local directory with the
[`deposit_add_resource()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-add-resource-):
```{r add_resource, eval = FALSE}
cli$deposit_add_resource (beaver_dir)
```
Printing the client, by typing `print(cli)`, or simply `cli`, then reveals that
it has been connected with the local directory holding those data:
```{r add-resource-print, eval = FALSE}
print (cli)
#> <deposits client>
#> deposits service : zenodo
#> sandbox: TRUE
#> url_base : https://sandbox.zenodo.org/api/
#> Current deposits : <none>
#>
#> hostdata : <none>
#> metadata : 6 terms (see 'metadata' element for details)
#> local_path : /tmp/RtmpPru5st/beaver
#> resources : 2 local, 0 remote
```
### Frictionless metadata
Calling the [`deposit_add_resource()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-add-resource-):
the first time also writes a ["frictionless" metadata
file](https://docs.ropensci.org/deposits/articles/frictionless.html#local-generation)
to the `local_path` directory:
```{r fake-dp, echo = FALSE}
write.csv (beaver1, file.path (beaver_dir, "datapackage.json"), row.names = FALSE)
```
```{r}
list.files (beaver_dir)
```
The additional "datapackage.json" file is initially generated by [the
"frictionless" R package](https://docs.ropensci.org/frictionless), which
automatically fills out details of each "resource", or local file, in a
"resources" section. The deposits package then inserts the metadata specified
above into an additional "metadata" section of the file. The following code
illustrates a portion of the contents of that file:
```{r dp-contents, eval = FALSE}
dp <- list.files (beaver_dir, pattern = "datapackage", full.names = TRUE)
readLines (dp) [1:25]
#> [1] "{"
#> [2] " \"profile\": \"tabular-data-package\","
#> [3] " \"metadata\": {"
#> [4] " \"created\": \"1994-01-01T00:00:00\","
#> [5] " \"creator\": ["
#> [6] " {"
#> [7] " \"name\": \"P.S. Reynolds\""
#> [8] " }"
#> [9] " ],"
#> [10] " \"description\": \"Original source of 'beaver' data, in Chapter 11 of Lange, N., Ryan, L., Billard, L., Brillinger, D., Conquest, L. and Greenhouse, J. eds (1994) Case Studies in Biometry.\","
#> [11] " \"isPartOf\": ["
#> [12] " {"
#> [13] " \"identifier\": \"ark:/13960/t0mt2n370\","
#> [14] " \"relation\": \"isPartOf\""
#> [15] " }"
#> [16] " ],"
#> [17] " \"publisher\": \"John Wiley and Sons\","
#> [18] " \"title\": \"Time-series analyses of beaver body temperatures.\""
#> [19] " },"
#> [20] " \"resources\": ["
#> [21] " {"
#> [22] " \"name\": \"beaver1\","
#> [23] " \"path\": \"beaver1.csv\","
#> [24] " \"profile\": \"tabular-data-resource\","
#> [25] " \"format\": \"csv\","
```
Once a frictionless "datapackage.json" file has been generated, any subsequent
editing of metadata should be done by directly editing that file. Editing
should also generally involve extending the automatically-inserted "resource"
metadata describing the structure of the actual files, as described in the
documentation for [the frictionless R
package](https://docs.ropensci.org/frictionless).
Edited and updated versions of metadata can then be loaded into a deposits
client by passing the path to the directory as the `path` argument to [the
`deposit_update()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-update-).
```{r cli-update-meta, eval = FALSE}
cli$deposit_update (beaver_dir)
```
Instead of `beaver_dir`, the full path to the local "datapackage.json" file can
also be passed. While the same effect can be achieved by calling [the
`deposit_fill_metadata()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-fill-metadata-)
for deposits which have not been initiated on the remote service, the
[`deposit_update()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-update-)
has additional effects after that point, and is the recommended method once a
"datapackage.json" file has been generated. This is demonstrated in the
subsequent section.
The following code demonstrates modification and updating of
metadata by first modifying the "title", and then showing that those changes
are reflected in the client itself:
```{r metadata-change, eval = FALSE}
m <- readLines (dp)
i <- grep ("\"title\"", m)
m [i] <- gsub ("Time", "Modified time", m [i])
writeLines (m, dp)
cli$metadata$title # original title
#> [1] "Time-series analyses of beaver body temperatures."
cli$deposit_update (beaver_dir)
cli$metadata$title
#> [1] "Modified time-series analyses of beaver body temperatures."
```
In short, metadata editing with deposits is generally done by editing a local
"datapackage.json" file, after which a deposits client can then be updated with
[the `deposit_update()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-update-).
## Initiating a remote deposit
The metadata held within a deposits client can be used to initiate a remote
deposit on the specified service with [the `deposit_new()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-new-):
```{r dep-new, eval = FALSE}
cli$deposit_new ()
#> ID of new deposit : 1065666
print (cli)
#> <deposits client>
#> deposits service : zenodo
#> sandbox: TRUE
#> url_base : https://sandbox.zenodo.org/api/
#> Current deposits : <none>
#>
#> url_service : https://sandbox.zenodo.org/deposit/1065666
#> deposit id : 1065666
#> hostdata : list with 14 elements
#> metadata : 7 terms (see 'metadata' element for details)
#> local_path : /tmp/RtmpMd4uB8/beaver
#> resources : 2 local, 0 remote
```
The client now contains additional "hostdata" elements, containing all data
recorded by Zenodo for that deposit. The default print method for the client
now also lists additional information including a URL for the new deposit, and
a unique identifier. In most R environments, the URL can be directly clicked to
view the deposit online. All new deposits are private, and can only be viewed
after first logging in to the service.
Metadata can still be edited and updated within a client through modifying the
"datapackage.json" file. The metadata held on Zenodo can then be updated by
calling [the `deposit_update()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-update-).
## Uploading files to a remote deposit
The main purpose of the deposits package, and of online deposition services, is
to deposit data. This is done with [the `deposit_upload_file()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-upload-file-).
The main parameter, `path`, can also be either a single file or an entire
directory. If `path` specifies a directory, all files contained within that
directory are uploaded.
```{r upload-files, eval = FALSE}
cli$deposit_upload_file (beaver_dir)
#> frictionless metadata file has been generated as '/tmp/RtmpCPOaqC/beaver/beaver1.csv'
cli$hostdata$files [, 1:3]
```
```{r hostfiles, echo = FALSE}
hf <- data.frame (
checksum = c (
"c8e7ff1e2e4323198b4be5227ff63864",
"c8e7ff1e2e4323198b4be5227ff63864",
"4fd4b5167c28a874170ab611daf824e7"
),
filename = c ("beaver1.csv", "beaver2.csv", "datapackage.json"),
filesize = c (1909, 1909, 1225)
)
print (hf)
```
The "hostdata" of the client now indicate that the three files have been
successfully uploaded.
### File compression
The [`deposit_upload_file()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-upload-file-)
includes a `compress` parameter which defaults to `"no"` for no compression,
but can also be set to `"tar"` or `"zip"` to compress files prior to uploading.
Compression is generally recommended for large files, both to ease uploading
and downloading, and to reduce storage sizes on the host services. The
frictionless "datapackage.json" file is always stored in uncompressed format,
to enable direct inspection via the online platforms. The following code
demonstrates the effects of file compression:
```{r upload-files-compressed, eval = FALSE}
cli$deposit_upload_file (beaver_dir, compress = "tar")
#> frictionless metadata file has been generated as '/tmp/RtmpCPOaqC/beaver/beaver1.csv'
cli$hostdata$files [, 1:3]
```
```{r hostfiles-compressed, echo = FALSE}
hf <- data.frame (
checksum = c (
"03dd72dacab515750494745e17e4f37c",
"713ce15cb9d3c2b2b6ba8d541c0934a5",
"4fd4b5167c28a874170ab611daf824e7"
),
filename = c ("beaver1.tar.gz", "beaver2.tar.gz", "datapackage.json"),
filesize = c (3584, 3584, 1225)
)
print (hf)
```
The frictionless "datapackage.json" files are never compressed, ensuring that
their contents can always be viewed on the web interfaces of the deposits
services. (The increase in sizes of the uploaded files there demonstrates that
compression often offers little advantage for small files. The advantages for
large files can nevertheless be considerable, and compression is generally
recommended.)
To change compression, or to compress a file that was previously uploaded in
uncompressed form, the file first needs to be removed from the deposits service
with [the `deposit_delete_file()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-delete-file-),
and then re-uploaded with either
[`deposit_upload_file()`](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-upload-file-)
or
[`deposit_update()`](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-update-).
### Editing and updating files
The [`deposit_update()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-update-)
will automatically update any files held on a remote deposits service if they
have been locally modified. If the local "datapackage.json" file has been
modified, any changes in the "metadata" section will be brought into the local
deposits client, and also translated to service-specific metadata, posted to
the service, and returned in updated "hostdata" of the client. The remote
version of that file will also be updated.
In short, [the `deposit_upload_file()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-upload-file-)
is only needed to initially upload files (or directories). Once files exist on
the remote deposits service, the [`deposit_update()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-update-)
can be used to automatically upload any modified files to the service.
## Publishing a deposit
The final steps of publishing a deposit, potentially along with an embargo
date, are described in [the main
vignette](https://docs.ropensci.org/deposits/articles/deposits.html), but
copied here for completeness.
Once all metadata and data have been satisfactorily edited, updated, and
uploaded, a deposit can be made publicly visible and permanently associated a
Digital Object Identifier (DOI) by publishing it. Prior to publishing, it is
often desired to apply an "embargo" to the deposit, in the form of a date after
which the deposit will become publicly visible. The two steps to publication
are thus generally:
```{r publish, eval = FALSE}
cli$deposit_embargo (embargo_date = "2030-03-30")
cli$deposit_publish ()
```
Calling [the `deposit_publish()`
method](https://docs.ropensci.org/deposits/reference/depositsClient.html#method-deposit-publish-)
is irreversible, and can never be undone. The published deposit will be
permanently associated with the account of the user who published it, as
identified by [the API token used to initiate the deposits
client](https://docs.ropensci.org/deposits/articles/install-setup.html#setup-api-tokens).
Publication will also change many items of the client's "hostdata", notably
involving a change of status or visibility from "private" to "public". Once a
deposit has been published, the associated DOI, or equivalent the URL given in
the deposits client, may be shared as a permanent link to the deposit.