-
Notifications
You must be signed in to change notification settings - Fork 1
/
diseasystore.Rmd
248 lines (200 loc) · 11.1 KB
/
diseasystore.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
title: "diseasystore: quick start guide"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{diseasystore: quick start guide}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(diseasystore)
```
```{r hidden_options, include = FALSE}
if (rlang::is_installed("withr")) {
withr::local_options("tibble.print_min" = 5)
withr::local_options("tibble.print_max" = 5)
withr::local_options("diseasystore.verbose" = FALSE)
withr::local_options("diseasystore.DiseasystoreGoogleCovid19.n_max" = 1000)
} else {
opts <- options("tibble.print_min" = 5, "tibble.print_max" = 5, "diseasystore.verbose" = FALSE,
"diseasystore.DiseasystoreGoogleCovid19.n_max" = 1000)
}
# We have a "hard" dependency for RSQLite to render parts of this vignette
suggests_available <- rlang::is_installed("RSQLite")
not_on_cran <- interactive() || as.logical(Sys.getenv("NOT_CRAN", unset = "false"))
```
# Available diseasystores
To see the available `diseasystores` on your system, you can use the `available_diseasystores()` function.
```{r available_diseasystores}
available_diseasystores()
```
This function looks for `diseasystores` on the current search path.
By default, this will show the `diseasystores` bundled with the base package.
If you have [extended](extending-diseasystore.html) `diseasystore` with either your own `diseasystores` or from an
external package, then attaching the package to your search path will allow it to show up as available.
Note: `diseasystores` are found if they are defined within packages named `diseasystore*` and are of the class
`DiseasystoreBase`.
Each of these `diseasystores` may have their own vignette that further details their content, use and/or tips and tricks.
This is for example the case with [`DiseasystoreGoogleCovid19`](diseasystore-google-covid-19.html).
# Using a diseasystore
To use a `diseasystore` we need to first do some configuration.
The `diseasystores` are designed to work with data bases to store the computed features in.
Each `diseasystore` may require individual configuration as listed in its documentation or accompanying vignette.
For this Quick start, we will configure a `DiseasystoreGoogleCovid19` to use a local `SQLite` data base
Ideally, we want to use a faster, more capable, data base to store the features in.
The `diseasystores` uses `SCDB` in the back end and can use any data base back end supported by `SCDB`.
```{r google_setup_hidden, include = FALSE, eval = suggests_available && not_on_cran}
# The files we need are stored remotely in Google's API
google_files <- c("by-age.csv", "demographics.csv", "index.csv", "weather.csv")
remote_conn <- diseasyoption("remote_conn", "DiseasystoreGoogleCovid19")
# In practice, it is best to make a local copy of the data which is stored in the "vignette_data" folder
# This folder can either be in the package folder (preferred, please create the folder) or in the tempdir()
local_conn <- purrr::detect("vignette_data", checkmate::test_directory_exists, .default = tempdir())
# Then we download the first n rows of each data set of interest
try({
purrr::discard(google_files, ~ checkmate::test_file_exists(file.path(local_conn, .))) |>
purrr::walk(\(file) {
paste0(remote_conn, file) |>
readr::read_csv(n_max = 1000, show_col_types = FALSE, progress = FALSE) |>
readr::write_csv(file.path(local_conn, file))
})
})
# Check that the files are available after attempting to download
if (purrr::some(google_files, ~ !checkmate::test_file_exists(file.path(local_conn, .)))) {
data_available <- FALSE
} else {
data_available <- TRUE
}
ds <- DiseasystoreGoogleCovid19$new(target_conn = DBI::dbConnect(RSQLite::SQLite()),
source_conn = local_conn,
start_date = as.Date("2020-03-01"),
end_date = as.Date("2020-03-15"))
```
```{r google_setup, eval = FALSE, eval = not_on_cran && suggests_available && data_available}
ds <- DiseasystoreGoogleCovid19$new(
target_conn = DBI::dbConnect(RSQLite::SQLite()),
start_date = as.Date("2020-03-01"),
end_date = as.Date("2020-03-15")
)
```
When we create our new `diseasystore` instance, we also supply `start_date` and `end_date` arguments.
These are not strictly required, but make getting features for this time interval simpler.
Once configured we can query the available features in the `diseasystore`
```{r google_available_features, eval = not_on_cran && suggests_available && data_available}
ds$available_features
```
These features can be retrieved individually
(using the `start_date` and `end_date` we specified during creation of `ds`):
```{r google_get_feature_example_1, eval = not_on_cran && suggests_available && data_available}
ds$get_feature("n_hospital")
```
Notice that features have associated "key_*" and "valid_from/until" columns.
These are used for one of the primary selling points of `diseasystore`, namely [automatic aggregation](#automatic-aggregation).
Go get features for other time intervals, we can manually supply `start_date` and/or `end_date`:
```{r google_get_feature_example_2, eval = not_on_cran && suggests_available && data_available}
ds$get_feature("n_hospital",
start_date = as.Date("2020-03-01"),
end_date = as.Date("2020-03-02"))
```
# Dynamically expanded
The `diseasystore` automatically expands the computed features.
Say a given "n_hospital" has been computed between 2020-03-01 and 2020-03-15. In this case, the call
`$get_feature("n_hospital", start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-20")` only needs to compute
the feature between 2020-03-16 and 2020-03-20.
# Time versioned
Through using `{SCDB}` as the back end, the features are stored even as new data becomes available.
This way, we get a time-versioned record of the features provided by `diseasystore`.
The features being computed is controlled through the `slice_ts` argument.
By default, `diseasystores` uses today's date for this argument.
The dynamical expansion of the features described above is only valid for any given `slice_ts`.
That is, if a feature has been computed for a time interval on one `slice_ts`, `diseasystore` will recompute the feature
for any other `slice_ts`.
This way, feature computation can be implemented into continuous integration
(requesting features will preserve a history of computed features).
Furthermore, post-hoc analysis can be performed by computing features as they would have looked on previous dates.
# Automatic aggregation
The real strength of `diseasystore` comes from its built-in automatic aggregation.
We saw above that the features come with additional associated "key_*" and "valid_from/until" columns.
This additional information is used to do automatic aggregation through the `$key_join_features()` method
(see [extending-diseasystore](extending-diseasystore.html) for more details).
To use this method, you need to provide the `observable` that you want to aggregate and the `stratification` you want
to apply to the aggregation.
Lets start with an simple example where we request no stratification (`NULL`):
```{r google_key_join_features_example_1, eval = not_on_cran && suggests_available && data_available}
ds$key_join_features(observable = "n_hospital",
stratification = NULL)
```
This gives us the same feature information as `ds$get_feature("n_hospital")` but simplified to give the
observable per day (in this case, the number of people hospitalised).
To specify a level of `stratification`, we need to supply a list of `quosures`
(see `help("topic-quosure", package = "rlang")`).
```{r google_key_join_features_example_2, eval = not_on_cran && suggests_available && data_available}
ds$key_join_features(observable = "n_hospital",
stratification = rlang::quos(country_id))
```
The `stratification` argument is very flexible, so we can supply any valid R expression:
```{r google_key_join_features_example_3, eval = not_on_cran && suggests_available && data_available}
ds$key_join_features(observable = "n_hospital",
stratification = rlang::quos(country_id,
old = age_group == "90+"))
```
# Dropping computed features
Sometimes, it is need to clear the compute features from the data base.
For this purpose, we provide the `drop_diseasystore()` function.
By default, this deletes all stored features in the default `diseasystore` schema.
A `pattern` argument to match tables by and a `schema` argument to specify the schema to delete from[^1].
```{r drop_diseasystore_example_1, eval = not_on_cran && suggests_available && data_available}
SCDB::get_tables(ds$target_conn)
```
```{r drop_diseasystore_example_2, eval = not_on_cran && suggests_available && data_available}
drop_diseasystore(conn = ds$target_conn)
SCDB::get_tables(ds$target_conn)
```
# diseasystore options
`diseasystores` have a number of options available to make configuration easier.
These options all start with "diseasystore.".
```{r diseasyoption_list}
options()[purrr::keep(names(options()), ~ startsWith(., "diseasystore"))]
```
Notice that several options are set as empty strings (""). These are treated as `NULL` by `diseasystore`[^2].
Importantly, the options are _scoped_.
Consider the above options for "source_conn":
Looking at the list of options we find "diseasystore.source_conn" and "diseasystore.DiseasystoreGoogleCovid19.source_conn".
The former is a general setting while the latter is specific setting for `DiseasystoreGoogleCovid19`.
The general setting is used as fallback if no specific setting is found.
This allows you to set a general configuration to use and to overwrite it for specific cases.
To get the option related to a scope, we can use the `diseasyoption()` function.
```{r diseasyoption_example_1}
diseasyoption("source_conn", class = "DiseasystoreGoogleCovid19")
```
As we saw in the options, a `source_conn` option was defined specifically for `DiseasystoreGoogleCovid19`.
If we try the same for the hypothetical `DiseasystoreDiseaseY`, we see that no value is defined as we have not yet
configured the fallback value.
```{r diseasyoption_example_2}
diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
```
If we change our general setting for `source_conn` and retry, we see that we get the fallback value.
```{r diseasyoption_example_3}
options("diseasystore.source_conn" = file.path("local", "path"))
diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
```
Finally, we can use the `.default` argument as a final fallback value in case no option is set for either general or
specific case.
```{r diseasyoption_example_4}
diseasyoption("non_existent", class = "DiseasystoreDiseaseY", .default = "final fallback")
```
[^1]: If using `SQLite` as the back end, it will instead prepend the schema specification to the pattern before matching (e.g. "ds\\..*").
[^2]: R's `options()` does not allow setting an option to `NULL`. By setting options as empty strings, the user can see the available options to set.
```{r cleanup, include = FALSE}
if (exists("ds")) rm(ds)
gc()
if (!rlang::is_installed("withr")) {
options(opts)
}
```