vignettes/diseasystore.Rmd

---
title: "diseasystore: quick start guide"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{diseasystore: quick start guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(diseasystore)
```

```{r hidden_options, include = FALSE}
if (rlang::is_installed("withr")) {
  withr::local_options("tibble.print_min" = 5)
  withr::local_options("tibble.print_max" = 5)
  withr::local_options("diseasystore.verbose" = FALSE)
  withr::local_options("diseasystore.DiseasystoreGoogleCovid19.n_max" = 1000)
} else {
  opts <- options("tibble.print_min" = 5, "tibble.print_max" = 5, "diseasystore.verbose" = FALSE,
                  "diseasystore.DiseasystoreGoogleCovid19.n_max" = 1000)
}

# We have a "hard" dependency for RSQLite to render parts of this vignette
suggests_available <- rlang::is_installed("RSQLite")
not_on_cran <- interactive() || as.logical(Sys.getenv("NOT_CRAN", unset = "false"))
```

# Available diseasystores
To see the available `diseasystores` on your system, you can use the `available_diseasystores()` function.
```{r available_diseasystores}
available_diseasystores()
```
This function looks for `diseasystores` on the current search path.
By default, this will show the `diseasystores` bundled with the base package.
If you have [extended](extending-diseasystore.html) `diseasystore` with either your own `diseasystores` or from an
external package, then attaching the package to your search path will allow it to show up as available.

Note: `diseasystores` are found if they are defined within packages named `diseasystore*` and are of the class
`DiseasystoreBase`.

Each of these `diseasystores` may have their own vignette that further details their content, use and/or tips and tricks.
This is for example the case with [`DiseasystoreGoogleCovid19`](diseasystore-google-covid-19.html).

# Using a diseasystore
To use a `diseasystore` we need to first do some configuration.
The `diseasystores` are designed to work with data bases to store the computed features in.
Each `diseasystore` may require individual configuration as listed in its documentation or accompanying vignette.

For this Quick start, we will configure a `DiseasystoreGoogleCovid19` to use a local `SQLite` data base
Ideally, we want to use a faster, more capable, data base to store the features in.
The `diseasystores` uses `SCDB` in the back end and can use any data base back end supported by `SCDB`.

```{r google_setup_hidden, include = FALSE, eval = suggests_available && not_on_cran}
# The files we need are stored remotely in Google's API
google_files <- c("by-age.csv", "demographics.csv", "index.csv", "weather.csv")
remote_conn <- diseasyoption("remote_conn", "DiseasystoreGoogleCovid19")

# In practice, it is best to make a local copy of the data which is stored in the "vignette_data" folder
# This folder can either be in the package folder (preferred, please create the folder) or in the tempdir()
local_conn <- purrr::detect("vignette_data", checkmate::test_directory_exists, .default = tempdir())

# Then we download the first n rows of each data set of interest
try({
  purrr::discard(google_files, ~ checkmate::test_file_exists(file.path(local_conn, .))) |>
    purrr::walk(\(file) {
      paste0(remote_conn, file) |>
        readr::read_csv(n_max = 1000, show_col_types = FALSE, progress = FALSE) |>
        readr::write_csv(file.path(local_conn, file))
    })
})

# Check that the files are available after attempting to download
if (purrr::some(google_files, ~ !checkmate::test_file_exists(file.path(local_conn, .)))) {
  data_available <- FALSE
} else {
  data_available <- TRUE
}

ds <- DiseasystoreGoogleCovid19$new(target_conn = DBI::dbConnect(RSQLite::SQLite()),
                                    source_conn = local_conn,
                                    start_date = as.Date("2020-03-01"),
                                    end_date = as.Date("2020-03-15"))
```

```{r google_setup, eval = FALSE, eval = not_on_cran && suggests_available && data_available}
ds <- DiseasystoreGoogleCovid19$new(
  target_conn = DBI::dbConnect(RSQLite::SQLite()),
  start_date = as.Date("2020-03-01"),
  end_date = as.Date("2020-03-15")
)
```
When we create our new `diseasystore` instance, we also supply `start_date` and `end_date` arguments.
These are not strictly required, but make getting features for this time interval simpler.

Once configured we can query the available features in the `diseasystore`
```{r google_available_features, eval = not_on_cran && suggests_available && data_available}
ds$available_features
```

These features can be retrieved individually
(using the `start_date` and `end_date` we specified during creation of `ds`):
```{r google_get_feature_example_1, eval = not_on_cran && suggests_available && data_available}
ds$get_feature("n_hospital")
```
Notice that features have associated "key_*" and "valid_from/until" columns.
These are used for one of the primary selling points of `diseasystore`, namely [automatic aggregation](#automatic-aggregation).

Go get features for other time intervals, we can manually supply `start_date` and/or `end_date`:
```{r google_get_feature_example_2, eval = not_on_cran && suggests_available && data_available}
ds$get_feature("n_hospital",
               start_date = as.Date("2020-03-01"),
               end_date = as.Date("2020-03-02"))
```

# Dynamically expanded
The `diseasystore` automatically expands the computed features.

Say a given "n_hospital" has been computed between 2020-03-01 and 2020-03-15. In this case, the call
`$get_feature("n_hospital", start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-20")` only needs to compute
the feature between 2020-03-16 and 2020-03-20.

# Time versioned
Through using `{SCDB}` as the back end, the features are stored even as new data becomes available.
This way, we get a time-versioned record of the features provided by `diseasystore`.

The features being computed is controlled through the `slice_ts` argument.
By default, `diseasystores` uses today's date for this argument.

The dynamical expansion of the features described above is only valid for any given `slice_ts`.
That is, if a feature has been computed for a time interval on one `slice_ts`, `diseasystore` will recompute the feature
for any other `slice_ts`.

This way, feature computation can be implemented into continuous integration
(requesting features will preserve a history of computed features).
Furthermore, post-hoc analysis can be performed by computing features as they would have looked on previous dates.

# Automatic aggregation
The real strength of `diseasystore` comes from its built-in automatic aggregation.

We saw above that the features come with additional associated "key_*" and "valid_from/until" columns.

This additional information is used to do automatic aggregation through the `$key_join_features()` method
(see [extending-diseasystore](extending-diseasystore.html) for more details).

To use this method, you need to provide the `observable` that you want to aggregate and the `stratification` you want
to apply to the aggregation.

Lets start with an simple example where we request no stratification (`NULL`):
```{r google_key_join_features_example_1, eval = not_on_cran && suggests_available && data_available}
ds$key_join_features(observable = "n_hospital",
                     stratification = NULL)
```

This gives us the same feature information as `ds$get_feature("n_hospital")` but simplified to give the
observable per day (in this case, the number of people hospitalised).

To specify a level of `stratification`, we need to supply a list of `quosures`
(see `help("topic-quosure", package = "rlang")`).
```{r google_key_join_features_example_2, eval = not_on_cran && suggests_available && data_available}
ds$key_join_features(observable = "n_hospital",
                     stratification = rlang::quos(country_id))
```

The `stratification` argument is very flexible, so we can supply any valid R expression:
```{r google_key_join_features_example_3, eval = not_on_cran && suggests_available && data_available}
ds$key_join_features(observable = "n_hospital",
                     stratification = rlang::quos(country_id,
                                                  old = age_group == "90+"))
```

# Dropping computed features
Sometimes, it is need to clear the compute features from the data base.
For this purpose, we provide the `drop_diseasystore()` function.

By default, this deletes all stored features in the default `diseasystore` schema.
A `pattern` argument to match tables by and a `schema` argument to specify the schema to delete from[^1].

```{r drop_diseasystore_example_1, eval = not_on_cran && suggests_available && data_available}
SCDB::get_tables(ds$target_conn)
```

```{r drop_diseasystore_example_2, eval = not_on_cran && suggests_available && data_available}
drop_diseasystore(conn = ds$target_conn)

SCDB::get_tables(ds$target_conn)
```

# diseasystore options
`diseasystores` have a number of options available to make configuration easier.
These options all start with "diseasystore.".

```{r diseasyoption_list}
options()[purrr::keep(names(options()), ~ startsWith(., "diseasystore"))]
```
Notice that several options are set as empty strings (""). These are treated as `NULL` by `diseasystore`[^2].

Importantly, the options are _scoped_.
Consider the above options for "source_conn":
Looking at the list of options we find "diseasystore.source_conn" and "diseasystore.DiseasystoreGoogleCovid19.source_conn".
The former is a general setting while the latter is specific setting for `DiseasystoreGoogleCovid19`.
The general setting is used as fallback if no specific setting is found.

This allows you to set a general configuration to use and to overwrite it for specific cases.

To get the option related to a scope, we can use the `diseasyoption()` function.
```{r diseasyoption_example_1}
diseasyoption("source_conn", class = "DiseasystoreGoogleCovid19")
```
As we saw in the options, a `source_conn` option was defined specifically for `DiseasystoreGoogleCovid19`.

If we try the same for the hypothetical `DiseasystoreDiseaseY`, we see that no value is defined as we have not yet
configured the fallback value.
```{r diseasyoption_example_2}
diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
```

If we change our general setting for `source_conn` and retry, we see that we get the fallback value.
```{r diseasyoption_example_3}
options("diseasystore.source_conn" = file.path("local", "path"))
diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
```

Finally, we can use the `.default` argument as a final fallback value in case no option is set for either general or
specific case.
```{r diseasyoption_example_4}
diseasyoption("non_existent", class = "DiseasystoreDiseaseY", .default = "final fallback")
```


[^1]: If using `SQLite` as the back end, it will instead prepend the schema specification to the pattern before matching (e.g. "ds\\..*").
[^2]: R's `options()` does not allow setting an option to `NULL`. By setting options as empty strings, the user can see the available options to set.

```{r cleanup, include = FALSE}
if (exists("ds")) rm(ds)
gc()
if (!rlang::is_installed("withr")) {
  options(opts)
}
```