/
nodbi-overview.Rmd
243 lines (173 loc) · 8.07 KB
/
nodbi-overview.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "nodbi - package overview"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{nodbi - package overview}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Summary
R package `nodbi` provides a single interface for several NoSQL databases and SQL databases with JSON functionality, with the same function parameters and return values across all of:
- MongoDB
- SQLite
- PostgreSQL
- DuckDB
- Elasticsearch
- CouchDB
Package `nodbi` has been designed to use any specific SQL functions a database may have for `JSON` and has added functionality tested for performance to enable switching databases without changing user code.
```{r setup}
library(nodbi)
```
## Functionality
### Connect
First, a connection to a database is opened. In the example, no additional parameters are used such as database file or server; see the help page for the respective database.
"Container" is used as term to indicate where conceptually the database holds the data (e.g. a collection in MongoDB, a table in DuckDB). The `key` parameter of `nodbi` functions holds the name of the relevant container.
```{r}
# name of container
key <- "my_container"
# nodbi can connect any of these databases
if (FALSE) {
src <- src_duckdb()
src <- src_mongo(collection = key)
src <- src_sqlite()
src <- src_postgres()
src <- src_elastic()
src <- src_couchdb(
user = Sys.getenv("COUCHDB_TEST_USER"),
pwd = Sys.getenv("COUCHDB_TEST_PWD")
)
}
# this example is run with
src <- src_sqlite()
# note additional parameters can be specified,
# for example for local or remote MongoDb:
help("src_mongo")
```
### docdb_create
Create a container if it does not yet exist and fill with `value`. The return value is the number of created documents. "Documents" refers to the rows in a data frame such as `mtcars`, or the number of `NDJSON` lines, or the number of list items, or the number of objects in an `JSON` array.
The parameter `value` in any `nodbi` function can take a data frame, a list, a JSON string, or a file name or URL pointing to `NDJSON`.
```{r}
# check if container already exists
docdb_exists(src = src, key = key)
# load data from a data frame with row names into
# the container specified in "key" parameter
docdb_create(src = src, key = key, value = mtcars)
# load additionally 98 NDJSON records
docdb_create(src, key, "https://httpbin.org/stream/98")
# load additionally mapdata as list
docdb_create(src, key, jsonlite::fromJSON(mapdata, simplifyVector = FALSE))
# show JSON structure of contacts
jsonlite::minify(contacts)
# load additionally contacts JSON data
docdb_create(src, key, contacts)
```
Check and list any other containers exist in the database:
```{r}
docdb_list(src = src)
```
### Identifiers
The unique document identifier is its `_id`, corresponding to a primary index with a constraint to be unique in SQL databases.
The `_id`'s of an input `value` are either the row names of a data frame (such as `mtcars`) or top-level elements with the name `_id` such as in `contacts` shown just above.
Thus, expect a warning when trying to create documents with `_id`'s that already exist in the container.
The return value can be `0` when no documents could newly be created, or the number of the subset of documents in `value` that did not yet exist and could newly be created.
```{r}
# zero new documents created
docdb_create(src, key, value = mtcars)
```
For updating existing documents, see below function `docdb_update()`.
### docdb_get
All documents in a container can now be retrieved with `docdb_get()`.
```{r}
# load library for more
# readable print output
library(tibble)
# get all documents, irrespective of schema
as_tibble(docdb_get(src, key))
# get just 2 documents using limit and note that
# only fields for these documents are returned
as_tibble(docdb_get(src, key, limit = 2L))
```
### docdb_query
One of the most powerful functions of `nodbi` is `docdb_query()` because it permits to combine a query to select documents and to filter for fields of interest.
```{r}
# query for some documents
docdb_query(src = src, key = key, query = '{"mpg": {"$gte": 30}}')
```
Both parameters `query` (obligatory) and `fields` (optional) use, across all databases, MongoDB syntax such as documented for [queries](https://www.mongodb.com/docs/manual/crud/) and [fields](https://www.mongodb.com/docs/manual/tutorial/project-fields-from-query-results/).
```{r}
# query some fields from some documents; 'query' is a mandatory
# parameter and is used here in its position in the signature
docdb_query(src, key, '{"mpg": {"$gte": 30}}', fields = '{"wt": 1, "mpg": 1}')
```
Unless `fields` specifies `"_id": 0`, the `_id` field is always included in the output of `docdb_query()`.
```{r}
# query some fields from some documents, limit return to one document
docdb_query(src, key, '{"mpg": {"$gte": 30}}', fields = '{"_id": 0, "mpg": 1}', limit = 1L)
```
Queries can be more complex such as in this example, showing a dot notation of a sub-field and an example operator (regular expression).
```{r}
# query some subitem fields from some documents
str(docdb_query(
src, key,
query = '{"$or": [{"age": {"$gt": 21}},
{"friends.name": {"$regex": "^B[a-z]{3,9}.*"}}]}',
fields = '{"age": 1, "friends.name": 1}'
))
```
Queries work across documents of different structure such as here.
```{r}
# query with results across documents
docdb_query(
src, key,
query = '{"$or": [{"age": {"$gt": 21}}, {"mpg": {"$gte": 30}}]}',
fields = '{"name": 1, "disp": 1}'
)
```
### Field names
The `JSON` data handled by package `nodbi` may have a large number of field included nested fields in objects (see for example `name` within array `friends` above). Thus, an argument is provided for `docdb_query()` so that the function returns only the comprehensive list of all field names in documents selected with a query (or in all documents in the container if `query = "{}"` is specified).
```{r}
docdb_query(src, key, query = '{"_id": {"$regex": "^[0-9]"}}', listfields = TRUE)
```
The dot notation is a path from a root field to the nested field, and this notation can be used in `query` and `fields` parameters of `docdb_query()`.
### docdb_update
Queries can also be used for updating (patching) selected documents with a new `value`. The return value of `docdb_update()` corresponds to the number of documents that were updated.
This is another powerful function because `value` can come from a data frame, a list, a JSON string, or a file name or URL pointing to `NDJSON`, and if `value` includes row names or `_id`'s, these are used to identify the documents to be updated.
```{r}
# number of documents corresponding to query
nrow(docdb_query(src, key, query = '{"carb": 3}'))
# update all documents using JSON, replacing the previously existing values
docdb_update(src, key, value = '{"vs": 9, "xy": [1, 2]}', query = '{"carb": 3}')
# update with value that includes _id's
docdb_update(src, key, value = '{"_id": "Merc 450SLC", "xy": 33}', query = "{}")
# show updated values
docdb_query(src, key, query = '{"carb": 3}', fields = '{"xy": 1}')
```
### docdb_delete
Documents and containers can be deleted with `docdb_delete()`. Its return value corresponds to the success of the delete operation.
```{r}
# number of documents corresponding to query
nrow(docdb_query(src, key, query = '{"age": {"$lte": 23}}'))
# to delete selected documents, specify a query parameter
docdb_delete(src, key, query = '{"age": {"$lte": 23}}')
# this deletes the complete container from database
docdb_delete(src, key)
# check if still exists
docdb_exists(src, key)
```
### Disconnect and shutdown
Package `nodbi` includes an automatic mechanism for shutting down, at the time of `quit()` or session restart, those databases that require it (SQLite, DuckDB, PostgreSQL).
Nevertheless, it is good practice to manually disconnect and shut down connections as specific to the database, for example for SQLite:
```{r}
src
# shutdown
DBI::dbDisconnect(src$con, shutdown = TRUE)
rm(src)
```