/
2018-01-25-nodbi.Rmd
180 lines (117 loc) · 6.76 KB
/
2018-01-25-nodbi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
slug: nodbi
title: 'nodbi: the NoSQL Database Connector'
date: '2018-01-25'
author: Scott Chamberlain
topicid: ~
tags:
- databases
- couchdb
- mongo
- elasticsearch
- redis
- etcd
- nodbi
---
```{r echo=FALSE}
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
warning = FALSE,
message = FALSE
)
```
## DBI
What is DBI? [DBI][] is an R package. It defines an interface to relational database management systems (R/DBMS) that other R packages build upon to interact with a specific relational database, such as [SQLite][] or [PostgreSQL][].
## NoSQL
[NoSQL databases](http://nosql-database.org/) are a very broad class of database that can include document databases such as CouchDB and MongoDB, key-value stores such as Redis, and more. They are generally not row-column relational stores though, though can include that. NoSQL is often thought of now as "not only SQL".
You can imagine how it is relatively straight-forward to create a common interace to row-column oriented databases, and DBI is great for that.
However, a common interface to NoSQL datbases is a bit harder to wrap your head around for various reasons. One of the most obvious of which is that they don't share a fundamental data structure like R/DMBS do.
## A sort of DBI for NoSQL
[nodbi][] (the `no` for "N" and "O" in "NoSQL") was started nearly 3 years ago during the [2015 rOpenSci Unconference](https://github.com/ropensci/unconf15) on [26 March](https://github.com/ropensci/nodbi/commits/29340320b28f13e942ae177886d3d6b87366c4c2) together with [Rich FitzJohn](https://github.com/richfitz) and [Jeroen Ooms](https://github.com/jeroen). We've worked on it here and there, but it hasn't seen any commits in nearly a year.
Given the success of [dplyr][] it goes to reason that something that can abstract away the details for NoSQL databases - for a subset of the most common operations - could prove to be very useful.
nodbi takes a different approach than DBI (thus the "sort of" in this section title) in that instead of package X for database Y importing DBI and extending its methods, nodbi provides the connection objects to various databases, and a set of common methods for interacting with those databases. Whereas DBI doesn't import or depend in any way on packages for individual databases, nodbi has packages for individual databases in Suggests. That is to say, nodbi can be installed and loaded without any specific database package, but to use any one database (e.g., Mongo) you need to install the package for that database (i.e., [mongolite][]). Because database packages in nodbi are in Suggests, you can pick and choose the packages you want to have installed - and forget about the others.
Check out the package [on GitHub][nodbi].
## data.frame's
Thus far, we've built nodbi around data.frame's. That is, we're focusing on the data.frame use case as it's very common that R users are dealing solely with data.frame's in their analysis pipelines.
Let us know if you have a strong need for lists or vectors, etc. in nodbi.
## supported databases
Connection functions to each database follow the pattern `src_` following the pattern used by `dplyr`
- MongoDB: `src_mongo`
- Redis: `src_redis`
- CouchDB: `src_couchdb`
- Elasticsvearch: `src_elastic`
- Etcd: `src_etcd`
The details of connecting to different databases vary as you'd imagine, but what's returned from `src_` functions can be passed to any of the below methods (for the most part).
## supported methods
- create (`docdb_create`): Create an object
- get (`docdb_get`): Get an object
- delete (`docdb_delete`): Delete an object
- update (`docdb_update`): Update an object. CouchDB only for now
Operations in nodbi follow a key-value pattern. That is, an object (read: data.frame) is stored in your choice of database with a key (read: character string). Not every operation above needs both the key and value - for example, deleting an object only requires the key.
Internally, a key maps differently to Redis vs. Mongo vs. CouchDB, etc. But the point is that we're abstracting away the details so that the user can focus on getting things done faster.
## Setup
You'll need to start up the database you want to work with if it's not started already. Let's work with Mongo for this blog post.
Check out the [mongolite book](https://jeroen.github.io/mongolite/) for details on installing MongoDB
Start up mongo on your cli (i.e., shell or terminal)
```
mongod # or sudo mongod
```
Install database packages
```{r eval=FALSE}
install.packages(c("sofa", "etseed", "elastic", "mongolite", "redux"))
```
Install nodbi
```{r}
devtools::install_github("ropensci/nodbi")
```
```{r}
library(nodbi)
```
## Examples
Connect to Mongo with `src_mongo`
```{r}
(src <- src_mongo(verbose = FALSE))
```
With the connection object you can access the name of the Mongo database
```{r}
src$db
```
And the connection `mongolite` connection object
```{r}
src$con
```
Insert a data.frame (the returned object is from `mongolite`) using `docdb_create`
```{r}
library("ggplot2")
x <- docdb_create(src = src, key = "diamonds", value = diamonds)
x
```
Get the data.frame back from Mongo using `docdb_get`
```{r}
head(docdb_get(src, "diamonds"))
```
Delete the object using `docdb_delete`
```{r}
docdb_delete(src, "diamonds")
```
The same pattern as above can be used for any of the other database options, except the connecting with `src_*` differs among databases.
<br><br>
## similar art
The one that comes closest is [storr][] by [Rich FitzJohn](https://github.com/richfitz) - which shares Redis in common with nodbi as a backend. storr seems to be more focused on the caching use case - whereas, nodbi focuses on user X has a specific database they keep their data in and they'd like to not worry about the details of working with that database. Perhaps the distinction isn't really there, I'm not sure.
## Todo
- We plan to add support for more databases, let us know what you want supported
- We may add more operations in addition to create/get/delete/update. What things do you want nodbi to do?
- [Fetching partial data.frame's](https://github.com/ropensci/nodbi/issues/10)
- Speed ups of course, in due time
- Eventually we'll push a version to CRAN - we're hoping to get more users kicking the tires before that.
## Feedback!
We'd love to get some eyes on this; to sort out problems that will no doubt arise from real world scenarios; to flesh out new use cases we hadn't thought of, etc.
Open an issue: <https://github.com/ropensci/nodbi/issues/new>.
[DBI]: https://cran.r-project.org/web/packages/DBI/
[nodbi]: https://github.com/ropensci/nodbi
[SQLite]: https://sqlite.org/
[PostgreSQL]: https://www.postgresql.org/
[dplyr]: https://cran.r-project.org/web/packages/dplyr/
[mongolite]: https://cran.r-project.org/web/packages/mongolite/
[storr]: https://github.com/richfitz/storr