/
DBI-arrow.Rmd
197 lines (144 loc) · 6.63 KB
/
DBI-arrow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: "Using DBI with Arrow"
author: "Kirill Müller"
date: "25/12/2023"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using DBI with Arrow}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(error = Sys.getenv("IN_PKGDOWN") != "true" || (getRversion() < "3.5"))
knit_print.data.frame <- function(x, ...) {
print(head(x, 6))
if (nrow(x) > 6) {
cat("Showing 6 out of", nrow(x), "rows.\n")
}
invisible(x)
}
registerS3method("knit_print", "data.frame", "knit_print.data.frame")
```
## Who this tutorial is for
This tutorial is for you if you want to leverage [Apache Arrow](https://arrow.apache.org/) for accessing and manipulating data on databases.
See `vignette("DBI", package = "DBI")` and `vignette("DBI", package = "DBI-advanced")` for tutorials on accessing data using R's data frames instead of Arrow's structures.
## Rationale
Apache Arrow is
> a cross-language development platform for in-memory analytics,
suitable for large and huge data, with support for out-of-memory operation.
Arrow is also a data exchange format, the data types covered by Arrow are a superset of the data types supported by SQL databases.
DBI 1.2.0 introduced support for Arrow as a format for exchanging data between R and databases.
The aim is to:
- accelerate data retrieval and loading, by using fewer costly data conversions
- better support reading and summarizing data from a database that is larger than memory
- provide better type fidelity with workflows centered around Arrow
This allows existing code to be used with Arrow, and it allows new code to be written that is more efficient and more flexible than code that uses R's data frames.
The interface is built around the {nanoarrow} R package, with `nanoarrow::as_nanoarrow_array` and `nanoarrow::as_nanoarrow_array_stream` as fundamental data structures.
## New classes and generics
DBI 1.2.0 introduces new classes and generics for working with Arrow data:
- `dbReadTableArrow()`
- `dbWriteTableArrow()`
- `dbCreateTableArrow()`
- `dbAppendTableArrow()`
- `dbGetQueryArrow()`
- `dbSendQueryArrow()`
- `dbBindArrow()`
- `dbFetchArrow()`
- `dbFetchArrowChunk()`
- `DBIResultArrow`
- `DBIResultArrowDefault`
Compatibility is important for DBI, and implementing new generics and classes greatly reduces the risk of breaking existing code.
The DBI package comes with a fully functional fallback implementation for all existing DBI backends.
The fallback is not improving performance, but it allows existing code to be used with Arrow before switching to a backend with native Arrow support.
Backends with native support, like the [adbi](https://adbi.r-dbi.org/) package, implement the new generics and classes for direct support and improved performance.
In the remainder of this tutorial, we will demonstrate the new generics and classes using the RSQLite package.
SQLite is an in-memory database, this code does not need a database server to be installed and running.
## Prepare
We start by setting up a database connection and creating a table with some data, using the original `dbWriteTable()` method.
```{r}
library(DBI)
con <- dbConnect(RSQLite::SQLite())
data <- data.frame(
a = 1:3,
b = 4.5,
c = "five"
)
dbWriteTable(con, "tbl", data)
```
## Read all rows from a table
The `dbReadTableArrow()` method reads all rows from a table into an Arrow stream, similarly to `dbReadTable()`.
Arrow objects implement the `as.data.frame()` method, so we can convert the stream to a data frame.
```{r}
dbReadTableArrow(con, "tbl")
stream <- dbReadTableArrow(con, "tbl")
stream
as.data.frame(stream)
```
## Run queries
The `dbGetQueryArrow()` method runs a query and returns the result as an Arrow stream.
This stream can be turned into an `arrow::RecordBatchReader` object and processed further, without bringing it into R.
```{r}
stream <- dbGetQueryArrow(con, "SELECT COUNT(*) AS n FROM tbl WHERE a < 3")
stream
path <- tempfile(fileext = ".parquet")
arrow::write_parquet(arrow::as_record_batch_reader(stream), path)
arrow::read_parquet(path)
```
## Prepared queries
The `dbGetQueryArrow()` method supports prepared queries, using the `params` argument which accepts a data frame or a list.
```{r}
params <- data.frame(a = 3L)
stream <- dbGetQueryArrow(con, "SELECT $a AS batch, * FROM tbl WHERE a < $a", params = params)
as.data.frame(stream)
params <- data.frame(a = c(2L, 4L))
# Equivalent to dbBind()
stream <- dbGetQueryArrow(con, "SELECT $a AS batch, * FROM tbl WHERE a < $a", params = params)
as.data.frame(stream)
```
## Manual flow
For the manual flow, use `dbSendQueryArrow()` to send a query to the database, and `dbFetchArrow()` to fetch the result.
This also allows using the new `dbBindArrow()` method to bind data in Arrow format to a prepared query.
Result objects must be cleared with `dbClearResult()`.
```{r}
rs <- dbSendQueryArrow(con, "SELECT $a AS batch, * FROM tbl WHERE a < $a")
in_arrow <- nanoarrow::as_nanoarrow_array(data.frame(a = 1L))
dbBindArrow(rs, in_arrow)
as.data.frame(dbFetchArrow(rs))
in_arrow <- nanoarrow::as_nanoarrow_array(data.frame(a = 2L))
dbBindArrow(rs, in_arrow)
as.data.frame(dbFetchArrow(rs))
in_arrow <- nanoarrow::as_nanoarrow_array(data.frame(a = 3L))
dbBindArrow(rs, in_arrow)
as.data.frame(dbFetchArrow(rs))
in_arrow <- nanoarrow::as_nanoarrow_array(data.frame(a = 1:4L))
dbBindArrow(rs, in_arrow)
as.data.frame(dbFetchArrow(rs))
dbClearResult(rs)
```
## Writing data
Streams returned by `dbGetQueryArrow()` and `dbReadTableArrow()` can be written to a table using `dbWriteTableArrow()`.
```{r}
stream <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a < 3")
dbWriteTableArrow(con, "tbl_new", stream)
dbReadTable(con, "tbl_new")
```
## Appending data
For more control over the writing process, use `dbCreateTableArrow()` and `dbAppendTableArrow()`.
```{r}
stream <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a < 3")
dbCreateTableArrow(con, "tbl_split", stream)
dbAppendTableArrow(con, "tbl_split", stream)
stream <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a >= 3")
dbAppendTableArrow(con, "tbl_split", stream)
dbReadTable(con, "tbl_split")
```
## Conclusion
Do not forget to disconnect from the database when done.
```{r}
dbDisconnect(con)
```
That concludes the major features of DBI's new Arrow interface.
For more details on the library functions covered in this tutorial see the DBI specification at `vignette("spec", package = "DBI")`.
See the [adbi](https://adbi.r-dbi.org/) package for a backend with native Arrow support, and [nanoarrow](https://github.com/apache/arrow-nanoarrow) and [arrow](https://arrow.apache.org/docs/r/) for packages to work with the Arrow format.