/
DBI-arrow.Rmd
145 lines (110 loc) · 3.59 KB
/
DBI-arrow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
title: "Using DBI with Arrow"
author: "Kirill Müller"
date: "29/09/2022"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using DBI with Arrow}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(error = Sys.getenv("IN_PKGDOWN") != "true" || (getRversion() < "3.5"))
knit_print.data.frame <- function(x, ...) {
print(head(x, 6))
if (nrow(x) > 6) {
cat("Showing 6 out of", nrow(x), "rows.\n")
}
invisible(x)
}
registerS3method("knit_print", "data.frame", "knit_print.data.frame")
```
## Who this tutorial is for
This tutorial is for you if you want to leverage [Apache Arrow](https://arrow.apache.org/) for accessing and manipulating data on databases.
See `vignette("DBI", package = "DBI")` and `vignette("DBI", package = "DBI-advanced")` for tutorials on accessing data using R's data frames instead of Arrow's structures.
## Rationale
Apache Arrow is
> a cross-language development platform for in-memory analytics.
- suitable for large and huge data, also out-of-memory
- data exchange format, good support for data types used in SQL databases
- new extension points to allow backends (currently DuckDB and adbc) to make use of the data exchange format
- faster data retrieval and loading, by avoiding serialization in some cases
- better support for reading and summarizing data from a database that is larger than memory
- better type fidelity with workflows centered around Arrow
- fundamental data structure: `arrow::RecordBatchReader`
## New classes and generics
- Zero chance of interfering with existing DBI backends
- Fully functional fallback implementation for all existing DBI backends
- Requires {arrow} R package
- New generics:
- `dbReadTableArrow()`
- `dbCreateTableArrow()`
- `dbAppendTableArrow()`
- `dbGetQueryArrow()`
- `dbSendQueryArrow()`
- `dbFetchArrow()`
- `dbFetchArrowChunk()`
- `dbWriteTableArrow()`
- New classes:
- `DBIResultArrow`
- `DBIResultArrowDefault`
## Prepare
```{r}
library(DBI)
con <- dbConnect(RSQLite::SQLite())
data <- data.frame(
a = 1:3,
b = 4.5,
c = "five"
)
dbWriteTable(con, "tbl", data)
```
## Read all rows from a table
```{r}
dbReadTableArrow(con, "tbl")
as.data.frame(dbReadTableArrow(con, "tbl"))
```
## Run queries
```{r}
stream <- dbGetQueryArrow(con, "SELECT COUNT(*) FROM tbl WHERE a < 3")
stream
as.data.frame(stream)
```
## Process data piecemeal
```{r}
stream <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a < 3")
stream
stream$get_next()
stream$get_next()
```
## Prepared queries
```{r}
in_arrow <- nanoarrow::as_nanoarrow_array(data.frame(a = 1:4))
stream <- dbGetQueryArrow(con, "SELECT $a AS batch, * FROM tbl WHERE a < $a", param = in_arrow)
as.data.frame(stream)
```
## Writing data
```{r}
stream <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a < 3")
dbWriteTableArrow(con, "tbl_new", stream)
dbReadTable(con, "tbl_new")
```
## Appending data
```{r}
stream <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a < 3")
dbCreateTableArrow(con, "tbl_split", stream)
dbAppendTableArrow(con, "tbl_split", stream)
stream <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a >= 3")
dbAppendTableArrow(con, "tbl_split", stream)
dbReadTable(con, "tbl_split")
```
As usual, do not forget to disconnect from the database when done.
```{r}
dbDisconnect(con)
```
## Conclusion
That concludes the major features of DBI.
For more details on the library functions covered in this tutorial see the DBI specification at `vignette("spec", package = "DBI")`.
- See arrow package for further processing