-
Notifications
You must be signed in to change notification settings - Fork 1
/
slides.qmd
124 lines (75 loc) · 4.06 KB
/
slides.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
title: "A Journey through Arrow in R"
subtitle: "ROpenSci Community Call - 28th June 2023"
author: "Nic Crane, Jonathan Keane, and Steph Hazlitt"
execute:
echo: true
format:
revealjs:
theme: white
---
# A Journey through Arrow in R
![](https://github.com/djnavarro/arrow-user2022/blob/main/img/arrow-hex-dark.png?raw=true){fig-align="center"}
## Arrow
> Apache Arrow is a multi-language toolbox for accelerated data interchange and processing.
![](https://github.com/djnavarro/arrow-user2022/blob/main/img/arrow-hex-dark.png?raw=true){fig-align="center"}
## What problems can you solve with Arrow?
- Work with larger-than-memory data and multifile datasets in R without needing to set up a cluster or database
- Pass data between R and Python, or Arrow and DuckDB without needing to copy data
- Do all this while using dplyr syntax and familiar R functions
- ...and more!
## Arrow libraries
[![Image from "Larger-Than-Memory Data Workflows with Apache Arrow" by Danielle Navarro is licensed under CC BY-ND 4.0](https://github.com/djnavarro/arrow-user2022/blob/main/img/arrow-libraries-structure.png?raw=true)](https://arrow-user2022.netlify.app/)
## The arrow R package
[![Image from "Larger-Than-Memory Data Workflows with Apache Arrow" by Danielle Navarro is licensed under CC BY-ND 4.0](https://github.com/djnavarro/arrow-user2022/blob/main/img/dplyr-backend.png?raw=true)](https://arrow-user2022.netlify.app/)
## Demo 1
Now over to Jon for a demo!
## Arrow Datasets
* Similar to database connections
* Can consist of multiple files
* Data not pulled into R memory until you call `collect()`
```{r, echo = FALSE}
library(arrow)
library(dplyr)
```
```{r, echo = FALSE}
td <- tempfile()
dir.create(td)
write_dataset(mtcars, td, format = "csv")
open_dataset(td, format = "csv")
```
## Interoperability
* zero-copy interoperability
* pass data between R and Python
* pass data between Arrow and DuckDB
## When might we want to use this?
* Data from multiple sources (i.e. Parquet file or Arrow Dataset and DuckDB table)
* Functionality in one which isn't implemented in the other
## Demo 2
Now over to Jon for a demo!
# Resources
## Docs
[![https://arrow.apache.org/docs/r/](images/docs.png)](https://arrow.apache.org/docs/r/)
## Cookbook
[![https://arrow.apache.org/cookbook/r/](images/cookbook.png)](https://arrow.apache.org/cookbook/r/)
## Cheatsheet
[![https://github.com/apache/arrow/blob/main/r/cheatsheet/arrow-cheatsheet.pdf](https://arrow.apache.org/img/20220427-arrow-r-cheatsheet-thumbnail.png)](https://github.com/apache/arrow/blob/master/r/cheatsheet/arrow-cheatsheet.pdf)
```{=html}
<!--
The Arrow for R cheatsheet is intended to be an easy-to-scan introduction to the Arrow R package and Arrow data structures, with getting started sections on some of the package’s main functionality. The cheatsheet includes introductory snippets on using Arrow to read and work with larger-than-memory multi-file data sets, sending and receiving data with Flight, reading data from cloud storage without downloading the data first, and more. The Arrow for R cheatsheet also directs users to the full Arrow for R package documentation and articles and the Arrow Cookbook, both full of code examples and recipes to support users build their Arrow-based data workflows. Finally, the cheatsheet debuts one of the first uses of the hot-off-the-presses Arrow hex sticker, recently made available as part of the Apache Arrow visual identity guidance.
-->
```
## UseR! 2022 Tutorial
[![https://arrow-user2022.netlify.app/](images/usertutorial.png)](https://arrow-user2022.netlify.app/)
<!-- Also mention mini taxi dataset -->
## Awesome Arrow
[![https://github.com/thisisnic/awesome-arrow-r](images/awesomearrow.png)](https://github.com/thisisnic/awesome-arrow-r)
# Get Involved!
## Open an issue
[![https://github.com/apache/arrow/issues/](images/issues.png)](https://github.com/apache/arrow/issues/)
## Make a PR!
- docs
- cookbook
- code
# A Journey through Arrow in R
![](https://github.com/djnavarro/arrow-user2022/blob/main/img/arrow-hex-dark.png?raw=true){fig-align="center"}