-
Notifications
You must be signed in to change notification settings - Fork 2
/
RTD_example.Rmd
108 lines (80 loc) · 2.9 KB
/
RTD_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
title: "Access Arm Treasure Data from R with RPresto and RTD"
output:
html_document:
df_print: paged
---
This is an example notebook to connect Arm Treasure Data (TD) from R. If you want to know how to access from R with RJDBC or RPresto, you can see [How to connect Arm Treasure Data from R](https://rpubs.com/chezou/TD-from-R).
While RPresto enables you to query TD easily, it can't handle some TD execution. In this notebook, I will explain how to query with RPresto and how to execute basic TD operation.
If you want to use [embulk](https://www.embulk.org/) and [embulk-output-td](https://github.com/treasure-data/embulk-output-td) plugin, ensure PATH for `embulk` command to be set appropriately.
You also have to set your "TD_API_KEY" is set as an environement variable.
Let's install RPresto and RTD packages.
```{r eval=FALSE}
install.packages(c("RPresto", "RTD", "dplyr", "dbplyr", "ggplot2"))
```
We will use nycflights13 package for analysis. Copy flights data to TD.
```{r eval=FALSE}
install.packages("nycflights13")
```
```{r}
library(RTD)
client <- Td(apikey = Sys.getenv("TD_API_KEY"), endpoint="api.treasuredata.com")
td_upload(client, "test", "flights", nycflights13::flights, overwrite=T)
```
If you want to use embulk for `td_upload`, you can set `mode="embulk"` like:
```{r eval=FALSE}
td_upload(client, "test", "flights", nycflights13::flights, mode="embulk", overwrite=T)
```
Then, connect the table with RPresto.
```{r}
library(RPresto)
library(dplyr)
db <- src_presto(
host="https://api-presto.treasuredata.com",
port=443,
user=Sys.getenv("TD_API_KEY"),
schema='test',
catalog='td-presto'
)
flights_tbl <- tbl(db, 'flights')
```
Execute simple dplyr operation.
```{r}
# filter by departure delay
flights_tbl %>% filter(dep_delay == 2)
```
["Introduction to dplyr"](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) introduces advanced dplyr examples. Let's try grouped operations example from the tutorial and plot the result.
```{r}
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(count > 20, dist < 2000) %>%
collect
# plot delays
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area()
```
You can write SQL using `DBI` package.
```{r}
library(DBI)
con <- dbConnect(
RPresto::Presto(),
host="https://api-presto.treasuredata.com",
port=443,
user=Sys.getenv("TD_API_KEY"),
schema='test',
catalog='td-presto'
)
flights_preview <- dbGetQuery(con, 'SELECT year, month, day, dep_time, dep_delay, carrier, flight from flights limit 10')
flights_preview
```
You can execute basic TD operation like delete table.
```{r}
delete_table(client, "test", "flights")
```
## References
- [RPresto](https://github.com/prestodb/RPresto)
- [RTD](https://github.com/treasure-data/RTD)