-
Notifications
You must be signed in to change notification settings - Fork 0
/
1_getting_started.Rmd
73 lines (52 loc) · 1.52 KB
/
1_getting_started.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
title: "Doing big data with Spark Pt. 1"
output:
html_document:
theme: lumen
toc: yes
toc_float: yes
html_notebook:
theme: lumen
---
# Getting Started
# Links
Slides
* https://github.com/rikturr/doing-big-data-with-spark/blob/master/slides.pdf
Some good references
* http://spark.rstudio.com/
* https://spark.apache.org/
# Install sparklyr + Spark
```{r, eval = F}
# run this once
install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.2.0")
```
Also, make sure you have these packages installed, as later notebooks will use them
```{r, eval = F}
install.packages('dplyr')
install.packages('pryr')
install.packages('ggplot2')
install.packages('dbplot')
```
# Initialize Spark
Can also do this from the Connections tab in RStudio. You may need to install Java if you don't already have it.
Do this at the top of every script/notebook.
```{r}
library(sparklyr)
sc <- spark_connect(master = "local")
```
The `master` argument will be different if you are running against a real Spark cluster: http://spark.rstudio.com/examples/. Also, you may need to allocate more memory for doing intense things:
```{r, eval=F}
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "4G"
config$`sparklyr.shell.executor-memory` <- "4G"
sc <- spark_connect(master = "local", config = config)
```
# Your best friend
The docs are pretty good for `sparklyr`
```{r}
?spark_connect
```
# Put it all together
To make life easier, we have a `spark_init.R` file so we don't have to copy/paste it over and over.