1_getting_started.Rmd

---
title: "Doing big data with Spark Pt. 1"
output:
  html_document:
    theme: lumen
    toc: yes
    toc_float: yes
  html_notebook:
    theme: lumen
---

# Getting Started

# Links

Slides

* https://github.com/rikturr/doing-big-data-with-spark/blob/master/slides.pdf

Some good references

* http://spark.rstudio.com/
* https://spark.apache.org/

# Install sparklyr + Spark

```{r, eval = F}
# run this once
install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.2.0")
```

Also, make sure you have these packages installed, as later notebooks will use them

```{r, eval = F}
install.packages('dplyr')
install.packages('pryr')
install.packages('ggplot2')
install.packages('dbplot')
```

# Initialize Spark

Can also do this from the Connections tab in RStudio. You may need to install Java if you don't already have it.

Do this at the top of every script/notebook.

```{r}
library(sparklyr)
sc <- spark_connect(master = "local")
```

The `master` argument will be different if you are running against a real Spark cluster: http://spark.rstudio.com/examples/. Also, you may need to allocate more memory for doing intense things:

```{r, eval=F}
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "4G"
config$`sparklyr.shell.executor-memory` <- "4G"
sc <- spark_connect(master = "local", config = config)
```

# Your best friend

The docs are pretty good for `sparklyr`

```{r}
?spark_connect
```

# Put it all together

To make life easier, we have a `spark_init.R` file so we don't have to copy/paste it over and over.