add slides with intro and modeling

r-spark · Jul 31, 2018 · 4b8f6b4 · 4b8f6b4
1 parent 5bf6b1a
commit 4b8f6b4
Show file tree

Hide file tree

Showing 50 changed files with 11,674 additions and 822 deletions.
diff --git a/.gitignore b/.gitignore
@@ -14,3 +14,5 @@ the-r-in-spark.log
 checkpoints/
 source/
 destination/
+input/
+output/
diff --git a/01-intro.Rmd b/01-intro.Rmd
@@ -43,11 +43,13 @@ While Hadoop with Hive was a powerful tool, it was still working over a distribu
 meaning that Spark is a tool designed to support:
 
 - **Data Processing**: Data processing is the collection and manipulation of items of data to produce meaningful information [@data-processing].
-- **Large-Scale**: What _large_ means is hard to quantify, but one can interpret this as cluster-scale instead, which represents a set of connected computers that work together.
-- **General**: Spark optimizes and executes parallel generic code, as in, there is no restriction as to what type of code one can write in Spark.
+- **General**: Spark optimizes and executes parallel generic code, as in, there are no restrictions as to what type of code one can write in Spark.
+- **Large-Scale**: One can interpret this as cluster-scale, as in, a set of connected computers working together to accomplish specific goals.
 - **Fast**: Spark is much faster than its predecessor by making efficient use of memory to speed data access while running algorithms at scale.
 
-Spark is good at tackling large-scale data processing problems, this usually known as **big data** ([data sets that are more voluminous and complex that traditional ones](https://en.wikipedia.org/wiki/big_data)), but also is good at tackling large-scale computation problems, known as **big compute** ([tools and approaches using a large amount of CPU and memory resources in a coordinated way](https://www.nimbix.net/glossary/big-compute/)). There is a third problem space where data nor compute are necessarily large scale and yet, there are significant benefits from using the same tools. 
+_Data_ _Processing_ means you can use Spark to process data, which is not surprising but also, not very descriptive. _General_ is required to allow you to run any data analysis but is not interesting in it's own. Describing Spark as _large_ _scale_ is relevant since it implies that a good use case for Spark is tackling problems that can be solved with multiple machines. For instance, when data does not fit in a single disk driver or does not fit into memory, Spark is a good candidate to consider. However, Spark is also _fast_, meaning, that is worth considering for problems that are not large-scale, but where using multiple processors could speed up computation. For instance, sorting large datasets that fit in memory or cpu intensive models could also bennefit from running in Spark.
+
+Therefore, Spark is good at tackling large-scale data processing problems, this usually known as **big data** ([data sets that are more voluminous and complex that traditional ones](https://en.wikipedia.org/wiki/big_data)), but also is good at tackling large-scale computation problems, known as **big compute** ([tools and approaches using a large amount of CPU and memory resources in a coordinated way](https://www.nimbix.net/glossary/big-compute/)). There is a third problem space where data nor compute are necessarily large scale and yet, there are significant benefits from using the same tools. 
 
 Big data and big compute problems are usually easy to spot, if the data does not fit into a single machine, you might have a big data problem; if the data fits into a single machine but a process over the data takes days, weeks or months to compute, you might have a big compute problem.
 

diff --git a/04-modeling.Rmd b/04-modeling.Rmd
@@ -1,35 +1,87 @@
 # Modeling {#modeling}
 
-While **this chatper has not been written**, a few resources and basic examples were made available to help out until this chapter is written.
+While **this chatper has not been written**, a few resources and basic examples were made available to help out until this chapter is completed.
+
+```{r include=FALSE}
+library(dplyr)
+library(ggplot2)
+```
 
 ## Overview
 
 MLlib is Apache Spark's scalable machine learning library and is available through `sparklyr`, mostly, with functions prefixed with `ml_`. The following table describes some of the modeling algorithms supported:
 
 Algorithm | Function
 ----------|---------
-Accelerated Failure Time Survival Regression | ml_aft_survival_regression
-Alternating Least Squares Factorization | ml_als
-Correlation Matrix | ml_corr
-Decision Trees | ml_decision_tree	
-Generalized Linear Regression | ml_generalized_linear_regression
-Gradient-Boosted Trees | ml_gradient_boosted_trees
-Isotonic Regression | ml_isotonic_regression
-K-Means Clustering | ml_kmeans	
-Latent Dirichlet Allocation | ml_lda
-Linear Regression | ml_linear_regression
-Linear Support Vector Machines | ml_linear_svc
-Logistic Regression | ml_logistic_regression
-Multilayer Perceptron | ml_multilayer_perceptron
-Naive-Bayes | ml_naive_bayes
-One vs Rest | ml_one_vs_rest
-Principal Components Analysis | ml_pca
-Random Forests | ml_random_forest
-Survival Regression | ml_survival_regression
+Accelerated Failure Time Survival Regression | ml_aft_survival_regression()
+Alternating Least Squares Factorization | ml_als()
+Bisecting K-Means Clustering | ml_bisecting_kmeans()
+Chi-square Hypothesis Testing | ml_chisquare_test()
+Correlation Matrix | ml_corr()
+Decision Trees | ml_decision_tree	()
+Frequent Pattern Mining | ml_fpgrowth()
+Gaussian Mixture Clustering | ml_gaussian_mixture()
+Generalized Linear Regression | ml_generalized_linear_regression()
+Gradient-Boosted Trees | ml_gradient_boosted_trees()
+Isotonic Regression | ml_isotonic_regression()
+K-Means Clustering | ml_kmeans()
+Latent Dirichlet Allocation | ml_lda()
+Linear Regression | ml_linear_regression()
+Linear Support Vector Machines | ml_linear_svc()
+Logistic Regression | ml_logistic_regression()
+Multilayer Perceptron | ml_multilayer_perceptron()
+Naive-Bayes | ml_naive_bayes()
+One vs Rest | ml_one_vs_rest()
+Principal Components Analysis | ml_pca()
+Random Forests | ml_random_forest()
+Survival Regression | ml_survival_regression()
+
+To complement those algorithms, you will often also want to consider using the following feature transformers:
+
+Transformer | Function
+------------|---------
+Binarizer | ft_binarizer()
+Bucketizer | ft_bucketizer()
+Chi-Squared Feature Selector | ft_chisq_selector()
+Vocabulary from Document Collections | ft_count_vectorizer()
+Discrete Cosine Transform  | ft_discrete_cosine_transform()
+Transformation using dplyr | ft_dplyr_transformer()
+Hadamard Product | ft_elementwise_product()
+Feature Hasher | ft_feature_hasher()
+Term Frequencies using Hashing | export(ft_hashing_tf)
+Inverse Document Frequency | ft_idf()
+Imputation for Missing Values | export(ft_imputer)
+Index to String | ft_index_to_string()
+Feature Interaction Transform | ft_interaction()
+Rescale to [-1, 1] Range | ft_max_abs_scaler()
+Rescale to [min, max] Range | ft_min_max_scaler()
+Locality Sensitive Hashing | ft_minhash_lsh()
+Converts to n-grams | ft_ngram()
+Normalize using the given P-Norm | ft_normalizer()
+One-Hot Encoding | ft_one_hot_encoder()
+Feature Expansion in Polynomial Space | ft_polynomial_expansion()
+Maps to Binned Categorical Features | ft_quantile_discretizer()
+SQL Transformation | ft_sql_transformer()
+Standardizes Features using Corrected STD | ft_standard_scaler()
+Filters out Stop Words | ft_stop_words_remover()
+Map to Label Indices | ft_string_indexer()
+Splits by White Spaces | export(ft_tokenizer)
+Combine Vectors to Row Vector | ft_vector_assembler()
+Indexing Categorical Feature | ft_vector_indexer()
+Subarray of the Original Feature | ft_vector_slicer()
+Transform Word into Code | ft_word2vec()
+
+## Supervised
+
+Examples are reosurces are available in [spark.rstudio.com/mlib](http://spark.rstudio.com/mlib/).
+
+## Unsupervised
+
+### K-Means Clustering
 
 Here is an example to get you started with K-Means:
 
-```{r}
+```{r eval=FALSE}
 library(sparklyr)
 
 # Connect to Spark in local mode
@@ -43,7 +95,89 @@ iris_tbl %>%
   ml_kmeans(centers = 3, Species ~ Petal_Width + Petal_Length)
 ```
 
-More examples are reosurces are available in [spark.rstudio.com/mlib](http://spark.rstudio.com/mlib/).
+### Gaussian Mixture Clustering
+
+Alternatevely, we can also cluster using [Gaussian Mixture Models](https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model) (GMMs).
+
+```{r eval=FALSE, echo=FALSE}
+devtools::install_github("hadley/fueleconomy")
+
+library(sparklyr)
+sc <- spark_connect(master = "local", version = "2.3.0")
+vehicles_tbl <- copy_to(sc, fueleconomy::vehicles, overwrite = TRUE)
+
+predictions <- vehicles_tbl %>%
+  ml_gaussian_mixture(~ hwy + cty, k = 3) %>%
+  ml_predict() %>%
+  collect()
+
+saveRDS(predictions, "data/03-gaussian-mixture-prediction.rds")
+```
+```{r eval=FALSE}
+predictions <- copy_to(sc, fueleconomy::vehicles) %>%
+  ml_gaussian_mixture(~ hwy + cty, k = 3) %>%
+  ml_predict() %>% collect()
+
+predictions %>%
+  ggplot(aes(hwy, cty)) +
+  geom_point(aes(hwy, cty, col = factor(prediction)), size = 2, alpha = 0.4) + 
+  scale_color_discrete(name = "", labels = paste("Cluster", 1:3)) +
+  labs(x = "Highway", y = "City") + theme_light()
+```
+```{r echo=FALSE, fig.cap="Fuel economy data for 1984-2015 from the US EPA"}
+predictions <- readRDS("data/03-gaussian-mixture-prediction.rds")
+predictions %>%
+  ggplot(aes(hwy, cty)) +
+  geom_point(aes(hwy, cty, col = factor(prediction)), size = 2, alpha = 0.4) + 
+  scale_color_discrete(name = "", labels = paste("Cluster", 1:3)) +
+  labs(x = "Highway", y = "City") +
+  theme_light()
+```
+
+## Broom
+
+You can turn your `sparklyr` models into data frames using the `broom` package:
+
+```{r eval=FALSE, echo=FALSE}
+library(sparklyr)
+library(dplyr)
+sc <- spark_connect(master = "local", version = "2.3.0")
+cars_tbl <- spark_read_csv(sc, "cars", "input/")
+model <- cars_tbl %>% ml_linear_regression(mpg ~ wt + cyl)
+tidy_model <- list(
+  broom::tidy(model),
+  broom::glance(model),
+  collect(broom::augment(model, cars_tbl))
+)
+saveRDS(tidy_model, "data/03-broom-examples.rds")
+```
+```{r echo=FALSE}
+tidy_model <- readRDS("data/03-broom-examples.rds")
+```
+```{r eval=FALSE}
+model <- cars_tbl %>%
+  ml_linear_regression(mpg ~ wt + cyl)
+
+# Turn a model object into a data frame
+broom::tidy(model)
+```
+```{r echo=FALSE}
+tidy_model[[1]]
+```
+```{r eval=FALSE}
+# Construct a single row summary
+broom::glance(model)
+```
+```{r echo=FALSE}
+tidy_model[[2]]
+```
+```{r eval=FALSE}
+# Augments each observation in the dataset with the model
+broom::augment(model, cars_tbl)
+```
+```{r echo=FALSE, rows.print=3}
+tidy_model[[3]]
+```
 
 ## Pipelines
 

diff --git a/08-extensions.Rmd b/08-extensions.Rmd
@@ -21,3 +21,8 @@ See [spark.rstudio.com/guides/h2o](http://spark.rstudio.com/guides/h2o/).
 See [spark.rstudio.com/extensions](http://spark.rstudio.com/extensions/).
 
 ### RStudio Projects
+
+You can create an `sparklyr` extension with ease from RStudio. This feature requires RStudio 1.1 or newer and the `sparklyr` package to be installed. Then, from the `File` menu, `New Project...`, select `R Packag using sparklyr`:
+
+![](images/08-extensions-rstudio-project.png)
+
diff --git a/10-streaming.Rmd b/10-streaming.Rmd
@@ -53,7 +53,7 @@ library(sparklyr)
 readRDS("data/10-streaming-overview.rds") %>% stream_render(stats = .)
 ```
 
-Notice that the rows-per-second in the destination stream are higher than the rows-per-second in the source stream; this is expected and desireable since Spark measures incoming rates from the source, but actual row processing times in the destination stream.
+Notice that the rows-per-second in the destination stream are higher than the rows-per-second in the source stream; this is expected and desireable since Spark measures incoming rates from the source, but actual row processing times in the destination stream. For example, if 10 rows-per-second are written to the `source/` path, the incoming rate is 10 RPS. However, if it takes Spark only 0.01 seconds to write all those 10 rows, the output rate is 100 RPS.
 
 Use `stream_stop()` to properly stop processing data from this stream:
 

diff --git a/book.bib b/book.bib
@@ -15,40 +15,40 @@ @book{data-revolution
 }
 
 @inproceedings{google-file-system,
- author = {Ghemawat, Sanjay and Gobioff, Howard and Leung, Shun-Tak},
- title = {The Google File System},
- booktitle = {Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles},
- series = {SOSP '03},
- year = {2003},
- isbn = {1-58113-757-5},
- location = {Bolton Landing, NY, USA},
- pages = {29--43},
- numpages = {15},
- url = {http://doi.acm.org/10.1145/945445.945450},
- doi = {10.1145/945445.945450},
- acmid = {945450},
- publisher = {ACM},
- address = {New York, NY, USA},
- keywords = {clustered storage, data storage, fault tolerance, scalability}
+  author = {Ghemawat, Sanjay and Gobioff, Howard and Leung, Shun-Tak},
+  title = {The Google File System},
+  booktitle = {Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles},
+  series = {SOSP '03},
+  year = {2003},
+  isbn = {1-58113-757-5},
+  location = {Bolton Landing, NY, USA},
+  pages = {29--43},
+  numpages = {15},
+  url = {http://doi.acm.org/10.1145/945445.945450},
+  doi = {10.1145/945445.945450},
+  acmid = {945450},
+  publisher = {ACM},
+  address = {New York, NY, USA},
+  keywords = {clustered storage, data storage, fault tolerance, scalability}
 }
 
 @article{google-map-reduce,
- author = {Dean, Jeffrey and Ghemawat, Sanjay},
- title = {MapReduce: Simplified Data Processing on Large Clusters},
- journal = {Commun. ACM},
- issue_date = {January 2008},
- volume = {51},
- number = {1},
- month = jan,
- year = {2008},
- issn = {0001-0782},
- pages = {107--113},
- numpages = {7},
- url = {http://doi.acm.org/10.1145/1327452.1327492},
- doi = {10.1145/1327452.1327492},
- acmid = {1327492},
- publisher = {ACM},
- address = {New York, NY, USA},
+  author = {Dean, Jeffrey and Ghemawat, Sanjay},
+  title = {MapReduce: Simplified Data Processing on Large Clusters},
+  journal = {Commun. ACM},
+  issue_date = {January 2008},
+  volume = {51},
+  number = {1},
+  month = jan,
+  year = {2008},
+  issn = {0001-0782},
+  pages = {107--113},
+  numpages = {7},
+  url = {http://doi.acm.org/10.1145/1327452.1327492},
+  doi = {10.1145/1327452.1327492},
+  acmid = {1327492},
+  publisher = {ACM},
+  address = {New York, NY, USA},
 }
 
 @book{data-processing,
@@ -59,3 +59,14 @@ @book{data-processing
   ISBN = {1844801004},
   URL = {https://www.amazon.com/Data-Processing-Information-Technology-French/dp/1844801004}
 }
+
+@techreport{Zaharia:EECS-2010-53,
+  Author = {Zaharia, Matei and Chowdhury, N. M. Mosharaf and Franklin, Michael and Shenker, Scott and Stoica, Ion},
+  Title = {Spark: Cluster Computing with Working Sets},
+  Institution = {EECS Department, University of California, Berkeley},
+  Year = {2010},
+  Month = {May},
+  URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.html},
+  Number = {UCB/EECS-2010-53},
+  Abstract = {MapReduce and its variants have been highly successful in implementing large-scale data intensive applications on clusters of unreliable machines. However, most of these systems are built around an acyclic data flow programming model that is not suitable for other popular applications. In this paper, we focus on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis environments. We propose a new framework called Spark that supports these applications while maintaining the scalability and fault-tolerance properties of MapReduce. To achieve these goals, Spark introduces a data abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.}
+}
diff --git a/data/03-broom-examples.rds b/data/03-broom-examples.rds
diff --git a/data/03-gaussian-mixture-prediction.rds b/data/03-gaussian-mixture-prediction.rds