Skip to content

Commit

Permalink
add slides with intro and modeling
Browse files Browse the repository at this point in the history
  • Loading branch information
javierluraschi committed Jul 31, 2018
1 parent 5bf6b1a commit 4b8f6b4
Show file tree
Hide file tree
Showing 50 changed files with 11,674 additions and 822 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Expand Up @@ -14,3 +14,5 @@ the-r-in-spark.log
checkpoints/
source/
destination/
input/
output/
8 changes: 5 additions & 3 deletions 01-intro.Rmd
Expand Up @@ -43,11 +43,13 @@ While Hadoop with Hive was a powerful tool, it was still working over a distribu
meaning that Spark is a tool designed to support:

- **Data Processing**: Data processing is the collection and manipulation of items of data to produce meaningful information [@data-processing].
- **Large-Scale**: What _large_ means is hard to quantify, but one can interpret this as cluster-scale instead, which represents a set of connected computers that work together.
- **General**: Spark optimizes and executes parallel generic code, as in, there is no restriction as to what type of code one can write in Spark.
- **General**: Spark optimizes and executes parallel generic code, as in, there are no restrictions as to what type of code one can write in Spark.
- **Large-Scale**: One can interpret this as cluster-scale, as in, a set of connected computers working together to accomplish specific goals.
- **Fast**: Spark is much faster than its predecessor by making efficient use of memory to speed data access while running algorithms at scale.

Spark is good at tackling large-scale data processing problems, this usually known as **big data** ([data sets that are more voluminous and complex that traditional ones](https://en.wikipedia.org/wiki/big_data)), but also is good at tackling large-scale computation problems, known as **big compute** ([tools and approaches using a large amount of CPU and memory resources in a coordinated way](https://www.nimbix.net/glossary/big-compute/)). There is a third problem space where data nor compute are necessarily large scale and yet, there are significant benefits from using the same tools.
_Data_ _Processing_ means you can use Spark to process data, which is not surprising but also, not very descriptive. _General_ is required to allow you to run any data analysis but is not interesting in it's own. Describing Spark as _large_ _scale_ is relevant since it implies that a good use case for Spark is tackling problems that can be solved with multiple machines. For instance, when data does not fit in a single disk driver or does not fit into memory, Spark is a good candidate to consider. However, Spark is also _fast_, meaning, that is worth considering for problems that are not large-scale, but where using multiple processors could speed up computation. For instance, sorting large datasets that fit in memory or cpu intensive models could also bennefit from running in Spark.

Therefore, Spark is good at tackling large-scale data processing problems, this usually known as **big data** ([data sets that are more voluminous and complex that traditional ones](https://en.wikipedia.org/wiki/big_data)), but also is good at tackling large-scale computation problems, known as **big compute** ([tools and approaches using a large amount of CPU and memory resources in a coordinated way](https://www.nimbix.net/glossary/big-compute/)). There is a third problem space where data nor compute are necessarily large scale and yet, there are significant benefits from using the same tools.

Big data and big compute problems are usually easy to spot, if the data does not fit into a single machine, you might have a big data problem; if the data fits into a single machine but a process over the data takes days, weeks or months to compute, you might have a big compute problem.

Expand Down
176 changes: 155 additions & 21 deletions 04-modeling.Rmd
@@ -1,35 +1,87 @@
# Modeling {#modeling}

While **this chatper has not been written**, a few resources and basic examples were made available to help out until this chapter is written.
While **this chatper has not been written**, a few resources and basic examples were made available to help out until this chapter is completed.

```{r include=FALSE}
library(dplyr)
library(ggplot2)
```

## Overview

MLlib is Apache Spark's scalable machine learning library and is available through `sparklyr`, mostly, with functions prefixed with `ml_`. The following table describes some of the modeling algorithms supported:

Algorithm | Function
----------|---------
Accelerated Failure Time Survival Regression | ml_aft_survival_regression
Alternating Least Squares Factorization | ml_als
Correlation Matrix | ml_corr
Decision Trees | ml_decision_tree
Generalized Linear Regression | ml_generalized_linear_regression
Gradient-Boosted Trees | ml_gradient_boosted_trees
Isotonic Regression | ml_isotonic_regression
K-Means Clustering | ml_kmeans
Latent Dirichlet Allocation | ml_lda
Linear Regression | ml_linear_regression
Linear Support Vector Machines | ml_linear_svc
Logistic Regression | ml_logistic_regression
Multilayer Perceptron | ml_multilayer_perceptron
Naive-Bayes | ml_naive_bayes
One vs Rest | ml_one_vs_rest
Principal Components Analysis | ml_pca
Random Forests | ml_random_forest
Survival Regression | ml_survival_regression
Accelerated Failure Time Survival Regression | ml_aft_survival_regression()
Alternating Least Squares Factorization | ml_als()
Bisecting K-Means Clustering | ml_bisecting_kmeans()
Chi-square Hypothesis Testing | ml_chisquare_test()
Correlation Matrix | ml_corr()
Decision Trees | ml_decision_tree ()
Frequent Pattern Mining | ml_fpgrowth()
Gaussian Mixture Clustering | ml_gaussian_mixture()
Generalized Linear Regression | ml_generalized_linear_regression()
Gradient-Boosted Trees | ml_gradient_boosted_trees()
Isotonic Regression | ml_isotonic_regression()
K-Means Clustering | ml_kmeans()
Latent Dirichlet Allocation | ml_lda()
Linear Regression | ml_linear_regression()
Linear Support Vector Machines | ml_linear_svc()
Logistic Regression | ml_logistic_regression()
Multilayer Perceptron | ml_multilayer_perceptron()
Naive-Bayes | ml_naive_bayes()
One vs Rest | ml_one_vs_rest()
Principal Components Analysis | ml_pca()
Random Forests | ml_random_forest()
Survival Regression | ml_survival_regression()

To complement those algorithms, you will often also want to consider using the following feature transformers:

Transformer | Function
------------|---------
Binarizer | ft_binarizer()
Bucketizer | ft_bucketizer()
Chi-Squared Feature Selector | ft_chisq_selector()
Vocabulary from Document Collections | ft_count_vectorizer()
Discrete Cosine Transform | ft_discrete_cosine_transform()
Transformation using dplyr | ft_dplyr_transformer()
Hadamard Product | ft_elementwise_product()
Feature Hasher | ft_feature_hasher()
Term Frequencies using Hashing | export(ft_hashing_tf)
Inverse Document Frequency | ft_idf()
Imputation for Missing Values | export(ft_imputer)
Index to String | ft_index_to_string()
Feature Interaction Transform | ft_interaction()
Rescale to [-1, 1] Range | ft_max_abs_scaler()
Rescale to [min, max] Range | ft_min_max_scaler()
Locality Sensitive Hashing | ft_minhash_lsh()
Converts to n-grams | ft_ngram()
Normalize using the given P-Norm | ft_normalizer()
One-Hot Encoding | ft_one_hot_encoder()
Feature Expansion in Polynomial Space | ft_polynomial_expansion()
Maps to Binned Categorical Features | ft_quantile_discretizer()
SQL Transformation | ft_sql_transformer()
Standardizes Features using Corrected STD | ft_standard_scaler()
Filters out Stop Words | ft_stop_words_remover()
Map to Label Indices | ft_string_indexer()
Splits by White Spaces | export(ft_tokenizer)
Combine Vectors to Row Vector | ft_vector_assembler()
Indexing Categorical Feature | ft_vector_indexer()
Subarray of the Original Feature | ft_vector_slicer()
Transform Word into Code | ft_word2vec()

## Supervised

Examples are reosurces are available in [spark.rstudio.com/mlib](http://spark.rstudio.com/mlib/).

## Unsupervised

### K-Means Clustering

Here is an example to get you started with K-Means:

```{r}
```{r eval=FALSE}
library(sparklyr)
# Connect to Spark in local mode
Expand All @@ -43,7 +95,89 @@ iris_tbl %>%
ml_kmeans(centers = 3, Species ~ Petal_Width + Petal_Length)
```

More examples are reosurces are available in [spark.rstudio.com/mlib](http://spark.rstudio.com/mlib/).
### Gaussian Mixture Clustering

Alternatevely, we can also cluster using [Gaussian Mixture Models](https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model) (GMMs).

```{r eval=FALSE, echo=FALSE}
devtools::install_github("hadley/fueleconomy")
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.3.0")
vehicles_tbl <- copy_to(sc, fueleconomy::vehicles, overwrite = TRUE)
predictions <- vehicles_tbl %>%
ml_gaussian_mixture(~ hwy + cty, k = 3) %>%
ml_predict() %>%
collect()
saveRDS(predictions, "data/03-gaussian-mixture-prediction.rds")
```
```{r eval=FALSE}
predictions <- copy_to(sc, fueleconomy::vehicles) %>%
ml_gaussian_mixture(~ hwy + cty, k = 3) %>%
ml_predict() %>% collect()
predictions %>%
ggplot(aes(hwy, cty)) +
geom_point(aes(hwy, cty, col = factor(prediction)), size = 2, alpha = 0.4) +
scale_color_discrete(name = "", labels = paste("Cluster", 1:3)) +
labs(x = "Highway", y = "City") + theme_light()
```
```{r echo=FALSE, fig.cap="Fuel economy data for 1984-2015 from the US EPA"}
predictions <- readRDS("data/03-gaussian-mixture-prediction.rds")
predictions %>%
ggplot(aes(hwy, cty)) +
geom_point(aes(hwy, cty, col = factor(prediction)), size = 2, alpha = 0.4) +
scale_color_discrete(name = "", labels = paste("Cluster", 1:3)) +
labs(x = "Highway", y = "City") +
theme_light()
```

## Broom

You can turn your `sparklyr` models into data frames using the `broom` package:

```{r eval=FALSE, echo=FALSE}
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.3.0")
cars_tbl <- spark_read_csv(sc, "cars", "input/")
model <- cars_tbl %>% ml_linear_regression(mpg ~ wt + cyl)
tidy_model <- list(
broom::tidy(model),
broom::glance(model),
collect(broom::augment(model, cars_tbl))
)
saveRDS(tidy_model, "data/03-broom-examples.rds")
```
```{r echo=FALSE}
tidy_model <- readRDS("data/03-broom-examples.rds")
```
```{r eval=FALSE}
model <- cars_tbl %>%
ml_linear_regression(mpg ~ wt + cyl)
# Turn a model object into a data frame
broom::tidy(model)
```
```{r echo=FALSE}
tidy_model[[1]]
```
```{r eval=FALSE}
# Construct a single row summary
broom::glance(model)
```
```{r echo=FALSE}
tidy_model[[2]]
```
```{r eval=FALSE}
# Augments each observation in the dataset with the model
broom::augment(model, cars_tbl)
```
```{r echo=FALSE, rows.print=3}
tidy_model[[3]]
```

## Pipelines

Expand Down
5 changes: 5 additions & 0 deletions 08-extensions.Rmd
Expand Up @@ -21,3 +21,8 @@ See [spark.rstudio.com/guides/h2o](http://spark.rstudio.com/guides/h2o/).
See [spark.rstudio.com/extensions](http://spark.rstudio.com/extensions/).

### RStudio Projects

You can create an `sparklyr` extension with ease from RStudio. This feature requires RStudio 1.1 or newer and the `sparklyr` package to be installed. Then, from the `File` menu, `New Project...`, select `R Packag using sparklyr`:

![](images/08-extensions-rstudio-project.png)

2 changes: 1 addition & 1 deletion 10-streaming.Rmd
Expand Up @@ -53,7 +53,7 @@ library(sparklyr)
readRDS("data/10-streaming-overview.rds") %>% stream_render(stats = .)
```

Notice that the rows-per-second in the destination stream are higher than the rows-per-second in the source stream; this is expected and desireable since Spark measures incoming rates from the source, but actual row processing times in the destination stream.
Notice that the rows-per-second in the destination stream are higher than the rows-per-second in the source stream; this is expected and desireable since Spark measures incoming rates from the source, but actual row processing times in the destination stream. For example, if 10 rows-per-second are written to the `source/` path, the incoming rate is 10 RPS. However, if it takes Spark only 0.01 seconds to write all those 10 rows, the output rate is 100 RPS.

Use `stream_stop()` to properly stop processing data from this stream:

Expand Down
73 changes: 42 additions & 31 deletions book.bib
Expand Up @@ -15,40 +15,40 @@ @book{data-revolution
}

@inproceedings{google-file-system,
author = {Ghemawat, Sanjay and Gobioff, Howard and Leung, Shun-Tak},
title = {The Google File System},
booktitle = {Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles},
series = {SOSP '03},
year = {2003},
isbn = {1-58113-757-5},
location = {Bolton Landing, NY, USA},
pages = {29--43},
numpages = {15},
url = {http://doi.acm.org/10.1145/945445.945450},
doi = {10.1145/945445.945450},
acmid = {945450},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {clustered storage, data storage, fault tolerance, scalability}
author = {Ghemawat, Sanjay and Gobioff, Howard and Leung, Shun-Tak},
title = {The Google File System},
booktitle = {Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles},
series = {SOSP '03},
year = {2003},
isbn = {1-58113-757-5},
location = {Bolton Landing, NY, USA},
pages = {29--43},
numpages = {15},
url = {http://doi.acm.org/10.1145/945445.945450},
doi = {10.1145/945445.945450},
acmid = {945450},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {clustered storage, data storage, fault tolerance, scalability}
}

@article{google-map-reduce,
author = {Dean, Jeffrey and Ghemawat, Sanjay},
title = {MapReduce: Simplified Data Processing on Large Clusters},
journal = {Commun. ACM},
issue_date = {January 2008},
volume = {51},
number = {1},
month = jan,
year = {2008},
issn = {0001-0782},
pages = {107--113},
numpages = {7},
url = {http://doi.acm.org/10.1145/1327452.1327492},
doi = {10.1145/1327452.1327492},
acmid = {1327492},
publisher = {ACM},
address = {New York, NY, USA},
author = {Dean, Jeffrey and Ghemawat, Sanjay},
title = {MapReduce: Simplified Data Processing on Large Clusters},
journal = {Commun. ACM},
issue_date = {January 2008},
volume = {51},
number = {1},
month = jan,
year = {2008},
issn = {0001-0782},
pages = {107--113},
numpages = {7},
url = {http://doi.acm.org/10.1145/1327452.1327492},
doi = {10.1145/1327452.1327492},
acmid = {1327492},
publisher = {ACM},
address = {New York, NY, USA},
}

@book{data-processing,
Expand All @@ -59,3 +59,14 @@ @book{data-processing
ISBN = {1844801004},
URL = {https://www.amazon.com/Data-Processing-Information-Technology-French/dp/1844801004}
}

@techreport{Zaharia:EECS-2010-53,
Author = {Zaharia, Matei and Chowdhury, N. M. Mosharaf and Franklin, Michael and Shenker, Scott and Stoica, Ion},
Title = {Spark: Cluster Computing with Working Sets},
Institution = {EECS Department, University of California, Berkeley},
Year = {2010},
Month = {May},
URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.html},
Number = {UCB/EECS-2010-53},
Abstract = {MapReduce and its variants have been highly successful in implementing large-scale data intensive applications on clusters of unreliable machines. However, most of these systems are built around an acyclic data flow programming model that is not suitable for other popular applications. In this paper, we focus on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis environments. We propose a new framework called Spark that supports these applications while maintaining the scalability and fault-tolerance properties of MapReduce. To achieve these goals, Spark introduces a data abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.}
}
Binary file added data/03-broom-examples.rds
Binary file not shown.
Binary file added data/03-gaussian-mixture-prediction.rds
Binary file not shown.

0 comments on commit 4b8f6b4

Please sign in to comment.