# Spark in Python

### Goals
* Learn how to create a Spark Context in Python through pyspark
* Learn how to create and use RDDs in Python
* Learn how to use MLlib




Basic Transformations and Actions
===========================

Common RDD Constructors
-----------------------

Expression                               |Meaning
----------                               |-------
`sc.parallelize(iterable)`               |Create RDD of elements of some iterable
`sc.textFile(path)`                      |Create RDD of lines from file

Common Transformations
----------------------

Expression                               |Meaning
----------                               |-------
`filter(boolean condition)`              |Returns for where some boolean condition is True
`map(some function)`                     |Applies some function
`flatMap(some function)`                 |Apply some function that returns an iterator and flatten the entire output
`sample(withReplacement=True, ratio)`    |Sample the data by some ratio
`distinct()`                             |Remove duplicates in RDD
`sortBy(key function, ascending=True)`   |Sort elements by key defined in function in designated order
`randomSplit([ratio1, ratio2], seed)`    |Splits your data into two depening on ratio array

Common Actions
--------------

Expression                             |Meaning
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(n)`                              |First n elements of RDD 
`top(n)`                               |Top n elements of RDD
`takeSample(withReplacement=True, n)`  |Create sample of n elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)
`takeOrdered(n, function)`             |Returns n ordered elements as sorted by the value returned by the function

### Example 1: Find all prime numbers between 1 and 100

*First lets import pyspark.*

In [None]:
import pyspark as ps
import math

In [None]:
def check_prime(n):
    """
    Checks if a number is prime
    """
    if n % 2 == 0:
        return False
    for i in xrange(3, int(math.sqrt(n)) + 1, 2):
        if n % i == 0:
            return False
    return True

*Initialize a Spark Context using pyspark.*

In [None]:
sc = ps.SparkContext('local[4]')

*Constrct a RDD.*

In [None]:
# parallize creates a RDD using an iterator
numbers_rdd = sc.parallelize(xrange(2, 101))

*Use a Transformation RDD to filter for primes.*

In [None]:
primes_rdd = numbers_rdd.filter(check_prime)

*Use an action RDD to show the primes we filtered for.*

In [None]:
primes_rdd.collect()[:10]

*In practice, we can just do all these in a single entry.*

In [None]:
sc.parallelize(xrange(2, 101)).filter(check_prime) \
                              .collect()[:10]

# Transformations on (Key, Value) RDDs

Common Pair RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`groupByKey(key value rdd)`              |Collapse a key value RDD by the key, and keeps the values in a iterable
`reduceByKey(some function)`             |Collapse a key value RDD by the key, and combines the values by some function
`mapValues(some function)`               |Apply some function to the values of some key value RDD
`flatMapValues(some function)`           |Apply some function that returns an iterator the the values of some key value RDD, and create a key value for each iterates
`keys()`                                 |Returns the keys of a key value RDD
`values()`                               |Returns the values of a key value RDD

### Example 2:  Using the below sales data, lets find the total dollars sold for each product.

In [None]:
%%writefile sales.txt
#ID    Date           Store   State  Product    Amount
101    11/13/2014     100     WA     331        300.00
104    11/18/2014     700     OR     329        450.00
102    11/15/2014     203     CA     321        200.00
106    11/19/2014     202     CA     331        330.00
103    11/17/2014     101     WA     373        750.00
105    11/19/2014     202     CA     321        200.00
107    11/20/2014     700     OR     329        400.00

*We will construct a RDD using textFile.*

In [None]:
sales_rdd = sc.textFile('sales.txt')

*If we look at the top two lines of the file, we can see that the column gets imported.....*

In [None]:
sales_rdd.map(lambda x: x.split()) \
         .take(2)

*..... so lets get rid of it!*

In [None]:
sales_rdd.map(lambda x: x.split()) \
         .filter(lambda x: not x[0].startswith('#')) \
         .take(2)

*We can do this by grouping then summing the values*

In [None]:
sales_rdd.map(lambda x: x.split()) \
         .filter(lambda x: not x[0].startswith('#')) \
         .map(lambda x: (x[4], float(x[5]))) \
         .groupByKey() \
         .map(lambda (k, v): (k, sum(v))) \
         .collect()

*Or we can do this by simply reducing the values by their keys*

In [None]:
sales_rdd.map(lambda x: x.split()) \
         .filter(lambda x: not x[0].startswith('#')) \
         .map(lambda x: (x[4], float(x[5]))) \
         .reduceByKey(lambda v1, v2: v1 + v2) \
         .collect()

# Transformations on Multiple RDDs 

Common Multiple RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`union(another rdd)`                     |Append another RDD to current RDD
`join(another rdd)`                      |Join another RDD to current RDD by matching keys
`leftOuterJoin(another rdd)`             |Join another RDD to current RDD where another RDD has matching keys
`rightOuterJoin(another rdd)`            |Join current RDD to other RDD where current RDD has matching keys
`zip(another rdd)`                       |Combines two RDD to form a key value pair RDD

### Example 3:  Use the customer data below with the sales data to find average sold per customer.

In [None]:
%%writefile customers.txt
#Store   Customers
100      50
700      14
203      25
202      30
101      10
202      40
700      20

In [None]:
customer_rdd = sc.textFile('customers.txt')

*First, lets calculate the total customers for each store.*

In [None]:
total_cust_rdd = customer_rdd.map(lambda x: x.split()) \
                             .filter(lambda x: not x[0].startswith('#')) \
                             .map(lambda x: (x[0], float(x[1]))) \
                             .reduceByKey(lambda v1, v2: v1 + v2)

*Next, lets calculate the total amount purchased for each store.*

In [None]:
total_sales_rdd = sales_rdd.map(lambda x: x.split()) \
                           .filter(lambda x: not x[0].startswith('#')) \
                           .map(lambda x: (x[2], float(x[5]))) \
                           .reduceByKey(lambda v1, v2: v1 + v2)

*Now we can join the two tables and take the average.*

In [None]:
total_sales_rdd.join(total_cust_rdd) \
               .mapValues(lambda (x, y): x / y) \
               .collect()

## Pop Quiz: Which state has the highest sales per customer?

### Caching RDDs

You can cache RDDs you expect to use a lot to speed up your applications.  Do so simply by doing .persist()

In [None]:
cached_rdd = customer_rdd.map(lambda x: x.split()) \
                         .filter(lambda x: not x[0].startswith('#')) \
                         .persist()

# Machine Learning in Spark with MLlib

Spark has implementation of most of the common/popular machine learning algorithms, such as:

* Statistical Tests
* Classification and regression
 * Linear models (SVMs, logistic regression, linear regression)
 * Naive Bayes
 * Decision Trees
 * Ensembles of Trees (Random Forests and Gradient-Boosted Trees)
* Collaborative filtering
 * Alternating Least Squares (ALS or set to non-negative for NMF)
* Clustering
 * K-means
* Dimensionality reduction
 * Singular Value Decomposition (SVD)
 * Principal Component Analysis (PCA)

For a full list of implementations, please reference the documentation at http://spark.apache.org/docs/latest/mllib-guide.html.

## Vectors and Matrices

You can create dense or sparse Vectors and Matrices in Spark, but they are types and not actually RDDs.  As a result, if you you create a RDD of matrices, it is essentially an Array of Matrices.  In most cases, you will only be using Vectors.

```python
from pyspark.mllib.linalg import Matrices, Vectors

Vectors.dense([1, 2, 4]) # Creates [1, 2, 4]
Vectors.sparse(3, [0, 2], [1, 4]) # Creates [1, 0, 4]
Matrices.dense(2, 2, np.array([1, 2, 3, 4])) # Creates [[1, 2], [3, 4]]
Matrices.sparse(2, 2, [0, 1, 2], [0, 1], [1, 1]) # Creates [[1, 0], [0, 1]]
```


## Supervised Models and LabeledPoint

Supervised models in Spark requires a LabeledPoint RDD, where each observation is a label with a feature vector.  LabeledPoints can be instantiated by:

```python
from pyspark.mllib.regression import LabeledPoint

some_data_rdd.map(lambda x: LabeledPoint(x[{y index}], x[{feature(s) index}]))
```

To extract the labels and features, use map on each point.

```python
labels = some_labelpoints_rdd.map(lambda x: x.label)
features = some_labelpoints_rdd.map(lambda x: x.features)
```

## StandardScaler

Several models in Spark can be very sensitive to different feature scaling.  We will often need to use StandardScaler to resolve this issue.  StandardScaler requires a feature RDD, and can be utilized as such:

```python
from pyspark.mllib.feature import StandardScaler

scaler = StandardScaler(withMean=True, withStd=True).fit(features)
scaler.transform(features)
```

## Train Test Split

There is no built in train test split function in Spark, but we do have transformation RDDs that specializes in random sampling or splits such as randomSplit.

## Model Evaluation

Spark has an evaluation package that allows you to calculate prediction errors.  Some options available are:

* BinaryClassificationMetrics
* MulticlassMetrics
* RegressionMetrics
* RankingMetrics
* More at https://spark.apache.org/docs/1.5.2/mllib-evaluation-metrics.html

Within each of the available Metrics module, there's a selection of suitable metrics such as mean squared error for regression and precision for classification.  All of these modules requires a key value or tuple RDD of label and prediction.  For example:

```python
from pyspark.mllib.evaluation import RegressionMetrics

metrics = RegressionMetrics(valuesAndPreds)
print metrics.meanSquaredError
```

### Example 4: Using the cars data, lets do a regression to predict MPG.

*First, lets import all the relevant packages.*

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionModel, LinearRegressionWithSGD
from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.evaluation import RegressionMetrics

*Lets create the data RDD.*

In [None]:
cars_rdd = sc.textFile('../data/cars_scrubbed.csv')

*Now lets transform our data into LabeledPoint RDD*

In [None]:
cars_labeledpoints = cars_rdd.map(lambda x: x.split(',')) \
                             .filter(lambda x: not x[1].startswith('m')) \
                             .map(lambda x: [float(y) for y in x]) \
                             .map(lambda x: LabeledPoint(x[1], x[2:]))

*We need to scale our data, this will require us to extract the feature label, scale, and recreate our LabeledPoint RDD.*

In [None]:
labels = cars_labeledpoints.map(lambda x: x.label)
features = cars_labeledpoints.map(lambda x: x.features)
scaler = StandardScaler(withMean=True, withStd=True).fit(features)
cars_scaled = labels.zip(scaler.transform(features))
cars_scaled = cars_scaled.map(lambda (y, x): LabeledPoint(y, x))

*Lets create a train test split.*

In [None]:
training, test = cars_scaled.randomSplit([0.7, 0.3], seed=0)

*Now we can train a Linear Regression Model.*

In [None]:
model = LinearRegressionWithSGD.train(training, iterations=2000, step=0.1, intercept=True, regType='l2')

*Lets Verify that our model is predicting fine*

In [None]:
training.map(lambda x: (x.label, model.predict(x.features))).take(10)

*I want to see mean squared error for both test and training, so lets build a wrapper function.*

In [None]:
def get_mse(rdd):
    valuesAndPreds = rdd.map(lambda x: (x.label, float(model.predict(x.features))))
    metrics = RegressionMetrics(valuesAndPreds)
    return metrics.meanSquaredError

*Training Error*

In [None]:
get_mse(training)

*Testing Error*

In [None]:
get_mse(test)

### Pop Quiz:  Build a Random Forest model to predict the origin of the car.