# PySpark Tutorial

By: Julien Naegeli 


Spark is a powerful open source cluster computing framework that provides simplistic uses of complex analytics processes. Apache Spark is an in-memory data processing engine that has well documented and frequently updated APIs to execute streaming, machine learning or SQL workloads on large datasets. PySpark allows us to make use of this Big Data processing framework in Python. What makes Spark a common selection for Big Data processing is its speed, due to the in-memory data processing engine that many alternatives do not provide. Spark computes everything in-nemory while Hadoop's MapReduce, a popular alternative, persists the data on the disk after every map or reduce job. For the purposes of this tutorial, we will focus on its uses.

This tutorial will teach the basic uses of PySpark, covering the basic data structures, statistics applications, and its Machine Learning Library.

## Installation
1. Download Spark from http://spark.apache.org/downloads.html
2. Unzip the download to this location: ~/spark-2.0.1/
3. Follow these instructions in the command line:
    
```
cd ~/spark-2.0.1/

brew install sbt

sbt assembly
```

## Configuration

1. Create a PySpark Profile for jupyter notebook:
```
ipython profile create pyspark
```
Follow the instructions listed here for setup if you encounter any issues: http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

## Introduction

Resilient Distributed Datasets (RDD) are central to Spark's main functionalities. RDDs are distributed collections of items that allow Spark to process data in a distributed fashion, as the name entails. RDDs let you save data on memory and preserve it to the disc if and only if it is required. This has a big role in the speed of Spark's execution engine.

There are two ways to create an RDD:

1. Parallelize an existing collection of data into an RDD. 

2. Reference an external dataset from storage.

Below you will find an example of both:

In [1]:
from pyspark import SparkContext

# Initialize Spark Context
sc = SparkContext()

# Parallelize
nums = [1,2,3,4,5,6,7,8,9,10]
dist_nums = sc.parallelize(nums)
hamlet = sc.textFile("hamlet.txt")

There are a number of ways to print out the contents of an RDD.

    1. Print out a specified number of elements with RDD.take(n)
    2. Print out the entire RDD with RDD.collect()

In [2]:
print hamlet.take(10)
print dist_nums.collect()

[u'', u'1604', u'', u'', u'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK', u'', u'', u'by William Shakespeare', u'', u'']
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


After printing the first 10 elements of the hamlet RDD, we see that there are a number of empty strings that represent empty lines, to account for this we can easily filter them out by using ```RDD.filter()```

In [3]:
hamlet = hamlet.filter(lambda x: len(x) > 0)
hamlet.take(5)

[u'1604',
 u'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK',
 u'by William Shakespeare',
 u'Dramatis Personae',
 u'  Claudius, King of Denmark.']

RDDs have operations called actions and transformations.

    1. Actions - Operations that return values.
    2. Transformations - Operations that create new RDDs.

```Filter``` was an example of a Transformation, as it returned a new RDD, with a filtered result. 

```Take``` and ```Collect``` are examples of actions, as they returned the values of the RDDs.

One of the most powerful, yet simple RDD actions is ```RDD.reduce(f)```.

```RDD.reduce(f)``` allows us to aggregate the elements of the RDD using a function f. Here are some examples:

In [250]:
total = dist_nums.reduce(lambda x, y: x + y)
prod  = dist_nums.reduce(lambda x, y: x * y)

print "sum: ", total
print "product: ", prod

sum:  55
product:  3628800


One of the most powerful RDD transformations is ```RDD.map(f)```.

```RDD.map(f)``` returns a new RDD with each element, ```x```, passed through the function ```f```, as ```f(x)```.

The best way to understand is by example:

In [5]:
doubled = dist_nums.map(lambda x: 2*x)
isEven  = dist_nums.map(lambda x: x % 2 == 0)

print "RDD with values doubled: ", doubled.collect()
print "RDD of if even: ", isEven.collect()

RDD with values doubled:  [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
RDD of if even:  [False, True, False, True, False, True, False, True, False, True]


Putting ```Map``` and ```Reduce``` together, we have a powerful aggregation tool.

Looking at our hamlet text file, let's calculate the total size of the file:

In [6]:
total_size = hamlet.map(lambda line: len(line)).reduce(lambda x, y: x + y)

print "total size: ", total_size

total size:  187271


Here, we mapped the lines to their respective lengths, and then reduced by summing up those lengths.  As you can see, the syntax is simple while the process is powerful.

Let's look into some additional useful transformations:

1. ```RDD.reduceByKey(f)``` operates on a dataset of key-value pairs, and reduces using the specified function ```f``` to return a new dataset of key-value pairs.

2. ```RDD.groupByKey(f)``` operates on a dataset of key-value pairs, and groups by each key, returning a dataset of keys mapped to a collection of their respective grouped values.

3. ```RDD.sortByKey(f)```, as the name suggests, orders the RDD by key. You can specify in what order with the (boolean) ascending parameter.

4.  ```RDD.sortBy(f)```, orders the RDD by the given function f. You can specify in what order with the (boolean) ascending parameter.

#### Using these tranformations, let's figure out which character speaks the most amount of times!

Here's an example of a line spoken by Francisco:
    
    "  Fran. You come most carefully upon your hour."
    
Looking carefully, we see that each line has two spaces ```"  "```, the shortened character name, followed by their line.

In [109]:
import re

# First filter to only have the lines with actual speech!
filtered = hamlet.filter(lambda line: re.match("(^  \w{3,6}\. )", line))
filtered.take(5)

[u"  Ber. Who's there.?",
 u'  Fran. Nay, answer me. Stand and unfold yourself.',
 u'  Ber. Long live the King!',
 u'  Fran. Bernardo?',
 u'  Ber. He.']

In [108]:
# Now, let's map the character name to 1 (so we can add up each occurence):
paired = filtered.map(lambda line: (re.match("(^  \w{3,6}\.)",line).group(0),1))
paired.take(5)

[(u'  Ber.', 1),
 (u'  Fran.', 1),
 (u'  Ber.', 1),
 (u'  Fran.', 1),
 (u'  Ber.', 1)]

In [134]:
# Using reduceByKey, let's determine the amount of times each character has spoken:
reduced_1 = paired.reduceByKey(lambda x, y: x + y)

# We now expect each element to have this structure: (Character, # of lines)
ordered_1 = reduced.sortBy(lambda x: x[1], ascending=False)

# In another fashion:

reduced_2 = paired.countByKey().items()
reduced_2.sort(key=lambda x: x[1], reverse=True)

print ordered_1.take(5)
print reduced_2[:5]

[(u'  Ham.', 358), (u'  Hor.', 108), (u'  King.', 102), (u'  Pol.', 86), (u'  Queen.', 69)]
[(u'  Ham.', 358), (u'  Hor.', 108), (u'  King.', 102), (u'  Pol.', 86), (u'  Queen.', 69)]


In the above output, we see that Hamlet has spoken the most. Using RDDs made the process in retrieving this information extremely simple, and can do this at scale with enormous amounts of data.

## Additional Important Spark Basics

In [137]:
# Indexing an RDD
indexed = ordered.zipWithIndex()

# Partitioning an RDD
nums = [1, 2, 3, 4, 5, 6, 7, 8, 9]
dist_nums = sc.parallelize(nums,3) # 3 Partitions 

# mapPartitions() allows us to map separately on each partition
def f(iterator):
    yield sum(iterator)
    
print indexed.take(5)
print dist_nums.mapPartitions(f).collect()

[((u'  Ham.', 358), 0), ((u'  Hor.', 108), 1), ((u'  King.', 102), 2), ((u'  Pol.', 86), 3), ((u'  Queen.', 69), 4)]
[6, 15, 24]


In the above output, you can see the ease at which we can index each element within an RDD.

We can also see the usefulness in taking advantage of Spark's distributed nature, and apply functions to each partition. Above, we summed up the values in each of the three partitions.

## Apache Spark Machine Learning Library

Now that we've learned the basic uses of spark and its data structures, let's get into some more complex tasks. Many of the models that we have applied in class are readily available within Spark's Machine Learning Library. In the below cells, we will learn how to use some basic models.

In order to do this we will look at a more interesting dataset. The dataset we will be using has just one predictor and one response variable.  We will be looking at the relationship between a human's weight and their systolic blood pressure. 

To prepare for modeling, we must understand a variety of new data types that are introduce with this Library:

1. Local Vector
2. Labeled Point
3. Local Matrix

We will show the use of vectors and matrices in the cell below to better understand them:

In [116]:
from pyspark.mllib.linalg import Vectors, Matrix, Matrices

# Local Vectors
dense_v  = [1.0, 2.0, 3.0]
sparse_v = Vectors.sparse(3, [1, 2], [3.0, 4.0])

# Local Matrices
dense_m  = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])
sparse_m = Matrices.sparse(3, 2, [0, 1, 2], [1, 2, 3], [2, 3, 4])

In this specific example, weight is our only predictor and systolic blood pressure is our response. Let's start by reading in and structuring the data. 

In [81]:
data = sc.textFile("weightVsBP.csv") # Read in data

data = data.map(lambda line: line.split(",")) # Break into rows

header = data.first()
sbp = data.filter(lambda line: line != header) # Take out the header from the csv

print sbp.collect() # Look at our data

[[u'165', u'130'], [u'167', u'133'], [u'180', u'150'], [u'155', u'128'], [u'212', u'151'], [u'175', u'146'], [u'190', u'150'], [u'210', u'140'], [u'200', u'148'], [u'149', u'125'], [u'158', u'133'], [u'169', u'135'], [u'170', u'150'], [u'172', u'153'], [u'159', u'128'], [u'168', u'132'], [u'174', u'149'], [u'183', u'158'], [u'215', u'150'], [u'195', u'163'], [u'180', u'156'], [u'143', u'124'], [u'240', u'170'], [u'235', u'165'], [u'192', u'160'], [u'187', u'159']]


A ```LabeledPoint``` is a combination of a label/response and a local vector which can be either dense or sparse. The label/response will be the value of interest that the model will look to predict, whereas the values within the vectors are the predictors. LabeledPoints are heavily used by the learning algorithms within Spark's Machine Learning Library (MLlib).

Now that we know what our data looks like, we will create our LabeledPoints.

In [113]:
from pyspark.mllib.regression import LabeledPoint

# Creating just one LabeledPoint as follows
practice_LP = LabeledPoint(0,[1,2,3,4])   # dense
practice_LP_2 = LabeledPoint(4, sparse_v) # sparse

# The predictors can be extracted as follows
predictors  = practice_LP.features
response    = practice_LP.label
print "Predictors: ", predictors
print "Response: ", response

# Aggregate across the entire dataset
lps = data.map(lambda row: LabeledPoint(row[0],[row[1]]))

print lps.take(5)

Predictors:  [1.0,2.0,3.0,4.0]
Response:  0.0
[LabeledPoint(1.0, [10.0]), LabeledPoint(2.0, [20.0]), LabeledPoint(5.0, [33.0])]


We now have an RDD of LabeledPoints and everything we need to start making some simple predictions through Spark's MLlib. Let's start by using a Linear Regression Model.

In [106]:
from pyspark.mllib.regression import LinearRegressionWithSGD

# Create the model
model = LinearRegressionWithSGD.train(lps, iterations=100, step=0.0001)

# Predict
predictions = lps.map(lambda p: (p.label, float(model.predict(p.features))))
print "(Actual, Prediction)", predictions.take(5), "\n"

# Calculate the MSE
square = predictions.map(lambda (v, p): (v - p)**2)
summed  = squared.reduce(lambda x, y: x + y)
MSE = summed / predictions.count()

print "Mean Squared Error: ", MSE

 (Actual, Prediction) [(165.0, 163.0348896074737), (167.0, 166.79723321380004), (180.0, 188.11718031631582), (155.0, 160.5266605365895), (212.0, 189.37129485175794)] 

Mean Squared Error:  241.765928817


As you can see, we have very simply and successfully created a Linear Regression model with Spark. LabeledPoints help simplify the process immensely, especially when there is a large amount of predictors.

### Evaluation Metrics

In the above linear regression model, we calculated the Mean Squared error by hand using ```map``` and ```reduce```. Spark also provides something called Evaluation Metrics, where these types of metrics are calculated for us.

The following metrics are available to be calculated using the Linear Regression Model we created above:

1. Mean Squared Error (MSE)
2. Root Mean Squared Error (RMSE)
3. Mean Absolute Error (MAE)
4. Coefficient of Determination (R2)
5. Explained Variance

These are executed below using Evaluation Metrics. The necessary input to the metrics object is an RDD of (label,predictions) as we made above in the variable ```predictions```.

In [117]:
from pyspark.mllib.evaluation import RegressionMetrics

# Create metrics object
metrics = RegressionMetrics(predictions)

print "MSE:", metrics.meanSquaredError
print "RMSE:", metrics.rootMeanSquaredError
print "R-squared:", metrics.r2
print "MAE:", metrics.meanAbsoluteError
print "Explained variance:", metrics.explainedVariance

MSE: 241.765928817
RMSE: 15.5488240332
R-squared: 0.112284688151
MAE: 12.4803510021
Explained variance: 588.974530086


### Calculating Statistics on RDDs with MLlib

Spark's Machine Learning Library also provides an API for calculating statistics with RDDs. Let's take a look at the possible options we have.

#### Column Statistics:

```.colStats()``` provides various statistics for the columns of a matrix. This matrix can be represented by an RDD. Let's revisit our dataset with systolic blood pressure and calculate some interesting statistics.

In [85]:
from pyspark.mllib.stat import Statistics

matrix = sbp # same data that was used above

summary = Statistics.colStats(matrix)

print "Maximum: ", summary.max()
print "Minimum: ", summary.min()
print "Mean: ", summary.mean()
print "Variance: ", summary.variance()
print "Amount of Nonzeroes: ", summary.numNonzeros()

Maximum:  [ 240.  170.]
Minimum:  [ 143.  124.]
Mean:  [ 182.42307692  145.61538462]
Variance:  [ 612.49384615  180.08615385]
Amount of Nonzeroes:  [ 26.  26.]


#### Calculating Correlation:

Here we will look at how we can calculate the correlation between two sequences of data.

```.corr()``` is the provided function to calculate correlation.

In [115]:
weight = sbp.map(lambda x: x[0])
blood_pressure = sbp.map(lambda x: x[1])

print "Correlation: ", Statistics.corr(weight, blood_pressure, method="pearson"), "\n"

# If we had an RDD of vectors, .corr() would calculate the correlation between each.

mat = sc.parallelize(
    [[1.0, 10.0, 100.0], 
     [2.0, 20.0, 200.0], 
     [5.0, 33.0, 366.0]])

print "Correlation Matrix:" 
print Statistics.corr(mat, method="pearson")

Correlation:  0.773490300531 

Correlation Matrix:
[[ 1.          0.97888347  0.99038957]
 [ 0.97888347  1.          0.99774832]
 [ 0.99038957  0.99774832  1.        ]]


## Classification - Real World Example

For this example, we will be working with a dataset that contains every single shot Kobe Bryant (Los Angeles Lakers Basketball Player) took throughout his career. Each shot has a number of attributes including shot type, distance, location, time remaining, opponent, and more. We will use this data to predict whether or not his shots went in! Now, let's put together the concepts we have learned so far and make some predictions.

This is clearly a classification problem, and for this reason we will use Spark's Logistic Regression Model.  Yet, since this is a real-word dataset, the data requires a bit of cleaning and structuring to prepare for the model. Below, I have simply converted each string categorical variable to number format.

In [255]:
k_data = sc.textFile("kobe-shots.csv") # Read in data
k_data = k_data.map(lambda line: line.split(",")) # Break into rows

header = k_data.first()
k_data = k_data.filter(lambda line: line != header) # Take out the header from the csv
k_data = k_data.map(lambda row: row[:7])

# Distinct values in each attribute

distinct_shot_type = k_data.map(lambda x: x[0]).distinct()
distinct_area = k_data.map(lambda x: x[1]).distinct()
distinct_distance = k_data.map(lambda x: x[2]).distinct()
distinct_season = k_data.map(lambda x: x[4]).distinct()

# Create a dictionary of attributes mapped to their category number

shot_type_map = dict(distinct_shot_type.zipWithIndex().collect()) # Remember zipWithIndex?
area_map = dict(distinct_area.zipWithIndex().collect())
distance_map = dict(distinct_distance.zipWithIndex().collect())
season_map = dict(distinct_season.zipWithIndex().collect())

kobe = k_data.map(lambda row: [shot_type_map[row[0]], area_map[row[1]], distance_map[row[2]], 
                                 row[3], season_map[row[4]], row[5], row[6]])

print "Row from pre-cleaned dataset: ", k_data.take(1)
print "Row from new dataset: ", kobe.take(1)

Row from pre-cleaned dataset:  [[u'Jump Shot', u'Left Side(L)', u'8-16 ft.', u'10', u'2000-01', u'0', u'0']]
Row from new dataset:  [[43, 2, 0, u'10', 6, u'0', u'0']]


As you can see, all of the categorical variables are indexed and ready to be given to the model. As we did with our Linear Regression Model, we will create our RDD of LabeledPoints, train the model, and predict.

In [256]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

labeled_data = kobe.map(lambda row: LabeledPoint(row[6],row[:6]))

model = LogisticRegressionWithLBFGS.train(labeled_data)

predictions = labeled_data.map(lambda p: (float(p.label), float(model.predict(p.features))))

Now that we've made our predictions, let's calculate our metrics. Let's put the skills that we've learned thus far to use:

1. Calculate the training error using RDD actions and transformations.
2. Calculate metrics using MLlib's Evaluation Metrics

In [249]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

trainingError = preds.filter(lambda (val, pred): val != pred).count() / float(labeled_data.count())

metrics = BinaryClassificationMetrics(predictions) # Metrics Object

print "Training Error: ", trainingError
print "Area under precision-recall curve: ", metrics.areaUnderPR
print "Area under ROC curve: ", metrics.areaUnderROC

Training Error:  0.375218897148
Area under precision-recall curve:  0.557574019457
Area under ROC curve:  0.626559120217


This real world example gives a great idea of how Spark can make complex tasks rather simple to accomplish. Spark's powerful RDD actions, RDD transformations, LabeledPoints, models, and evaluation metrics all came to fruition in this real-world example.

## Conclusion

Now that we have seen the various data structures, data types, applications, and uses of PySpark, you should feel confident applying these skills to more complicated tasks! The Linear and Logisitic Regression Models provided in this notebook were simple to help in the understanding of Spark's applications as a whole, and more complicated ones shouldn't be all that much harder to implement. With the basics down, you are prepared to start processing massive amounts of data and providing meaningful information from that data using Spark and Spark MLlib!