In [1]:
from pyspark import SparkContext
sc = SparkContext("local", "count app")

22/11/01 11:16:10 WARN Utils: Your hostname, nn1448lr222 resolves to a loopback address: 127.0.1.1; using 172.22.171.23 instead (on interface eth0)
22/11/01 11:16:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/01 11:16:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/01 11:16:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Example - Time between earthquakes

Suppose that earthquakes of a certain magnitude in a specific region can be modeled as a Poisson process with a mean of $\lambda = 4.5$ earthquakes per day.  Let $X$ be the time until the third earth quake.  It can be shown that $X$ has a $Gamma$ distribution with $\alpha = 3$ (number of events) and $\beta = \frac{1}{\lambda}=\frac{1}{4.5}$ (average time until the 3rd earthquake).  We can use Python's `random.gammavariate` to simulate the distribution.

In [2]:
from random import gammavariate
?gammavariate

In [3]:
from composable.sequence import head
N = 1000000
time_between_3_quakes = [gammavariate(3,1/4.5) for i in range(N)]
time_between_3_quakes >> head(5)

[0.926665593480046,
 0.7737326767690611,
 0.7445581120773845,
 0.6615235213640299,
 0.8333565811335074]

## Three `for` loop patterns

Most all `for` loops are reinventing one of the following patterns.

1. **Map**ping a function/transformation unto each value.
2. **Filter**ing the values by some boolean condition.
3. **Reduce** values to one or more statistics.

### Map example - Convert the times from days to hours.

In [4]:
# Loop solution
time_in_hours = []
for t in time_between_3_quakes:
    time_in_hours.append(t*24)
time_in_hours >> head(5)

[22.239974243521104,
 18.569584242457466,
 17.86939468985723,
 15.876564512736717,
 20.00055794720418]

In [5]:
# Comprehension solution
time_in_hours = [t*24 for t in time_between_3_quakes]
time_in_hours >> head(5)

[22.239974243521104,
 18.569584242457466,
 17.86939468985723,
 15.876564512736717,
 20.00055794720418]

In [6]:
# With pipeable functions
from composable.strict import map

(time_between_3_quakes
 >> map(lambda t: t*24)
 >> head(5)
)

[22.239974243521104,
 18.569584242457466,
 17.86939468985723,
 15.876564512736717,
 20.00055794720418]

### Filter Example -  filter out all value less than 1 day.

In [7]:
# loop solution
less_than_1_day = []
for t in time_between_3_quakes:
    if t < 1:
        less_than_1_day.append(t)
less_than_1_day >> head(5)

[0.926665593480046,
 0.7737326767690611,
 0.7445581120773845,
 0.6615235213640299,
 0.8333565811335074]

In [8]:
# comprehension solution
less_than_1_day = [t for t in time_between_3_quakes if t < 1]
less_than_1_day >> head(5)

[0.926665593480046,
 0.7737326767690611,
 0.7445581120773845,
 0.6615235213640299,
 0.8333565811335074]

In [9]:
# pipeable functions
from composable.strict import filter

(time_between_3_quakes
 >> filter(lambda t: t < 1)
 >> head(5)
)

[0.926665593480046,
 0.7737326767690611,
 0.7445581120773845,
 0.6615235213640299,
 0.8333565811335074]

### Reduce Example - Accumulating the maximum

In [10]:
## Loop solution
max_time = 0 # safe since Gamma is non-negative
for t in time_between_3_quakes:
    max_time = max(max_time, t) # update step
max_time

4.629679952322248

In [11]:
# Functional solution
from functools import reduce

reduce(lambda m, t: max(m, t), time_between_3_quakes, 0)

4.629679952322248

In [12]:
# Pipeable solution
from composable import pipeable

update_max = lambda m, t: max(m, t)

@pipeable
def my_reduce(func, xs, init = None):
    if init is None:
        return reduce(func, xs) # Uses first value as init
    else:
        return reduce(func, xs, init)

In [13]:
# with init = 0
(time_between_3_quakes
 >> my_reduce(update_max, init = 0)
)

4.629679952322248

In [14]:
# with init = first value
(time_between_3_quakes
 >> my_reduce(update_max)
)

4.629679952322248

## So Iverson, why are you such a piping fanboi!?!??

Legos

In [15]:
# Find the number of time less than 1 hour
(time_between_3_quakes
 >> map(lambda t: 24*t)
 >> filter(lambda t: t < 1)
 >> my_reduce(lambda cnt, t: cnt + 1, init = 0)
)

952

22/11/01 11:49:13 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 282440 ms exceeds timeout 120000 ms
22/11/01 11:49:14 WARN SparkContext: Killing executors is not supported by current scheduler.


### <font color="red"> Task 6.1.3 </font>

Explain

> You can stack them up and keep building something bigger

## So ... about those loops ...

<img src="./img/no_more_for_loops.png"/>

### Loops don't work well on multiple or multi-core machines

<img src="./img/loop_problems.png">

### What about functions?

* Using [lambda calculus](https://en.wikipedia.org/wiki/Lambda_calculus) we can show that all functional programs that terminate will provide the same result regardless of the order of execution.
* This explains why `pyspark` uses functional idioms like `map`, `filter`, and `reduce`.

##  The `pyspark.RDD` 

> A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned  collection of elements that can be operated on in parallel.

In [16]:
times_RDD = sc.parallelize(time_between_3_quakes)

In [18]:
# Find the number of time less than 1 hour
(times_RDD
 .map(lambda t: 24*t)
 .filter(lambda t: t < 1)
 .map(lambda t: 1)
 .cache()
 .reduce(lambda cnt, t: cnt + 1)
)

22/11/01 11:56:50 WARN TaskSetManager: Stage 1 contains a task of very large size (8820 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

952

<font color="red"><h2> Task 6.1.2 </h2></font>

Use our three functional idioms to compute the average time (in seconds) of all times greater than 1 hour, in two ways.

1. In python using the various `pipeable` functions presented earlier.
2. Using `pyspark` RDD's

**Hint 1.** Computing the mean requires that we (A) compute both the total and count, then (B) divide.

**Hint 2.** Allow yourself two passes through the data for 2. and 3.

In [25]:
# pipeable function solution
from operator import add

filtered_time_seconds  = (time_between_3_quakes
 >> filter(lambda t: 24*t > 1)
 >> map(lambda t: t*24*3600)
)

total = filtered_time_seconds >> my_reduce(add)
count = filtered_time_seconds >> my_reduce(lambda x,y: x+1, init = 0)
total / count

57649.43885130581

In [28]:
# pyspark RDD solution

filtered_Rdd = (times_RDD
 .filter(lambda t: t*24 > 1)
 .map(lambda t: t*24*3600)
)

total_rdd = filtered_Rdd.reduce(add)
count_rdd = filtered_Rdd.map(lambda x:1).reduce(add)
total_rdd / count_rdd

22/11/01 12:15:05 WARN TaskSetManager: Stage 6 contains a task of very large size (8820 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

22/11/01 12:15:06 WARN TaskSetManager: Stage 7 contains a task of very large size (8820 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

57649.43885130581

<font color="red"><h2> Task 6.1.3 </h2></font>

The variance of a random variable is the average square of the difference between $X$ and it's mean.  Use the three functional idioms to compute the variance of the times in three ways:

1. In python using a `for` loop.
2. In python using the various `pipeable` functions presented earlier.
3. Using `pyspark` RDD's

**Hint 1.** It can be shown that the mean of our distribution is $\alpha*\beta = \frac{3}{4.5}$.

**Hint 2.** Subtract, then square, then average.

**Hint 3.** In this case, we can show $V(X) = \alpha\beta^2 = \frac{3}{4.5^2}$.  Use this to check your approximation.

In [30]:
from statistics import variance

variance(time_between_3_quakes)

0.14825316376314696

In [31]:
# pipeable function solution
mu = 3/4.5

standardized = (time_between_3_quakes
               >> map(lambda t: (t-mu)**2)
               )

total = standardized >> my_reduce(add)
count = standardized >> my_reduce(lambda x,y: x+1, init = 0)
total / count

0.14825301661559392

In [33]:
# pyspark RDD solution
standardizedRDD = (times_RDD.map(lambda t: (t-mu)**2)
               )

total_rdd = standardizedRDD.reduce(add)
count_rdd = standardizedRDD.map(lambda x:1).reduce(add)
total_rdd / count_rdd

22/11/01 12:22:44 WARN TaskSetManager: Stage 8 contains a task of very large size (8820 KiB). The maximum recommended task size is 1000 KiB.
22/11/01 12:22:44 WARN TaskSetManager: Stage 9 contains a task of very large size (8820 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

0.14825301661559392