In [1]:
import numpy as np

# Spark Tuto - First Encounter

First of all make sure you have followed the installation instructions in the `README` file.
In this first spark tuto in python3 we'll try to run a simple example to get familiar with PySpark.
The basic idea is to create a Resilient Distributed Dataset (RDD) to which we will apply operations. Operations can be _transformations_ or _actions_. Note that Spark will distribute the data and parallelize operations for you.

## Squaring a collection of elements

For this first example we are simply going to square all elements of a collection and return the first ten entries. We'll try to compare the spark efficiency to more classic ways of doing the same operation (i.e. using numpy or list comprehension for instance).

In [2]:
# the size of the collection large enough to see an effect (if any)
n_samples = 1000000

In [3]:
# Function that takes an element and returns it squared
def square_it(x):
    return x**2

In [4]:
# create a Resilient Distributed Dataset
rdd = sc.parallelize(range(n_samples))

In [5]:
# applying the square function to the rdd
rdd_square = rdd.map(square_it)

In [6]:
# Collecting the first ten results (just checking)
rdd_square.take(10)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Let's now look at how quickly the different operations were done :


In [7]:
%timeit rdd.map(square_it)

2.91 µs ± 59.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [8]:
%timeit rdd_square.take(10)

37 ms ± 6.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


It would be easy to jump to the conclusion that spark is able to perform very quickly the square operation (I did the first time :/ ) ... and it would be wrong!!!  
If we refer to the [programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations) :  
__" All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. "__

So basically when we write `rdd.map(square_it)` no computation is actually performed.

Let's compare to more classic ways of doing the same operation, for instance using numpy :

In [9]:
# For instance using numpy
x = np.arange(n_samples)
tmp_np = x**2
tmp_np[:10]

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [10]:
%timeit x**2

3.21 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%timeit tmp_np[:10]

237 ns ± 9.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


We can also perform the same operation using list comprehension :

In [12]:
tmp_list = [i**2 for i in range(n_samples)]
tmp_list[:10]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [13]:
%timeit [i**2 for i in range(n_samples)]

388 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%timeit tmp_list[:10]

222 ns ± 6.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


If we look at the total time it takes for the different methods to square a million elements and return the first ten we find that numpy is faster than spark which is faster than list comprehension.
However this is a very simple example in which we haven't exploited the whole power of spark.