<h1>RDD Mechanics</h1>
<p>The first example did a ton of things, so I'd like to proceed a bit more slowly to get a sense of what's going on.  Here, I'll simplify the code a bit so we can have a sense of what's happening at a more grainular level.  Let's begin by creating a small dataset manually, and turn it into an RDD.
</p>

In [1]:
data = ["Spark is great for big data.", 
        "Big data cannot fit on one computer.",
        "Spark can be installed on a cluster of computers."
       ]

rdd = sc.parallelize(data)

rdd.take(1)

['Spark is great for big data.']

<h2> What is an RDD?</h2>
<p>The most important thing to understand about Spark programing is what an RDD is.  An RDD is not a collection of data or variables, but rather, it is a map from a set of variables to a manipulated state of those variables.  Spark employs what is called 'lazy evaluation', which means that it will only compute up to the point that it needs to satisfy the current prompt.  To illustrate this point, I will create an rdd map without collecting, which should run very quickly.  However, when I try to aquire a sample, the process will then get kicked off.</p>

In [2]:
from time import time
def timeit(method):

    def timed(*args, **kw):
        ts = time()
        result = method(*args, **kw)
        te = time()

        print "{} {}".format(method.__name__, te-ts)
        return result

    return timed

@timeit
def col_rdd(rdd):
    print rdd.collect()

@timeit
def sort_rdd(rdd):
    rdd.sortBy(lambda x: x[0]).take(1)

In [3]:
from time import time

t0 = time()
rdd_fm = rdd.flatMap(lambda text: text.split())
rdd_tup = rdd_fm.map(lambda word: (word.strip(".,-;?").lower(),1))
print "Time for completion, step 1:", time() - t0

Time for completion, step 1: 0.000345945358276


In [4]:
col_rdd(rdd_fm)
sort_rdd(rdd_fm)

['Spark', 'is', 'great', 'for', 'big', 'data.', 'Big', 'data', 'cannot', 'fit', 'on', 'one', 'computer.', 'Spark', 'can', 'be', 'installed', 'on', 'a', 'cluster', 'of', 'computers.']
col_rdd 0.159617900848
sort_rdd 0.529456853867


<h2>Caching progress</h2>
<p>Here, we can see that the step of creating the RDD happens almost instantly, where the process of simply printing out the objects that are created in that step take much longer.  Again, this is due to the fact that the RDD is simply creating a map to a particular state.  This is exaserbated when you try to do multiple operation on the same RDD, because not only do the new operations need to be performed, but the new ones do as well.<br/><br/>
In order to avoid this behavior, you can tell spark to cache results at various points along the way.  Caching will store the values of an RDD in memory (if you have enough) so that you can avoid uneeded calculations.<br/><br/>
The other side of the coin is that the graphs that spark build are efficient, so unless you need to reference back to data, generally you are well served simply executing all at once.
</p>

In [5]:
rdd_tup.cache()
terms = rdd_tup.reduceByKey(lambda a, b: a+b)\
            .sortBy(lambda x: x[1], ascending=False)\
            .take(15)
            
print terms

[('data', 2), ('on', 2), ('big', 2), ('spark', 2), ('a', 1), ('be', 1), ('cluster', 1), ('can', 1), ('of', 1), ('great', 1), ('fit', 1), ('is', 1), ('for', 1), ('computers', 1), ('cannot', 1)]
