In [None]:
"""
Resilient Distributed Datasets or RDDs were the primary API for version one and they're still 
available in Spark version two. Now almost all the code we've been running using DataFrames 
compiles down to an RDD. So it makes sense for us to have a basic understanding of what an RDD is. 

An RDD is an immutable partitioned collection of records that can be worked on in parallel. 


Now remember that with a DataFrame, each record is a structured row containing fields with a known 
schema. In the case of RDD, the records are just Java, Scala or Python objects. And so you have complete 
control over them. 


Although this has several advantages, there are a couple of challenges. Spark does not understand the 
inner structure of your records as it does with your DataFrames. This means that the optimizations you 
would have automatically got with DataFrames, you will need to manually recreate. 


The RDD APIs are available in Python as well as Scala and Java. You can get good performance with running 
RDDs with Scala and Java. However, running Python RDDs, is like running Python user-defined functions row by row. 


So we need to serialize the data to the Python process, work on it in Python and then serialize it back 
to the Java Virtual Machine. For this reason, it's recommended to stick with the high level APIs in Spark 
and only use the RDDs when absolutely necessary. 


If you're curious about the performance difference between RDDs and DataFrames, take a look at 
this blog post. 

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

When the databricks team, first introduced DataFrames back in 2015. The team ran a couple of experiments 
and compared the performance of Scala and Python grouped by aggregation on 10 million integer pairs on a 
single machine. In the case of DataFrames, both Scala and Python operations are compiled into JVM byte code. 
So there's little difference between them. Python and Scala DataFrames had two times better performance than 
Scala RDDs. But wait for it. The same DataFrames, had a whopping five times better performance than Python RDDs. 


So when would you use RDDs? 

    This is when you need fine grained control over the physical distribution and partition of data. 

    Another possibility is that you are having to maintain some legacy codebase written using RDDs
    in Spark version one. 

Now, RDDs are a low level API that are powerful but lack a lot of the optimizations that we have seen 
with DataFrames. With RDDs, you also do not have access to the built in functions that you do when using 
DataFrames. This means that you must define each filter, map and aggregation as a function. 

Transformations, actions and lazy evaluations work in the same way as they do with DataFrames. If you're 
ever uncertain whether a given function is a transformation or an action in RDD land, look at its return 
type. Transformations return RDDs while as actions return some other data type. 

Let's take a look at some transformations. 

    With map, you defined a function and then apply it record by record. 
    
    Flatmap returns a new RDD by first applying a function to all of the elements RDDs and 
    then flattening the result.
    
    Filter, returns a new RDD. Meaning only the elements that satisfy a condition. 

    With reduce, we are taking neighboring elements and producing a single combined result. 
    For example, let's say you have a set of numbers. You can reduce this to its sum by providing a function 
    that takes as input two values and reduces them to one. 
    
    Count works in the same way that we've seen in DataFrames and allows you to count the number of rows 
    in an RDD.
"""

In [2]:
from pyspark import SparkContext

In [3]:
sc = SparkContext.getOrCreate()

In [4]:
sc

In [9]:
ps_rdd = sc.textFile('./data./police-station.csv')

In [None]:
ps_header = ps_rdd.first()

In [16]:
ps_rest = ps_rdd.filter(lambda line: line != ps_header)

In [None]:
ps_rest.first()

In [None]:
ps_rest.map(lambda line: line.split(',')).collect()

In [None]:
ps_rest.map(lambda line: line.split(',')).count()

In [None]:
(ps_rest.filter(lambda line: line.split(',')[0] == '7').
     map(lambda line: (line.split(',')[0],
                       line.split(',')[1],
                       line.split(',')[2],
                       line.split(',')[5]
                      )).collect())

In [None]:
(ps_rest.filter(lambda line: line.split(',')[0] in [10, 11])
        .map(lambda line: (line.split(',')[1],
                          line.split(',')[2]
                          line.split(',')[5]
                          )).collect())