# RDD basics

This notebook will introduce essential Spark operations, to work with data. The data is read into a distributed dataset, then information can be extracted by defining a (chain) of **transformation** function(s) that process the data and finally an **action** function that extracts the information.    

## Data containers

Spark has two main types of data containers (formally these are API's).

(1) an **RDD** or Resilient Distributed Dataset, which is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers *transformations* and *actions*. Since they are immutable, every tranformation can be seen as an operation that generates a new RDD and action as an operation that generates a result. 

(2) a **Dataframe** is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction 

## Creating an RDD

You can create an RDD from memory using the **parallelize(collection)** on the SparkContext (usually abbreviated as `sc`). We can use **collect()** to retrieve a Dataframe with all elements from an RDD.

In [1]:
import os
import sys

spark_path = "C:\spark-2.0.1.-bin-hadoop2.7\spark-2.0.1-bin-hadoop2.7\bin"

os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path

sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.9-src.zip")

from pyspark import SparkContext
from pyspark import SparkConf


nameages = sc.parallelize([('Peter', 3), ('Mike', 2), ('John', 5)])
nameages.collect()

ImportError: No module named 'pyspark'

## Reading data into an RDD

Alternatively, an RDD can be read from files. For this example, first download a csv file that contains how often babies received a given name. Pythons urllib can be used to download a URL and store the downloaded file on disk.

In [None]:
if not os.path.exists("../data/babynames.csv"):
    import urllib.request
    f = urllib.request.urlretrieve ("https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv?accessType=DOWNLOAD", \
                                    "../data/babynames.csv")

There are a few options to read textfiles in Spark. The first is using the `textFile()` method of the SparkContext (abbreviated as `sc`). 

We can view a sample from the RDD with the **take(n)** action, which shows the first n elements. TextFile() simply uses every line in the file as a string element. 

In [None]:
babyrddprimitive = sc.textFile("../data/babynames.csv")
print(babyrddprimitive.take(5))

We can use a transformation to remove the first line. The action **first()** returns the first element from the RDD, which is the line with the header. The transformation **filter(condition)** evaluates the condition for every element and only keeps the elements for which the condition is true. For the *condition*, we pass a function that accepts a single element as a parameter and returns a boolean. This can be a regular Python function that is described by a *def* but we will often use **lambda function**s, which is a short way to describe an anonymous function with the parameter being the part before the colon, and the result the part after the colon.

In [None]:
firstline = babyrddprimitive.first()
babyrddnofirstline = babyrddprimitive.filter(lambda x: x != firstline)
print(babyrddnofirstline.take(5))

Then to transform every element into a list of column values, we can just use the python **split** function, using a "," as the character to split every string. The **map(f)** transformation function, calls the function **f(element)** on every element, and stores the results returned by those function calls as a new RDD. Since the result of split() is a list of elements, in the resulting RDD every element is a list, resembling a two-dimensional list.

In [None]:
babyrdd = babyrddnofirstline.map(lambda x: x.split(','))
print(babyrdd.take(5))

## Reading a Dataframe

In [None]:
babydf = spark.read.csv("babynames.csv",header=True)
print(babydf.take(5))
print(babydf.first())

Dataframes order the data in Rows, in which the values are named by the column. These rows can be used similar to Python dictionaries. By themselves Dataframes have a limited set of operations. To use the full capabilities of RDD's, Dataframes do however have an `.rdd` property allowing to use them as RDD.

We can select only the male or female names by using a filter. In the lambda expression, every element x is a Row in the RDD, and in a Row we can address a value as described. The filter results in a new RDD of Rows, and then the map transforms every element by returning only the first name of every row. Thus the final result is an RDD of strings. 

In [None]:
print(babydf.rdd.filter(lambda x: x["Sex"] == 'F').map(lambda x: x["First Name"]).take(5))
print(babydf.rdd.filter(lambda x: x["Sex"] == 'M').map(lambda x: x["First Name"]).take(5))

## RDD Operations

All transformations in Spark are _lazy_, in that they do not compute their results right away. Instead, they just remember the transformations that are defined, and only computing these when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.