<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Summary" data-toc-modified-id="Summary-1">Summary</a></span></li><li><span><a href="#Terminology" data-toc-modified-id="Terminology-2">Terminology</a></span><ul class="toc-item"><li><span><a href="#Resilient-Distributed-Dataset-(RDD)" data-toc-modified-id="Resilient-Distributed-Dataset-(RDD)-2.1">Resilient Distributed Dataset (RDD)</a></span></li><li><span><a href="#User-defined-functions-(UDF)" data-toc-modified-id="User-defined-functions-(UDF)-2.2">User-defined functions (UDF)</a></span></li></ul></li><li><span><a href="#Example:-Creating-an-RDD-from-a-Python-list" data-toc-modified-id="Example:-Creating-an-RDD-from-a-Python-list-3">Example: Creating an RDD from a Python list</a></span></li><li><span><a href="#Manipulating-data-with-map,-filter-and-reduce" data-toc-modified-id="Manipulating-data-with-map,-filter-and-reduce-4">Manipulating data with <code>map</code>, <code>filter</code> and <code>reduce</code></a></span><ul class="toc-item"><li><span><a href="#map" data-toc-modified-id="map-4.1"><code>map</code></a></span></li><li><span><a href="#filter" data-toc-modified-id="filter-4.2"><code>filter</code></a></span></li></ul></li><li><span><a href="#reduce" data-toc-modified-id="reduce-5"><code>reduce</code></a></span><ul class="toc-item"><li><span><a href="#Use-Commutative-and-associate-functions-(for-distributed-computing)" data-toc-modified-id="Use-Commutative-and-associate-functions-(for-distributed-computing)-5.1">Use <em>Commutative</em> and <em>associate</em> functions (for distributed computing)</a></span></li></ul></li><li><span><a href="#User-defined-Functions" data-toc-modified-id="User-defined-Functions-6">User-defined Functions</a></span><ul class="toc-item"><li><span><a href="#Using-UDF-to-create-a-fractions-function" data-toc-modified-id="Using-UDF-to-create-a-fractions-function-6.1">Using UDF to create a <code>fractions</code> function</a></span></li><li><span><a href="#Using-typed-Python-functions" data-toc-modified-id="Using-typed-Python-functions-6.2">Using typed Python functions</a></span></li><li><span><a href="#Promoting-Python-functions-to-udf" data-toc-modified-id="Promoting-Python-functions-to-udf-6.3">Promoting Python functions to <code>udf</code></a></span><ul class="toc-item"><li><span><a href="#Option-1:-Creating-a-UDF-explicitily-with-udf()-and-apply-it-to-dataframe" data-toc-modified-id="Option-1:-Creating-a-UDF-explicitily-with-udf()-and-apply-it-to-dataframe-6.3.1">Option 1: Creating a UDF explicitily with <code>udf()</code> and apply it to dataframe</a></span></li><li><span><a href="#Option-2:-Creating-a-UDF-directly-using-udf()-decorator" data-toc-modified-id="Option-2:-Creating-a-UDF-directly-using-udf()-decorator-6.3.2">Option 2: Creating a UDF directly using <code>udf()</code> decorator</a></span></li></ul></li></ul></li></ul></div>

# RDD and user-defined functions


## Summary

- The resilient distributed dataset allows for better flexibility compared to the records and columns approach of the data frame.
- The most low level and flexible way of running Python code within the distributed Spark environment is to use the RDD. With an RDD, you have no structure imposed on your data and need to manage type information into your program, and defensively code against potential exceptions.
- The API for data processing on the RDD is heavily inspired by the MapReduce framework. You use higher order functions such as map(), filter() and reduce() on the objects of the RDD.
- The data frame’s most basic Python code promotion functionality, called the (PySpark) UDF, emulates the "map" part of the RDD. You use it as a scalar function, taking Column objects as parameters and returning a single Column.


## Terminology

### Resilient Distributed Dataset (RDD)

- Bag of elements, independent, no schema
- Flexible with what you want to do but no safeguards

### User-defined functions (UDF)

- Simple way to promote Python functions to be used on a data frame.

RDD's Pros

1. When you have unordered collection of Python objects that can be pickled
2. Unordered `key value` pairs i.e. Python dict


## Example: Creating an RDD from a Python list

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

collection = [1, "two", 3.0, ("four", 4), {"five": 5}]

sc = spark.sparkContext

collection_rdd = sc.parallelize(collection)

print(collection_rdd)

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:262


## Manipulating data with `map`, `filter` and `reduce`

- Each take a function as their only param, ie. they are _higher-order functions_.

### `map`

- apply one function to every object
- need to be careful with unsupported types on whatever function you're trying to run

In [23]:
# Map a simple function to each element to an RDD.
# This will raise an error because not all of the elements are integers

from py4j.protocol import Py4JJavaError
import re


def add_one(value):
    return value + 1


collection_rdd = collection_rdd.map(add_one)

try:
    print(collection_rdd.collect())
except Py4JJavaError as e:
    pass

# Stack trace galore! The important bit, you'll get one of the following:
# TypeError: can only concatenate str (not "int") to str
# TypeError: unsupported operand type(s) for +: 'dict' and 'int'
# TypeError: can only concatenate tuple (not "int") to tuple

In [37]:
# Safer option with a try/except inside the function
def safer_add_one(value):
    try:
        return value + 1
    except TypeError:
        return value
    
# reset rdd
collection_rdd = sc.parallelize(collection)
print("Before: ", collection)

# run safe adding method
collection_rdd = collection_rdd.map(safer_add_one)
print("After : ", collection_rdd.collect())

Before:  [1, 'two', 3.0, ('four', 4), {'five': 5}]
After :  [2, 'two', 4.0, ('four', 4), {'five': 5}]


![](notes/img/rdd_error.png)

### `filter`

- Returns RDD element if True, else drops it.
-  As a reminder, False, number 0, empty sequences and collections (list, tuple, dict, set, range) are falsey 
  - ref: https://docs.python.org/3/library/stdtypes.html#truth-value-testing)

In [48]:
# Filtering RDD with lambda function to keep only int and floats

collection_rdd = sc.parallelize(collection)


collection_rdd = collection_rdd.filter(lambda x: isinstance(x, (float, int)))
print(collection_rdd.collect())


# Alternative: Creating a separate function

collection_rdd = sc.parallelize(collection)

def is_string(elem):
    return True if isinstance(elem, str) else False

collection_rdd = collection_rdd.filter(is_string)
print(collection_rdd.collect())

[1, 3.0]
['two']


## `reduce`

- Used for summarization (ie. groupby and agg with dataframe)
- Takes 2 elements and returns 1 element. If list > 2, will taking first 2 elements, then apply result again to third and so forth.

![](notes/img/reduce.png)

In [53]:
# Add list of numbers through reduce

from operator import add

collection_rdd = sc.parallelize(range(10))
print(collection_rdd.reduce(add))

45


### Use _Commutative_ and _associate_ functions (for distributed computing)

Only give `reduce` _commutative_ and _associate_ functions.

- Commutative function: Function in which order of arguments doesn't matter
- Associative function: Function in which grouping of arguments doesn't matter, 
  - `subtract` is not because `(a - b) - c != a - (b - c)`
- `add`, `multiply`, `min` and `max` are both associative and commutative

## User-defined Functions

- UDFs allow you to implement custom functions on PySpark data frame columns

### Using UDF to create a `fractions` function

In [57]:
import pyspark.sql.functions as F
import pyspark.sql.types as T

fractions = [[x,y] for x in range(100) for y in range(1, 100)]
frac_df = spark.createDataFrame(fractions, ["numerator", "denominator"])

frac_df = frac_df.select(
    F.array(F.col("numerator"), F.col("denominator")).alias("fraction"),
)

frac_df.show(5, False)

+--------+
|fraction|
+--------+
|[0, 1]  |
|[0, 2]  |
|[0, 3]  |
|[0, 4]  |
|[0, 5]  |
+--------+
only showing top 5 rows



### Using typed Python functions

This section will create a function to reduce a fraction and one to transform a fraction into a floating-point number.

In [59]:
from fractions import Fraction
from typing import Tuple, Optional

Frac = Tuple[int, int]

def py_reduce_fraction(frac: Frac) -> Optional[Frac]:
    """Reduce a fraction represented as a 2-tuple of integers"""
    num, denom = frac
    if denom:
        answer = Fraction(num, denom)
        return answer.numerator, answer.denominator
    return None

assert py_reduce_fraction((3,6)) == (1, 2)
assert py_reduce_fraction((1, 0)) is None

In [60]:
def py_fraction_to_float(frac: Frac) -> Optional[float]:
    """Transforms a fraction represented as a 2-tuple of integer into a float"""
    num, denom = frac
    if denom:
        return num / denom
    return None

assert py_fraction_to_float((2, 8)) == 0.25
assert py_fraction_to_float((10, 0)) is None

### Promoting Python functions to `udf`

The function takes two parameters.

1. The function you want to promote.
2. Optionally, the return type of the generated UDF.

#### Option 1: Creating a UDF explicitily with `udf()` and apply it to dataframe

In [63]:
SparkFrac = T.ArrayType(T.LongType())

# Promote python func to udf, passing SparkFrac type alias
reduce_fraction = F.udf(py_reduce_fraction, SparkFrac)

# apply to existing dataframe
frac_df = frac_df.withColumn(
    "reduced_fraction", reduce_fraction(F.col("fraction"))
)

frac_df.show(5, False)

+--------+----------------+
|fraction|reduced_fraction|
+--------+----------------+
|[0, 1]  |[0, 1]          |
|[0, 2]  |[0, 1]          |
|[0, 3]  |[0, 1]          |
|[0, 4]  |[0, 1]          |
|[0, 5]  |[0, 1]          |
+--------+----------------+
only showing top 5 rows



#### Option 2: Creating a UDF directly using `udf()` decorator

In [67]:
@F.udf(T.DoubleType())
def fraction_to_float(frac: Frac) -> Optional[float]:
    num, denom = frac
    if denom:
        return num / denom
    return None


frac_df = frac_df.withColumn(
    "fraction_float", fraction_to_float(F.col("reduced_fraction"))
)

frac_df.select("reduced_fraction", "fraction_float").distinct().show(5, False)

assert fraction_to_float.func((1, 2)) == 0.5

+----------------+-------------------+
|reduced_fraction|fraction_float     |
+----------------+-------------------+
|[3, 50]         |0.06               |
|[3, 67]         |0.04477611940298507|
|[7, 76]         |0.09210526315789473|
|[9, 23]         |0.391304347826087  |
|[9, 25]         |0.36               |
+----------------+-------------------+
only showing top 5 rows

