# User Defined Functions

Recall that the`pandas` `Series.apply` and `Series.map` methods allowed us to apply functions to each individual data value in a column.  We would like to do the same in `pyspark`, but the underlying code runs in Scala, we need some extra machinery to apply/map a function to the data. Our two solitions are

- Define a Python `udf` column function that will run Python code on each data value.
- Define a `pandas_udf` that is faster and uses existing pandas methods to perform [vectorized operations](https://en.wikipedia.org/wiki/Array_programming).

### User defined Python functions

In this notebook, we will start by looking at defining pure Python user defined functions.

## Data sets

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository.  [Download Instructions](./get_MOMA_data.ipynb)

In [1]:
from pyspark.sql import SparkSession
from more_pyspark import get_spark_types, to_pandas

spark = SparkSession.builder.appName('Ops').getOrCreate()

22/11/03 10:34:41 WARN Utils: Your hostname, nn1448lr222 resolves to a loopback address: 127.0.1.1; using 172.22.172.10 instead (on interface eth0)
22/11/03 10:34:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/03 10:34:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/03 10:34:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
from MoMA_schema import artwork_schema

artists = spark.read.csv("./data/Artists.csv", inferSchema=True, header=True)
artwork = spark.read.csv("./data/Artworks.csv", schema=artwork_schema, header=True)

In [3]:
artists.take(2) >> to_pandas

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,


In [4]:
artists.printSchema()

root
 |-- ConstituentID: integer (nullable = true)
 |-- DisplayName: string (nullable = true)
 |-- ArtistBio: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- BeginDate: integer (nullable = true)
 |-- EndDate: integer (nullable = true)
 |-- Wiki QID: string (nullable = true)
 |-- ULAN: integer (nullable = true)



In [5]:
artwork.take(2) >> to_pandas

22/11/03 10:35:00 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,


In [6]:
artwork.printSchema()

root
 |-- Title: string (nullable = true)
 |-- Artist: string (nullable = true)
 |-- ConstituentID: string (nullable = true)
 |-- ArtistBio: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- BeginDate: string (nullable = true)
 |-- EndDate: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Medium: string (nullable = true)
 |-- Dimensions: string (nullable = true)
 |-- CreditLine: string (nullable = true)
 |-- AccessionNumber: string (nullable = true)
 |-- Classification: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- DateAcquired: string (nullable = true)
 |-- Cataloged: string (nullable = true)
 |-- ObjectID: string (nullable = true)
 |-- URL: string (nullable = true)
 |-- ThumbnailURL: string (nullable = true)
 |-- Circumference (cm): double (nullable = true)
 |-- Depth (cm): double (nullable = true)
 |-- Diameter (cm): double (nullable = true)
 |-- Height (cm): double (nullable = tr

## Creating and applying a `udf` in `pyspark`

* **udf**: <b>U</b>ser <b>D</b>efined <b>F</b>unction
* Use `pyspark.sql.functions.udf(func, pyspark_type)` to define the function
* Use the `udf` inside `withColumn` to make/change columns

### Example 1 -  Compute the Century of Birth

In [18]:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf, col

century = udf(lambda year: (int(year)//100)*100,
              IntegerType())

In [13]:
century(artists.BeginDate) # lazy column expression

Column<'<lambda>(BeginDate)'>

In [14]:
(artists
.select('BeginDate')
.withColumn('Century of Birth', century(artists.BeginDate))
.take(3)
) >> to_pandas

                                                                                

Unnamed: 0,BeginDate,Century of Birth
0,1930,1900
1,1936,1900
2,1941,1900


### Example 2 - Indicator column using a Conditional expression

In [15]:
is_american = udf(lambda n: 1 if n == 'American' else 0,
                 IntegerType())

(artists
.select(artists.Nationality)
.withColumn('American', is_american(artists.Nationality))
.take(3)
) >> to_pandas

Unnamed: 0,Nationality,American
0,American,1
1,Spanish,0
2,American,1


### Example 3 - Applying the $\log + 1$ transformation 

In statistics, it is common to apply a $\log$ transformation to correct right skew.  Since $log(0)=-\infty$, it is customary to add one before applying the $\log$ to data that might have zeros.  Together the $y=\log(x+1)$ is known as the "log plus 1" transform.

#### Attempt 1

In [16]:
from math import log, e
from pyspark.sql.types import DoubleType

lnp1 = lambda num: log(num + 1, e)

lnp1_spark = udf(lnp1, DoubleType())

In [19]:
(artwork
 .select('Height (cm)')
 .withColumn('ln(Height + 1)', lnp1_spark(col('Height (cm)')))
 .take(3)
)

22/11/03 10:38:03 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/tmp/ipykernel_1327/3049692649.py", line 4, in <lambda>
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:86)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:68)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.ca

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/tmp/ipykernel_1327/3049692649.py", line 4, in <lambda>
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'


#### What went wrong?

The `unsupported operand type(s) for +: 'NoneType' and 'int'` means that we tried to add `None` and `1`, which is undefined.  This problem is so common that the solution has it's own name--the [null object pattern](https://en.wikipedia.org/wiki/Null_object_pattern).

In [20]:
lnp1(None)

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

#### What's the solution?

Check for missing values.

In [21]:
lnp1 = lambda num: None if num is None else log(num + 1, e)

lnp1_spark_V2 = udf(lnp1, DoubleType())

In [22]:
(artwork
 .select('Height (cm)')
 .withColumn('ln(Height + 1)', lnp1_spark_V2(col('Height (cm)')))
 .take(3)
) >> to_pandas

Unnamed: 0,Height (cm),ln(Height + 1)
0,48.6,3.903991
1,40.6401,3.729064
2,34.3,3.563883


#### What's a fancier solution?

Think about the type of our most recent function.  We know that a `float` becomes a `float`, but `None` remains `None`.  This type is known as the [Maybe monad](https://en.wikipedia.org/wiki/Monad_(functional_programming)#An_example:_Maybe), a type that you will reinvent 9,000 times in your career.  

Let's create a more gneral solution by creating a wrapper function that will decorate another function, handling the null object issue.

In [23]:
def maybe_apply(func):
    ''' Decorates a function to account for missing input (e.g. None).
    
    args:
        func - A unary function
        
    returns:
        a unary function that accepts an input (say x) returns 
            - None when x is None
            - func(x) otherwise
    '''
    return lambda x: None if x is None else func(x)

In [24]:
from pyspark.sql.functions import udf, col

ln1p = lambda num: log(num + 1, e)

spark_log1p_V3 = udf(maybe_apply(ln1p), DoubleType())

(artwork
 .select('Height (cm)')
 .withColumn('log (height + 1)', spark_log1p_V3(col('Height (cm)')))
 .take(2)
) >> to_pandas

Unnamed: 0,Height (cm),log (height + 1)
0,48.6,3.903991
1,40.6401,3.729064


## <font color="red"> Exercise 6.6.1 </font>

Solve each of the following tasks by creating and applying a Python `udf`.

1. Use the string `replace` method to remove parentheses from the `artwork.Nationality` column.
2. Convert the weight column to pounds.

In both cases, you should check for missing values and proceed accordingly.

In [117]:
# Hint 1 - the string replace method doesn't accept regular expr and must be chained

'(Austrian)'.replace('(', '').replace(')', '')

'Austrian'

In [118]:
# Hint 2 - use re.sub if you want to use regex
from re import sub

sub('[()]', '', '(Austrian)')

'Austrian'

In [29]:
# Your code here
from re import sub
from pyspark.sql.types import StringType


replace_r = lambda string: sub('[()]','',string)

replace = udf(maybe_apply(replace_r), StringType())

(artwork
 .select('Nationality')
 .withColumn('Nationality', replace(col('Nationality')))
 .take(2)
) >> to_pandas

Unnamed: 0,Nationality
0,Austrian
1,French


22/11/03 11:05:43 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 283808 ms exceeds timeout 120000 ms
22/11/03 11:05:43 WARN SparkContext: Killing executors is not supported by current scheduler.
