# Overview

In Apache Spark 2.3. a new feature called Pandas UDFs was introduced. More can be found [here](https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html). 

A UDF stands for User Defined Function, meaning, as the name suggests, that a user writes the definition of a python function to be used with pandas. 

So why the distinction and what's the big real? Normally Pandas tries to be fast and defines it's built in operations in terms of vectors. This is why sometimes you'll see the term vectorized UDF popping up in mathematical libraries like pandas. The distrinction is that typically UDFs operate one row at a time (iteratively) unlike their vectorized cousins which operate in parrallel. As such things can be very slow. To get around this, users would define their UDFs in a language that was faster than the one they worked in, and would then call those libraries from their high level framework. For example spark users might choose java or scala. While there is a speed uptick, this adds complexity for users who are not multilingual. 

This is what makes apache sparks Pandas UDFs so exciting. Spark has integrated with the [Apache Arrow](https://arrow.apache.org/) project to provide the best of both worlds. Apache 
Arrow provides a standardized cross language columnar memory format and several usedul high performance libraries. By leveraging Arrow, Spark is abl to provide the ability to define low-overhead high-performance UDFs entierly in python.

Conveniently many Koalas APIs utilize pandas UDFs under the hood. With the rollout of Apache Spark 3.0 even more UDFs have been introduced and leveraged as part of the Koalas API. As a result the Koalas 1.0.0 sees a 20 - 25% boost in performance when running against the Apache spark 3.0 framework.

<center><img src="images/koalas_udf_speed.png" width="400px"></center>

A very good blog article on this Koalas 1.0.0 release can be found [here](https://databricks.com/blog/2020/06/24/introducing-koalas-1-0.html).

# 1. Get Setup
## 1.1. Create spark context

In [1]:
# Load a helper module
import importlib.util
spec = importlib.util.spec_from_file_location("spark_helper", "../../../Utilities/spark_helper.py")
spark_helper = importlib.util.module_from_spec(spec)
spec.loader.exec_module(spark_helper)

In [2]:
spark_app_name = "spark-jupyter-win"
docker_image = "tschneider/pyspark:v6-beta"
k8_master_ip = "15.4.7.11"
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/opt/spark

Running findspark.init() function
['/opt/spark/python', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages/IPython/extensions', '/root/.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Determining IP Of Server
The ip was detected as: 15.4.12.12

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Creating Spark Session

Done!


In [9]:
! kubectl -n spark get pod

NAME                                        READY     STATUS    RESTARTS   AGE
spark-jupyter-win-9a5ad77d95b8a2c5-exec-1   1/1       Running   0          1m
spark-jupyter-win-9a5ad77d95b8a2c5-exec-2   1/1       Running   0          1m
spark-jupyter-win-9a5ad77d95b8a2c5-exec-3   1/1       Running   0          1m


# 2. Types of Pandas UDFs
AS of Spark 2.3. There are two types of Pandas UDFs: Scalar and GroupedMap. 

We will explore each individually.

## 2.1. Scalar Pandas UDFs (Vectorized UDFs)
Scalar Pandas UDFs allow us to vectorize scalar functions we define in python.

Recall from mathematics that a vector has a magnitude and direction while scalars have only a value. A linear function, or a linear operation/transformation, is when one applies a scalar to some object. For example we might multiply a vector $[4, 6]$ by the scalar $5$ resulting in another vector $[20, 25]$ with the same dimenstions.

Within the pandas/spark framework, vectors are represented as a series while scalars are represented by built in types such as int, float, double, etc. Thus scalar functions are those which accept a scalar as a parameter and return a scalar of the same dimension.

Below we will see examples of Pandas UDFs which apply scalar operations on our data (vectors). The cool thing, is that the spark framework will take care of vectorizing the functions for us! We do not have to worry abotu it and our code stays very clean and simple.

In [10]:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

In [11]:
from databricks import koalas

koalas_dataframe = koalas.DataFrame({
    "A" : [1,2,3,4,5],
    "B" : ["a", "b", "c", "b", "a"],
    "C" : [333, 444, 555, 222, 333]
})

koalas_dataframe

Unnamed: 0,A,B,C
0,1,a,333
1,2,b,444
2,3,c,555
3,4,b,222
4,5,a,333


In [12]:
import numpy
import pandas
from pyspark.sql.functions import pandas_udf, PandasUDFType, lit
from pyspark.sql.types import *

In [13]:
koalas_dataframe.to_spark().schema

StructType(List(StructField(A,LongType,false),StructField(B,StringType,false),StructField(C,LongType,false)))

In [14]:
def square(x):
    return x ** 2

In [15]:
koalas_dataframe["C"].apply(square)



0    110889
1    197136
2    308025
3     49284
4    110889
Name: C, dtype: int64

In [18]:
def foobar(df):
    df["E"] = "foobar"
    df["F"] = 5
    df["G"] = "barfoo"
    return df

koalas_dataframe.groupby("A").apply(foobar)

Unnamed: 0,A,B,C,E,F,G
0,1,a,333,foobar,5,barfoo
1,2,b,444,foobar,5,barfoo
2,3,c,555,foobar,5,barfoo
3,4,b,222,foobar,5,barfoo
4,5,a,333,foobar,5,barfoo


In [None]:
def my_scalar_func(col: pandas.Series) -> numpy.int32:
    return col * 5

return_schema = StructType([StructField("C", IntegerType())])
my_udf = pandas_udf(my_scalar_func, "int", PandasUDFType.SCALAR)


my_udf(lit(koalas_dataframe["C"].to_pandas()))

In [None]:
spark_session.sparkContext.stop()