<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Summary" data-toc-modified-id="Summary-1">Summary</a></span></li><li><span><a href="#Requirements" data-toc-modified-id="Requirements-2">Requirements</a></span></li><li><span><a href="#Column-transformation-using-Series-UDF" data-toc-modified-id="Column-transformation-using-Series-UDF-3">Column transformation using Series UDF</a></span><ul class="toc-item"><li><span><a href="#Types-of-Series-UDF" data-toc-modified-id="Types-of-Series-UDF-3.1">Types of Series UDF</a></span></li><li><span><a href="#Dataset---Google-BigQuery" data-toc-modified-id="Dataset---Google-BigQuery-3.2">Dataset - Google BigQuery</a></span><ul class="toc-item"><li><span><a href="#Steps-to-data-connection" data-toc-modified-id="Steps-to-data-connection-3.2.1">Steps to data connection</a></span></li><li><span><a href="#Installation-+-Configuration" data-toc-modified-id="Installation-+-Configuration-3.2.2">Installation + Configuration</a></span></li></ul></li><li><span><a href="#Errata" data-toc-modified-id="Errata-3.3">Errata</a></span></li><li><span><a href="#Read-data-locally" data-toc-modified-id="Read-data-locally-3.4">Read data locally</a></span></li></ul></li><li><span><a href="#Series-to-Series-UDF" data-toc-modified-id="Series-to-Series-UDF-4">Series to Series UDF</a></span><ul class="toc-item"><li><span><a href="#Converting-Fahrenheit-to-Celsius-with-a-S-to-S-UDF" data-toc-modified-id="Converting-Fahrenheit-to-Celsius-with-a-S-to-S-UDF-4.1">Converting Fahrenheit to Celsius with a S-to-S UDF</a></span><ul class="toc-item"><li><span><a href="#Errata" data-toc-modified-id="Errata-4.1.1">Errata</a></span></li></ul></li></ul></li><li><span><a href="#Iterator-of-Series-UDF" data-toc-modified-id="Iterator-of-Series-UDF-5">Iterator of Series UDF</a></span></li></ul></div>

# Using Pandas UDF

## Summary
- Pandas UDFs allow you to take code that works on Pandas data frames and scale it to the Spark Data Frame structure. Efficient serialization between the two data structures is ensured by PyArrow.
- We can group Pandas UDF into two main families, depending on the level of control we need over the batches. Series and Iterator of Series (and Iterator of data frame/mapInPandas) will batch efficiently with the user having no control over the batch composition.
- If you need control over the content of each batch, you can use grouped data UDF with the split-apply-combing programming pattern. PySpark provides access to the values inside each batch of a GroupedData object either as Series (group aggregate UDF) of as data frame (group map UDF).

## Requirements

This chapter will use:
1. pandas
2. scikit-learn
3. PyArrow

The chapter assumes you are using PySpark 3.0 and above.

## Column transformation using Series UDF

### Types of Series UDF

__Series to Series__ 

- Takes `Columns` objects, converts to Pandas `Series` and return `Series` object that gets promoted back to `Column` object.

__Iterator of Series to Iterator of Series__ 

- `Column` is batched, then fed as a Iterator object.  
- Takes single `Column`, returns single `Column`
- Good when you need to initialize an expensive state

__Iterator of multiples Series to Iterator of Series__

- Takes multiple `Columns` as input but preserves iterator pattern.

### Dataset - Google BigQuery

We will use the National Oceanic and Atmospheric Administration (NOAA) Global Surface Summary of the Day (GSOD) dataset.

#### Steps to data connection

1. Install and configure the connector (if necessary), following the vendor’s documentation.
2. Customize the SparkReader object to account for the new data source type.
3. Read the data, authenticating as needed.

#### Installation + Configuration

After setting up Google Cloud Platform account, intiialize PySpark with the BigQuery connector enabled


### Errata 
The code below doesn't work due to a lot of issues with PyArrow compatability with Java 11.  I've skipped this part and just downloaded the dataset from the author's github.

Reference:
- https://stackoverflow.com/questions/62109276/errorjava-lang-unsupportedoperationexception-for-pyspark-pandas-udf-documenta
- https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/200
- https://stackoverflow.com/questions/64960642/rewrite-udf-to-pandas-udf-with-arraytype-column

In [None]:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set(
    "spark.executor.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true"
)
conf.set("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true")
conf.set(
    "spark.jars.packages",
    "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.19.1",
)

# spark = (
#     SparkSession.builder
#     .config(
#         "spark.jars.packages",
#         "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.19.1",
#     )
#     .config(
#         "spark.driver.extraJavaOptions",
#         "-Dio.netty.tryReflectionSetAccessible=true"
#     )
#     .config(
#         "spark.executor.extraJavaOptions",
#         "-Dio.netty.tryReflectionSetAccessible=true"
#     )
#     .getOrCreate()
# )

spark = SparkSession.builder.config(conf=conf).getOrCreate()

After initializing, read the `stations` and `gsod` tables for 2010 to 2020

In [None]:
from functools import reduce
import pyspark.sql.functions as F


def read_df_from_bq(year):
    return (
        spark.read.format("bigquery").option(
            "table", f"bigquery-public-data.noaa_gsod.gsod{year}"
        )
        .option("credentialsFile", "/Users/taichinakatani/dotfiles/keys/bq-key.json")
        .option("parentProject", "still-vim-244001")
        .load()
    )


# Because gsod2020 has an additional date column that the previous years do not have,
# unionByName will fill the values with null
gsod = (
    reduce(
        lambda x, y: x.unionByName(y, allowMissingColumns=True),
        [read_df_from_bq(year) for year in range(2020, 2021)],
    )
    .dropna(subset=["year", "mo", "da", "temp"])
    .where(F.col("temp") != 9999.9)
    .drop("date")
)

In [None]:
gsod.select(F.col('year')).show(5)

### Read data locally

In [3]:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark = SparkSession.builder.config(conf=conf).getOrCreate()

# Read from local parquet instead
gsod = spark.read.load("data/gsod_noaa/gsod2018.parquet")

## Series to Series UDF

- Python UDFs work on one record at a time, while Scalar UDF work on one _Series_ at a time and is written through Pandas code.
- Pandas has simpler data types than PySpark, so need to be careful to align the types.  `pandas_udf` helps with this.

![](notes/img/pd.png)

### Converting Fahrenheit to Celsius with a S-to-S UDF

#### Errata

Using the `pandas_udf` decorator is killing the kernel for some reason.

In [5]:
import pandas as pd
import pyspark.sql.types as T
import pyspark.sql.functions as F

# note the syntax "pandas_udf" and how it returns a pd.Series
# @F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
    """Transforms Farhenheit to Celcius."""
    return (degrees - 32) * 5 / 9

In [6]:
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)

+----+------------------+
|temp|            temp_c|
+----+------------------+
|37.2|2.8888888888888906|
|71.6|21.999999999999996|
|53.5|11.944444444444445|
|24.7|-4.055555555555555|
|70.4|21.333333333333336|
+----+------------------+
only showing top 5 rows



## Iterator of Series UDF

- signature goes from `(pd.Series) → pd.Series` to `(Iterator[pd.Series]) → Iterator[pd.Series]`
- Since we are working with an Iterator of Series, we are explicitly iterating over each batch one by one. PySpark will take care of distributing the work for us.
- Uses `yield` than `return` so function returns an iterator

In [None]:
from time import sleep
from typing import Iterator


@F.pandas_udf(T.DoubleType())
def f_to_c2(degrees: Iterator[pd.Series]) -> Iterator[pd.Series]:
    """Transforms Farhenheit to Celcius."""
    sleep(5)
    for batch in degrees:
        yield (batch - 32) * 5 / 9


gsod.select(
    "temp", f_to_c2(F.col("temp")).alias("temp_c")
).distinct().show(5)

# +-----+-------------------+
# | temp|             temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows