# 3. Functions in Apache Spark

## Aggregations

### "summary" and "describe"
![Summary and Describe](./images/Summary_Describe.png)


### format_number()
![format_number](./images/format_number.png)



### groupBy
Use the DataFrame **`groupBy`** method to create a grouped data object. 

<img src="https://files.training.databricks.com/images/aspwd/aggregation_groupby.png" width="60%" />

In [0]:
df.groupBy("event_name")

In [0]:
df.groupBy("geo.state", "geo.city")


#### Grouped data methods
Various aggregation methods are available on the <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html" target="_blank">GroupedData</a> object.


| Method | Description |
| --- | --- |
| agg | Compute aggregates by specifying a series of aggregate columns |
| avg | Compute the mean value for each numeric columns for each group |
| count | Count the number of rows for each group |
| max | Compute the max value for each numeric columns for each group |
| mean | Compute the average value for each numeric columns for each group |
| min | Compute the min value for each numeric column for each group |
| pivot | Pivots a column of the current DataFrame and performs the specified aggregation |
| sum | Compute the sum for each numeric columns for each group |

Here, we're getting the average purchase revenue for each.

In [0]:
avg_state_purchases_df = df.groupBy("geo.state").avg("ecommerce.purchase_revenue_in_usd")
display(avg_state_purchases_df)

And here the total quantity and sum of the purchase revenue for each combination of state and city.

In [0]:
city_purchase_quantities_df = df.groupBy("geo.state", "geo.city").sum("ecommerce.total_item_quantity", "ecommerce.purchase_revenue_in_usd")
display(city_purchase_quantities_df)


### Built-In Functions
In addition to DataFrame and Column transformation methods, there are a ton of helpful functions in Spark's built-in <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-builtin.html" target="_blank">SQL functions</a> module.


#### Aggregate Functions **`agg`**

Here are some of the built-in functions available for aggregation.

| Method | Description |
| --- | --- |
| approx_count_distinct | Returns the approximate number of distinct items in a group |
| avg | Returns the average of the values in a group |
| collect_list | Returns a list of objects with duplicates |
| corr | Returns the Pearson Correlation Coefficient for two columns |
| max | Compute the max value for each numeric columns for each group |
| mean | Compute the average value for each numeric columns for each group |
| stddev_samp | Returns the sample standard deviation of the expression in a group |
| sumDistinct | Returns the sum of distinct values in the expression |
| var_pop | Returns the population variance of the values in a group |

Use the grouped data method <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.agg.html#pyspark.sql.GroupedData.agg" target="_blank">**`agg`**</a> to apply built-in aggregate functions

This allows you to apply other transformations on the resulting columns, such as <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html" target="_blank">**`alias`**</a>.

In [0]:
from pyspark.sql.functions import sum

state_purchases_df = df.groupBy("geo.state").agg(sum("ecommerce.total_item_quantity").alias("total_purchases"))
display(state_purchases_df)


Apply multiple aggregate functions on grouped data

In [0]:
from pyspark.sql.functions import avg, approx_count_distinct

state_aggregates_df = (df
                       .groupBy("geo.state")
                       .agg(avg("ecommerce.total_item_quantity").alias("avg_quantity"),
                            approx_count_distinct("user_id").alias("distinct_users"))
                      )

display(state_aggregates_df)


#### Math Functions
Here are some of the built-in functions for math operations.

| Method | Description |
| --- | --- |
| ceil | Computes the ceiling of the given column. |
| cos | Computes the cosine of the given value. |
| log | Computes the natural logarithm of the given value. |
| round | Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode. |
| sqrt | Computes the square root of the specified float value. |

In [0]:
from pyspark.sql.functions import cos, sqrt

display(spark.range(10)  # Create a DataFrame with a single column called "id" with a range of integer values
        .withColumn("sqrt", sqrt("id"))
        .withColumn("cos", cos("id"))
       )

## DateTimes

### Unix time
- Unix time is a system for describing a point in time. It is the number of million seconds that have elapsed since Unix epoch. The Unix eposh is 00:00:00 UTC on 1 Jan 1970.
  - ```
    df2 = df1.withColumn("ts", (col("unixtime") / 1e6).cast("timestamp"))
    ```

![Unix Timestamp](./images/Unix_Timestamp.png)



### Built-In Functions: Date Time Functions
Here are a few built-in functions to manipulate dates and times in Spark.

| Method | Description |
| --- | --- |
| **`add_months`** | Returns the date that is numMonths after startDate |
| **`current_timestamp`** | Returns the current timestamp at the start of query evaluation as a timestamp column |
| **`date_format`** | Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument. |
| **`dayofweek`** | Extracts the day of the month as an integer from a given date/timestamp/string |
| **`from_unixtime`** | Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format |
| **`minute`** | Extracts the minutes as an integer from a given date/timestamp/string. |
| **`unix_timestamp`** | Converts time string with given pattern to Unix timestamp (in seconds) |

### Datetime Patterns for Formatting and Parsing

Spark uses <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html" target="_blank">pattern letters for date and timestamp parsing and formatting</a>. A subset of these patterns are shown below.

| Symbol | Meaning         | Presentation | Examples               |
| ------ | --------------- | ------------ | ---------------------- |
| G      | era             | text         | AD; Anno Domini        |
| y      | year            | year         | 2020; 20               |
| D      | day-of-year     | number(3)    | 189                    |
| M/L    | month-of-year   | month        | 7; 07; Jul; July       |
| d      | day-of-month    | number(3)    | 28                     |
| Q/q    | quarter-of-year | number/text  | 3; 03; Q3; 3rd quarter |
| E      | day-of-week     | text         | Tue; Tuesday           |

<img src="https://files.training.databricks.com/images/icon_warn_32.png" alt="Warning"> Spark's handling of dates and timestamps changed in version 3.0, and the patterns used for parsing and formatting these values changed as well. For a discussion of these changes, please reference <a href="https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html" target="_blank">this Databricks blog post</a>.


#### **`date_format()`**
Converts a date/timestamp/string to a string formatted with the given date time pattern.

In [0]:
from pyspark.sql.functions import date_format

formatted_df = (timestamp_df
                .withColumn("date string", date_format("col_timestamp", "MMMM dd, yyyy"))
                .withColumn("time string", date_format("col_timestamp", "HH:mm:ss.SSSSSS"))
               )
display(formatted_df)

#### **`cast()`**

Casts column to a different data type, specified using string representation or DataType.

In [0]:
timestamp_df = df.withColumn("col_timestamp", (col("col_timestamp") / 1e6).cast("timestamp"))
display(timestamp_df)


#### **`year`**
Extracts the year as an integer from a given date/timestamp/string.

##### Similar methods: **`month`**, **`dayofweek`**, **`minute`**, **`second`**, etc.

In [0]:
from pyspark.sql.functions import year, month, dayofweek, minute, second

datetime_df = (timestamp_df
               .withColumn("year", year(col("col_timestamp")))
               .withColumn("month", month(col("col_timestamp")))
               .withColumn("dayofweek", dayofweek(col("col_timestamp")))
               .withColumn("minute", minute(col("col_timestamp")))
               .withColumn("second", second(col("col_timestamp")))
              )
display(datetime_df)


#### **`to_date`**
Converts the column into DateType by casting rules to DateType.

In [0]:
from pyspark.sql.functions import to_date

date_df = timestamp_df.withColumn("date", to_date(col("col_timestamp")))
display(date_df)



#### **`date_add`**
Returns the date that is the given number of days after start

In [0]:
from pyspark.sql.functions import date_add

plus_2_df = timestamp_df.withColumn("plus_two_days", date_add(col("timestamp"), 2))
display(plus_2_df)

## Complex Types


### String Functions
Here are some of the built-in functions available for manipulating strings.

| Method | Description |
| --- | --- |
| translate | Translate any character in the src by a character in replaceString |
| regexp_replace | Replace all substrings of the specified string value that match regexp with rep |
| regexp_extract | Extract a specific group matched by a Java regex, from the specified string column |
| ltrim | Removes the leading space characters from the specified string column |
| lower | Converts a string column to lowercase |
| split | Splits str around matches of the given pattern |


For example: let's imagine that we need to parse our **`email`** column. We're going to use the **`split`** function  to split domain and handle.

In [0]:
from pyspark.sql.functions import split
display(df.select(split(df.email, '@', 0).alias('email_handle')))


### Collection Functions

Here are some of the built-in functions available for working with arrays.

| Method | Description |
| --- | --- |
| array_contains | Returns null if the array is null, true if the array contains value, and false otherwise. |
| element_at | Returns element of array at given index. Array elements are numbered starting with **1**. |
| explode | Creates a new row for each element in the given array or map column. |
| collect_set | Returns a set of objects with duplicate elements eliminated. |

In [0]:
mattress_df = (details_df
               .filter(array_contains(col("details"), "Mattress"))
               .withColumn("size", element_at(col("details"), 2)))
display(mattress_df)


### Aggregate Functions

Here are some of the built-in aggregate functions available for creating arrays, typically from GroupedData.

| Method | Description |
| --- | --- |
| collect_list | Returns an array consisting of all values within the group. |
| collect_set | Returns an array consisting of all unique values within the group. |


Let's say that we wanted to see the sizes of mattresses ordered by each email address. For this, we can use the **`collect_set`** function

In [0]:
size_df = mattress_df.groupBy("email").agg(collect_set("size").alias("size options"))

display(size_df)


### Union and unionByName
<img src="https://files.training.databricks.com/images/icon_warn_32.png" alt="Warning"> The DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.union.html" target="_blank">**`union`**</a> method resolves columns by position, as in standard SQL. You should use it only if the two DataFrames have exactly the same schema, including the column order. In contrast, the DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html" target="_blank">**`unionByName`**</a> method resolves columns by name.  This is equivalent to UNION ALL in SQL.  Neither one will remove duplicates.  

Below is a check to see if the two dataframes have a matching schema where **`union`** would be appropriate

In [0]:
mattress_df.schema==size_df.schema

In [0]:
If we do get the two schemas to match with a simple select statement, then we can use a union

In [0]:
union_count = mattress_df.select("email").union(size_df.select("email")).count()

mattress_count = mattress_df.count()
size_count = size_df.count()

mattress_count + size_count == union_count

## Additional Functions

### Non-aggregate and Miscellaneous Functions
Here are a few additional non-aggregate and miscellaneous built-in functions.

| Method | Description |
| --- | --- |
| col / column | Returns a Column based on the given column name. |
| lit | Creates a Column of literal value |
| isnull | Return true iff the column is null |
| rand | Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0) |


#### **`col()`**

We could select a particular column using the **`col`** function

In [0]:
gmail_accounts = sales_df.filter(col("email").endswith("gmail.com"))

display(gmail_accounts)


#### **`lit`** 
Used to create a column out of a value, which is useful for appending columns.

In [0]:
display(gmail_accounts.select("email", lit(True).alias("gmail user")))



### DataFrame Na Functions
<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameNaFunctions.html#pyspark.sql.DataFrameNaFunctions" target="_blank">DataFrameNaFunctions</a> is a DataFrame submodule with methods for handling null values. Obtain an instance of DataFrameNaFunctions by accessing the **`na`** attribute of a DataFrame.

| Method | Description |
| --- | --- |
| drop | Returns a new DataFrame omitting rows with any, all, or a specified number of null values, considering an optional subset of columns |
| fill | Replace null values with the specified value for an optional subset of columns |
| replace | Returns a new DataFrame replacing a value with another value, considering an optional subset of columns |

In [0]:
print(sales_df.count())
print(sales_df.na.drop().count())


We can fill in the missing coupon codes with **`na.fill`**

In [0]:
display(sales_exploded_df.select("items.coupon").na.fill("NO COUPON"))



### Joining DataFrames
The DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html?highlight=join#pyspark.sql.DataFrame.join" target="_blank">**`join`**</a> method joins two DataFrames based on a given join expression. 

Several different types of joins are supported:

Inner join based on equal values of a shared column called "name" (i.e., an equi join)<br/>
**`df1.join(df2, "name")`**

Inner join based on equal values of the shared columns called "name" and "age"<br/>
**`df1.join(df2, ["name", "age"])`**

Full outer join based on equal values of a shared column called "name"<br/>
**`df1.join(df2, "name", "outer")`**

Left outer join based on an explicit column expression<br/>
**`df1.join(df2, df1["customer_name"] == df2["account_name"], "left_outer")`**

In [0]:
joined_df = gmail_accounts.join(other=users_df, on='email', how = "inner")
display(joined_df)



## User-Defined Functions



### User-Defined Function (UDF)
A custom column transformation function

- Can’t be optimized by Catalyst Optimizer
- Function is serialized and sent to executors
- Row data is deserialized from Spark's native binary format to pass to the UDF, and the results are serialized back into Spark's native format
- For Python UDFs, additional interprocess communication overhead between the executor and a Python interpreter running on each worker node


### Define a function

Define a function (on the driver) to get the first letter of a string from the **`email`** field.

In [0]:
def first_letter_function(email):
    return email[0]

first_letter_function("annagray@kaufman.com")


### Create and apply UDF
Register the function as a UDF. This serializes the function and sends it to executors to be able to transform DataFrame records.

In [0]:
first_letter_udf = udf(first_letter_function)


Apply the UDF on the **`email`** column.

In [0]:
from pyspark.sql.functions import col

display(sales_df.select(first_letter_udf(col("email"))))


### Register UDF to use in SQL
Register the UDF using **`spark.udf.register`** to also make it available for use in the SQL namespace.

In [0]:
sales_df.createOrReplaceTempView("sales")

first_letter_udf = spark.udf.register("sql_udf", first_letter_function)

In [0]:
# You can still apply the UDF from Python
display(sales_df.select(first_letter_udf(col("email"))))

In [0]:
%sql
-- You can now also apply the UDF from SQL
SELECT sql_udf(email) AS first_letter FROM sales


### Use Decorator Syntax (Python Only)

Alternatively, you can define and register a UDF using <a href="https://realpython.com/primer-on-python-decorators/" target="_blank">Python decorator syntax</a>. The **`@udf`** decorator parameter is the Column datatype the function returns.

You will no longer be able to call the local Python function (i.e., **`first_letter_udf("annagray@kaufman.com")`** will not work).

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> This example also uses <a href="https://docs.python.org/3/library/typing.html" target="_blank">Python type hints</a>, which were introduced in Python 3.5. Type hints are not required for this example, but instead serve as "documentation" to help developers use the function correctly. They are used in this example to emphasize that the UDF processes one record at a time, taking a single **`str`** argument and returning a **`str`** value.

In [0]:
# Our input/output is a string
@udf("string")
def first_letter_udf(email: str) -> str:
    return email[0]


And let's use our decorator UDF here.

In [0]:
from pyspark.sql.functions import col

sales_df = spark.read.format("delta").load(DA.paths.sales)
display(sales_df.select(first_letter_udf(col("email"))))


### Pandas/Vectorized UDFs

Pandas UDFs are available in Python to improve the efficiency of UDFs. Pandas UDFs utilize Apache Arrow to speed up computation.

* <a href="https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html" target="_blank">Blog post</a>
* <a href="https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html?highlight=arrow" target="_blank">Documentation</a>

<img src="https://databricks.com/wp-content/uploads/2017/10/image1-4.png" alt="Benchmark" width ="500" height="1500">

The user-defined functions are executed using: 
* <a href="https://arrow.apache.org/" target="_blank">Apache Arrow</a>, an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes with near-zero (de)serialization cost
* Pandas inside the function, to work with Pandas instances and APIs

<img src="https://files.training.databricks.com/images/icon_warn_32.png" alt="Warning"> As of Spark 3.0, you should **always** define your Pandas UDF using Python type hints.

In [0]:
import pandas as pd
from pyspark.sql.functions import pandas_udf

# We have a string input/output
@pandas_udf("string")
def vectorized_udf(email: pd.Series) -> pd.Series:
    return email.str[0]

# Alternatively
# def vectorized_udf(email: pd.Series) -> pd.Series:
#     return email.str[0]
# vectorized_udf = pandas_udf(vectorized_udf, "string")

In [0]:
display(sales_df.select(vectorized_udf(col("email"))))


We can also register these Pandas UDFs to the SQL namespace.

In [0]:
spark.udf.register("sql_vectorized_udf", vectorized_udf)

In [0]:
%sql
-- Use the Pandas UDF from SQL
SELECT sql_vectorized_udf(email) AS firstLetter FROM sales