## User-Defined Functions (UDFs) in Apache Spark with Python

User-Defined Functions (UDFs) in Apache Spark empower you to extend Spark's functionality by incorporating custom logic written in Python (or other languages). This lesson explores UDFs, highlighting their benefits and drawbacks, and comparing standard UDFs with Pandas UDFs for performance.


**What are UDFs?**

UDFs are functions you define and register with the Spark session. They can then be applied to columns within Spark DataFrames, enabling custom data transformations and calculations that are not built into Spark's core functionality.  This allows you to leverage your existing Python code or create specialized functions tailored to your specific data processing needs.


**Creating and Using Standard UDFs**

The simplest way to create a UDF in Spark is using the `spark.udf.register()` method. This registers a Python function that can be called within Spark DataFrame operations.

This code defines a simple UDF `age_in_five_years`, registers it with Spark, and applies it to the "age" column, creating a new column "age_in_five_years".


In [0]:
from pyspark.sql.functions import col

# Sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

# Define a UDF to calculate the age in years from now
def age_in_five_years(age):
    return age + 5

# Register the UDF
age_in_five_years_udf = spark.udf.register("age_in_five_years", age_in_five_years)

# Apply the UDF to the DataFrame
df = df.withColumn("age_in_five_years", age_in_five_years_udf(col("age")))

# Show the result
df.show()

**Pandas UDFs:  A Performance Boost**

Pandas UDFs offer significant performance advantages over standard UDFs, especially when dealing with complex operations or large datasets. They leverage the power of Pandas' vectorized operations to process data in chunks, minimizing the overhead of transferring data back and forth between Spark and Python.

The `@pandas_udf` decorator indicates a Pandas UDF.  The function now operates on a Pandas Series (`age_series`), allowing for vectorized computations.  The `"int"` argument specifies the return type.







In [0]:
from pyspark.sql.functions import pandas_udf
import pandas as pd

# Sample DataFrame (same as before)
data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

# Define a Pandas UDF to calculate age in five years (vectorized)
@pandas_udf("int")
def age_in_five_years_pandas(age_series: pd.Series) -> pd.Series:
    return age_series + 5

# Apply the Pandas UDF
df = df.withColumn("age_in_five_years", age_in_five_years_pandas(col("age")))

# Show the result
display(df)

**Understanding Python UDF Execution in Spark**

When you execute a Python UDF within a Spark job, a fascinating interplay of processes occurs behind the scenes.  Let's break down what's happening:

1. **Data Partitioning:** The Spark DataFrame is divided into multiple partitions, distributed across the worker nodes (executors) in the cluster.  This parallelizes processing.

2. **UDF Serialization:** Your Python UDF code needs to be serialized (converted into a byte stream) and transmitted to each executor.  This ensures that every worker has a copy of the function to execute.  The serialization process uses Python's `pickle` module by default. This adds overhead, especially for large UDFs.

3. **Data Transfer:**  Data from each partition is sent, one partition at a time, from the executors to the Python worker processes running on those executors.  This transfer happens through Spark's network communication infrastructure.  This transfer is another potential bottleneck, especially for large datasets or complex data structures.

4. **Python Worker Execution:** On each executor, a Python worker process is launched (or reused from a pool). This worker receives the serialized UDF and the data for its assigned partition.  The UDF is then executed, processing the data in a partition.  Because it’s Python code, this stage can be relatively slow.

5. **Pandas UDF Optimization:**  In the case of a Pandas UDF, the data is not processed row-by-row.  Instead, it leverages the power of Pandas to process the whole dataset (or a large chunk of it) at once using vectorized operations. The Pandas operations within the UDF are done in memory.  This is a massive optimization over standard UDFs which process one row at a time.

6. **Result Aggregation:**  After the UDF execution on each partition, the results are sent back to the driver node, where they are aggregated (combined) into the final output DataFrame.  Again, this data transfer can become a performance bottleneck with very large datasets.

7. **Garbage Collection:**  Spark's garbage collection (automated memory management) periodically cleans up unused memory.  Python's garbage collector also plays a role, further impacting performance.  Efficient garbage collection is crucial for minimizing latency.


**Why this matters:**

Understanding these steps highlights potential performance bottlenecks:

* **Serialization/Deserialization:** Serialization/deserialization overhead increases with UDF size and data volume.
* **Network Transfer:** Data transfer between executors and Python workers can be slow, especially across a network.
* **Python Execution Speed:** Python is generally slower than JVM-based languages (like Java or Scala).  This difference is why Pandas UDFs are so much faster; they significantly reduce the amount of data moving between Python and JVM.

**Optimizations:**

* **Pandas UDFs:** Use Pandas UDFs whenever possible for improved performance, especially with large datasets and computationally intensive operations. The data transfer is reduced to just one transfer for the entire dataset chunk which is then processed very quickly in the Pandas vectorized operations.
* **Minimize Data Transfer:**  Efficient data structures and careful column selection can reduce the volume of data transferred.
* **Efficient UDFs:** Write concise and optimized UDFs to minimize execution time.
* **Broadcast Variables:** For smaller datasets needed by the UDF, consider using broadcast variables to avoid redundant data transfers to each executor.




**Code Overview: Combining Machine Learning and Distributed Data Processing**

This Python script demonstrates a common workflow in data science, where we combine the power of traditional machine learning algorithms (from scikit-learn) with the scalability of distributed data processing (using Apache Spark). The primary goal is to train a classification model using scikit-learn and then use that model to make predictions on a larger dataset managed by Spark, while also calculating common classification metrics.

**Detailed Breakdown:**

1.  **Sample Data Generation (using NumPy and Pandas)**:
    *   **Purpose**: To create a synthetic dataset for training and testing our model.
    *   **Implementation**:
        *   `np.random.seed(42)`: Sets a random seed for NumPy's random number generator. This ensures reproducibility, meaning that each time the code is run the same random data will be generated. This is important for consistent results and debugging.
        *   `num_samples = 1000`: Defines the size of the dataset (number of data points).
        *   `feature_1 = np.random.rand(num_samples) * 10` and `feature_2 = np.random.rand(num_samples) * 10`: Generates two numerical features, with random values uniformly distributed between 0 and 10. These features will be the input for our model.
        *   `target = np.where(feature_1 + feature_2 > 10, "Class_A", "Class_B")`: Creates a categorical target label based on a simple rule. If the sum of `feature_1` and `feature_2` is greater than 10, the target is assigned "Class_A", otherwise it’s "Class_B". This provides a binary classification problem, where our model tries to predict whether a data point is "Class_A" or "Class_B" based on its two features.
        *   `data = pd.DataFrame(...)`: Creates a Pandas DataFrame to store our synthetic data, which is a tabular data structure designed for data analysis. Each column is a feature or the target, and each row is a data point.

2.  **Scikit-learn Model Training (using RandomForestClassifier)**:
    *   **Purpose**: To train a classification model on a portion of the generated data. This model will later be used for inference.
    *   **Implementation**:
        *   `X = data[['feature_1', 'feature_2']]` and `y = data['target']`: Separates the features (`X`) and target variable (`y`) from the Pandas DataFrame.
        *   `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)`: Splits the data into training and testing sets. The training set (80% of the data) will be used to train the model, while the testing set (20% of the data) will be used to evaluate the performance of the trained model. The random state ensures reproducible split for a given seed.
        *   `model = RandomForestClassifier(n_estimators=100, random_state=42)`: Initializes a Random Forest classifier, a popular ensemble learning method. The parameters `n_estimators=100` indicates that the model will use 100 decision trees.
        *   `model.fit(X_train, y_train)`: Trains the model using the training features and target labels. The model learns from the training data the optimal patterns to predict the target.
        *   `predictions = model.predict(X_test)`: The model makes predictions on the test data.
        *    `accuracy = accuracy_score(y_test, predictions)`: Calculates the accuracy score between the model's predictions and the actual values of the test dataset, this will be printed to show the quality of the trained model.

3.  **Apache Spark Setup and Data Preparation**:
    *   **Purpose**: To create a Spark environment and bring the data into a distributed, scalable framework.
    *   **Implementation**:
        *   `schema = StructType(...)`: Defines the schema of the Spark DataFrame. This schema is needed to tell Spark what types of columns are expected in the DataFrame, enabling Spark to properly distribute and process the data.
        *   `spark_df = spark.createDataFrame(data, schema=schema)`: Creates a Spark DataFrame from the Pandas DataFrame, using the specified schema. The data is now stored in the Spark cluster, and can be used for distributed processing.
        *   `spark_df.show(5)`: Displays the first five rows of the Spark DataFrame for inspection.

4.  **Pandas UDF for Scalable Inference**:
    *   **Purpose**: To apply the trained scikit-learn model to the entire Spark DataFrame using a distributed approach.
    *   **Implementation**:
        *   `@pandas_udf("string", functionType="pandas")`: Decorates the `predict_with_model` function to be a Pandas UDF. This allows you to use the code with pandas to process a batch of rows from a Spark dataframe.
        *   `def predict_with_model(feature_1: pd.Series, feature_2: pd.Series) -> pd.Series:`: This defines the function. It takes pandas Series as inputs, and it should return a pandas Series object as output.
            *   `input_data = pd.DataFrame({'feature_1': feature_1, 'feature_2': feature_2})`: Receives two pandas Series as input and creates a pandas DataFrame object.
            *    `predictions = model.predict(input_data)`: Uses the scikit-learn `model` to make predictions on the input data, making use of vectorization features of the pandas library to make predictions faster than with a for loop.
            *   `return pd.Series(predictions)`: Returns the results as a pandas Series.
        *   `spark_df = spark_df.withColumn("predicted_target", predict_with_model(col("feature_1"), col("feature_2")))`: Adds a new column named `predicted_target` to the Spark DataFrame. It applies the Pandas UDF to batches of rows using Spark column operations. This operation is distributed across the Spark cluster, allowing scalable predictions.

5.  **Classification Metrics with Spark**:
    *   **Purpose**: To evaluate the performance of the model on the distributed Spark DataFrame using classification metrics.
    *   **Implementation**:
        *   `predictions_pd = spark_df.select("target", "predicted_target").toPandas()`: Transfers the target and predicted target columns to a pandas DataFrame.
        *    `class_report = classification_report(predictions_pd['target'], predictions_pd['predicted_target'])`: Computes the classification report using the target and predicted target columns of the pandas DataFrame. The classification report contains metrics like precision, recall, F1-score, and support for each class in a classification problem.
        *    `conf_matrix = confusion_matrix(predictions_pd['target'], predictions_pd['predicted_target'])`: Computes the confusion matrix which shows the performance of the classification model by showing the number of true positives, true negatives, false positives, and false negatives..

**Key Concepts for First-Year Graduate Students:**

*   **Distributed Data Processing:** Spark allows you to process large datasets across a cluster of computers, offering significant speedups over processing data on a single machine.
*   **Pandas UDFs:** Enable you to leverage the power of Pandas and scikit-learn within Spark pipelines, combining the benefits of both libraries and enabling scalable model inference.
*   **Ensemble Learning:** Random Forests are a type of ensemble learning method that combines the predictions of multiple decision trees to create a robust predictive model.
*   **Classification Metrics:** Understanding the metrics like accuracy, recall, precision, and the confusion matrix is crucial for evaluating the performance of classification models.
*   **Schema Definition:** Defining a schema when creating DataFrames (both in Spark and Pandas) is critical for understanding the structure of the data.

**In summary,** this code demonstrates how to integrate scikit-learn, a library for machine learning algorithms, with Apache Spark, a system for large scale data processing. The process involves generating data, training a classification model, and applying it in a distributed way to new data, which is a typical process in modern machine learning workflows, while also calculating the quality of the predictions with different metrics. The use of Pandas UDF and distributed processing shows an efficient way of using data with different libraries in a scalable way.


In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from pyspark.sql.functions import pandas_udf, col
from pyspark.sql.types import FloatType, StringType, StructType, StructField
from sklearn.metrics import classification_report, confusion_matrix

# 1. Generate a Sample Dataset (using the same code as before)
# ----------------------------
np.random.seed(42)  # for reproducibility
num_samples = 1000

feature_1 = np.random.rand(num_samples) * 10
feature_2 = np.random.rand(num_samples) * 10
target = np.where(feature_1 + feature_2 > 10, "Class_A", "Class_B")
data = pd.DataFrame({'feature_1': feature_1, 'feature_2': feature_2, 'target': target})

# 2. Build an SKLearn Classification Model (using the same code as before)
# ---------------------------------------
X = data[['feature_1', 'feature_2']]
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"SKLearn Model Accuracy: {accuracy}")

# 4. Create a Spark DataFrame
# --------------------------

# Define the schema for the spark dataframe
schema = StructType([
    StructField("feature_1", FloatType(), True),
    StructField("feature_2", FloatType(), True),
    StructField("target", StringType(), True)
])

# create the spark dataframe using the pandas one
spark_df = spark.createDataFrame(data, schema=schema)

# Print the first few rows of the Spark DataFrame
print("\nSpark DataFrame:")
spark_df.show(5)

# 5. Define a Pandas UDF for Inference
# ------------------------------------

# Define the input and output types
@pandas_udf("string")
def predict_with_model(feature_1: pd.Series, feature_2: pd.Series) -> pd.Series:
    # Create a pandas dataframe for input data
    input_data = pd.DataFrame({'feature_1': feature_1, 'feature_2': feature_2})

    # Get predictions from sklearn model using the input data
    predictions = model.predict(input_data)
    # returns the prediction for each row in the input.
    return pd.Series(predictions)

# 6. Perform Inference and Add Predictions
# -----------------------------------------
# Apply the Pandas UDF to get the predictions from the trained model.
spark_df = spark_df.withColumn("predicted_target", predict_with_model(col("feature_1"), col("feature_2")))

print("\nSpark DataFrame with Predictions:")
spark_df.show(5)

# 7. Calculate Classification Statistics
# ---------------------------------------

# Convert Spark DataFrame to Pandas DataFrame for metrics calculation
predictions_pd = spark_df.select("target", "predicted_target").toPandas()

# Get classification report and print it.
class_report = classification_report(predictions_pd['target'], predictions_pd['predicted_target'])
print("\nClassification Report:\n", class_report)

# Get confusion matrix and print it.
conf_matrix = confusion_matrix(predictions_pd['target'], predictions_pd['predicted_target'])
print("\nConfusion Matrix:\n", conf_matrix)