In [1]:
import pandas as pd

from faker import Faker
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    broadcast, spark_partition_id, rand
)

In [2]:
spark = (
    SparkSession.builder
        .appName("chap4")
        .config("spark.driver.memory", "2g")
        .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/17 23:20:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Batch Processing

Generate some fake data:

In [3]:
fake = Faker()
Faker.seed(0)

In [4]:
data = [
    (fake.unique.name(), fake.random_int(18, 25), fake.job())
    for _ in range(1000)   
]

In [5]:
df1 = spark.createDataFrame(data, ["name", "age", "job"])
df1.show()

+-------------------+---+--------------------+
|               name|age|                 job|
+-------------------+---+--------------------+
|       Norma Fisher| 22|Sales promotion a...|
|Dr. Ronald Faulkner| 23|Television produc...|
|     Colleen Taylor| 20|      Chief of Staff|
|  Danielle Browning| 20|Insurance claims ...|
| Benjamin Jefferson| 23|Public house manager|
|    Heather Stewart| 21|                 Sub|
|         Sean Green| 18|Chief Financial O...|
|   Jennifer Summers| 18|  Veterinary surgeon|
|    Sean Sanchez MD| 19|Engineer, aeronau...|
|       Connie Pratt| 20|Speech and langua...|
|       Bobby Flores| 25|Clinical embryolo...|
|     Eddie Martinez| 23|Sound technician,...|
|       Robert Payne| 22|Producer, televis...|
|     Robert Stewart| 21|Horticultural the...|
|    Roberto Johnson| 22|     Publishing copy|
|   Michael Anderson| 20|     Arboriculturist|
|  Stephanie Leblanc| 24|Scientist, water ...|
|    Robert Atkinson| 24|        TEFL teacher|
| Johnathan D

                                                                                

Partition the data by a column:

In [6]:
db = "fake_data"

In [7]:
# Remove directory associated with the table (if it already exists)
%rm -rf spark-warehouse/"$db"

In [8]:
df1.write.saveAsTable(db, partitionBy="age")

                                                                                

## Data Skew

Data skew is when data is unevenly distributed across partitions. This slows down performance and needs handling. Most of the time, Spark's Adaptive Query Engine (AQE) is efficient in optimizing the data distribution. However, sometimes we need to manually fix the data skew problem. Here are some ways to do it:
- Configuring the number of partitions to use when shuffling data for joins or aggregations (i.e., the `spark.sql.shuffle.partitions` option). See the [Working With Partitions](http://localhost:8888/notebooks/chap3/Apache%20Spark%20deep%20dive.ipynb#Working-With-Partitions) section in the Chapter 3 notebook for more information.
- Broadcast join: Send the smaller dataset across all nodes and then join each node's portion of the larger dataset. This is suitable for small-to-medium-sized DataFrames.

In [9]:
pd_df = pd.DataFrame({
    "name": fake.random_sample([tup[0] for tup in data], 5),
    "catchPhrase": [fake.unique.catch_phrase() for _ in range(5)]
})
df2 = spark.createDataFrame(pd_df)

In [10]:
df1.join(broadcast(df2), "name").show()

+----------------+---+--------------------+--------------------+
|            name|age|                 job|         catchPhrase|
+----------------+---+--------------------+--------------------+
|     Daniel Cruz| 19|Nurse, mental health|Public-key mobile...|
|Reginald Garrett| 24|     Arboriculturist|Visionary systema...|
|  Victoria Reese| 25|Health and safety...|Multi-layered hyb...|
| Brent Willis MD| 23|           Ecologist|Fundamental inter...|
| Dustin Mcdowell| 22|Senior tax profes...|Multi-lateral zer...|
+----------------+---+--------------------+--------------------+



- Salting (idea from cryptography): Add a random or unique identifier to each record. This is useful if we are unsure what column we want to repartition by.

In [11]:
df1.withColumn("salt", (rand(0) * 10).cast("int")).show()

+-------------------+---+--------------------+----+
|               name|age|                 job|salt|
+-------------------+---+--------------------+----+
|       Norma Fisher| 22|Sales promotion a...|   7|
|Dr. Ronald Faulkner| 23|Television produc...|   5|
|     Colleen Taylor| 20|      Chief of Staff|   0|
|  Danielle Browning| 20|Insurance claims ...|   3|
| Benjamin Jefferson| 23|Public house manager|   7|
|    Heather Stewart| 21|                 Sub|   2|
|         Sean Green| 18|Chief Financial O...|   2|
|   Jennifer Summers| 18|  Veterinary surgeon|   5|
|    Sean Sanchez MD| 19|Engineer, aeronau...|   7|
|       Connie Pratt| 20|Speech and langua...|   0|
|       Bobby Flores| 25|Clinical embryolo...|   2|
|     Eddie Martinez| 23|Sound technician,...|   6|
|       Robert Payne| 22|Producer, televis...|   4|
|     Robert Stewart| 21|Horticultural the...|   5|
|    Roberto Johnson| 22|     Publishing copy|   3|
|   Michael Anderson| 20|     Arboriculturist|   2|
|  Stephanie

---

**Further reading**:

- [Deep Dive into Handling Apache Spark Data Skew](https://chengzhizhao.com/deep-dive-into-handling-apache-spark-data-skew/) (Zhao, 2022)