In [1]:
from faker import Faker
from pyspark.sql import SparkSession

In [2]:
spark = (
    SparkSession.builder
        .appName("chap4")
        .config("spark.driver.memory", "2g")
        .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/16 23:45:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Batch Processing

Generate some fake data:

In [3]:
fake = Faker()
Faker.seed(0)

In [4]:
def generate_data(nrow):
    data = [
        (fake.unique.name(), fake.random_int(18, 25), fake.job())
        for i in range(nrow)
    ]

    return data

In [5]:
df1 = spark.createDataFrame(generate_data(50),
                            ["name", "age", "job"])
df1.show()

                                                                                

+-------------------+---+--------------------+
|               name|age|                 job|
+-------------------+---+--------------------+
|       Norma Fisher| 20|Sales promotion a...|
|Dr. Ronald Faulkner| 20|Television produc...|
|     Colleen Taylor| 19|      Chief of Staff|
|  Danielle Browning| 19|Insurance claims ...|
| Benjamin Jefferson| 20|Public house manager|
|    Heather Stewart| 19|                 Sub|
|         Sean Green| 18|Chief Financial O...|
|   Jennifer Summers| 18|  Veterinary surgeon|
|    Sean Sanchez MD| 18|Engineer, aeronau...|
|       Connie Pratt| 19|Speech and langua...|
|       Bobby Flores| 21|Clinical embryolo...|
|     Eddie Martinez| 20|Sound technician,...|
|       Robert Payne| 20|Producer, televis...|
|     Robert Stewart| 19|Horticultural the...|
|    Roberto Johnson| 20|     Publishing copy|
|   Michael Anderson| 19|     Arboriculturist|
|  Stephanie Leblanc| 21|Scientist, water ...|
|    Robert Atkinson| 21|        TEFL teacher|
| Johnathan D

Partition the data by a column:

In [6]:
db = "fake_data"

In [7]:
# Remove directory associated with the table (if it already exists)
%rm -rf spark-warehouse/"$db"

In [8]:
df1.write.saveAsTable(db, partitionBy="age")

                                                                                

## Data Skew

Data skew is when data is unevenly distributed across partitions. This slows down performance and needs handling. Most of the time, Spark's Adaptive Query Engine (AQE) is efficient in optimizing the data distribution. However, sometimes we need to manually fix the data skew problem.

---

**Further reading**:

- [Deep Dive into Handling Apache Spark Data Skew](https://chengzhizhao.com/deep-dive-into-handling-apache-spark-data-skew/) (Zhao, 2022)