<a href="https://colab.research.google.com/github/tomassalcedas/dataeng/blob/main/spark/examples/06-write_partitioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Write
- .write
- .format (parquet, csv, json)
- options
- spark.sql.sources.partitionOverwriteMode dynamic

# Write Mode
- overwrite - The overwrite mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite
- append - To add the data to the existing file, alternatively, you can use SaveMode.Append
- ignore - Ignores write operation when the file already exists, alternatively, you can use SaveMode.Ignore.
- errorifexists or error - This is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists.

# Partitioning
Process to organize the data into multiple chunks based on some criteria.
Partitions are organized in sub-folders.
Partitioning improves performance in Spark.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Setting up PySpark

In [2]:
%pip install pyspark



In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Spark Course').getOrCreate()

# Preparing data

In [4]:
!pip install faker

Collecting faker
  Downloading faker-37.4.0-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.4.0-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-37.4.0


In [5]:
from faker import Faker
from datetime import datetime

fake = Faker()

users = []
for _ in range(50):
    user = {
        'date': fake.date_time_between_dates(datetime(2024, 5, 1), datetime(2024, 5, 5)),
        'name': fake.name(),
        'address': fake.address(),
        'email': fake.email(),
        'dob': fake.date_of_birth(),
        'phone': fake.phone_number()
    }
    users.append(user)

df = spark.createDataFrame(users)

df.show(10, False)


+---------------------------------------------------------+--------------------------+----------+------------------------------+------------------+---------------------+
|address                                                  |date                      |dob       |email                         |name              |phone                |
+---------------------------------------------------------+--------------------------+----------+------------------------------+------------------+---------------------+
|3576 Green Turnpike Apt. 973\nJonathanfort, OH 20172     |2024-05-02 01:52:33.343982|1971-01-29|kennethporter@example.net     |Kenneth Clarke    |(454)916-7184        |
|00743 Pamela Crossing Apt. 423\nBenjaminport, OR 82666   |2024-05-02 11:30:22.134937|1954-01-03|michaelcrawford@example.com   |Brooke Scott      |001-670-596-2391x3422|
|7622 Sean Parks\nWest Sean, AZ 84874                     |2024-05-01 03:44:46.207469|1913-03-17|michael77@example.com         |Jennifer Baker    |+1-

# Writing as PARQUET



In [11]:
# Writing as PARQUET with no partitions

path = "/content/write_partitioning/parquet_no_partitions"

df.write.mode("overwrite").format("parquet").save(path)

!ls /content/write_partitioning/parquet_no_partitions

spark.read.format("parquet").load(path).count()

part-00000-ae7de57c-8b92-4302-8f76-a7ece9a166ea-c000.snappy.parquet  _SUCCESS


50

In [16]:
# Writing as PARQUET with partitions
from pyspark.sql.functions import *

path = "/content/write_partitioning/parquet_with_partitions"

# Creating partition column
df = df.withColumn("date_part", date_format(col("date"), "yyyyMMdd"))

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") # enable dynamic partition overwrite - only overwrites partitions that are coming in the dataframe

(df
 #.where("date_part = '20240503'")
 .write
 .mode("overwrite")                                               # overwrites the entire path with the new data
 .partitionBy("date_part")                                        # partition the data by column - creates sub-folders for each partition
 .format("parquet")                                               # format of output
 .save(path))                                                     # path

!ls /content/write_partitioning/parquet_with_partitions

spark.read.format("parquet").load(path).count()

'date_part=20240501'  'date_part=20240503'
'date_part=20240502'  'date_part=20240504'


50

In [17]:
# Checking single partition
spark.read.parquet("/content/write_partitioning/parquet_with_partitions/date_part=20240502").show(50)

+--------------------+--------------------+----------+--------------------+--------------------+--------------------+
|             address|                date|       dob|               email|                name|               phone|
+--------------------+--------------------+----------+--------------------+--------------------+--------------------+
|3576 Green Turnpi...|2024-05-02 01:52:...|1971-01-29|kennethporter@exa...|      Kenneth Clarke|       (454)916-7184|
|00743 Pamela Cros...|2024-05-02 11:30:...|1954-01-03|michaelcrawford@e...|        Brooke Scott|001-670-596-2391x...|
|3571 Beard Crest ...|2024-05-02 14:00:...|1958-08-02|frazierwayne@exam...|       Veronica Ruiz|        926-265-0650|
|40413 Jordan Keys...|2024-05-02 13:44:...|1997-01-23|christopherbennet...|        Briana Davis|       (468)358-4804|
|68657 Randy Canyo...|2024-05-02 11:45:...|2010-02-20|kathryn61@example...|Brittany Hopkins DVM| +1-371-700-3711x021|
|Unit 3853 Box 477...|2024-05-02 22:53:...|1954-05-16| n

# Writing as CSV

https://spark.apache.org/docs/3.5.1/sql-data-sources-csv.html

In [18]:
df.count()

50

In [19]:
path = "/content/write_partitioning/csv_no_partitioning/"

# write as csv
(df
  .write
  .format("csv")
  .mode("overwrite")
  .option("delimiter", "|")
  .option("header", True)
  .save(path))

# listing files in the folder
!ls /content/write_partitioning/csv_no_partitioning/

# read as csv
(spark
  .read
  .options(sep="|", multiLine=True, header=True)
  .csv(path)
  .count())

part-00000-bb29f782-bf2e-453a-80b5-d53b6bf5dc66-c000.csv  _SUCCESS


50

# Writing as JSON

https://spark.apache.org/docs/3.5.1/sql-data-sources-json.html

In [20]:
path = "/content/write_partitioning/json_no_partitioning/"

# write as json
(df
.write
.mode("overwrite")
.format("json")
.save(path))

# listing files in the folder
!ls /content/write_partitioning/json_no_partitioning/

# read as json
(spark
  .read
  .json(path)
  .count())

part-00000-630122bd-29a5-43ff-8344-9b2c842a3e67-c000.json  _SUCCESS


50

In [21]:
# reading json as text
spark.read.text(path).show(10, False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"address":"3576 Green Turnpike Apt. 973\nJonathanfort, OH 20172","date":"2024-05-02T01:52:33.343Z","dob":"1971-01-29","email":"kennethporter@example.net","name":"Kenneth Clarke","phone":"(454)916-7184","date_part":"20240502"}          |
|{"address":"00743 Pamela Crossing Apt. 423\

In [22]:
# reading json as text
spark.read.json(path).show(10, False)

+---------------------------------------------------------+------------------------+---------+----------+------------------------------+------------------+---------------------+
|address                                                  |date                    |date_part|dob       |email                         |name              |phone                |
+---------------------------------------------------------+------------------------+---------+----------+------------------------------+------------------+---------------------+
|3576 Green Turnpike Apt. 973\nJonathanfort, OH 20172     |2024-05-02T01:52:33.343Z|20240502 |1971-01-29|kennethporter@example.net     |Kenneth Clarke    |(454)916-7184        |
|00743 Pamela Crossing Apt. 423\nBenjaminport, OR 82666   |2024-05-02T11:30:22.134Z|20240502 |1954-01-03|michaelcrawford@example.com   |Brooke Scott      |001-670-596-2391x3422|
|7622 Sean Parks\nWest Sean, AZ 84874                     |2024-05-01T03:44:46.207Z|20240501 |1913-03-17|micha

In [23]:
# partition json data + saveAsTable

# Creating partition column
df = df.withColumn("date_part", date_format(col("date"), "yyyyMMdd"))

# write as json
(df.write
  .partitionBy("date_part")
  .mode("overwrite")
  .format("json")
  .saveAsTable("tbl_json_part"))

# read as json
spark.table("tbl_json_part").count()

# read as json
spark.sql("show partitions tbl_json_part").show()

+------------------+
|         partition|
+------------------+
|date_part=20240501|
|date_part=20240502|
|date_part=20240503|
|date_part=20240504|
+------------------+



# Append Mode

In [24]:
# Writing as PARQUET with APPEND

path = "/content/write_partitioning/parquet_append"

df.write.mode("append").format("parquet").save(path)

!ls /content/write_partitioning/parquet_append

spark.read.format("parquet").load(path).count()

part-00000-4d1869fa-8198-472d-a2d3-398c5a20c0d0-c000.snappy.parquet  _SUCCESS


50