###Writing PySpark DataFrame to JSON file
- Writing DataFrame to JSON file<br>
- Options while writing JSON files<br>
      _path_: Specifies the path where the JSON files will be saved.<br>
      _mode_: Specifies the behavior when writing to an existing directory.<br>
      _dateFormat_: Specifies the format for date and timestamp columns.<br>
- Saving Mode<br>
> Append: Appends the data to the existing data in the target location. If the target location does not exist, it creates a new one.<br>
> Overwrite: Overwrites the data in the target location if it already exists. If the target location does not exist, it creates a new one.<br>
> Ignore: Ignores the operation and does nothing if the target location already exists. If the target location does not exist, it creates a new one.<br>
> Error or ErrorIfExists: Throws an error and fails the operation if the target location already exists. This is the default behavior if no saving mode is specified.



In [0]:
file_path = "/Volumes/workspace/training/test_data/json"

In [0]:

from pyspark.sql.types import StructType, StructField, StringType, LongType, DateType
from datetime import date

# Sample schema
schema = StructType([
    StructField("emp_id", LongType(), True),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("hire_date", DateType(), True)
])

# Sample data
data = [
    (1, "Mark", "Jack", date(2017, 9, 14)),
    (2, "Scott", "Sam", date(2022, 1, 20)),
    (3, "Katherine", "Angel", date(2016, 1, 18))
]

df = spark.createDataFrame(data, schema)
df.show()


In [0]:

df.write \
    .mode("overwrite") \
    .option("dateFormat", "yyyy-MM-dd") \
    .json(f"{file_path}/output")


In [0]:

data = [
    (1, "Mark", "Jack", date(2017, 9, 14)),
    (2, "Scott", "Sam", date(2022, 1, 20)),
    (3, "Kite", "", date(2016, 1, 18))
]

df = spark.createDataFrame(data, schema)

In [0]:

df.write \
    .mode("overwrite") \
    .option("dateFormat", "yyyy-MM-dd") \
    .json(f"{file_path}/output")


In [0]:
new_df = spark.read \
    .option("multiLine", True) \
    .json(f"{file_path}/output")

display(new_df)