<a href="https://colab.research.google.com/github/rahulrajpr/prepare-anytime/blob/main/spark/functions/20_spark_sql_dataframe_writer_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spark DataFrame Writer Methods**
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.html

In [None]:
# Install Java and PySpark

import warnings
warnings.filterwarnings('ignore')

!apt-get update -qq
!apt-get install -y openjdk-11-jdk-headless -qq > /dev/null
!pip install pyspark -q

# Set Java home
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

import pyspark
print(pyspark.__version__)

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
3.5.1


In [None]:
# download the postgre driver

!mkdir -p ~/jars
!wget -P ~/jars https://jdbc.postgresql.org/download/postgresql-42.6.0.jar

--2025-11-04 06:37:06--  https://jdbc.postgresql.org/download/postgresql-42.6.0.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1081604 (1.0M) [application/java-archive]
Saving to: ‘/root/jars/postgresql-42.6.0.jar’


2025-11-04 06:37:06 (11.8 MB/s) - ‘/root/jars/postgresql-42.6.0.jar’ saved [1081604/1081604]



In [None]:
from pyspark.sql import SparkSession

spark = SparkSession\
            .builder\
            .appName('spark-dataframe')\
            .config("spark.jars", "/root/jars/postgresql-42.6.0.jar")\
            .getOrCreate()

#### Spark DataFrame Writer vs DataFrameWriterV2
---

#### 1. Core Architectural Foundation

#### DataFrameWriter (V1)
- Built on original **DataSource V1 API**
- **Monolithic architecture**: Single, fixed execution path
- Tightly coupled with Spark's legacy SQL engine
- Designed for traditional HDFS and relational databases

#### DataFrameWriterV2
- Built on modern **DataSource V2 API**
- **Pluggable architecture**: Modular, composable components
- Framework for extensible connector development
- Designed for cloud-native and modern data formats

---

#### 2. Fundamental Design Philosophy

### V1 Approach: "One Size Fits All"
- Fixed write execution pattern for all data sources
- Limited customization points for connector developers
- Batch and streaming treated as separate concerns
- File-system oriented commit protocols

#### V2 Approach: "Extensible Framework"
- Customizable write execution per data source
- Rich interfaces for connector-specific optimizations
- Unified batch and streaming treatment
- Transaction-aware commit protocols

---

#### 3. Critical Technical Limitations in V1

#### Reliability Issues
- Non-atomic commits on cloud object stores
- No transaction boundaries for partial failures
- Corruption risks during job failures
- Limited recovery mechanisms

#### Extensibility Constraints
- Black box execution model
- Difficult to implement custom data sources
- Limited push-down capability for operations
- Rigid interface contracts

#### Operational Limitations
- Coarse-grained overwrite behavior
- Poor integration with catalog systems
- Limited schema evolution support
- Basic data distribution controls

---

#### 4. V2 Architectural Solutions

#### Transaction Management
- Pluggable commit protocols
- ACID transaction support
- Atomic operation guarantees
- Recovery and rollback capabilities

#### Extensibility Framework
- Clean interfaces for custom implementations
- Operation push-down framework
- Customizable write optimization
- Unified batch and streaming APIs

#### Data Management
- Fine-grained data distribution controls
- Advanced partitioning strategies
- Integrated catalog management
- Native schema evolution

---

#### 5. Key Differentiators

#### Execution Model
- **V1**: Fixed pipeline, runtime optimizations only
- **V2**: Customizable pipeline, both planning and runtime optimizations

#### Connector Development
- **V1**: Complex, requires deep Spark internals knowledge
- **V2**: Structured, well-defined interfaces and contracts

#### Cloud Compatibility
- **V1**: Adapted to cloud storage with limitations
- **V2**: Designed for cloud-native operation from inception

#### Data Ecosystem Integration
- **V1**: Basic table format support
- **V2**: Native integration with modern table formats (Iceberg, Delta, Hudi)

---

#### 6. Evolution Context

#### V1 Represents Spark's Origins
- Born from academic and early internet scale
- HDFS and traditional database focus
- Batch processing primacy
- Single data center deployment model

#### V2 Represents Spark's Maturity
- Cloud-native and hybrid cloud reality
- Streaming and batch unification
- Global scale deployment requirements
- Diverse data ecosystem integration

---

#### 7. Practical Implications

#### For Data Engineers:
- **V1**: Sufficient for basic ETL and analytics
- **V2**: Necessary for production-grade, reliable pipelines

#### For Platform Developers:
- **V1**: Maintenance and compatibility focus
- **V2**: Innovation and ecosystem expansion

#### For Organizations:
- **V1**: Legacy pipeline maintenance
- **V2**: Future-proof data platform foundation

---

#### 8. Strategic Direction

#### V1 Status
- Maintenance mode
- Critical bug fixes only
- No new feature development
- Gradual deprecation path

#### V2 Status
- Active development focus
- New feature delivery
- Ecosystem expansion
- Performance optimization priority

---

#### Conclusion

The transition from DataFrameWriter to DataFrameWriterV2 represents Spark's evolution from a monolithic data processing engine to a modular, extensible data platform framework. While V1 addressed the initial scale challenges of big data, V2 addresses the reliability, extensibility, and operational requirements of modern data platforms in cloud-native environments.

V2 isn't merely an API version increment—it's a fundamental architectural shift that enables Spark to remain relevant in the evolving data ecosystem while maintaining backward compatibility for existing workloads.

In [None]:
csv_file_path = 'https://raw.githubusercontent.com/rahulrajpr/prepare-anytime/refs/heads/main/sample-files/csv/sample.csv'
!wget {csv_file_path}
csv_local_path = '/content/sample.csv'

--2025-11-04 06:37:16--  https://raw.githubusercontent.com/rahulrajpr/prepare-anytime/refs/heads/main/sample-files/csv/sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘sample.csv’


2025-11-04 06:37:16 (5.08 MB/s) - ‘sample.csv’ saved [60302/60302]



In [None]:
dataframe = spark.read\
                 .option('header','true')\
                 .option('inferSchema','true')\
                 .csv(csv_local_path)
dataframe.show(truncate = False)

+-----------+--------+------+-------------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|Name                                                   |Sex   |Age |SibSp|Parch|Ticket          |Fare   |Cabin|Embarked|
+-----------+--------+------+-------------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
|1          |0       |3     |Braund, Mr. Owen Harris                                |male  |22.0|1    |0    |A/5 21171       |7.25   |NULL |S       |
|2          |1       |1     |Cumings, Mrs. John Bradley (Florence Briggs Thayer)    |female|38.0|1    |0    |PC 17599        |71.2833|C85  |C       |
|3          |1       |3     |Heikkinen, Miss. Laina                                 |female|26.0|0    |0    |STON/O2. 3101282|7.925  |NULL |S       |
|4          |1       |1     |Futrelle, Mrs. Jacques Heath (Lily May Peel)           |female|35.0|1  

In [None]:
# csv

!mkdir output

csv_write_path = 'output/csv/dataframe_csv_overwrite.csv'

dataframe.write\
         .option('header','true')\
         .option('delimeter',',')\
         .option('dateFormat','yyyy-MM-dd')\
         .option('nullValue','NULL')\
         .option('compression','none')\
         .csv(csv_write_path, mode = 'overwrite')

# -- with mode as a method

dataframe.write\
         .option('header','true')\
         .option('delimeter',',')\
         .option('dateFormat','yyyy-MM-dd')\
         .option('nullValue','NULL')\
         .option('compression','none')\
         .mode('overwrite')\
         .csv(csv_write_path)

# -- with mode as a method  (append)

csv_write_path = 'output/csv/dataframe_csv_append.csv'

dataframe.write\
         .option('header','true')\
         .option('delimeter',',')\
         .option('dateFormat','yyyy-MM-dd')\
         .option('nullValue','NULL')\
         .option('compression','none')\
         .mode('append')\
         .csv(csv_write_path)

# -- lets see how the partitions works

csv_write_path = 'output/csv/dataframe_csv_partitions.csv'

dataframe.write\
         .option('header','true')\
         .option('delimeter',',')\
         .option('dateFormat','yyyy-MM-dd')\
         .option('nullValue','NULL')\
         .option('compression','none')\
         .mode('overwrite')\
         .partitionBy('PClass','survived')\
         .csv(csv_write_path)

##### PySpark Write Modes

##### Available Write Modes
- **`overwrite`** - Completely replaces existing data
- **`append`** - Adds new data to existing data
- **`ignore`** - No operation if target exists
- **`error` / `errorifexists`** - Throws error if target exists (default)

---

##### Detailed Comparison

| Mode | Behavior | When Target Exists | When Target Doesn't Exist | Common Use Cases | Risk Level |
|------|----------|-------------------|--------------------------|------------------|------------|
| **`overwrite`** | Replaces entire dataset | Deletes all existing data, writes new data | Creates new data | Full refreshes, schema changes, complete replacements | High |
| **`append`** | Adds to existing data | New data added to existing data | Creates new data | Incremental loads, event streams, daily batches | Medium |
| **`ignore`** | No operation if exists | Silent skip, no action taken | Creates new data | Safe initialization, idempotent pipelines | Low |
| **`errorifexists`** | Fails if target exists | Throws AnalysisException | Creates new data | Safety default, preventing accidents | Low |

---

##### Key Characteristics Summary

##### Data Safety
- **Safest**: `errorifexists`, `ignore`
- **Moderate**: `append`
- **Riskiest**: `overwrite`

##### Performance Impact
- **Fastest**: `ignore` (when skipping)
- **Moderate**: `append`, `overwrite` (depends on data size)
- **Slowest**: `overwrite` (for large datasets)

##### Common Patterns
- **Development**: `overwrite` for testing
- **Production**: `append` for incremental, `errorifexists` for safety
- **Initialization**: `ignore` for first-time setup

In [None]:
# writer -json

json_write_path = 'output/json/dataframe_json_overwrite.json'

dataframe.write\
         .option('multiLine','true')\
         .mode('overwrite')\
         .json(json_write_path)

##--

json_write_path = 'output/json/dataframe_json_append.json'

dataframe.write\
         .option('multiLine','true')\
         .mode('append')\
         .json(json_write_path)

##--

json_write_path = 'output/json/dataframe_json_partitioned.json'

dataframe.write\
         .option('multiLine','true')\
         .mode('overwrite')\
         .partitionBy('PClass','survived')\
         .json(json_write_path)

In [None]:
# writer - parquet

parquet_write_path = 'output/parquet/dataframe_parquet.parquet'

dataframe.write\
         .option('mergeSchema','true')\
         .option('compression','gzip')\
         .option('parquet.enable.bloom.filter','true')\
         .partitionBy('PClass','survived')\
         .mode('overwrite')\
         .parquet(parquet_write_path)


Note : The Cluster By and Sort By Operation are not available in writer V1, but those are available in writer V2.


In [None]:
# writer - format

format_write_path = 'output/format/dataframe_format_partitioned.csv'

dataframe.write\
         .format('csv')\
         .option('header','true')\
         .mode('overwrite')\
         .partitionBy('PClass','survived')\
         .option('compression','gzip')\
         .save(format_write_path)

In [None]:
# in order to test #jdbc

# SET THE POSTGRE RDBMS (LOCALLY) TO TEST THE JDBC

!apt-get update
!apt-get install -y postgresql postgresql-contrib
!service postgresql start
!clear

0% [Working]            Hit:1 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:5 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:6 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:7 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:8 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:9 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:10 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists... Done
W: Skipping acq

In [None]:
database = 'magic_database'
schema = 'magic_schema'

user = 'rahul'
password = 'rahul_password'

!sudo -u postgres psql -c "CREATE USER {user} WITH PASSWORD '{password}';"
!sudo -u postgres psql -c "CREATE DATABASE {database} OWNER {user};"
!sudo -u postgres psql -d {database} -c "CREATE SCHEMA {schema} AUTHORIZATION {user};"

CREATE ROLE
CREATE DATABASE
CREATE SCHEMA


In [None]:
# writer - jdbc

url =  f"jdbc:postgresql://localhost:5432/{database}"
table = 'magic_schema.magic_table'

properties = {'user':'rahul', 'password':'rahul_password','driver': 'org.postgresql.Driver'}

dataframe.repartition(10)\
         .write\
         .mode('overwrite')\
         .option('numPartition',8)\
         .option('partitionColumns','PClass')\
         .jdbc(url = url,
               table = table,
               properties = properties)

#--

url =  f"jdbc:postgresql://localhost:5432/{database}"

table = 'magic_schema.magic_table_apped'

properties = {'user':'rahul', 'password':'rahul_password','driver': 'org.postgresql.Driver'}

dataframe.repartition(10)\
         .write\
         .mode('append')\
         .option('numPartition',8)\
         .option('partitionColumns','PClass')\
         .jdbc(url = url,
               table = table,
               properties = properties)

In [None]:
spark.read.jdbc(url = url, table = table, properties=properties).show(truncate = False)

+-----------+--------+------+----------------------------------------------------+------+----+-----+-----+----------------+--------+-------+--------+
|PassengerId|Survived|Pclass|Name                                                |Sex   |Age |SibSp|Parch|Ticket          |Fare    |Cabin  |Embarked|
+-----------+--------+------+----------------------------------------------------+------+----+-----+-----+----------------+--------+-------+--------+
|119        |0       |1     |Baxter, Mr. Quigg Edmond                            |male  |24.0|0    |1    |PC 17558        |247.5208|B58 B60|C       |
|240        |0       |2     |Hunt, Mr. George Henry                              |male  |33.0|0    |0    |SCO/W 1585      |12.275  |NULL   |S       |
|138        |0       |1     |Futrelle, Mr. Jacques Heath                         |male  |37.0|1    |0    |113803          |53.1    |C123   |S       |
|730        |0       |3     |Ilmakangas, Miss. Pieta Sofia                       |female|25.0|1    |

In [None]:
!service postgresql stop

 * Stopping PostgreSQL 14 database server
   ...done.


In [None]:
# saveASTable

dataframe.write \
    .mode('overwrite') \
    .option('mergeSchema','true')\
    .partitionBy('PClass') \
    .bucketBy(2, 'survived') \
    .sortBy('PassengerId') \
    .saveAsTable('demo_table')

In [None]:
spark.sql('''select * from demo_table''').printSchema()

table_schema =  spark.sql('''select * from demo_table''').schema

spark.sql('''select * from demo_table''').show(truncate = False)

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)
 |-- Pclass: integer (nullable = true)

+-----------+--------+---------------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|PassengerId|Survived|Name                                                     |Sex   |Age |SibSp|Parch|Ticket          |Fare   |Cabin|Embarked|Pclass|
+-----------+--------+---------------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|1          |0       |Braund, Mr. Owen Harris                                  |m

##### `partitionBy` vs `bucketBy` vs `sortBy`
---
##### partitionBy - **Physical Separation**
- **What**: Creates **actual folders/directories** on storage
- **Visual**: You can SEE the separation in file explorer
- **Structure**: `country=US/`, `country=UK/` as separate folders
- **Analogy**: Library with separate rooms for each genre
---
##### bucketBy - **Logical Separation**  
- **What**: Splits data into **fixed number of files** using hashing
- **Visual**: All files in same folder, separation is INTERNAL
- **Structure**: 10 files where user_id hashes determine file location
- **Analogy**: Single room with numbered shelves (calculate which shelf)
---
##### sortBy - **Internal Ordering**
- **What**: **Sorts data rows** within each file
- **Visual**: No structural change, just internal row order
- **Structure**: Records ordered by timestamp within each file
- **Analogy**: Books arranged alphabetically on each shelf
---
##### Availability Matrix

| Method | CSV | JSON | Parquet/ORC | JDBC | saveAsTable |
|--------|-----|------|-------------|------|-------------|
| **partitionBy** | ✅ | ✅ | ✅ | ❌ | ✅ |
| **bucketBy** | ❌ | ❌ | ❌ | ❌ | ✅ |
| **sortBy** | ❌ | ❌ | ❌ | ❌ | ✅ |
---
##### Detailed Comparison
| Aspect | partitionBy | bucketBy | sortBy |
|--------|-------------|----------|---------|
| **Separation Level** | Directory | File | Row |
| **Visibility** | Visible in file system | Hidden, internal | Hidden, internal |
| **Data Access** | Direct folder navigation | Hash calculation | Sequential scanning |
| **Optimal Use Case** | Low cardinality (≤1000 values) | High cardinality (millions) | Ordered access patterns |
| **Performance Benefit** | Partition pruning | Join optimization | Range query optimization |
| **File Impact** | Multiple folders with files | Fixed files in one folder | Same files, sorted internally |
---
##### When to Use Each Method?
---
##### ✅ Use partitionBy when:
- You have clear categories (country, year, month)
- You frequently filter by these categories
- You need to manage data lifecycle (drop old partitions)
- **Works with**: Files (CSV, JSON, Parquet) + Tables
---
##### ✅ Use bucketBy when:
- You have high-cardinality columns (user_id, product_id)
- Tables are frequently joined on these columns
- You need even data distribution
- **Works with**: saveAsTable ONLY
---
##### ✅ Use sortBy when:
- You perform range queries (BETWEEN, >, <)
- Data has natural ordering (timestamps, sequences)
- You need better compression
- **Works with**: saveAsTable ONLY
---
## Critical Limitations
---
##### File Formats (CSV, JSON, Parquet, ORC):
- **Only partitionBy** available for physical organization
- **No bucketing** - cannot optimize joins at write time
- **No sortBy** - must pre-sort DataFrames before writing
---
##### JDBC Writes:
- **No partitioning** - database handles table partitioning
- **No bucketing** - database handles indexing
- **No sortBy** - database handles query optimization
---
##### saveAsTable (Hive/Spark Tables):
- **Full feature set** available
- **Only method** for bucketing and sorting during write
- **Requires** metastore integration
---
##### Best Practices
---
##### For Maximum Performance (saveAsTable only):
Use the **three-layer optimization**:
1. **partitionBy** for coarse-grained physical separation
2. **bucketBy** for join optimization and even distribution  
3. **sortBy** for scan efficiency and compression
---
##### Cardinality Guidelines:
- **partitionBy**: 10-1000 distinct values ideal
- **bucketBy**: Millions of distinct values handled well
- **sortBy**: No cardinality limits
---
##### File Management:
- **partitionBy**: Risk of too many small files
- **bucketBy**: Fixed file count, predictable
- **sortBy**: No impact on file count
---
##### Summary
✅ **Supported**:
- partitionBy: Universal (files + tables)
- bucketBy: saveAsTable exclusive
- sortBy: saveAsTable exclusive

❌ **Not Supported**:
- bucketBy/sortBy with file formats (CSV, JSON, Parquet)
- Any organization methods with JDBC writer
- partitionBy with JDBC (database handles partitioning)
---
**Key Insight**: partitionBy organizes your storage, bucketBy organizes your data relationships, sortBy organizes your data access patterns.

In [None]:
table_schema

StructType([StructField('PassengerId', IntegerType(), True), StructField('Survived', IntegerType(), True), StructField('Name', StringType(), True), StructField('Sex', StringType(), True), StructField('Age', DoubleType(), True), StructField('SibSp', IntegerType(), True), StructField('Parch', IntegerType(), True), StructField('Ticket', StringType(), True), StructField('Fare', DoubleType(), True), StructField('Cabin', StringType(), True), StructField('Embarked', StringType(), True), StructField('Pclass', IntegerType(), True)])

DataFrame[PassengerId: int, Survived: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string, Pclass: int]

In [None]:
# insertInto

from pyspark.sql.functions import col

compatible_df = dataframe.to(table_schema)

compatible_df.write\
             .insertInto('demo_table')

##### Spark Warehouse Table vs Spark Temp View - Comparison

| Feature | Spark Warehouse Table (Managed) | Spark Temp View |
|---------|---------------------------------|-----------------|
| **Persistence & Storage** | Yes. Data is physically stored in the Spark SQL warehouse directory (e.g., `spark-warehouse/`). | No. It is purely a logical abstraction or a "view" over existing data. It holds no data itself. |
| **Lifetime** | Permanent. Persists until explicitly dropped by a `DROP TABLE` command. Survives Spark application restarts. | Session-scoped. Automatically disappears when the SparkSession ends. For GLOBAL TEMP VIEW, it is tied to the Spark application. |
| **Metadata Management** | Cataloged. Its schema and location are stored in a metastore (Spark's built-in metastore or Hive). | Not Cataloged. Its definition is only known to the SparkSession that created it. |
| **Impact of DROP** | Deletes both the metadata AND the underlying data files. | Only deletes the view definition. The underlying data source is completely unaffected. |
| **Underlying Data Source** | The data is the table itself. The table "owns" the data. | Can be built on top of an existing table, a file (CSV, Parquet), or the result of a DataFrame transformation. |
| **Primary Use Case** | For data that needs to be stored, managed, and shared long-term across multiple jobs and users. The "single source of truth." | For ad-hoc, session-specific data manipulation. Great for breaking down complex queries, providing a friendly name to a complex DataFrame, or during exploratory data analysis. |

---

##### When to Use Which?

### Use a Warehouse Table when...
- You need to persist the results of your processing
- The data is a shared dimension or fact table for other jobs/users
- You are building a curated dataset or a data mart
- The data's lifecycle should be managed by Spark

##### Use a Temp View when...
- You are performing exploratory data analysis in a notebook
- You need to simplify a complex SQL query by breaking it into parts
- You are working with DataFrames in Python/Scala/Java and want to run SQL on them
- The data is temporary and only relevant for the duration of your current session or script

---

##### Key Analogy

**Warehouse Table** = The physical house (permanent structure)

**Temp View** = The blueprints or brochure (temporary description)

In [None]:
spark.stop()