# PySpark File Format Guide: Reading and Writing Data

This notebook demonstrates how to efficiently read and write data in various file formats using PySpark:

1. CSV
2. JSON
3. Parquet
4. Avro

For each format, we'll cover:
- Basic reading and writing operations
- Common options and parameters
- Performance considerations
- Best practices

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, rand, monotonically_increasing_id
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    DoubleType, BooleanType, DateType, TimestampType
)
import time
import os

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("File Format Guide") \
    .config("spark.sql.avro.compression.codec", "snappy") \
    .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.3.0") \
    .getOrCreate()

print("SparkSession initialized successfully!")

/opt/spark/bin/load-spark-env.sh: line 68: ps: command not found


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4d80b244-e9d8-4cbe-bf9d-2286bd9beae6;1.0
	confs: [default]
	found org.apache.spark#spark-avro_2.12;3.3.0 in central
	found org.tukaani#xz;1.8 in central
	found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 109ms :: artifacts dl 4ms
	:: modules in use:
	org.apache.spark#spark-avro_2.12;3.3.0 from central in [default]
	org.spark-project.spark#unused;1.0.0 from central in [default]
	org.tukaani#xz;1.8 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0  

SparkSession initialized successfully!


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 47654)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 271, in accum_updates
   

## Creating Sample Data

First, let's create a sample dataset to work with throughout this guide.

In [2]:
# Create a simple dataset with different data types
data = [
    (1, "John Doe", 35, "New York", 72000.50, True, "2020-01-15"),
    (2, "Jane Smith", 28, "San Francisco", 86000.00, False, "2019-06-22"),
    (3, "Robert Brown", 42, "Chicago", 92000.75, True, "2021-03-08"),
    (4, "Maria Garcia", 31, "Los Angeles", 67500.25, True, "2018-11-30"),
    (5, "James Wilson", 45, "Seattle", 115000.00, False, "2022-02-12")
]

# Define schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("salary", DoubleType(), True),
    StructField("is_manager", BooleanType(), True),
    StructField("hire_date", StringType(), True)
])

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Create data directory if it doesn't exist
os.makedirs("/tmp/spark_data", exist_ok=True)

# Show the DataFrame
print("Sample DataFrame:")
df.show()

Sample DataFrame:


                                                                                

+---+------------+---+-------------+--------+----------+----------+
| id|        name|age|         city|  salary|is_manager| hire_date|
+---+------------+---+-------------+--------+----------+----------+
|  1|    John Doe| 35|     New York| 72000.5|      true|2020-01-15|
|  2|  Jane Smith| 28|San Francisco| 86000.0|     false|2019-06-22|
|  3|Robert Brown| 42|      Chicago|92000.75|      true|2021-03-08|
|  4|Maria Garcia| 31|  Los Angeles|67500.25|      true|2018-11-30|
|  5|James Wilson| 45|      Seattle|115000.0|     false|2022-02-12|
+---+------------+---+-------------+--------+----------+----------+



## 1. CSV Files

CSV (Comma-Separated Values) is a widely-used format for tabular data.

### Writing CSV Files

In [3]:
# Basic CSV write
df.write \
    .mode("overwrite") \
    .csv("/tmp/spark_data/basic.csv")

# CSV write with options
df.write \
    .mode("overwrite") \
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("quote", "\"") \
    .option("dateFormat", "yyyy-MM-dd") \
    .option("nullValue", "NULL") \
    .csv("/tmp/spark_data/formatted.csv")

# CSV with partition (data is stored in subdirectories by city)
df.write \
    .mode("overwrite") \
    .partitionBy("city") \
    .option("header", "true") \
    .csv("/tmp/spark_data/partitioned.csv")

print("CSV files written successfully.")

CSV files written successfully.


### Reading CSV Files

In [4]:
# Basic CSV read
df_csv_basic = spark.read.csv("/tmp/spark_data/basic.csv")

print("Basic CSV Read (note how column names are generic and types are strings):")
df_csv_basic.printSchema()
df_csv_basic.show(3)

# Read with options and schema inference
df_csv_with_header = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("delimiter", ",") \
    .csv("/tmp/spark_data/formatted.csv")

print("\nCSV Read with Header and Schema Inference:")
df_csv_with_header.printSchema()
df_csv_with_header.show(3)

# Read with explicit schema
df_csv_with_schema = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv("/tmp/spark_data/formatted.csv")

print("\nCSV Read with Explicit Schema:")
df_csv_with_schema.printSchema()
df_csv_with_schema.show(3)

Basic CSV Read (note how column names are generic and types are strings):
root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)

+---+------------+---+-------------+--------+-----+----------+
|_c0|         _c1|_c2|          _c3|     _c4|  _c5|       _c6|
+---+------------+---+-------------+--------+-----+----------+
|  4|Maria Garcia| 31|  Los Angeles|67500.25| true|2018-11-30|
|  2|  Jane Smith| 28|San Francisco| 86000.0|false|2019-06-22|
|  5|James Wilson| 45|      Seattle|115000.0|false|2022-02-12|
+---+------------+---+-------------+--------+-----+----------+
only showing top 3 rows


CSV Read with Header and Schema Inference:
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- salary: doub

### CSV Best Practices

1. **Always specify a schema for production workloads** - Schema inference is convenient but slow and may guess the wrong types.
2. **Use header=true when possible** - Makes your data self-documenting.
3. **Set appropriate nullValue option** - Define how NULL values are represented in your CSV.
4. **Be explicit with date formats** - Set dateFormat to ensure correct parsing.
5. **For large datasets, consider**:
   - Setting compression (compression='gzip')
   - Proper partitioning (partitionBy)
   - Specifying escape characters for special data

## 2. JSON Files

JSON (JavaScript Object Notation) is excellent for semi-structured data.

### Writing JSON Files

In [5]:
# Basic JSON write
df.write \
    .mode("overwrite") \
    .json("/tmp/spark_data/basic.json")

# JSON write with options
df.write \
    .mode("overwrite") \
    .option("compression", "gzip") \
    .option("dateFormat", "yyyy-MM-dd") \
    .json("/tmp/spark_data/compressed.json")

# JSON with pretty printing (one record per line, properly formatted)
df.write \
    .mode("overwrite") \
    .option("pretty", "true") \
    .json("/tmp/spark_data/pretty.json")

print("JSON files written successfully.")

JSON files written successfully.


### Reading JSON Files

In [6]:
# Basic JSON read
df_json_basic = spark.read.json("/tmp/spark_data/basic.json")

print("Basic JSON Read (with schema inference):")
df_json_basic.printSchema()
df_json_basic.show(3)

# Read with explicit schema
df_json_with_schema = spark.read \
    .schema(schema) \
    .json("/tmp/spark_data/basic.json")

print("\nJSON Read with Explicit Schema:")
df_json_with_schema.printSchema()
df_json_with_schema.show(3)

# Reading multi-line JSON (where each JSON object may span multiple lines)
multi_line_json = """[
    {
        "id": 1,
        "name": "John Doe",
        "age": 35,
        "city": "New York",
        "salary": 72000.50,
        "is_manager": true,
        "hire_date": "2020-01-15"
    },
    {
        "id": 2,
        "name": "Jane Smith",
        "age": 28,
        "city": "San Francisco",
        "salary": 86000.00,
        "is_manager": false,
        "hire_date": "2019-06-22"
    }
]"""

# Write multi-line JSON to a file
with open("/tmp/spark_data/multiline.json", "w") as f:
    f.write(multi_line_json)

# Read multi-line JSON
df_multiline_json = spark.read \
    .option("multiline", "true") \
    .json("/tmp/spark_data/multiline.json")

print("\nMulti-line JSON Read:")
df_multiline_json.show()

Basic JSON Read (with schema inference):
root
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- is_manager: boolean (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: double (nullable = true)

+---+-------------+----------+---+----------+------------+--------+
|age|         city| hire_date| id|is_manager|        name|  salary|
+---+-------------+----------+---+----------+------------+--------+
| 28|San Francisco|2019-06-22|  2|     false|  Jane Smith| 86000.0|
| 31|  Los Angeles|2018-11-30|  4|      true|Maria Garcia|67500.25|
| 45|      Seattle|2022-02-12|  5|     false|James Wilson|115000.0|
+---+-------------+----------+---+----------+------------+--------+
only showing top 3 rows


JSON Read with Explicit Schema:
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- salary:

### JSON Best Practices

1. **Use multiline=true when needed** - For JSON files where objects span multiple lines.
2. **Prefer one record per line** - For performance and parallelism.
3. **Use explicit schemas in production** - For consistent type handling.
4. **Consider compression** - JSON is verbose, so compression helps with storage.
5. **Be careful with complex nested structures** - These can be processed but impact performance.

## 3. Parquet Files

Parquet is a columnar format optimized for analytics workloads, offering efficient storage and querying.

### Writing Parquet Files

In [7]:
# Basic Parquet write
df.write \
    .mode("overwrite") \
    .parquet("/tmp/spark_data/basic.parquet")

# Parquet with compression options
df.write \
    .mode("overwrite") \
    .option("compression", "snappy") \
    .parquet("/tmp/spark_data/compressed.parquet")

# Parquet with partitioning
df.write \
    .mode("overwrite") \
    .partitionBy("city", "is_manager") \
    .parquet("/tmp/spark_data/partitioned.parquet")

print("Parquet files written successfully.")

Parquet files written successfully.


### Reading Parquet Files

In [13]:
# Basic Parquet read
df_parquet_basic = spark.read.parquet("/tmp/spark_data/basic.parquet")

print("Basic Parquet Read (schemas are preserved):")
df_parquet_basic.printSchema()
df_parquet_basic.show(3)

# Read with column projection (reading only specific columns)
df_parquet_select = spark.read.parquet("/tmp/spark_data/basic.parquet").select("id", "name", "city")

print("\nParquet Read with Column Projection:")
df_parquet_select.show(3)

# Read with partition discovery and filtering
df_parquet_filtered = spark.read.parquet("/tmp/spark_data/partitioned.parquet") \
    .filter(col("city") == "New York")

print("\nParquet Read with Partition Filtering:")
df_parquet_filtered.explain()  # Look for PushedFilters and PartitionFilters in the plan
df_parquet_filtered.show()

Basic Parquet Read (schemas are preserved):
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- is_manager: boolean (nullable = true)
 |-- hire_date: string (nullable = true)

+---+------------+---+-------------+--------+----------+----------+
| id|        name|age|         city|  salary|is_manager| hire_date|
+---+------------+---+-------------+--------+----------+----------+
|  4|Maria Garcia| 31|  Los Angeles|67500.25|      true|2018-11-30|
|  2|  Jane Smith| 28|San Francisco| 86000.0|     false|2019-06-22|
|  5|James Wilson| 45|      Seattle|115000.0|     false|2022-02-12|
+---+------------+---+-------------+--------+----------+----------+
only showing top 3 rows


Parquet Read with Column Projection:
+---+------------+-------------+
| id|        name|         city|
+---+------------+-------------+
|  4|Maria Garcia|  Los Angeles|
|  2|  Jane Sm

### Parquet Best Practices

1. **Use Parquet for analytics workloads** - It's optimized for query performance.
2. **Choose appropriate partitioning** - Partition on columns used for filtering.
3. **Enable predicate pushdown** - Filtering happens at file read time, not after loading data.
4. **Choose Snappy compression** - Good balance of compression ratio and speed.
5. **Consider file size** - Aim for parquet files in the 256MB-1GB range for best performance.
6. **Design for column projection** - Parquet shines when you only need to read a subset of columns.

## 4. Performance Comparison

Let's create a larger dataset and compare the formats for both writing and reading performance.

In [14]:
# Create a larger dataset
large_df = df
for i in range(4):  # Will give us about 80 rows (5 * 2^4)
    large_df = large_df.union(large_df)

# Add some randomness
large_df = large_df.withColumn("id", monotonically_increasing_id())
large_df = large_df.withColumn("salary", col("salary") * (rand() + 0.5))

print(f"Created performance test dataset with {large_df.count()} rows")



Created performance test dataset with 80 rows


                                                                                

In [15]:
# Function to measure write performance
def measure_write_performance(df, format_name, options={}):
    path = f"/tmp/spark_data/perf_{format_name}"
    writer = df.write.mode("overwrite")
    
    # Add options
    for k, v in options.items():
        writer = writer.option(k, v)
    
    # Measure write time
    start_time = time.time()
    
    if format_name == "avro":
        writer.format("avro").save(path)
    else:
        getattr(writer, format_name)(path)
    
    write_time = time.time() - start_time
    
    return path, write_time

# Function to measure read performance
def measure_read_performance(format_name, path, options={}):
    reader = spark.read
    
    # Add options
    for k, v in options.items():
        reader = reader.option(k, v)
    
    # Measure read time
    start_time = time.time()
    
    if format_name == "avro":
        df_read = reader.format("avro").load(path)
    else:
        df_read = getattr(reader, format_name)(path)
    
    count = df_read.count()  # Force execution
    read_time = time.time() - start_time
    
    return read_time

In [16]:
# Run performance comparison
results = []

# Test CSV
csv_path, csv_write_time = measure_write_performance(
    large_df, "csv", {"header": "true"}
)
csv_read_time = measure_read_performance(
    "csv", csv_path, {"header": "true", "inferSchema": "true"}
)
results.append(("CSV", csv_write_time, csv_read_time))

# Test JSON
json_path, json_write_time = measure_write_performance(large_df, "json")
json_read_time = measure_read_performance("json", json_path)
results.append(("JSON", json_write_time, json_read_time))

# Test Parquet
parquet_path, parquet_write_time = measure_write_performance(
    large_df, "parquet", {"compression": "snappy"}
)
parquet_read_time = measure_read_performance("parquet", parquet_path)
results.append(("Parquet", parquet_write_time, parquet_read_time))

# Display results
print("Performance Comparison:")
print("Format\tWrite Time (s)\tRead Time (s)")
print("------\t--------------\t-------------")
for format_name, write_time, read_time in results:
    print(f"{format_name}\t{write_time:.2f}\t\t{read_time:.2f}")

                                                                                

Performance Comparison:
Format	Write Time (s)	Read Time (s)
------	--------------	-------------
CSV	1.75		0.32
JSON	1.53		0.18
Parquet	1.64		0.09


## 5. File Size Comparison

In [17]:
def get_directory_size(path):
    # A helper function to calculate directory size
    # This is a simplified version and might not work in all environments
    import subprocess
    try:
        output = subprocess.check_output(['du', '-sh', path]).decode('utf-8')
        size = output.split()[0]
        return size
    except Exception as e:
        return f"Error: {e}"

# Measure directory sizes
print("File Size Comparison:")
print("Format\tSize")
print("------\t----")
print(f"CSV\t{get_directory_size(csv_path)}")
print(f"JSON\t{get_directory_size(json_path)}")
print(f"Parquet\t{get_directory_size(parquet_path)}")
try:
    print(f"Avro\t{get_directory_size(avro_path)}")
except NameError:
    print("Avro\tNot tested")

File Size Comparison:
Format	Size
------	----
CSV	672K
JSON	668K
Parquet	672K
Avro	Not tested


## 6. Format Selection Guide

### When to use each format:

**CSV**
- ✅ Human-readable, universal compatibility
- ✅ Easy to edit manually or with spreadsheet software
- ❌ Poor performance for large data
- ❌ Inefficient storage (no compression by default)
- ❌ No schema preservation
- **Best for**: Small datasets, data exchange with non-Spark systems, human-editable files

**JSON**
- ✅ Supports nested structures
- ✅ Good compatibility with web services
- ✅ Human-readable
- ❌ Verbose format, larger files
- ❌ Slower than binary formats
- **Best for**: API integrations, semi-structured data, moderate-sized datasets

**Parquet**
- ✅ Best query performance
- ✅ Column pruning and predicate pushdown
- ✅ Efficient compression
- ✅ Schema preservation
- ❌ Not human-readable
- ❌ Less universal than CSV/JSON
- **Best for**: Analytics workloads, large datasets, frequent querying, column-oriented access patterns

**Avro**
- ✅ Schema evolution support
- ✅ Good for record/row-based access
- ✅ Rich data type support
- ✅ Compact binary format
- ❌ Not as efficient as Parquet for analytics
- ❌ Not human-readable
- **Best for**: Data with evolving schemas, streaming data integration, row-oriented access patterns

## 7. Summary of Key Options and Parameters

### Common Parameters for All Formats
- `mode`: `overwrite`, `append`, `ignore`, `error` (default)
- `partitionBy`: Saves data in partitioned directory structure

### CSV Options
- `header`: `true` to include column names
- `delimiter`: Character separator (default `,`)
- `quote`: Character for quoting (default `"`)
- `escape`: Character for escaping (default `\`)
- `nullValue`: String representation for null values
- `inferSchema`: Automatically detect column types
- `dateFormat`: Format string for date parsing

### JSON Options
- `multiLine`: `true` for multi-line JSON records
- `dateFormat`: Format string for date parsing
- `compression`: Compression codec (e.g., `gzip`, `snappy`)
- `primitivesAsString`: Convert primitives to strings
- `allowUnquotedFieldNames`: Allow unquoted field names

### Parquet Options
- `compression`: Compression codec (e.g., `snappy`, `gzip`, `none`)
- `mergeSchema`: Reconcile schemas when reading multiple files
- `partitionOverwriteMode`: `static` or `dynamic` partition overwrite

### Avro Options
- `avroSchema`: User-provided schema
- `compression`: Compression codec (e.g., `snappy`, `deflate`)
- `recordName`: Record name in the Avro schema
- `recordNamespace`: Namespace in the Avro schema