# Data Lake File Formats

Understanding file formats is crucial for building efficient data lakes. The choice of format impacts:
- **Query performance** - How fast can you read specific columns or rows?
- **Storage costs** - How well does the format compress?
- **Schema evolution** - Can you add/remove fields without breaking pipelines?
- **Interoperability** - Which tools and engines support the format?

This notebook compares the most common data lake file formats and provides practical guidance for selection.

## 1. File Format Comparison

| Format | Type | Schema | Compression | Best For |
|--------|------|--------|-------------|----------|
| **Parquet** | Columnar | Embedded | Snappy, Gzip, Zstd, LZ4 | Analytics, OLAP queries |
| **ORC** | Columnar | Embedded | Zlib, Snappy, LZO, Zstd | Hive ecosystem, analytics |
| **Avro** | Row-based | Embedded (JSON) | Snappy, Deflate, Bzip2 | Streaming, Kafka, schema evolution |
| **JSON** | Row-based | Self-describing | Gzip, external | APIs, logs, flexibility |
| **CSV** | Row-based | None | Gzip, external | Simple interchange, legacy systems |

### Detailed Format Breakdown

#### Parquet
- **Origin**: Apache project, created by Twitter and Cloudera
- **Structure**: Columnar with row groups, column chunks, and pages
- **Strengths**: Excellent compression ratios, predicate pushdown, column pruning
- **Ecosystem**: Spark, Pandas, Dask, Presto, Athena, BigQuery

#### ORC (Optimized Row Columnar)
- **Origin**: Apache Hive project
- **Structure**: Columnar with stripes, row indexes, and bloom filters
- **Strengths**: ACID support in Hive, built-in indexes, excellent for Hive
- **Ecosystem**: Hive, Presto, Spark (with some limitations)

#### Avro
- **Origin**: Apache project, developed for Hadoop
- **Structure**: Row-based with schema stored in header
- **Strengths**: Schema evolution, compact binary format, RPC support
- **Ecosystem**: Kafka, Spark, Flink, schema registries

#### JSON (JavaScript Object Notation)
- **Structure**: Human-readable, self-describing, nested support
- **Strengths**: Universal support, flexible schema, debugging-friendly
- **Weaknesses**: Verbose, slow parsing, no native compression

#### CSV (Comma-Separated Values)
- **Structure**: Plain text, delimiter-separated
- **Strengths**: Universal compatibility, human-readable
- **Weaknesses**: No schema, no types, poor compression, escaping issues

## 2. Column vs Row-Based Storage

Understanding the fundamental difference between columnar and row-based storage is essential.

### Row-Based Storage (CSV, JSON, Avro)

```
Row 1: [id=1, name="Alice", age=30, salary=75000]
Row 2: [id=2, name="Bob",   age=25, salary=65000]
Row 3: [id=3, name="Carol", age=35, salary=85000]
```

**Advantages:**
- ✅ Fast writes (append entire rows)
- ✅ Efficient for `SELECT *` queries
- ✅ Better for OLTP workloads
- ✅ Natural for streaming data

**Disadvantages:**
- ❌ Must read entire row even for single column
- ❌ Poor compression (mixed data types)
- ❌ Slow aggregations across large datasets

---

### Columnar Storage (Parquet, ORC)

```
Column 'id':     [1, 2, 3]
Column 'name':   ["Alice", "Bob", "Carol"]
Column 'age':    [30, 25, 35]
Column 'salary': [75000, 65000, 85000]
```

**Advantages:**
- ✅ Read only needed columns (column pruning)
- ✅ Excellent compression (similar values together)
- ✅ Fast aggregations (SUM, AVG, COUNT)
- ✅ Predicate pushdown support

**Disadvantages:**
- ❌ Slower writes (must organize by column)
- ❌ Inefficient for `SELECT *` on few rows
- ❌ More complex file structure

---

### When to Use Each

| Use Case | Recommended Format |
|----------|--------------------|
| Analytics/BI queries | Columnar (Parquet/ORC) |
| Data warehouse tables | Columnar (Parquet/ORC) |
| Streaming ingestion | Row-based (Avro/JSON) |
| Log processing | Row-based → Columnar |
| Machine learning features | Columnar (Parquet) |
| API responses | JSON |
| Data exchange | CSV/JSON (compatibility) |

## 3. Compression Codecs

Compression reduces storage costs and can improve read performance by reducing I/O.

### Codec Comparison

| Codec | Compression Ratio | Speed | CPU Usage | Best For |
|-------|-------------------|-------|-----------|----------|
| **Snappy** | Medium | Very Fast | Low | Hot data, interactive queries |
| **LZ4** | Medium | Very Fast | Low | Similar to Snappy, slightly faster |
| **Gzip** | High | Slow | High | Cold storage, archival |
| **Zstd** | High | Fast | Medium | Best balance of ratio vs speed |
| **Brotli** | Very High | Very Slow | Very High | Web assets, rarely for data lakes |

### Compression Levels

Most codecs support compression levels:

```
Zstd Levels:
  Level 1:  Fast compression, lower ratio
  Level 3:  Default, good balance
  Level 9:  Slower, better ratio
  Level 19: Maximum compression

Gzip Levels:
  Level 1:  Fastest
  Level 6:  Default
  Level 9:  Best compression
```

### Recommendations by Use Case

| Scenario | Recommended Codec |
|----------|-------------------|
| Interactive queries | Snappy or LZ4 |
| Batch processing | Zstd (level 3) |
| Long-term archival | Gzip or Zstd (high level) |
| Streaming pipelines | Snappy (low latency) |
| Storage-constrained | Zstd (level 9+) |

## 4. Python: Reading and Writing Parquet

Parquet is the de facto standard for data lakes. Let's explore how to work with it in Python.

In [None]:
# Install required packages (uncomment if needed)
# !pip install pandas pyarrow fastparquet

In [None]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import os
from datetime import datetime, timedelta
import numpy as np

# Create sample data
np.random.seed(42)
n_rows = 10000

df = pd.DataFrame({
    'transaction_id': range(1, n_rows + 1),
    'customer_id': np.random.randint(1, 1000, n_rows),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books'], n_rows),
    'amount': np.round(np.random.uniform(10, 500, n_rows), 2),
    'transaction_date': [datetime(2024, 1, 1) + timedelta(days=int(x)) for x in np.random.randint(0, 365, n_rows)],
    'is_online': np.random.choice([True, False], n_rows)
})

print(f"DataFrame shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
df.head()

### 4.1 Writing Parquet with Pandas

In [None]:
# Simple write with pandas (uses pyarrow by default)
output_dir = 'sample_data'
os.makedirs(output_dir, exist_ok=True)

# Basic write
df.to_parquet(f'{output_dir}/transactions_basic.parquet')

# Write with compression options
df.to_parquet(f'{output_dir}/transactions_snappy.parquet', compression='snappy')
df.to_parquet(f'{output_dir}/transactions_gzip.parquet', compression='gzip')
df.to_parquet(f'{output_dir}/transactions_zstd.parquet', compression='zstd')

# Compare file sizes
for filename in os.listdir(output_dir):
    if filename.endswith('.parquet'):
        size = os.path.getsize(f'{output_dir}/{filename}')
        print(f"{filename}: {size:,} bytes ({size/1024:.1f} KB)")

### 4.2 Writing Parquet with PyArrow (More Control)

In [None]:
# Convert pandas DataFrame to PyArrow Table
table = pa.Table.from_pandas(df)

# Write with detailed options
pq.write_table(
    table,
    f'{output_dir}/transactions_advanced.parquet',
    compression='zstd',
    compression_level=3,  # Zstd compression level
    row_group_size=5000,  # Rows per row group
    use_dictionary=True,  # Dictionary encoding for strings
    write_statistics=True  # Column statistics for predicate pushdown
)

print("Advanced parquet file written successfully!")

# View schema
print(f"\nSchema:\n{table.schema}")

### 4.3 Partitioned Parquet (Common in Data Lakes)

In [None]:
# Add partition columns
df['year'] = df['transaction_date'].dt.year
df['month'] = df['transaction_date'].dt.month

# Write partitioned dataset
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path=f'{output_dir}/transactions_partitioned',
    partition_cols=['year', 'month'],
    compression='snappy'
)

# Show partition structure
print("Partition structure created:")
for root, dirs, files in os.walk(f'{output_dir}/transactions_partitioned'):
    level = root.replace(f'{output_dir}/transactions_partitioned', '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = ' ' * 2 * (level + 1)
    for file in files[:2]:  # Show first 2 files per directory
        print(f"{subindent}{file}")

### 4.4 Reading Parquet Files

In [None]:
# Simple read with pandas
df_read = pd.read_parquet(f'{output_dir}/transactions_basic.parquet')
print(f"Read {len(df_read):,} rows")

# Read specific columns only (column pruning)
df_subset = pd.read_parquet(
    f'{output_dir}/transactions_basic.parquet',
    columns=['customer_id', 'amount', 'product_category']
)
print(f"\nSubset columns: {df_subset.columns.tolist()}")

# Read with filters (predicate pushdown with pyarrow)
df_filtered = pd.read_parquet(
    f'{output_dir}/transactions_basic.parquet',
    filters=[('product_category', '==', 'Electronics')]
)
print(f"\nFiltered rows (Electronics only): {len(df_filtered):,}")

In [None]:
# Read partitioned dataset
df_partitioned = pd.read_parquet(f'{output_dir}/transactions_partitioned')
print(f"Read partitioned dataset: {len(df_partitioned):,} rows")

# Read specific partition only
df_jan = pd.read_parquet(
    f'{output_dir}/transactions_partitioned',
    filters=[('year', '==', 2024), ('month', '==', 1)]
)
print(f"January 2024 only: {len(df_jan):,} rows")

### 4.5 Inspecting Parquet Metadata

In [None]:
# Read parquet file metadata (without loading data)
parquet_file = pq.ParquetFile(f'{output_dir}/transactions_advanced.parquet')

print("=== Parquet File Metadata ===")
print(f"Number of row groups: {parquet_file.metadata.num_row_groups}")
print(f"Number of columns: {parquet_file.metadata.num_columns}")
print(f"Number of rows: {parquet_file.metadata.num_rows:,}")
print(f"Created by: {parquet_file.metadata.created_by}")

print("\n=== Schema ===")
print(parquet_file.schema_arrow)

print("\n=== Row Group 0 Info ===")
row_group = parquet_file.metadata.row_group(0)
print(f"Rows: {row_group.num_rows:,}")
print(f"Total byte size: {row_group.total_byte_size:,}")

In [None]:
# Column-level statistics (useful for query optimization)
print("=== Column Statistics ===")
for i in range(parquet_file.metadata.num_columns):
    col = parquet_file.metadata.row_group(0).column(i)
    print(f"\nColumn: {col.path_in_schema}")
    print(f"  Compression: {col.compression}")
    print(f"  Encodings: {col.encodings}")
    print(f"  Total compressed size: {col.total_compressed_size:,} bytes")
    if col.statistics:
        print(f"  Has nulls: {col.statistics.has_null_count}")
        print(f"  Distinct count: {col.statistics.distinct_count if col.statistics.has_distinct_count else 'N/A'}")

In [None]:
# Cleanup sample files
import shutil
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Sample data cleaned up.")

## 5. Format Selection Guide

Use this decision tree to select the right file format:

```
┌─────────────────────────────────────────────────────────────────┐
│                    FORMAT SELECTION GUIDE                       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │  What is your primary use?    │
              └───────────────────────────────┘
                    │                │
            Analytics/BI      Streaming/Events
                    │                │
                    ▼                ▼
              ┌─────────┐      ┌─────────┐
              │ PARQUET │      │  AVRO   │
              └─────────┘      └─────────┘
                    │                │
                    ▼                ▼
              Best for:        Best for:
              • Data lakes     • Kafka events
              • Spark/Presto   • Schema registry
              • Athena/BQ      • Change data capture
```

### Quick Reference Matrix

| Requirement | Best Format | Compression |
|-------------|-------------|-------------|
| Data warehouse/lake storage | Parquet | Zstd |
| Hive-based analytics | ORC | Zlib |
| Kafka streaming | Avro | Snappy |
| API data exchange | JSON | Gzip |
| Legacy system integration | CSV | Gzip |
| Schema evolution needed | Avro | Snappy |
| Maximum query performance | Parquet | Snappy |
| Minimum storage cost | Parquet | Zstd (high) |
| Human debugging | JSON | None |

### Migration Path

Common pattern in data lakes:

```
Raw Zone (Landing)     →    Curated Zone (Processed)    →    Consumption Zone
─────────────────────       ───────────────────────         ──────────────────
JSON/CSV/Avro               Parquet (partitioned)            Parquet (optimized)
As-is from source           Cleaned, typed                   Aggregated, denormalized
```

## 6. Key Takeaways

### Format Selection

| Format | Choose When |
|--------|-------------|
| **Parquet** | Default for data lakes, analytics, and ML pipelines |
| **ORC** | Hive-centric environments with ACID requirements |
| **Avro** | Streaming with schema evolution (Kafka) |
| **JSON** | APIs, logs, or when human readability matters |
| **CSV** | Legacy systems or simple data exchange |

### Storage Layout

- **Columnar** (Parquet/ORC): Analytics queries, aggregations, column-specific access
- **Row-based** (Avro/JSON/CSV): Streaming, full-row access, simple writes

### Compression Strategy

- **Snappy/LZ4**: Interactive queries, low latency
- **Zstd**: Best overall balance (use level 3 as default)
- **Gzip**: Cold storage, archival data

### Best Practices

1. **Partition wisely** - Use date/time or high-cardinality columns
2. **Right-size files** - Target 128MB-1GB per file for optimal parallelism
3. **Enable statistics** - Helps query engines with predicate pushdown
4. **Use dictionary encoding** - Excellent for low-cardinality string columns
5. **Consider schema evolution** - Avro for frequent changes, Parquet for stability

### Common Pitfalls

- ❌ Using CSV for large-scale analytics (slow, no types)
- ❌ Over-partitioning (too many small files)
- ❌ Ignoring compression (wastes storage and I/O)
- ❌ Using JSON for large datasets (verbose, slow)
- ❌ Not considering query patterns when choosing format