# Understanding Columnar Storage Formats: A Practical Guide

This notebook provides a comprehensive exploration of columnar storage formats, focusing on their advantages and practical applications in data engineering. Through hands-on examples using Polars and Apache Arrow, we'll demonstrate why columnar formats like Parquet are the industry standard for analytical workloads.

**Author:** Data Engineering Team  
**Last Modified:** September 15, 2025

## Prerequisites
- Basic understanding of row-oriented formats (CSV, JSON)
- Familiarity with DataFrame libraries (Pandas or Polars)
- Basic Python programming knowledge

## Table of Contents
1. [Setup and Data Generation](#setup)
   - Library imports
   - Sample dataset creation
2. [CSV vs Parquet Format Comparison](#comparison)
   - File size analysis
   - Read performance
   - Memory usage patterns
3. [Column Pruning Performance](#pruning)
   - Column selection efficiency
   - I/O optimization demonstration
4. [Compression Analysis](#compression)
   - Compression ratio comparison
   - Data type grouping benefits
5. [Schema Evolution](#schema)
   - Adding/removing columns
   - Data type modifications
6. [Real-world Query Benchmarks](#benchmarks)
   - NYC Taxi dataset analysis
   - Practical performance testing

## Setup and Configuration

First, let's import the necessary libraries and set up our configuration. We'll be using:
- **Polars**: For high-performance data manipulation
- **pyarrow**: For Apache Arrow functionality and Parquet support
- **time**: For performance measurements
- **os**: For file operations
- **numpy**: For numerical operations and random data generation

In [None]:
# Import required libraries
import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import os
import time
from datetime import datetime, timedelta

# Configuration
SEED = 42
np.random.seed(SEED)

# File paths
DATA_DIR = "data"
os.makedirs(DATA_DIR, exist_ok=True)
CSV_PATH = os.path.join(DATA_DIR, "sample_data.csv")
PARQUET_PATH = os.path.join(DATA_DIR, "sample_data.parquet")

# Dataset parameters
N_ROWS = 100_000
N_COLS = 50

## Understanding Row vs Column-Oriented Storage

Before diving into the implementation, let's understand the fundamental difference between row and column-oriented storage using a simple analogy:

### The Phone Book Analogy 📱

Imagine you have two different versions of a phone book:

1. **Traditional Phone Book (Row-oriented)**
   - Each entry contains: (Name, Address, Phone Number)
   - Organized by complete records
   - Great for looking up all information about one person
   - Not efficient for finding "all phone numbers" or "all addresses"

2. **Specialized Index (Column-oriented)**
   - Separate lists for Names, Addresses, and Phone Numbers
   - Each type of data stored together
   - Perfect for questions like "list all phone numbers"
   - Better compression (similar data stored together)

This is exactly how row-oriented formats (like CSV) and columnar formats (like Parquet) differ in storing data. Let's see this in practice!

## Data Generation

Let's create a sample dataset with various data types to demonstrate the benefits of columnar storage. Our dataset will have:
- Numeric columns (integers and floats)
- Categorical columns
- DateTime columns
- Text columns

This variety of data types will help us showcase how columnar storage handles different types of data efficiently.

In [None]:
def generate_sample_data(n_rows: int, n_cols: int) -> pl.DataFrame:
    """
    Generate a sample DataFrame with various data types.
    
    Args:
        n_rows: Number of rows to generate
        n_cols: Total number of columns to generate (distributed across types)
        
    Returns:
        pl.DataFrame: Generated sample data
    """
    # Calculate number of columns per type
    n_per_type = n_cols // 4  # We'll have 4 types of columns
    
    # Generate numeric columns (integers)
    int_cols = {
        f"int_col_{i}": np.random.randint(0, 1000000, n_rows)
        for i in range(n_per_type)
    }
    
    # Generate float columns
    float_cols = {
        f"float_col_{i}": np.random.normal(0, 1, n_rows)
        for i in range(n_per_type)
    }
    
    # Generate categorical columns
    categories = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    cat_cols = {
        f"cat_col_{i}": np.random.choice(categories, n_rows)
        for i in range(n_per_type)
    }
    
    # Generate datetime columns
    base_date = datetime(2023, 1, 1)
    date_cols = {
        f"date_col_{i}": [
            base_date + timedelta(days=np.random.randint(0, 365))
            for _ in range(n_rows)
        ]
        for i in range(n_per_type)
    }
    
    # Combine all columns
    data = {**int_cols, **float_cols, **cat_cols, **date_cols}
    
    # Create Polars DataFrame
    df = pl.DataFrame(data)
    return df

# Generate the sample dataset
print("Generating sample dataset...")
df = generate_sample_data(N_ROWS, N_COLS)
print(f"Generated dataset shape: {df.shape}")
print("\nDataset preview:")
df.head()

## CSV vs Parquet Format Comparison

Now that we have our sample dataset, let's compare how it behaves when stored in CSV (row-oriented) versus Parquet (columnar) format. We'll examine:

1. File sizes on disk
2. Read performance
3. Memory usage

We'll write our dataset to both formats and then analyze the differences.

In [None]:
# Helper function for file size formatting
def format_size(size_in_bytes):
    """Convert size in bytes to human readable format"""
    for unit in ['B', 'KB', 'MB', 'GB']:
        if size_in_bytes < 1024:
            return f"{size_in_bytes:.2f} {unit}"
        size_in_bytes /= 1024
    return f"{size_in_bytes:.2f} GB"

# Write to CSV and Parquet
print("Writing files...")
df.write_csv(CSV_PATH)
df.write_parquet(PARQUET_PATH)

# Compare file sizes
csv_size = os.path.getsize(CSV_PATH)
parquet_size = os.path.getsize(PARQUET_PATH)

print("\nFile size comparison:")
print(f"CSV size: {format_size(csv_size)}")
print(f"Parquet size: {format_size(parquet_size)}")
print(f"Compression ratio: {csv_size / parquet_size:.2f}x")

### Read Performance Comparison

Let's compare how fast we can read data from both formats. We'll measure:
1. Time to read the entire dataset
2. Memory usage when reading
3. Time to read specific columns (column pruning)

In [None]:
def measure_read_time(func):
    """Decorator to measure execution time and memory usage"""
    import psutil
    import gc
    
    def wrapper(*args, **kwargs):
        # Clear memory and garbage collect
        gc.collect()
        process = psutil.Process()
        
        # Measure initial memory
        start_mem = process.memory_info().rss
        
        # Time the operation
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        
        # Measure final memory
        end_mem = process.memory_info().rss
        
        print(f"Time taken: {(end_time - start_time):.2f} seconds")
        print(f"Memory usage: {format_size(end_mem - start_mem)}")
        return result
    return wrapper

# Read complete files
print("Reading complete CSV file:")
@measure_read_time
def read_csv():
    return pl.read_csv(CSV_PATH)

csv_df = read_csv()

print("\nReading complete Parquet file:")
@measure_read_time
def read_parquet():
    return pl.read_parquet(PARQUET_PATH)

parquet_df = read_parquet()

## Column Pruning Performance

One of the key advantages of columnar storage is its ability to read only the columns needed for a specific query. This is called "column pruning" and it can significantly improve performance for queries that only need a subset of columns.

Let's demonstrate this by reading different numbers of columns from both formats:

In [None]:
# Test column pruning with different numbers of columns
column_counts = [1, 5, 10, 25, N_COLS]  # Test with different numbers of columns

print("Testing column pruning performance:")
print("-" * 50)

for n_cols in column_counts:
    # Get a subset of columns
    columns = df.columns[:n_cols]
    
    print(f"\nReading {n_cols} columns:")
    
    print("\nFrom CSV:")
    @measure_read_time
    def read_csv_columns():
        return pl.read_csv(CSV_PATH, columns=columns)
    
    csv_subset = read_csv_columns()
    
    print("\nFrom Parquet:")
    @measure_read_time
    def read_parquet_columns():
        return pl.read_parquet(PARQUET_PATH, columns=columns)
    
    parquet_subset = read_parquet_columns()

## Schema Evolution

One of the powerful features of columnar formats like Parquet is their ability to handle schema evolution gracefully. This means you can:
1. Add new columns
2. Remove existing columns
3. Change data types (with some restrictions)

Let's demonstrate these capabilities:

In [None]:
# 1. Adding a new column
df_new = df.with_columns([
    pl.lit("new_column").alias("extra_col")
])

# Write to new Parquet file
new_parquet_path = os.path.join(DATA_DIR, "evolved_schema.parquet")
df_new.write_parquet(new_parquet_path)

# Read and compare schemas
original_schema = pl.read_parquet(PARQUET_PATH).schema
new_schema = pl.read_parquet(new_parquet_path).schema

print("Original schema:")
print(original_schema)
print("\nNew schema (with added column):")
print(new_schema)

# 2. Reading specific columns (column pruning still works)
print("\nReading only original columns from new file:")
@measure_read_time
def read_original_columns():
    return pl.read_parquet(new_parquet_path, columns=df.columns)

original_cols_df = read_original_columns()

# 3. Type conversion
df_converted = df_new.with_columns([
    pl.col("int_col_0").cast(pl.Float64).alias("int_col_0_float")
])

converted_parquet_path = os.path.join(DATA_DIR, "converted_schema.parquet")
df_converted.write_parquet(converted_parquet_path)

# Show schema changes
print("\nSchema after type conversion:")
print(pl.read_parquet(converted_parquet_path).schema)

## Conclusion

Through this practical demonstration, we've seen the key advantages of columnar storage formats like Parquet:

1. **Smaller File Size:** Parquet files are typically much smaller than equivalent CSV files due to efficient encoding and compression.
2. **Better Query Performance:** Column pruning allows us to read only the data we need, resulting in faster query execution.
3. **Schema Flexibility:** Parquet handles schema evolution gracefully, making it ideal for evolving datasets.
4. **Type Safety:** Parquet preserves data types and provides a self-describing schema.

For analytical workloads where you frequently:
- Query specific columns
- Need efficient storage
- Work with large datasets
- Require schema evolution support

Columnar storage formats like Parquet are clearly the better choice over row-based formats like CSV.