# CSV to Columnar Format Conversion

This notebook demonstrates how to convert CSV files to Apache Parquet and Feather formats using vroom-csv.
These columnar formats offer significant benefits:

- **Parquet**: Highly compressed, optimized for storage and analytics
- **Feather**: Fast read/write, ideal for intermediate files and interprocess communication

In [None]:
import vroom_csv
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.feather as feather
import tempfile
import os

# Create sample CSV data
csv_content = """id,name,age,salary,department,active
1,Alice,30,75000.50,Engineering,true
2,Bob,25,82000.00,Marketing,false
3,Charlie,35,68000.25,Engineering,true
4,Diana,28,91000.75,Sales,true
5,Eve,32,85000.00,Marketing,false
6,Frank,29,77500.25,Engineering,true
7,Grace,31,88000.00,Sales,true
8,Henry,27,72000.50,Marketing,false
"""

temp_dir = tempfile.mkdtemp()
csv_path = os.path.join(temp_dir, "employees.csv")

with open(csv_path, "w") as f:
    f.write(csv_content)

print(f"Created CSV at: {csv_path}")

## Read CSV with vroom-csv

First, let's read the CSV file using vroom-csv's high-performance parser with automatic type inference.

In [None]:
# Read CSV with type inference
table = vroom_csv.read_csv(csv_path)

# Convert to PyArrow Table (zero-copy)
arrow_table = pa.table(table)

print("Schema (inferred types):")
print(arrow_table.schema)
print(f"\nRows: {arrow_table.num_rows}")
print(f"Columns: {arrow_table.num_columns}")

## Convert to Parquet

Parquet is ideal for:
- Long-term storage
- Analytics workloads (e.g., Spark, BigQuery)
- Efficient column pruning and predicate pushdown

In [None]:
parquet_path = os.path.join(temp_dir, "employees.parquet")

# Write to Parquet with Snappy compression (default)
pq.write_table(arrow_table, parquet_path, compression='snappy')

# Compare file sizes
csv_size = os.path.getsize(csv_path)
parquet_size = os.path.getsize(parquet_path)

print(f"CSV size: {csv_size:,} bytes")
print(f"Parquet size: {parquet_size:,} bytes")
print(f"Compression ratio: {csv_size / parquet_size:.2f}x")

In [None]:
# Read back the Parquet file to verify
parquet_table = pq.read_table(parquet_path)

print("Data from Parquet:")
print(parquet_table.to_pandas())

### Parquet Compression Options

Parquet supports several compression codecs with different tradeoffs:

In [None]:
compression_options = ['none', 'snappy', 'gzip', 'zstd', 'lz4']

print("Compression comparison:")
print("-" * 50)

for codec in compression_options:
    path = os.path.join(temp_dir, f"employees_{codec}.parquet")
    pq.write_table(arrow_table, path, compression=codec)
    size = os.path.getsize(path)
    print(f"{codec:12} : {size:6,} bytes ({csv_size / size:5.2f}x compression)")

## Convert to Feather

Feather (Arrow IPC format) is ideal for:
- Fast temporary files
- Interprocess communication
- Intermediate results in data pipelines

In [None]:
feather_path = os.path.join(temp_dir, "employees.feather")

# Write to Feather format
feather.write_feather(arrow_table, feather_path)

feather_size = os.path.getsize(feather_path)
print(f"Feather size: {feather_size:,} bytes")
print(f"\nNote: Feather prioritizes read/write speed over compression")

In [None]:
# Read back the Feather file
feather_table = feather.read_table(feather_path)

print("Data from Feather:")
print(feather_table.to_pandas())

## Working with Large Files

For large CSV files, vroom-csv's SIMD-accelerated parsing combined with columnar output provides significant performance benefits.

In [None]:
import time
import random

# Generate a larger test file
large_csv_path = os.path.join(temp_dir, "large_data.csv")
n_rows = 100000

with open(large_csv_path, "w") as f:
    f.write("id,value1,value2,value3,category\n")
    for i in range(n_rows):
        f.write(f"{i},{random.random():.4f},{random.randint(0, 1000)},{random.random() * 100:.2f},cat_{i % 10}\n")

print(f"Generated {n_rows:,} rows")
print(f"CSV size: {os.path.getsize(large_csv_path):,} bytes")

In [None]:
# Time CSV parsing and conversion
start = time.time()
table = vroom_csv.read_csv(large_csv_path)
arrow_table = pa.table(table)
parse_time = time.time() - start

print(f"CSV parse time: {parse_time:.3f}s")

# Time Parquet write
parquet_path = os.path.join(temp_dir, "large_data.parquet")
start = time.time()
pq.write_table(arrow_table, parquet_path, compression='snappy')
write_time = time.time() - start

print(f"Parquet write time: {write_time:.3f}s")
print(f"Total CSV -> Parquet: {parse_time + write_time:.3f}s")

# Compare sizes
csv_size = os.path.getsize(large_csv_path)
parquet_size = os.path.getsize(parquet_path)
print(f"\nCompression: {csv_size:,} -> {parquet_size:,} bytes ({csv_size / parquet_size:.2f}x)")

## Using Polars for Conversion

Polars provides built-in support for reading vroom-csv tables and writing to Parquet/Feather.

In [None]:
import polars as pl

# Read with vroom-csv, convert with Polars
table = vroom_csv.read_csv(csv_path)
df = pl.from_arrow(table)

# Write to Parquet with Polars
polars_parquet_path = os.path.join(temp_dir, "polars_output.parquet")
df.write_parquet(polars_parquet_path, compression='zstd')

print(f"Written {os.path.getsize(polars_parquet_path):,} bytes via Polars")

## Cleanup

In [None]:
import shutil
shutil.rmtree(temp_dir)
print("Cleaned up temporary files.")