Skip to content

1 4 1#64

Merged
rchillyard merged 6 commits intomasterfrom
1_4_1
Mar 28, 2026
Merged

1 4 1#64
rchillyard merged 6 commits intomasterfrom
1_4_1

Conversation

@rchillyard
Copy link
Copy Markdown
Owner

Deferred items for Parquet

- Rename Analysis case class → ColumnStatistics (the result, not the actor)
- Introduce sealed Analyzer trait with CsvAnalyzer and ParquetAnalyzer
- Extract CSV analysis logic into CsvAnalyzer.analyze()
- Add Analysis factory methods:
  * Analysis(table: RawTable) — backward-compatible primary path
  * Analysis.forCsv(path: Path) — parse and analyze CSV files
  * Analysis.forParquet[Row](path: Path) — analyze Parquet (impl TBD)
- Preserve backward compatibility via @deprecated alias
- Enables schema-driven analysis for Parquet without row materialization
- Refactor core Analysis into Analyzer trait hierarchy (CsvAnalyzer, sealed trait)
- Introduce ColumnStatisticsProvider trait for pluggable column statistics
- Move ParquetAnalyzer to parquet module; zero Parquet imports in core
- Add ParquetColumnStatisticsProvider for efficient single-column analysis
- Provide implicit analyzer factory and provider via parquet package object
- Analysis.forParquet[Row](path) and Column.statisticsFrom(path, col) now work with implicits
- Lazy statistics: schema analysis is fast (metadata only); row stats computed on demand
- Add comprehensive unit tests for schema analysis and on-demand statistics
- Add MaybeStatistics sealed trait with EagerStatistics and LazyStatistics cases
- EagerStatistics: computed upfront (from Parquet metadata or CSV row scan)
- LazyStatistics: deferred as thunk () => Option[Statistics] for on-demand evaluation
- Update Column to use Option[MaybeStatistics] instead of Option[Statistics]
- ParquetColumnStatisticsProvider supports metadata-first with lazy fallback
  * useMetadataOnly=true (default): return None if metadata unavailable
  * useMetadataOnly=false: return LazyStatistics thunk for row scan
- extractMetadataStatistics stubbed for future Parquet metadata extraction
- CSV analysis continues to use eager statistics (no change to behavior)
- Add comprehensive tests covering eager, lazy, and metadata-only modes
- Enables fast schema analysis with deferred statistics computation
- Adjustments to Cats and Zio tests to accomodate the new statistics types.
…et APIs

- Add parseParquetDataset() and parseDataset() factory methods for multi-part datasets
- Add Analysis.forParquetDataset[Row](path) for schema analysis on directories
- Implement ParquetDatasetAnalyzer for polymorphic dataset handling
- Handle both single files and dataset directories transparently:
  * Read schema from _metadata if present, else from first part-*.parquet file
  * Sum row counts across all part files for accurate dataset statistics
- Add SingleFileAnalyzerFactory and DatasetAnalyzerFactory marker traits in core
- Parquet module implements factories; core remains parquet-agnostic
- Add comprehensive tests for dataset parsing, analysis, and error cases
- Validate paths: parseParquet rejects directories, parseDataset rejects files
- Extend CsvRenderers and CsvGenerators traits from arity 13 to 19
- Add bare-type and Option instances for Float, Short, Byte, Instant, Temporal, and Option[Long]
- Wire YellowTaxiTrip companion with renderer19/generator19
- Demonstrate full Parquet→CSV pipeline: ParquetParser → Table[YellowTaxiTrip] → CsvTableFileRenderer
Remove Codacy complaints
@rchillyard rchillyard merged commit 49f74b3 into master Mar 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant