Merged
Conversation
- Rename Analysis case class → ColumnStatistics (the result, not the actor) - Introduce sealed Analyzer trait with CsvAnalyzer and ParquetAnalyzer - Extract CSV analysis logic into CsvAnalyzer.analyze() - Add Analysis factory methods: * Analysis(table: RawTable) — backward-compatible primary path * Analysis.forCsv(path: Path) — parse and analyze CSV files * Analysis.forParquet[Row](path: Path) — analyze Parquet (impl TBD) - Preserve backward compatibility via @deprecated alias - Enables schema-driven analysis for Parquet without row materialization
- Refactor core Analysis into Analyzer trait hierarchy (CsvAnalyzer, sealed trait) - Introduce ColumnStatisticsProvider trait for pluggable column statistics - Move ParquetAnalyzer to parquet module; zero Parquet imports in core - Add ParquetColumnStatisticsProvider for efficient single-column analysis - Provide implicit analyzer factory and provider via parquet package object - Analysis.forParquet[Row](path) and Column.statisticsFrom(path, col) now work with implicits - Lazy statistics: schema analysis is fast (metadata only); row stats computed on demand - Add comprehensive unit tests for schema analysis and on-demand statistics
- Add MaybeStatistics sealed trait with EagerStatistics and LazyStatistics cases - EagerStatistics: computed upfront (from Parquet metadata or CSV row scan) - LazyStatistics: deferred as thunk () => Option[Statistics] for on-demand evaluation - Update Column to use Option[MaybeStatistics] instead of Option[Statistics] - ParquetColumnStatisticsProvider supports metadata-first with lazy fallback * useMetadataOnly=true (default): return None if metadata unavailable * useMetadataOnly=false: return LazyStatistics thunk for row scan - extractMetadataStatistics stubbed for future Parquet metadata extraction - CSV analysis continues to use eager statistics (no change to behavior) - Add comprehensive tests covering eager, lazy, and metadata-only modes - Enables fast schema analysis with deferred statistics computation - Adjustments to Cats and Zio tests to accomodate the new statistics types.
…et APIs - Add parseParquetDataset() and parseDataset() factory methods for multi-part datasets - Add Analysis.forParquetDataset[Row](path) for schema analysis on directories - Implement ParquetDatasetAnalyzer for polymorphic dataset handling - Handle both single files and dataset directories transparently: * Read schema from _metadata if present, else from first part-*.parquet file * Sum row counts across all part files for accurate dataset statistics - Add SingleFileAnalyzerFactory and DatasetAnalyzerFactory marker traits in core - Parquet module implements factories; core remains parquet-agnostic - Add comprehensive tests for dataset parsing, analysis, and error cases - Validate paths: parseParquet rejects directories, parseDataset rejects files
- Extend CsvRenderers and CsvGenerators traits from arity 13 to 19 - Add bare-type and Option instances for Float, Short, Byte, Instant, Temporal, and Option[Long] - Wire YellowTaxiTrip companion with renderer19/generator19 - Demonstrate full Parquet→CSV pipeline: ParquetParser → Table[YellowTaxiTrip] → CsvTableFileRenderer
Remove Codacy complaints
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Deferred items for Parquet