diff --git a/why-parquet/index.qmd b/why-parquet/index.qmd new file mode 100644 index 0000000..3dd0d78 --- /dev/null +++ b/why-parquet/index.qmd @@ -0,0 +1,235 @@ +--- +title: "Why Parquet" +description: | + The format of the file for storing data directly impacts how it is used + later. Parquet is a format that has very good compressed file sizes, is very + fast to process and analyse, and is well integrated into the R and Python + ecosystems. +date: "2025-03-10" +categories: +- database +- organise +--- + +## Context and problem statement + +The core aim of Seedcase is to structure and organise data to be modern, +FAIR, and standardised so that it can be more easily used and shared +later on. How we store the data directly affects how it can be used +later on. There is a large variety of file formats, so we need to +decide on the best one for our needs. So the question is: + +*What file format should we use for storing data organised and +structured by Seedcase software?* + +## Decision drivers + +::: content-hidden +List some reasons for why we need to make this decision and what things +have arisen that impact work. +::: + +- Mainly used for storing various sizes of data. +- We don't do "transactional" data processing or data entry, so we + don't need to worry about [ACID](https://en.wikipedia.org/wiki/ACID) + compliance as much nor do we need to worry about row-level write + speeds. +- Stores the schema within the format itself, and can also handle + [schema evolution](https://en.wikipedia.org/wiki/Schema_evolution). +- Suitable for both local, single-node processing and remote, + distributed processing. +- Has good compressed file sizes. +- Is also relatively simple to use and understand for those doing data + analysis, which means it can integrate easily with software like R + and Python. + +## Considered options + +Based on our needs, we considered the following options: + +- [Parquet](https://parquet.apache.org/) +- [Avro](https://avro.apache.org/) +- [SQL (e.g. SQLite)](https://www.sqlite.org/) + +Commonly used file formats that we don't consider are: + +- CSV, JSON, and other text-based formats are some of the most + commonly used file formats for data. However, their biggest + disadvantage is that they don't store the data schema nor are they + good for file compression and eventual analysis. So we don't + consider them. +- [ORC](https://orc.apache.org) is a format that is similar to + Parquet, but is used mostly within distributed computing and Big + Data environments like in [Hadoop](https://hadoop.apache.org/) or + [Hive](https://hive.apache.org/) systems. We don't use these + technologies, so we are not considering ORC. +- [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html), + which is a common file system used in Big Data environments. We + don't consider it because both Parquet and Avro are natively + supported within HDFS systems. We also don't consider it because + Hadoop is not a common use case for the problems we aim to solve. +- Non-embedded SQL engines (like + [Postgres](https://www.postgresql.org/) or + [MySQL](https://www.mysql.com/)) because they require a server to + use and aren't stored as a file on a local filesystem. +- Excel or similar spreadsheet programs, as they aren't technically + open source nor do they store any schema information. + +### Parquet + +[Parquet](https://parquet.apache.org) is a column-oriented binary file +format, where each line in the file represents a "column" of data rather +than each line being a "row" of data as is seen with spreadsheet +structured data formats. This Parquet file format is optimized for +compression and storage as well as very large-scale data processing and +analysis. It is designed to handle complex data structures and is +natively supported by many big data processing frameworks like +[Spark](https://spark.apache.org/) or +[Hadoop](https://hadoop.apache.org/). + +:::::::::::: columns +::: column +#### Benefits + +- Designed for batch data processing, which is how most research data + analysis is performed. +- Stores the schema within the format itself, and can also handle + schema evolution. +- Of the options considered, it has the best compressed file sizes + because it stores data by column, not by row. What this means is + that each line in the file format is a column, not a row, unlike + formats like CSV where each line in the file is a row. +- Integrates very easily into both R and Python ecosystems for data + analysis via popular packages like + [arrow](https://arrow.apache.org/). +- Is natively supported by the incredibly powerful in-memory SQL + engine [DuckDB](https://duckdb.org/), which is a great tool for data + analysis. +- Has plugins for most common data processing tools, including SQLite. +- Can handle unstructured data. +- Designed to handle the situation of many columns relative to rows + (though still lots of rows). Most research data falls under this + category, for instance -omic type data, where there are often many + hundreds of columns and maybe a few thousand rows. +::: + +::: column +#### Drawbacks + +- Not particularly well designed for inserting new rows into the + dataset. This isn't strictly an issue for our purposes, since we + don't do "transactional" data processing or entry. +- Is in a binary format, so is not human-readable. This is a drawback + for people who want to quickly look at their data. +- Not very good at row-level scans or lookups, though this isn't a + common use case in research data analysis. +::: + +### Avro + +[Avro](https://avro.apache.org/) is a row-oriented binary file format +that was designed to be a compact and fast format for serializing data +within the Hadoop system. It is also designed to be a data exchange +format to more effectively move data between systems. + +::::::::: columns +::: column +#### Benefits + +- Can handle unstructured data. +- Has the schema stored within the format itself. +- Very good file size compression. +- Better at writing new rows to the dataset than Parquet. +- Has better schema evolution features than Parquet. +- For individual row lookups, Avro is faster than Parquet since it + stores data by row, not column. +::: + +::: column +#### Drawbacks + +- Not as fast as Parquet at reading data. +- Doesn't have as good compressed file sizes as Parquet because it + stores data by row, not column. +- Isn't as well integrated into common research analysis tools like R + and Python. +::: + +### SQL (e.g. SQLite) + +SQL is a standard language for relational databases throughout much of +the internet, and [SQLite](https://sqlite.org) is a popular embedded SQL +database format that is used in many applications. SQLite is a +file-based database format, which means that it is a single file that +contains the entire database. + +::::: columns +::: column +#### Benefits + +- Is a well established, classic relational, embedded SQL database + format. +- Very fast at reading and writing data (though not as fast as Parquet + or Avro for reading). +- Can handle unstructured data. +- Can integrate in a wide range of software tools. +- Great for row-level scans, insertions, and lookups. However, this + isn't a common use case for our purposes. +::: + +::: column +#### Drawbacks + +- Is a relational database, which requires some knowledge of SQL to + use. +- It is row-oriented, so isn't as good of a format for processing or + analysing data compared to Parquet. +- Doesn't have as good compressed file sizes as Parquet or Avro. +- Doesn't integrate well with the [Frictionless Data + Package](/why-frictionless-data/index.qmd) standard. +- While it can integrate with R and Python, it does require some + additional knowledge and code compared to using a file format like + Parquet. +::: +::::: + +## Decision outcome + +We've decided to use the Parquet file format for storing our data as it +is the best file format for our needs, which are mainly storing and +processing large amounts of data for research purposes. + +### Consequences + +- Since Parquet is not a text-based format, it is not human-readable. + This means that people wanting to have a quick look at their + data won't be able to do that. A consequence will be that some use + cases will not be ideal for Parquet, so people who have smaller + datasets may not use our solutions. + +## Resources used for this post + +::: content-hidden +List the resources used to write this post +::: + +- [Big Data File Formats + Explained](https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33/) +- [The Architect’s Guide to Data and File + Formats](https://thenewstack.io/the-architects-guide-to-data-and-file-formats/) +- [Big Data File Formats: A Comprehensive + Guide](https://risingwave.com/blog/big-data-file-formats-a-comprehensive-guide/) +- [What are the pros and cons of the Apache Parquet format compared to + other + formats?](https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-the-apache-parquet-format-compared-to-other-format) +- [Performance comparison of different file formats and storage + engines in the Hadoop + ecosystem](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines) +- [Should you use + Parquet?](https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html) +- [CSV vs Parquet vs Avro: Choosing the Right Tool for the Right + Job](https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8) +- [Avro vs. + Parquet](https://www.snowflake.com/trending/avro-vs-parquet) +::::::::: +::::::::::::