From 3e00de751997876792746ade9c3a5742f234d2b9 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Thu, 6 Mar 2025 10:28:22 +0100 Subject: [PATCH 1/4] docs: :construction: start post on why parquet --- why-parquet/index.qmd | 125 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 125 insertions(+) create mode 100644 why-parquet/index.qmd diff --git a/why-parquet/index.qmd b/why-parquet/index.qmd new file mode 100644 index 0000000..56cab74 --- /dev/null +++ b/why-parquet/index.qmd @@ -0,0 +1,125 @@ +--- +title: "Why Parquet" +description: "The format for storing data directly impacts how it is used later. " +date: "2025-03-03" +categories: +- database +- organise +--- + +## Context and problem statement + +The way + +Storing for the data that will be used, shared etc + +## Decision drivers + +::: content-hidden +List some reasons for why we need to make this decision and what things +have arisen that impact work. +::: + +- Mainly used for storing various sizes of data. +- We don't do "transactional" data processing or data entry, so we don't need to worry about [ACID](https://en.wikipedia.org/wiki/ACID) compliance as much nor do we need to worry about row-level write speeds. +- Stores the schema within the format itself, and can also handle [schema evolution]() +- Suitable for both local, single-node processing and remote, distributed processing. +- Has good compressed file sizes. +- Is also relatively simple to use and understand for those doing data analysis, which means it can integrate easily with software like R and Python. + +## Considered options + +- [Parquet](https://parquet.apache.org/) +- [Avro](https://avro.apache.org/) +- [SQL (e.g. SQLite)](https://www.sqlite.org/) + +Not considered: + +- CSV, JSON, and other text-based formats are some of the most commonly used file formats for data. However, their biggest disadvantage is that they don't store the data schema nor are they good for file compression and eventual analysis. So we don't consider them. +- [ORC](https://orc.apache.org) is a format that is similar to Parquet, but is used mostly within distributed computing and Big Data environments like in [Hadoop](https://hadoop.apache.org/) or [Hive](https://hive.apache.org/) systems. We don't use these technologies, so we are not considering ORC. +- [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html), which is a common file system used in Big Data environments. We don't consider it because both Parquet and Avro are natively supported with HDFS systems. We also don't consider it because Hadoop is not a common use case for our problems we aim to solution. + +### Parquet + +::: {.columns} +::: {.column} +#### Benefits + +- Designed for batch processing, which is how most research data analysis is performed. +- Stores the schema within the format itself, and can also handle schema evolution. +- Of the options considered, it has the best compressed file sizes because it stores data by column, not by row. What this means is that each line in the file format is a column, not a row unlike formats like CSV where each line in the file is a row. +- Integrates very easily into both R and Python ecosystems for data analysis via popular packages like [arrow](https://arrow.apache.org/). +- Is natively supported by the incredibly powerful in-memory SQL engine [DuckDB](https://duckdb.org/), which is a great tool for data analysis. +- Has plugins for most common data processing tools, including SQLite. +- Can handle unstructured data. +- Designed to handle the situation of many columns relative to rows (though still lots of rows). Most research data falls under this category, for instance -omic type data, where there are often many hundreds of columns and maybe a few thousand rows. +::: +::: {.column} +#### Drawbacks + +- Not particularly well designed for inserting new rows into the dataset. This isn't strictly an issue for our purposes, since we don't do "transactional" data processing. +- Is in a binary format, so is not human-readable. This is a drawback for people who want to quickly look at their data. +- Not very good at row-level scans or lookups, though this isn't a common use case in research data analysis. +::: + +### Avro + +::: {.columns} +::: {.column} +#### Benefits + +- Can handle unstructured data. +- Has the schema stored within the format itself. +- Very good file size compression. +- Better at writing new rows to the dataset than Parquet. +- Has better schema evolution features than Parquet. +- For individual row lookups, Avro is faster than Parquet since it stores data by row, not column. +::: +::: {.column} +#### Drawbacks + +- Not as fast as Parquet at reading data. +- Doesn't have as good compressed file sizes as Parquet because it stores data by row, not column. +- Isn't as well integrated into common research analysis tools like R and Python. +::: + +### SQL (e.g. SQLite) + +::: {.columns} +::: {.column} +#### Benefits + +- Is a well established, classic relational, embedded SQL database format. +- Very fast at reading and writing data (though not as fast as Parquet or Avro for reading). +- Can handle unstructured data. +::: +::: {.column} +#### Drawbacks + +- Item 1 +::: +::: + +## Decision outcome + + + +### Consequences + +- Since Parquet is not a text-based format, it is not human-readable. This means that for people wanting to have a quick look at their data won't be able to do that. +A consequence will be that some use cases will not be ideal for Parquet, so people who have smaller datasets may not use our solutions. + +## Resources used for this post + +::: content-hidden +List the resources used to write this post +::: + +- https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33/ +- https://thenewstack.io/the-architects-guide-to-data-and-file-formats/ +- https://risingwave.com/blog/big-data-file-formats-a-comprehensive-guide/ +- https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-the-apache-parquet-format-compared-to-other-format +- https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines +- https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html +- https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8 +- https://www.snowflake.com/trending/avro-vs-parquet From 2a41346a146d8d60fa69c48491cbcbbd8f0205c2 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Mon, 10 Mar 2025 16:05:13 +0100 Subject: [PATCH 2/4] docs: :memo: decision post on using Parquet Closes #108 --- why-parquet/index.qmd | 238 ++++++++++++++++++++++++++++++------------ 1 file changed, 174 insertions(+), 64 deletions(-) diff --git a/why-parquet/index.qmd b/why-parquet/index.qmd index 56cab74..9e682c4 100644 --- a/why-parquet/index.qmd +++ b/why-parquet/index.qmd @@ -1,7 +1,11 @@ --- title: "Why Parquet" -description: "The format for storing data directly impacts how it is used later. " -date: "2025-03-03" +description: | + The format of the file for storing data directly impacts how it is used + later. Parquet is a format that has very good compressed file sizes, is very + fast to process and analyse, and is well integrated into the R and Python + ecosystems. +date: "2025-03-10" categories: - database - organise @@ -9,9 +13,14 @@ categories: ## Context and problem statement -The way +The core aim of Seedcase is to structure and organise data to be modern, +FAIR, and standardised so that it can be more easily used and shared +later on. How we store the data directly affects how it can be used +later on. There are a large variety of file formats, so we need to +decide on the best one for our needs. So the question is: -Storing for the data that will be used, shared etc +*What file format should we use for storing data organised and +structured by Seedcase software?* ## Decision drivers @@ -20,94 +29,183 @@ List some reasons for why we need to make this decision and what things have arisen that impact work. ::: -- Mainly used for storing various sizes of data. -- We don't do "transactional" data processing or data entry, so we don't need to worry about [ACID](https://en.wikipedia.org/wiki/ACID) compliance as much nor do we need to worry about row-level write speeds. -- Stores the schema within the format itself, and can also handle [schema evolution]() -- Suitable for both local, single-node processing and remote, distributed processing. -- Has good compressed file sizes. -- Is also relatively simple to use and understand for those doing data analysis, which means it can integrate easily with software like R and Python. +- Mainly used for storing various sizes of data. +- We don't do "transactional" data processing or data entry, so we + don't need to worry about [ACID](https://en.wikipedia.org/wiki/ACID) + compliance as much nor do we need to worry about row-level write + speeds. +- Stores the schema within the format itself, and can also handle + [schema evolution](https://en.wikipedia.org/wiki/Schema_evolution). +- Suitable for both local, single-node processing and remote, + distributed processing. +- Has good compressed file sizes. +- Is also relatively simple to use and understand for those doing data + analysis, which means it can integrate easily with software like R + and Python. ## Considered options -- [Parquet](https://parquet.apache.org/) -- [Avro](https://avro.apache.org/) -- [SQL (e.g. SQLite)](https://www.sqlite.org/) - -Not considered: - -- CSV, JSON, and other text-based formats are some of the most commonly used file formats for data. However, their biggest disadvantage is that they don't store the data schema nor are they good for file compression and eventual analysis. So we don't consider them. -- [ORC](https://orc.apache.org) is a format that is similar to Parquet, but is used mostly within distributed computing and Big Data environments like in [Hadoop](https://hadoop.apache.org/) or [Hive](https://hive.apache.org/) systems. We don't use these technologies, so we are not considering ORC. -- [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html), which is a common file system used in Big Data environments. We don't consider it because both Parquet and Avro are natively supported with HDFS systems. We also don't consider it because Hadoop is not a common use case for our problems we aim to solution. +Based on our needs, we considered the following options: + +- [Parquet](https://parquet.apache.org/) +- [Avro](https://avro.apache.org/) +- [SQL (e.g. SQLite)](https://www.sqlite.org/) + +Commonly used file formats that we don't consider are: + +- CSV, JSON, and other text-based formats are some of the most + commonly used file formats for data. However, their biggest + disadvantage is that they don't store the data schema nor are they + good for file compression and eventual analysis. So we don't + consider them. +- [ORC](https://orc.apache.org) is a format that is similar to + Parquet, but is used mostly within distributed computing and Big + Data environments like in [Hadoop](https://hadoop.apache.org/) or + [Hive](https://hive.apache.org/) systems. We don't use these + technologies, so we are not considering ORC. +- [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html), + which is a common file system used in Big Data environments. We + don't consider it because both Parquet and Avro are natively + supported within HDFS systems. We also don't consider it because + Hadoop is not a common use case for our problems we aim to solve. +- Non-embedded SQL engines (like + [Postgres](https://www.postgresql.org/) or + [MySQL](https://www.mysql.com/)) because they require a server to + use and aren't stored as a file on a local filesystem. +- Excel or similar spreadsheet programs, as they aren't technically + open source nor do ### Parquet -::: {.columns} -::: {.column} +[Parquet](https://parquet.apache.org) is a column-oriented binary file +format, where each line in the file represents a "column" of data rather +than each line being a "row" of a data as is seen with spreadsheet +structured data formats. This Parquet file format is optimized for +compression and storage as well as very large-scale data processing and +analysis. It is designed to handle complex data structures and is +natively supported by many big data processing frameworks like +[Spark](https://spark.apache.org/) or +[Hadoop](https://hadoop.apache.org/). + +:::::::::::: columns +::: column #### Benefits -- Designed for batch processing, which is how most research data analysis is performed. -- Stores the schema within the format itself, and can also handle schema evolution. -- Of the options considered, it has the best compressed file sizes because it stores data by column, not by row. What this means is that each line in the file format is a column, not a row unlike formats like CSV where each line in the file is a row. -- Integrates very easily into both R and Python ecosystems for data analysis via popular packages like [arrow](https://arrow.apache.org/). -- Is natively supported by the incredibly powerful in-memory SQL engine [DuckDB](https://duckdb.org/), which is a great tool for data analysis. -- Has plugins for most common data processing tools, including SQLite. -- Can handle unstructured data. -- Designed to handle the situation of many columns relative to rows (though still lots of rows). Most research data falls under this category, for instance -omic type data, where there are often many hundreds of columns and maybe a few thousand rows. +- Designed for batch data processing, which is how most research data + analysis is performed. +- Stores the schema within the format itself, and can also handle + schema evolution. +- Of the options considered, it has the best compressed file sizes + because it stores data by column, not by row. What this means is + that each line in the file format is a column, not a row unlike + formats like CSV where each line in the file is a row. +- Integrates very easily into both R and Python ecosystems for data + analysis via popular packages like + [arrow](https://arrow.apache.org/). +- Is natively supported by the incredibly powerful in-memory SQL + engine [DuckDB](https://duckdb.org/), which is a great tool for data + analysis. +- Has plugins for most common data processing tools, including SQLite. +- Can handle unstructured data. +- Designed to handle the situation of many columns relative to rows + (though still lots of rows). Most research data falls under this + category, for instance -omic type data, where there are often many + hundreds of columns and maybe a few thousand rows. ::: -::: {.column} + +::: column #### Drawbacks -- Not particularly well designed for inserting new rows into the dataset. This isn't strictly an issue for our purposes, since we don't do "transactional" data processing. -- Is in a binary format, so is not human-readable. This is a drawback for people who want to quickly look at their data. -- Not very good at row-level scans or lookups, though this isn't a common use case in research data analysis. +- Not particularly well designed for inserting new rows into the + dataset. This isn't strictly an issue for our purposes, since we + don't do "transactional" data processing or entry. +- Is in a binary format, so is not human-readable. This is a drawback + for people who want to quickly look at their data. +- Not very good at row-level scans or lookups, though this isn't a + common use case in research data analysis. ::: ### Avro -::: {.columns} -::: {.column} +[Avro](https://avro.apache.org/) is a row-oriented binary file format +that was designed to be a compact and fast format for serializing data +within the Hadoop system. It is also designed to be a data exchange +format to more effectively move data between systems. + +::::::::: columns +::: column #### Benefits -- Can handle unstructured data. -- Has the schema stored within the format itself. -- Very good file size compression. -- Better at writing new rows to the dataset than Parquet. -- Has better schema evolution features than Parquet. -- For individual row lookups, Avro is faster than Parquet since it stores data by row, not column. +- Can handle unstructured data. +- Has the schema stored within the format itself. +- Very good file size compression. +- Better at writing new rows to the dataset than Parquet. +- Has better schema evolution features than Parquet. +- For individual row lookups, Avro is faster than Parquet since it + stores data by row, not column. ::: -::: {.column} + +::: column #### Drawbacks -- Not as fast as Parquet at reading data. -- Doesn't have as good compressed file sizes as Parquet because it stores data by row, not column. -- Isn't as well integrated into common research analysis tools like R and Python. +- Not as fast as Parquet at reading data. +- Doesn't have as good compressed file sizes as Parquet because it + stores data by row, not column. +- Isn't as well integrated into common research analysis tools like R + and Python. ::: ### SQL (e.g. SQLite) -::: {.columns} -::: {.column} +SQL is a standard language for relational databases throughout much of +the internet, and [SQLite](https://sqlite.org) is a popular embedded SQL +database format that is used in many applications. SQLite is a +file-based database format, which means that it is a single file that +contains the entire database. + +::::: columns +::: column #### Benefits -- Is a well established, classic relational, embedded SQL database format. -- Very fast at reading and writing data (though not as fast as Parquet or Avro for reading). -- Can handle unstructured data. +- Is a well established, classic relational, embedded SQL database + format. +- Very fast at reading and writing data (though not as fast as Parquet + or Avro for reading). +- Can handle unstructured data. +- Can integrate in a wide range of software tools. +- Great for row-level scans, insertions, and lookups. However, this + isn't a common use case for our purposes. ::: -::: {.column} + +::: column #### Drawbacks -- Item 1 -::: +- Is a relational database, which requires some knowledge of SQL to + use. +- It is row-oriented, so isn't as good of a format for processing or + analysing data compared to Parquet. +- Doesn't have as good compressed file sizes as Parquet or Avro. +- Doesn't integrate well with the [Frictionless Data + Package](/why-frictionless-data/index.qmd) standard. +- While it can integrate with R and Python, it does require some + additional knowledge and code compared to using a file format like + Parquet. ::: +::::: ## Decision outcome - +We've decided to use the Parquet file format for storing our data as it +is the best file format for our needs, which are mainly for storing and +processing large amounts of data for research purposes. ### Consequences -- Since Parquet is not a text-based format, it is not human-readable. This means that for people wanting to have a quick look at their data won't be able to do that. -A consequence will be that some use cases will not be ideal for Parquet, so people who have smaller datasets may not use our solutions. +- Since Parquet is not a text-based format, it is not human-readable. + This means that for people wanting to have a quick look at their + data won't be able to do that. A consequence will be that some use + cases will not be ideal for Parquet, so people who have smaller + datasets may not use our solutions. ## Resources used for this post @@ -115,11 +213,23 @@ A consequence will be that some use cases will not be ideal for Parquet, so peop List the resources used to write this post ::: -- https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33/ -- https://thenewstack.io/the-architects-guide-to-data-and-file-formats/ -- https://risingwave.com/blog/big-data-file-formats-a-comprehensive-guide/ -- https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-the-apache-parquet-format-compared-to-other-format -- https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines -- https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html -- https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8 -- https://www.snowflake.com/trending/avro-vs-parquet +- [Big Data File Formats + Explained](https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33/) +- [The Architect’s Guide to Data and File + Formats](https://thenewstack.io/the-architects-guide-to-data-and-file-formats/) +- [Big Data File Formats: A Comprehensive + Guide](https://risingwave.com/blog/big-data-file-formats-a-comprehensive-guide/) +- [What are the pros and cons of the Apache Parquet format compared to + other + formats?](https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-the-apache-parquet-format-compared-to-other-format) +- [Performance comparison of different file formats and storage + engines in the Hadoop + ecosystem](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines) +- [Should you use + Parquet?](https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html) +- [CSV vs Parquet vs Avro: Choosing the Right Tool for the Right + Job](https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8) +- [Avro vs. + Parquet](https://www.snowflake.com/trending/avro-vs-parquet) +::::::::: +:::::::::::: From 93d2bf126bbfd7a2037df26a1eec426a1bde87f2 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Thu, 13 Mar 2025 09:41:23 +0100 Subject: [PATCH 3/4] docs: :pencil2: forgot to finish sentence --- why-parquet/index.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/why-parquet/index.qmd b/why-parquet/index.qmd index 9e682c4..9465a96 100644 --- a/why-parquet/index.qmd +++ b/why-parquet/index.qmd @@ -73,7 +73,7 @@ Commonly used file formats that we don't consider are: [MySQL](https://www.mysql.com/)) because they require a server to use and aren't stored as a file on a local filesystem. - Excel or similar spreadsheet programs, as they aren't technically - open source nor do + open source nor do they store any schema information. ### Parquet From 2a043a922156f7ab66703cd4057c94bb9641c328 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Thu, 13 Mar 2025 10:27:47 +0100 Subject: [PATCH 4/4] docs: :pencil2: proofreading edits Co-authored-by: martonvago <57952344+martonvago@users.noreply.github.com> --- why-parquet/index.qmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/why-parquet/index.qmd b/why-parquet/index.qmd index 9465a96..3dd0d78 100644 --- a/why-parquet/index.qmd +++ b/why-parquet/index.qmd @@ -16,7 +16,7 @@ categories: The core aim of Seedcase is to structure and organise data to be modern, FAIR, and standardised so that it can be more easily used and shared later on. How we store the data directly affects how it can be used -later on. There are a large variety of file formats, so we need to +later on. There is a large variety of file formats, so we need to decide on the best one for our needs. So the question is: *What file format should we use for storing data organised and @@ -67,7 +67,7 @@ Commonly used file formats that we don't consider are: which is a common file system used in Big Data environments. We don't consider it because both Parquet and Avro are natively supported within HDFS systems. We also don't consider it because - Hadoop is not a common use case for our problems we aim to solve. + Hadoop is not a common use case for the problems we aim to solve. - Non-embedded SQL engines (like [Postgres](https://www.postgresql.org/) or [MySQL](https://www.mysql.com/)) because they require a server to @@ -79,7 +79,7 @@ Commonly used file formats that we don't consider are: [Parquet](https://parquet.apache.org) is a column-oriented binary file format, where each line in the file represents a "column" of data rather -than each line being a "row" of a data as is seen with spreadsheet +than each line being a "row" of data as is seen with spreadsheet structured data formats. This Parquet file format is optimized for compression and storage as well as very large-scale data processing and analysis. It is designed to handle complex data structures and is @@ -97,7 +97,7 @@ natively supported by many big data processing frameworks like schema evolution. - Of the options considered, it has the best compressed file sizes because it stores data by column, not by row. What this means is - that each line in the file format is a column, not a row unlike + that each line in the file format is a column, not a row, unlike formats like CSV where each line in the file is a row. - Integrates very easily into both R and Python ecosystems for data analysis via popular packages like @@ -196,13 +196,13 @@ contains the entire database. ## Decision outcome We've decided to use the Parquet file format for storing our data as it -is the best file format for our needs, which are mainly for storing and +is the best file format for our needs, which are mainly storing and processing large amounts of data for research purposes. ### Consequences - Since Parquet is not a text-based format, it is not human-readable. - This means that for people wanting to have a quick look at their + This means that people wanting to have a quick look at their data won't be able to do that. A consequence will be that some use cases will not be ideal for Parquet, so people who have smaller datasets may not use our solutions.