diff --git a/why-parquet/index.qmd b/why-parquet/index.qmd
new file mode 100644
index 0000000..3dd0d78
--- /dev/null
+++ b/why-parquet/index.qmd
@@ -0,0 +1,235 @@
+---
+title: "Why Parquet"
+description: |
+    The format of the file for storing data directly impacts how it is used
+    later. Parquet is a format that has very good compressed file sizes, is very
+    fast to process and analyse, and is well integrated into the R and Python
+    ecosystems.
+date: "2025-03-10"
+categories:
+- database
+- organise
+---
+
+## Context and problem statement
+
+The core aim of Seedcase is to structure and organise data to be modern,
+FAIR, and standardised so that it can be more easily used and shared
+later on. How we store the data directly affects how it can be used
+later on. There is a large variety of file formats, so we need to
+decide on the best one for our needs. So the question is:
+
+*What file format should we use for storing data organised and
+structured by Seedcase software?*
+
+## Decision drivers
+
+::: content-hidden
+List some reasons for why we need to make this decision and what things
+have arisen that impact work.
+:::
+
+-   Mainly used for storing various sizes of data.
+-   We don't do "transactional" data processing or data entry, so we
+    don't need to worry about [ACID](https://en.wikipedia.org/wiki/ACID)
+    compliance as much nor do we need to worry about row-level write
+    speeds.
+-   Stores the schema within the format itself, and can also handle
+    [schema evolution](https://en.wikipedia.org/wiki/Schema_evolution).
+-   Suitable for both local, single-node processing and remote,
+    distributed processing.
+-   Has good compressed file sizes.
+-   Is also relatively simple to use and understand for those doing data
+    analysis, which means it can integrate easily with software like R
+    and Python.
+
+## Considered options
+
+Based on our needs, we considered the following options:
+
+-   [Parquet](https://parquet.apache.org/)
+-   [Avro](https://avro.apache.org/)
+-   [SQL (e.g. SQLite)](https://www.sqlite.org/)
+
+Commonly used file formats that we don't consider are:
+
+-   CSV, JSON, and other text-based formats are some of the most
+    commonly used file formats for data. However, their biggest
+    disadvantage is that they don't store the data schema nor are they
+    good for file compression and eventual analysis. So we don't
+    consider them.
+-   [ORC](https://orc.apache.org) is a format that is similar to
+    Parquet, but is used mostly within distributed computing and Big
+    Data environments like in [Hadoop](https://hadoop.apache.org/) or
+    [Hive](https://hive.apache.org/) systems. We don't use these
+    technologies, so we are not considering ORC.
+-   [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html),
+    which is a common file system used in Big Data environments. We
+    don't consider it because both Parquet and Avro are natively
+    supported within HDFS systems. We also don't consider it because
+    Hadoop is not a common use case for the problems we aim to solve.
+-   Non-embedded SQL engines (like
+    [Postgres](https://www.postgresql.org/) or
+    [MySQL](https://www.mysql.com/)) because they require a server to
+    use and aren't stored as a file on a local filesystem.
+-   Excel or similar spreadsheet programs, as they aren't technically
+    open source nor do they store any schema information.
+
+### Parquet
+
+[Parquet](https://parquet.apache.org) is a column-oriented binary file
+format, where each line in the file represents a "column" of data rather
+than each line being a "row" of data as is seen with spreadsheet
+structured data formats. This Parquet file format is optimized for
+compression and storage as well as very large-scale data processing and
+analysis. It is designed to handle complex data structures and is
+natively supported by many big data processing frameworks like
+[Spark](https://spark.apache.org/) or
+[Hadoop](https://hadoop.apache.org/).
+
+:::::::::::: columns
+::: column
+#### Benefits
+
+-   Designed for batch data processing, which is how most research data
+    analysis is performed.
+-   Stores the schema within the format itself, and can also handle
+    schema evolution.
+-   Of the options considered, it has the best compressed file sizes
+    because it stores data by column, not by row. What this means is
+    that each line in the file format is a column, not a row, unlike
+    formats like CSV where each line in the file is a row.
+-   Integrates very easily into both R and Python ecosystems for data
+    analysis via popular packages like
+    [arrow](https://arrow.apache.org/).
+-   Is natively supported by the incredibly powerful in-memory SQL
+    engine [DuckDB](https://duckdb.org/), which is a great tool for data
+    analysis.
+-   Has plugins for most common data processing tools, including SQLite.
+-   Can handle unstructured data.
+-   Designed to handle the situation of many columns relative to rows
+    (though still lots of rows). Most research data falls under this
+    category, for instance -omic type data, where there are often many
+    hundreds of columns and maybe a few thousand rows.
+:::
+
+::: column
+#### Drawbacks
+
+-   Not particularly well designed for inserting new rows into the
+    dataset. This isn't strictly an issue for our purposes, since we
+    don't do "transactional" data processing or entry.
+-   Is in a binary format, so is not human-readable. This is a drawback
+    for people who want to quickly look at their data.
+-   Not very good at row-level scans or lookups, though this isn't a
+    common use case in research data analysis.
+:::
+
+### Avro
+
+[Avro](https://avro.apache.org/) is a row-oriented binary file format
+that was designed to be a compact and fast format for serializing data
+within the Hadoop system. It is also designed to be a data exchange
+format to more effectively move data between systems.
+
+::::::::: columns
+::: column
+#### Benefits
+
+-   Can handle unstructured data.
+-   Has the schema stored within the format itself.
+-   Very good file size compression.
+-   Better at writing new rows to the dataset than Parquet.
+-   Has better schema evolution features than Parquet.
+-   For individual row lookups, Avro is faster than Parquet since it
+    stores data by row, not column.
+:::
+
+::: column
+#### Drawbacks
+
+-   Not as fast as Parquet at reading data.
+-   Doesn't have as good compressed file sizes as Parquet because it
+    stores data by row, not column.
+-   Isn't as well integrated into common research analysis tools like R
+    and Python.
+:::
+
+### SQL (e.g. SQLite)
+
+SQL is a standard language for relational databases throughout much of
+the internet, and [SQLite](https://sqlite.org) is a popular embedded SQL
+database format that is used in many applications. SQLite is a
+file-based database format, which means that it is a single file that
+contains the entire database.
+
+::::: columns
+::: column
+#### Benefits
+
+-   Is a well established, classic relational, embedded SQL database
+    format.
+-   Very fast at reading and writing data (though not as fast as Parquet
+    or Avro for reading).
+-   Can handle unstructured data.
+-   Can integrate in a wide range of software tools.
+-   Great for row-level scans, insertions, and lookups. However, this
+    isn't a common use case for our purposes.
+:::
+
+::: column
+#### Drawbacks
+
+-   Is a relational database, which requires some knowledge of SQL to
+    use.
+-   It is row-oriented, so isn't as good of a format for processing or
+    analysing data compared to Parquet.
+-   Doesn't have as good compressed file sizes as Parquet or Avro.
+-   Doesn't integrate well with the [Frictionless Data
+    Package](/why-frictionless-data/index.qmd) standard.
+-   While it can integrate with R and Python, it does require some
+    additional knowledge and code compared to using a file format like
+    Parquet.
+:::
+:::::
+
+## Decision outcome
+
+We've decided to use the Parquet file format for storing our data as it
+is the best file format for our needs, which are mainly storing and
+processing large amounts of data for research purposes.
+
+### Consequences
+
+-   Since Parquet is not a text-based format, it is not human-readable.
+    This means that people wanting to have a quick look at their
+    data won't be able to do that. A consequence will be that some use
+    cases will not be ideal for Parquet, so people who have smaller
+    datasets may not use our solutions.
+
+## Resources used for this post
+
+::: content-hidden
+List the resources used to write this post
+:::
+
+-   [Big Data File Formats
+    Explained](https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33/)
+-   [The Architect’s Guide to Data and File
+    Formats](https://thenewstack.io/the-architects-guide-to-data-and-file-formats/)
+-   [Big Data File Formats: A Comprehensive
+    Guide](https://risingwave.com/blog/big-data-file-formats-a-comprehensive-guide/)
+-   [What are the pros and cons of the Apache Parquet format compared to
+    other
+    formats?](https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-the-apache-parquet-format-compared-to-other-format)
+-   [Performance comparison of different file formats and storage
+    engines in the Hadoop
+    ecosystem](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
+-   [Should you use
+    Parquet?](https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html)
+-   [CSV vs Parquet vs Avro: Choosing the Right Tool for the Right
+    Job](https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8)
+-   [Avro vs.
+    Parquet](https://www.snowflake.com/trending/avro-vs-parquet)
+:::::::::
+::::::::::::