A list of papers, articles, and online resources I have found essential to understanding data-intensive systems and building new data systems. The list is curated and maintained by Sujith Jay Nair (@sujithjay). If you think a paper should be part of this list, please submit a pull request. I will add it to the list once I peruse the paper. Please make sure the subject-matter of the paper is within the realm of either i) understanding data systems, or ii) building data systems.
Data systems are defined to include:
- Database systems
- Data processing systems
This list is inspired by Reynold Xin's list on Database Readings, and is a work in progress.
-
Linearizability: A Correctness Condition for Concurrent Objects (1990): Defines linearizability as a correctness condition for a register, as opposed to serializability which is a correctness condition for the higher abstraction, 'transaction'.
-
On Scalable and Efficient Distributed Failure Detectors (2001): The gossip-inspired failure detection protocol behind Dynamo-family databases. It establishes the optimum worst-case network load for a distributed failure detection scheme, and provides an algorithm of such an optimum scheme.
-
Paxos Made Simple (2001): The consensus protocol behind many distributed systems explained in plain English.
-
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources (2018): Explains the design of the Calcite project, which is a distributed query parser & optimizer for heterogenous data sources. Calcite is used in a host of data processing systems, such as Apache Flink, Apache Drill and others. This paper is particularly interesting to understand the concepts around query parsing (and transformation into relational algebra), query optimizations (such as predicate pushdown & column pruning), and logical & physical plan generation. It is worthwhile to compare and contrast this with the paper on Spark SQL (listed below). Although this paper came after the Spark SQL paper, the work predates it.
-
Spark SQL: Relational Data Processing in Spark (2015): Explains the design of a distributed relational processing system in Apache Spark.
-
How to Architect a Query Compiler, Revisited (2018): A study on how to design a query compiler from a query interpreter. There are places where the lack of foundational background might hamper your progress in reading this paper. For this, I would suggest skimming Query Evaluation Techniques for Large Databases as a primer. Also, I would suggest reading the HyPer paper (a part of this list as well) before reading this one.
-
Efficiently Compiling Efficient Query Plans for Modern Hardware (2011): Also known as the HyPer paper, this paper introduced data-centric query evaluation as an alternative to the the traditional iterative approach.
- Data in Flight (2010): Introduces a model of streams as a superset of the relational model. Streams introduce a notion of time (processing-time, IMO) to the relational model. I explore a similar idea in this post. In a relational table, data is persistent and query is transient; in a stream, query is persistent and data is transient.
-
Dynamo: Amazon’s Highly Available Key-value Store (2007): This paper on Dynamo (not to be confused with DynamoDB, which is 'built on the principles of Dynamo') is an excellent primer on understanding concepts behind high-availability storage systems; concepts such as Consistent Hashing, Sloppy Quorum, Anti-entropy processes, and Gossip.
-
Cassandra - A Decentralized Structured Storage System (2009): Cassandra is one of many data storage systems heavily influenced by Dynamo. However, important differences exist. I have written about it in this post.