Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Reading List in Data Systems

A list of papers, articles, and online resources I have found essential to understanding data-intensive systems and building new data systems. The list is curated and maintained by Sujith Jay Nair (@sujithjay). If you think a paper should be part of this list, please submit a pull request. I will add it to the list once I peruse the paper. Please make sure the subject-matter of the paper is within the realm of either i) understanding data systems, or ii) building data systems.

Data systems are defined to include:

  • Database systems
  • Data processing systems

This list is inspired by Reynold Xin's list on Database Readings, and is a work in progress.

Table of Contents

  1. Consistency and Consensus
  2. Query Processing
  3. State and Stream
  4. Database Design

Consistency and Consensus

Query Processing

State and Stream

  • Data in Flight (2010): Introduces a model of streams as a superset of the relational model. Streams introduce a notion of time (processing-time, IMO) to the relational model. I explore a similar idea in this post. In a relational table, data is persistent and query is transient; in a stream, query is persistent and data is transient.

Database Design

  • Dynamo: Amazon’s Highly Available Key-value Store (2007): This paper on Dynamo (not to be confused with DynamoDB, which is 'built on the principles of Dynamo') is an excellent primer on understanding concepts behind high-availability storage systems; concepts such as Consistent Hashing, Sloppy Quorum, Anti-entropy processes, and Gossip.

  • Cassandra - A Decentralized Structured Storage System (2009): Cassandra is one of many data storage systems heavily influenced by Dynamo. However, important differences exist. I have written about it in this post.


Reading List in Data Systems





No releases published


No packages published