# Big Data: Principles and best practices of scalable real-time data systems

This is a summary of the first two chapters of "Big Data: Principles and best practices of scalable real-time data systems" by Nathan Marz.
    

## Chapter 1:

*I thought it was important to include these definitions at the beginning since these terms are used repeatedly along the two chapters*

**What is data?** 

Data can be considered as the rawest information you have: information you hold to be true simply because it exists. This is the definition provided in the book for their own explanatory purposes, but most of the times people think of data as any type of information.

**What does a data system do?** 

A data system answers questions based on information that was acquired in the past up to the present. A general-purpose definition for a data system is the following:query = function (all data). 

**What is big data?**

This is a definition provided by the Oxford dictionary: "Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions." Many other sites define 'big data' as the amount of information needed to be managed by non-traditional data systems. 
    
*This is exactly where the first chapter of the book starts, explaining how NoSQL systems were introduced as a consequence of traditional data systems being unscalable to larger amounts of data.*

**Who were the pioneers in creating big data systems?**

Google pioneered many Big Data systems (distributed filesystems, MapReduce computation framework, distributed locking services). Amazon was also a pioneer creating Dynamo (distributed key/value store). Open source contributors introduced some other technologies such as Hadoop, Hbase, MongoDB, Cassandra and RabbitMQ.

*Traditional data systems are not meant to be used with large amounts of data which is also continously being updated. The book provides an example hitting the limits of traditional database technologies which makes it obvious why it is necessary to work with an adequate type of data system from the beginning*

**What could be the problems of working with a RDBMS when the amount of data superpass the capacities of the database system?**

Everything can start with timing out problems, which then could lead to creating worker processes, horizontal partiotioning or shrading. Many other things could go bad like the disks on the database machines going bad and the deployment of bugs due to human-faults. It could never stop and it would be always necessary to deal with more shards, replicas, resharding scripts. This is a great example of how the complexity increments when databases become bigger and they are built with a traditional database system. All those problems could be avoided by working with  NoSQL technologies instead, and the reason to that is explained in the following question.


**What are some things guaranteed by NoSQL technologies that traditonal systems are not able cope with?**

In Big Data computation systems, distributed operations are internalized (e.g. sharding and replication), so they automatically rebalance the data and it is only up to the developer to add new nodes. Immutability, NoSQL technologies store the raw pageview information, providing a much stronger human-fault tolerance. 

*Even though NoSQL technologies are definitely a more proper way of adressing Big Data problems, it is not guaranteed that people won't face any problems when working with them.*

**Why is NoSQL not the solution for all difficulties?**

Scalable data systems can be good at working with larger amounts of data but can also present some specific problems depending on the tool that has been chosen to work with (e.g. Cassandra offers limited data models that are mutable and can be very complex, as well as Hadoop, whose computations have high lattency)

*The book describes the desired properties for a Big Data system. It explains each one of them but only provides an overview on how to achieve them, for it says there will be a further and more detailed explanations in following chapters.*

**If robustness is about avoiding complexities to make it easy to reason about the system, in which way can it be achieved?**

It can be achieved by building immutability and recomputation into the core of the Big Data system, so the system can be innately resilient to human error by providing a clear and simple mechanism for recovery.

**Is achieving low latency always the goal for any data system?**

Some applications require updates to propagate immediately, but in other applications a latency of a few hours is fine. Regardless, you need to be able to achieve low latency updates when you need them in your Big Data systems.

**What is one of the tricks of Lambda Architecture to avoid complexity and therefore minimize maintenance in systems?**

To push complexity out of the core components and into pieces of the system whose outputs are discardable after a few hours.

**What is the key for making the debuggability of any data system easier?**

To be able to trace, for each value in the system, exactly what caused it to have that value. It is accomplished in the Lambda Architecture through the functional nature of the batch layer and by preferring to use recomputation algorithms when possible. 

*The next questions are all about the problems with fully incremental architectures*

**Why is it so hard for people to realize the complexities of working with fully incremental architectures?**

It has nothing to do with relational versus non-relational data systems, because fully incremental architectures are more fundamental since majority of database deployments have been done this way for many decades. The reason then is that complexities become familiar, and that explains why people would not even think about avoinding them.

**How can online compaction be conducted to avoid problems like  server lockups and even a possible cascading failure?**

To manage compaction correctly, you have to schedule compactions on each node so that not too many nodes are affected at once. It is also important to be aware of the disk and cluster capacities so they don't become overloaded when resources are lost during compactions. In Lambda Architecture the primary databases don't require any online compaction.

**Is it possible for incremental architectures to achieve high availability as well as consistency at the same time?**

A theorem called the CAP theorem has shown that it’s impossible to achieve both high availability and consistency in the same system in the presence of network partitions. Trying to achieve both can lead to dealing with more complex tasks wich more likely will end up in causing errors in computations.

**What is the advantage of Lambda Architecture being built as a series of layers?**

The advantage is that each layer satisfies a subset of the properties and builds upon the functionality provided by the layers beneath it. 

**How is the process carried out by each layer?**

First, the batch layer emits batch views as the result of its functions. Then all the views are loaded in the server layer that makes it possible to do random reads on it. Finally the speed layer updates whenever the batch layer finishes precomputing a batch view.

**How does the funtionality of the server layer help reduce the complexity?**

A serving layer database supports batch updates and random reads. Most notably, it doesn’t need to support random writes. By not supporting random writes, these databases are extremely simple. That simplicity makes them robust, predictable, easy to configure, and easy to operate.

**Why is it important to know what you are gonna do before you choose the database you will work with?**

Becausethey all have different semantics and are meant to be used for specific purposes (e.g. Cassandra, HBase, MongoDB, Voldemort, Riak and CouchDB). They are not meant to be used for arbitrary data warehousing.


## Chapter 2

**Why is the master dataset so important in Lambda Architecture?**

The master dataset is the source of truth in the Lambda Achitecture, and it can be used to reconstruct your whole application, and this is because everything else is produced via functions on the master dataset (the speed layer, the serving layer and the batch layer).

*This chapter goes on the data model of the master dataset, which is one of its two compoents along with its phisical storage.*

**As explained in the terms defined in the book, if someone provided their full name to a social network but they decided to only make visible their first name, what would be considered as data and what as information?**

The first name would be a view derived from the full name, therefore the first name would be considered as information and the full name would be data.

**Why is it pertinent to store as much raw data as possible?**

The more raw data you possess, the more information you can generate from it. Storing raw data is hugely valuable because you rarely know in advance all the questions you want answered. By keeping the rawest data possible, you maximize your ability to obtain new insights, whereas summarizing, overwriting, or deleting information limits what your data can tell you.

**When is it appropiate to store unstructured and structured data?**

If the algorithm for extracting the data is simple and accurate, like extracting an age from an HTML page, you should store the results of that algorithm (which will be structured data). If the algorithm is subject to change, due to improvements or broadening the requirements, it is recommendable to store the unstructured form of the data.

**Does more information mean rawer data?**

It all depends on how useful the information retrieved is for the general purposes of the application. If there is something valuable to get out of it, then it should be included, if not, that information should be ignored as it would only add up to the total of bytes.

**Why did database moved from mutable to immutable?**

The traditional method—overwriting mutable files—originated during the era of smaller data when storage was expensive and all systems were transactional. Now disk storage is inexpensive, and enterprises have lots of data they never had before. They want to minimize the latency caused by overwriting data and to keep all data they write in a full-fidelity form so they can analyze and learn from it.

**Does immutability have any disadvantages?**

Immutability offers two vital advantages: human-fault tolerance and simplicity. The first comes with the fact that fixing the data system is
just a matter of deleting the bad data units and recomputing the views built from the master dataset. The second is thanks to no index is required for the data, which is a huge simplification. Nevertheless, something that can be considered a trade-off more than a disadvantage is that the immutable approach uses more storage than a mutuable schema.  

**If immutability is all about keeping record of each piece of data, does that mean data should never be deleted?**

There are two cases when you would probabily prefer to forget some information. The first one is about having a garbage collection, meaning that some data does not provide enough value for the storage cost. The second case has to do with government regulations and for that you must purge data from your databases.

**How is it possible to delete data from a immutable schema?**

Instead of actually deleting, a safer option is  create a copy of the master dataset with all the unwanted data filtered out.

*The following questions will be focused on explaining the fact-based model for representing data*

**What are facts?**

Facts are fundamental units of data and possess three different properties: they are atomic, timestamped and identifiable. The first property refers to the facts not being able to subdivide further into meaningful components. The second property, is what makes each fact immutable and eterrnally true, and finally, the indentifiability is what prevents facts of being duplicated.

**What happens when you "update" and "delete" using a Lambda Architecture compared to relational databases?**

In Lambda Architecture “updates” and “deletes” are performed by adding new facts with more recent timestamps, but because no data is actually removed (as it is in relational databases), you can reconstruct the state of the world at the time specified by your query. 

**What does it mean to normalize your data?**

Normalization is the process of restructuring a relational database in accordance with a series of “normal” forms to improve data integrity. In simpler terms, normalization makes sure that all of your data looks and reads the same way across all records.

**How does Lambda Architecture integrate normalized and denormalized data into its structure?**

Normalization occurs in the master data set(Batch layer), where no data is stored redundantly. Denormalization occurs in the batch views (Serving layer) where data is redundantly stored for efficient querying.

**How can batch processing benefit from parallel preprocessing?**

Steps that do not depend on each other can run on different threads. Steps where the processing of each item does not depend on the results of processing previous items can run on more than one thread.

**How does Hadoop utilizes MapReduce?**

It divides the large problem into a sub-problems (mapping), performs the same function on each sub-problems and finally combines (reduce) the output from all sub-problems.

*The following questions will be focused on graph schemas*

**What are the components of a graph schema?**

Nodes, edges and properties. The nodes are represented by a user ID. The edges are relationships beteween the nodes and the properties are information about the edges.

**Why JSON documents are not the most appropiate text formats for storing facts?**

Because they are too flexible. Human errors have a great chance to occur using this type of files because JSON's could potentially have inconsistent formats or missing data.
