# Big Data: Principles and best practices of scalable real-time data systems

This is a summary of chapters 3 and 4 of "Big Data: Principles and best practices of scalable real-time data systems" by Nathan Marz.



## Chapter 3: Data model for Big Data: Illustration


**1.What schemaless format used for writing data can get easily corrupted?**
JSON files. They might seem appealing because of their simplicity but the use of them can lead to problems. Due to bugs or misunderstandings between different developers, data corruption inevitably occurs. 

**2.Why is data corruption one of the most time-consuming errors to debug?**
Because the problem is noticeable, but there is little context on how the corruption occurred. Typically you’ll only notice there’s a problem when there’s an error downstream in the processing—long after the corrupt data was written. 

**3.What are some advantages of working with enforceable schemas?**
When you create an enforceable schema, you get errors at the time of writing the data—giving you full context as to how and why the data became invalid (like a stack trace). In addition, the error prevents the program from corrupting the master dataset by writing that data.

**4.How can you make an enforceable schema?**
Serialization frameworks are an easy approach to making an enforceable schema. Serialization frameworks generate code for whatever languages you wish to use for reading, writing, and validating objects that match your schema. However, serialization frameworks are limited when it comes to achieving a fully rigorous schema. 

**5.Why is it important that a serialization framework let schemas evolve over time?**
Letting schemas evolve over time is a crucial property, because as your business requirements change you’ll need to add new kinds of data, and you’ll want to do so as effortlessly as possible. 

**6.How do schemas evolve in Apache Thrift?**
The key to evolving Thrift schemas is the numeric identifiers associated with each field. Those IDs are used to identify fields in their serialized form. When you want to change the schema but still be backward compatible with existing data, you must obey the following rules: Fields may be renamed, a field may be removed, but you must never reuse that field ID and only optional fields can be added to existing structs.

**7.Why renaming fields is allowed in serialization?**
This is because the serialized form of an object uses the field IDs, not the names, to identify fields.

**8.What are the limitations of serialization frameworks?**
Serialization frameworks only check that all required fields are present and are of the expected type. They’re unable to check richer properties like “Ages should be nonnegative” or “true-as-of timestamps should not be in the future.” Data not matching these properties would indicate a problem in your system, and you wouldn’t want them written to your master dataset. 

**9.How can you work around those limitations with a serialization framework like Apache Thrift?**
There are two approaches: you can wrap your generated code in additional code that checks the additional properties you care about, like ages being non-negative, and also you can check the extra properties at the very beginning of your batch-processing workflow.

**Why would you want to use serialization?**
The two most important reasons are to persist the state of an object to a storage medium so an exact copy can be re-created at a later stage, and to send the object by value from one application domain to another. For example, serialization is used to save session state in ASP.NET and to copy objects to the Clipboard in Windows Forms. It is also used by remoting to pass objects by value from one application domain to another.
(retrieved from:https://docs.microsoft.com/en-us/dotnet/standard/serialization/serialization-concepts)



## Chapter 4: Data storage on the batch layer


**1.Why is it important to support parallel processing in the batch layer?**
Constructing the batch views requires computing functions on the entire master dataset. The batch storage must consequently support parallel processing to handle large amounts of data in a scalable manner.

**2.What are the advantages and disadvantages (or trade-off's) of compressing your data?**
The advantage of compressing data offers is that it helps you minimize your expenses (storage costs money) but something to keep in mind is that decompressing your data during computations can affect performance, which is consider as a trade-off. That is why the batch layer should give you the flexibility to decide how to store and compress your data to suit your specific needs.

**3.How can you deal with the inherit mutability of computers so it does not affect the immutability of your dataset?**
The best you can do is put checks in place to disallow mutable operations. These checks should prevent bugs or other random errors from trampling over existing data. 

**4.Why key/value stores are not considered a good option for batch layer storage?**
First, you cannot compress multiple key/value pairs together, so you’re severely limited in tuning the trade-off between storage costs and processing costs. Second, key/value stores are meant to be used as mutable stores, which is a problem if enforcing immutability is so crucial for the master dataset. Finally, there are many things a key/value store has that you do not need, such as: random reads, random writes and many other unneeded services, which will increase your storage costs and lower your performance when reading and writing data.

**5.Which technology is a perfect fit for batch layer storage?**
Filesystems. Unlike a key/value store, a filesystem gives you exactly what you need and no more, while also not limiting your ability to tune storage cost versus processing cost. On top of that, filesystems implement fine-grained permissions systems, which are perfect for enforcing immutability. 

**6.How is it possible to have filesystems spread their storage across a cluster of computers when regular filesystems are supposed to exist on just a single machine?** 
This is possible because there is a class of technologies called distributed filesystems that are quite similar to regular filesystems and allows you to work with more than one computer. Also, distributed filesystems are designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible.

**7.What are some of the differences between distributed filesystems and regular filesystems?**
The operations you can do with a distributed filesystem are often more limited than you can do with a regular filesystem. For instance, you may not be able to write to the middle of a file or even modify a file at all after creation.

**8.How distributed filesystems make storage tunable and help reduce processing costs?**
Just like regular filesystems, you have full control over how you store your data units within the files. You choose the file format for your data as well as the level of compression. You’re free to do individual record compression, block-level compression, or neither. 

**9.How can you enforce immutability in a distributed filesystem?**
To enforce immutability, you can disable the ability to modify or delete files in the master dataset folder for the user with which your application runs. This redundant check will protect your previously existing data against bugs or other human mistakes.

**10.What is vertical partitioning?**
The process of partitioning your data so that a function only accesses data relevant to its computation. 

**What is the difference between a distributed file system and clustered file system?**
The difference lies in the model used for the underlying block storage. In a cluster filesystem such as GFS2, all of the nodes connect to the *same* block storage, with access mediated by locks or other synchronization primitives. In a distributed filesystem such as GlusterFS, each server has its own *private* block storage, which is only unified at a higher level.
(retrieved from: https://www.quora.com/What-is-the-difference-between-a-distributed-file-system-and-clustered-file-system)

