# Designing Data-Intensive Applications <a class="tocSkip">
## Notes on Chapter 1 <a class="tocSkip">

---

# Issues posed by data intensive systems


> A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:

> + **Store data** so that they, or another application, can find it again later (`databases`)
> + Remember the **result of an expensive operation**, to speed up reads (`caches`)
> + Allow users to **search data by keyword or filter it** in various ways (`search indexes`)
> + Send a **message to another process, to be handled asynchronously** (`stream processing`)
>+ Periodically **crunch a large amount of accumulated data** (`batch processing`)


Quoted from the book

---

# The old vs the new conceptions

## CAP Theorem (1999)

It is **impossible** for a distributed data store to simultaneously **provide more than two out of the following three guarantees**:

+ **`Consistency`**: Every read receives the most recent write or an error
+ **`Availability`**: Every request receives a (non-error) response – without the guarantee that it contains the most recent write
+ **`Partition tolerance`**: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

## CAP Theorem (contd.) <a class="tocSkip">

Review and add from this:
+ [towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e](https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e)
+ [faculty.washington.edu/wlloyd/courses/tcss562/papers/ChoosingTheRightNoSQLDatabaseForTheJob-AQualityAttributeEvaluation.pdf](http://faculty.washington.edu/wlloyd/courses/tcss562/papers/ChoosingTheRightNoSQLDatabaseForTheJob-AQualityAttributeEvaluation.pdf)
+ [robertgreiner.com/2014/08/cap-theorem-revisited/](http://robertgreiner.com/2014/08/cap-theorem-revisited/)

## CAP Theorem (contd.) <a class="tocSkip">

![](https://www.researchgate.net/profile/Joao_Lourenco11/publication/282519669/figure/fig1/AS:281002732736529@1444007680733/CAP-theorem-with-databases-that-choose-CA-CP-and-AP.png)

From the Lourenço article quoted previously

## Blurring of boundaries

> They no longer neatly fit into traditional categories. For example:
> + datastores that are also used as message queues (Redis)
> + message queues with database-like durability guarantees (Apache Kafka)

> The boundaries between the categories are becoming blurred.

Quoted from the book

## There is no magic bullet

Increasingly many applications now **have such demanding or wide-ranging
requirements that a single tool can no longer meet all of its data processing and storage needs**.

Instead, **the work is broken down into tasks** that can be performed efficiently on a single tool, and **those different tools are stitched together using application code**.


## There is no magic bullet (contd.) <a class="tocSkip">

![](assets/ddia-fig1-1.png)
> Fig 1.1. One possible architecture for a data system that combines several components.

Figure from the book.

## There is no magic bullet (contd.) <a class="tocSkip">

As such, we are creating a:
> new, **special-purpose data system** from **smaller, general-purpose components**.

> Your composite data system **may provide certain guarantees**: e.g., that the **cache will be correctly invalidated or updated on writes** so that outside clients see consistent results.

Quote from the book.

---

# High-level concerns


>**`Reliability`**
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).

>**`Scalability`**
As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

>**`Maintainability`**
Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

Quoted from the book.

---

# Reliability

Expectation of reliability can take many forms:

> + The application **performs the function that the user expected**.
> + It can **tolerate the user making mistakes** or **using the software in unexpected ways**.
> + Its **performance is good enough** for the required use case, under the expected load and data volume.
> + The system **prevents any unauthorized access and abuse**.

As such, reliability can be summed up as meaning **"continuing to work correctly, even when this go wrong"**.

Quoted from the book

## Managing fault-tolerance, preventing failures

Systems that anticipate faults and cope with them are called **fault-tolerant** or **resilient**.


The is a distinction to be made:
+ **`Faults`** are usually **one component of the system deviating** from its spec.
+ **`Failures`** are **when the system as a whole stops** providing the required service to the user.



Quoted from the book

### Faults

#### Case Study: https://status.datacamp.com/ <a class="tocSkip">






![](assets/ddia-statuspage-datacamp-outage.png)


![](assets/ddia-statuspage-datacamp-time.png)

#### Case Study: https://www.githubstatus.com/ <a class="tocSkip">

![](assets/ddia-statuspage-github.png)

#### Faults - Solutions <a class="tocSkip">
    
Let users know:
+ whether there is a fault or degraded performance, so they plan accordingly
+ where is the problem, so they know if it affects them
+ what is the status of the problem, to assure them of the eventual solution


### Failures


#### Send an encrypted stacktrace to be given to support <a class="tocSkip">

![](https://i.kym-cdn.com/photos/images/original/001/224/831/8ab.jpg)

#### Service Workers ! <a class="tocSkip">

https://www.youtube.com/watch?v=ZiWnq7bYO5o

![](https://kinsta.com/wp-content/uploads/2017/12/airbnb-500-internal-server-error-1024x503.png)

#### Service Worker Lifecycle <a class="tocSkip">

![](https://scotch-res.cloudinary.com/image/upload/dpr_3,w_350,q_auto:good,f_auto/v1536590617/kxhshgxmbcl1aw7gncem.png)

#### Service Worker Registration <a class="tocSkip">

![](assets/ddia-service-worker-register-fail.png)

#### Service Worker Caching <a class="tocSkip">

![](https://cdn.netlify.com/9852865e142b6f8453d7c1ae083d2e342adc8c02/cbc3a/img/blog/service-worker-diagram.png)

#### Luckily I found a demo in one of my repos

[github.com/xR86/algo/tree/master/code-js](https://github.com/xR86/algo/tree/master/code-js)

http://localhost:8000/service-workers/
http://localhost:8000/workers/


## Deliberately inducing faults

>By deliberately inducing faults, you ensure 

> that the fault-tolerance machinery is continually exercised and tested,

> which can increase your confidence that faults will be handled correctly when they occur naturally.

Quote from the book.

### Principles of Chaos
http://principlesofchaos.org/

### Netflix - Chaos Monkey

https://github.com/Netflix/chaosmonkey

![](https://raw.githubusercontent.com/Netflix/chaosmonkey/master/docs/logo.png)

### Agile Security Practices - GameDays

https://books.google.ro/books?id=Jco0DwAAQBAJ&pg=PA272&lpg=PA272&dq=gamedays+security&source=bl&ots=Iq1cPTgg_c&sig=ACfU3U2w9PODvfuhFeNrlfPYlbs-ho8Qtg&hl=ro&sa=X&ved=2ahUKEwiHiOKtlYThAhWGDewKHdXVCIEQ6AEwCXoECAIQAQ#v=onepage&q&f=false

So widespread in AWS, they also have introduced an event recently: https://aws.amazon.com/gameday/

### So ... about faults

> Another class of fault is a systematic error within the system [8]. 

> Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults [5]. Examples include:

> + A software bug that causes every instance of an application server to crash when given a particular bad input. For example, consider the leap second on June 30, 2012, that caused many applications to hang simultaneously due to a bug in the Linux kernel [9].
> + A runaway process that uses up some shared resource—CPU time, memory, disk space, or network bandwidth.


Quoted from the book.

### So ... solutions ?

> + Lots of small things can help:
>   + carefully thinking about assumptions and interactions in the system;
>   + thorough testing;
>   + process isolation;
>   + allowing processes to crash and restart;
>   + measuring, monitoring, and analyzing system behavior in production. 
  
> + If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found.

Quoted from the book.

add monitoring screenshots from aws interface

### Human errors

> For example, one study of large internet services found that
configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages.

Quoted from the book.

#### Possible solutions <a class="tocSkip">

+ well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.”
  + However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right.
 
+ Decouple the places where people make the most mistakes from the places where they can cause failures.
  + In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.




---

# Scalability

---

# Maintainability

---

# Bibliography

## The old vs the new conceptions

+ [en.wikipedia.org/wiki/CAP_theorem](https://en.wikipedia.org/wiki/CAP_theorem)
+ [towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e](https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e)
+ [researchgate.net/publication/282519669_Choosing_the_right_NoSQL_database_for_the_job_a_quality_attribute_evaluation](https://www.researchgate.net/publication/282519669_Choosing_the_right_NoSQL_database_for_the_job_a_quality_attribute_evaluation)
  + fulltext here : [faculty.washington.edu/wlloyd/courses/tcss562/papers/ChoosingTheRightNoSQLDatabaseForTheJob-AQualityAttributeEvaluation.pdf](http://faculty.washington.edu/wlloyd/courses/tcss562/papers/ChoosingTheRightNoSQLDatabaseForTheJob-AQualityAttributeEvaluation.pdf)



---