Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 277 lines (180 sloc) 13.845 kb
d63bcd4 @DracoBlue Initial commit
DracoBlue authored
1 # NOSQL in Dependable Systems
665ff0b @rmetzler added most words from the talk
authored
2
05434af @rmetzler startet writing abstract
authored
3 __Abstract:__
23e65e5 @rmetzler N/W/R
authored
4 The fault model for very large e-commerce websites like Amazon is fundamentally different from standard websites. These websites loose money when the aren't available (or just slow) for potential customers but can't risk to loose any data. The data has to be replicated between databases but traditional RDBMSs may not fit.
05434af @rmetzler startet writing abstract
authored
5 This paper discusses some of the better known NoSQL software products available today.
6
7 __Keywords:__
8 Fault Tolerance, CAP Theorem, NoSQL, Dynamo, Riak, Cassandra, MongoDB, CouchDB
9
10 __Authors:__
23e65e5 @rmetzler N/W/R
authored
11 Richard Metzler [@rmetzler](twitter.com/rmetzler "follow Richard Metzler on Twitter"),
12 Jan Schütze [@dracoblue](twitter.com/dracoblue "follow Jan Schütze on Twitter")
05434af @rmetzler startet writing abstract
authored
13
14
399c875 @DracoBlue Fixed indention
DracoBlue authored
15 # Fault Model
f20b0a8 @rmetzler todos
authored
16 On very large e-commerce websites like Amazon people order every minute _TODO: WRITE SOME FACTS_. Amazon has statistics showing a causal connection between response time of the amazon.com website and the time potential customers spend on the website. _TODO: SOURCE?_
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
17 The customer's shopping cart has to be always accessible for writes and the slightest outage has direct significant financial consequences.
bd30211 @rmetzler startet writing fault model
authored
18
19 But on the other side failures are the normal case, not an exception. Disks fail, the network experiences partitioning and whole data centers could become potentially unavailable because of natural disasters like hurricanes.
20
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
21 What our big e-commerce websites need is a datastore that is always read and write enabled, even in presence of network partitions. Data must be replicated across multiple datacenters and these datacenters may be located hundreds of kilometers away from each other and even on different continents.
665ff0b @rmetzler added most words from the talk
authored
22
23
399c875 @DracoBlue Fixed indention
DracoBlue authored
24 # Replication
65a75d3 @rmetzler replication
authored
25
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
26 _Replication_ is one of the fundamental ideas for fault tolerant systems. But replicating data across datacenters located several hundreds of kilometers away from each other takes time. Using a traditional RDBMS with ACID style transactions to replicate data in a distributed transaction may be slow and not very scalable. Synchronous atomic updates would not be tolerant towards network partitions.
65a75d3 @rmetzler replication
authored
27
28 Asynchronous updates can't be atomic, but they are potentially more resistant in case of network partitioning as these are usually transient faults.
665ff0b @rmetzler added most words from the talk
authored
29
23e65e5 @rmetzler N/W/R
authored
30
3f97fe9 @rmetzler Splitbrain/Quorum
authored
31 ## Split Brain
32 In distributed systems the interconnect between nodes is a weak spot. If it is broken, nodes are split into partitions unable to communicate and thus unable to share state. This scenario is called _split brain_. Nodes in split brain scenarios must be prevented from producing inconsistant state and one method to prevent inconsistency is the quorum consensus.
33
34 ## Quorum
35 As the system replica managers in different partitions cannot communicate with each other, the subgroup of replica managers within each partition must be able to decide independently whether they are allowed to carry out operations. A quorum is a subgroup of replica managers whose size gives it the right to carry out operations. __CITE: COULORIS__ One possible criteria for a quorum may be having a majority. Any other partition would be smaller than the majority partition and as a consequence only the majority partition would be the quorum.
4d69043 @rmetzler quorum device
authored
36 Another possible quorum criteria could be the availability of a _quorum device_. The partition that is able to access the quorum device is allowed to carry out operations.
3f97fe9 @rmetzler Splitbrain/Quorum
authored
37
399c875 @DracoBlue Fixed indention
DracoBlue authored
38 # Brewer's CAP Theorem
665ff0b @rmetzler added most words from the talk
authored
39
c259a6f @DracoBlue added CAP source and info
DracoBlue authored
40 In 2000 Eric Brewer at this time chief scientist of Inktomi hold a keynote at the "Principles of Distributed Computing" conference. He presented his assumption that was later proved in "Brewer‘s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" stating that atomic data consistency, high availability (i.e. performance) and network partition tolerance can't be achieved all together at any given time and you may get only two of these properties for every distributed operation. This is called the CAP Theorem after the acronym for __C__onsistency, __A__vailability and __P__artition tolerance.
665ff0b @rmetzler added most words from the talk
authored
41
90e774d @rmetzler ausdruck
authored
42 Because it is impossible to prevent network partitions in large networks the decision has to be between high availability and data consistency. As stated, large e-commerce websites usually go for high availability and trade consistency for that.
665ff0b @rmetzler added most words from the talk
authored
43
399c875 @DracoBlue Fixed indention
DracoBlue authored
44 # Eventual Consistent
665ff0b @rmetzler added most words from the talk
authored
45
298814c @rmetzler sources
authored
46 Werner Vogels, CTO at Amazon, presented in his article "Eventually Consistent" __CITE: eventually-consistent___ his idea of data being not consistent through atomic transactions but only eventually consistent. By trading ACID's atomicy and consistency for performance and partition tolerance it is possible to increase the response time and fault tolerance of websites. The database replications may not be fully consistent but a customer wouldn't usually experience any inconsistencies.
fa30c5e @rmetzler started CAP theorem
authored
47
43b9036 @DracoBlue Removed quotations from the bottom, because there are not recognized …
DracoBlue authored
48 He defined the __inconsistency window__ as "The period between the update and the moment when it is guaranteed that any observer will always see the updated value."
665ff0b @rmetzler added most words from the talk
authored
49
50
399c875 @DracoBlue Fixed indention
DracoBlue authored
51 # N / W / R Replica Configuration
23e65e5 @rmetzler N/W/R
authored
52
53 Vogels introduces the reader to a short notation for replication configuration for _quorum_ like systems:
54
f20b0a8 @rmetzler todos
authored
55 + __N__ is the number of nodes, that store replicas of the data
56 + __W__ is the number of replicas that acknowledge a write operation
57 + __R__ is the number of replicas contacted in a read operation
58
43fcc5a @rmetzler ring
authored
59 To avoid ties in failover scenarios usually an odd number is picked for N.
23e65e5 @rmetzler N/W/R
authored
60
61 With these numbers you are guaranteed _strong consistency_ if following condition holds:
62
63 + N < W + R
64
65 This is because the set of replicas for writing and reading the data overlap.
665ff0b @rmetzler added most words from the talk
authored
66
23e65e5 @rmetzler N/W/R
authored
67 If your replica configuration only holds the condition
78f2102 @rmetzler converted bullets
authored
68
23e65e5 @rmetzler N/W/R
authored
69 + N >= W + R
78f2102 @rmetzler converted bullets
authored
70
23e65e5 @rmetzler N/W/R
authored
71 it only guaranties weak or eventual consistency.
72
f20b0a8 @rmetzler todos
authored
73 A RDBMS is typically configured with {N = 2, W = 2, R = 1} while {N = 3, W = 2, R = 2} is a common configuration for fault tolerant systems.
23e65e5 @rmetzler N/W/R
authored
74
75
76 It is possible to deduce different attributes from these configuration properties.
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
77 Consistency over all nodes is reached if W = N .
23e65e5 @rmetzler N/W/R
authored
78
09d0f96 @DracoBlue Fixed issue with conversion to latex.
DracoBlue authored
79 Read optimized systems will use R = 1, while write optimized systems use W = 1.
23e65e5 @rmetzler N/W/R
authored
80
8a2ecc3 @DracoBlue fixed an other issue which kept markdown2pdf from working
DracoBlue authored
81 Cassandra is able to run in application specific N / W / R configuration. This helps Cassandra to recover from transient and permanent failures.
665ff0b @rmetzler added most words from the talk
authored
82
7f03083 @rmetzler ring
authored
83 # Ring topology
84
85 Cassandra and Riak are both heavily inspired by Amazon's Dynamo and both organize nodes in a ring topology. Also Cloudant's BigCouch __CITE: BIGCOUCH__ of CouchDB is closely modeled after Dynamo and features a ring to replicate CouchDB instances.
8ae7fc3 @rmetzler ring
authored
86 N consecutive nodes on the ring form one replica set and replica sets overlap. By distributing the nodes in the physical space a better fault tolerance should be obtained.
7f03083 @rmetzler ring
authored
87
399c875 @DracoBlue Fixed indention
DracoBlue authored
88 # Products
665ff0b @rmetzler added most words from the talk
authored
89
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
90 There are several NoSQL systems available. We focused on 4 of the major ones:
23e65e5 @rmetzler N/W/R
authored
91
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
92 * Riak (document oriented)
93 * Cassandra (column oriented)
94 * CouchDB (document oriented)
95 * MongoDB (document oriented)
23e65e5 @rmetzler N/W/R
authored
96
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
97 In terms of the CAP theorem: Riak, Cassandra and CouchDB provide availability
98 and partition tolerance. MongoDB on the other hand provides consistency and
99 partition tolerance.
23e65e5 @rmetzler N/W/R
authored
100
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
101 We installed those on multiple virtual machines, connected them with each other
102 and ran some very simple tests to figure out how they behave in case of a fault.
665ff0b @rmetzler added most words from the talk
authored
103
399c875 @DracoBlue Fixed indention
DracoBlue authored
104 ## Riak 0.11.0
665ff0b @rmetzler added most words from the talk
authored
105
e23e623 @rmetzler riak neu
authored
106 Riak is created and maintained by Basho (a company founded by Ex-Akamai employees)
107 and the de facto open source reference implementation of the dynamo paper.
108 It is released under the terms of Apache License 2.0 .
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
109
110 Riak is a document oriented key value store and also supports links between
111 them. Documents are stored in so called buckets.
31e5f87 @rmetzler lauter kleinigkeiten
authored
112 Riak is written entirely in Erlang and has an HTTP interface to read and write data.
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
113
114 More information about Riak may be found at <https://wiki.basho.com/display/RIAK/Riak>.
665ff0b @rmetzler added most words from the talk
authored
115
399c875 @DracoBlue Fixed indention
DracoBlue authored
116 ### Replication Config
665ff0b @rmetzler added most words from the talk
authored
117
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
118 One of Riaks features is the variable configuration of N W R:
665ff0b @rmetzler added most words from the talk
authored
119
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
120 * N can vary for each bucket
121 * R & W can vary for each operation (read/write/delete)
665ff0b @rmetzler added most words from the talk
authored
122
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
123 There are also additional quorums for:
124
125 * durable writes to disk
126 * deletes
665ff0b @rmetzler added most words from the talk
authored
127
399c875 @DracoBlue Fixed indention
DracoBlue authored
128 ## Cassandra 0.6.3
129cc39 @rmetzler fixed bullet points
authored
129
433bc8b @rmetzler cassandra neu
authored
130 Cassandra is a column oriented distributed database system written in Java.
131 It was created by Facebook and donated to the Apache Foundation. Cassandra
132 is released under the Apache License 2.0.
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
133
433bc8b @rmetzler cassandra neu
authored
134 Cassandra's architecture is inspired by Amazon's Dynamo and the Google BigTable.
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
135
433bc8b @rmetzler cassandra neu
authored
136 When developing applications with Cassandra the developer need to configure the
137 collumns before Cassandra can be started. Thus Cassandra is not schema less like
138 one may expect from the term NoSQL. Once configured you can search and order the
139 documents by these columns.
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
140
141 More information about Cassandra may be found at the official website at
142 <http://cassandra.apache.org/> or at the wiki at
143 <http://wiki.apache.org/cassandra/FrontPage>.
665ff0b @rmetzler added most words from the talk
authored
144
399c875 @DracoBlue Fixed indention
DracoBlue authored
145 ## MongoDB 1.4.3
665ff0b @rmetzler added most words from the talk
authored
146
80857d5 @rmetzler it's
authored
147 MongoDB is a document oriented distributed database system by 10gen. It is
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
148 available under the terms of GNU Affero General Public License.
149
150 It uses a custom TCP protocal (BSON) and is written in C++.
151
152 More information about MongoDB is available at the official website at
153 <http://www.mongodb.org/> or at the wiki at
154 <http://www.mongodb.org/display/DOCS/Home>.
665ff0b @rmetzler added most words from the talk
authored
155
399c875 @DracoBlue Fixed indention
DracoBlue authored
156 ## Replication
665ff0b @rmetzler added most words from the talk
authored
157
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
158 There is Master/Slave replication in MongoDB available. In this case one can
1e0deb2 @rmetzler mongodb replication
authored
159 read from the slaves and the master, but write only into the master.
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
160
161 Additionally there are Replica Pairs available. When using Replica Pairs only
1e0deb2 @rmetzler mongodb replication
authored
162 one of two nodes is the master at any time. Read and write is only possible on the
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
163 master of the Replica Pair.
164
165 In case one node of the Replica Pair fails the other one is made the new
1e0deb2 @rmetzler mongodb replication
authored
166 master. Deciding who is the master can be done by an external arbiter (Quorum
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
167 Device).
168
169 The MongoDB Team is currently working on Replica Sets, which are meant to allow
1e0deb2 @rmetzler mongodb replication
authored
170 more then 2 machines to be part of the replication configuration.
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
171
172 ## Crash
173
3b011d2 @rmetzler mongodb crash
authored
174 In case of a MongoDB crash, the entire database must be reindexed again. According to
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
175 David Mytton's blogpost <http://blog.boxedice.com/2010/02/28/notes-from-a-production-mongodb-deployment/>
176 this takes up to 72 hours for 664.000.000 database entries.
177
178 That's why MongoDB has an increased MTTR (Mean Time To Repair).
665ff0b @rmetzler added most words from the talk
authored
179
399c875 @DracoBlue Fixed indention
DracoBlue authored
180 ## CouchDB 0.11.0
665ff0b @rmetzler added most words from the talk
authored
181
f15cf2e @rmetzler an http
authored
182 CouchDB is a document oriented database written in Erlang with support for JavaScript views.
183 CouchDB is an Apache Project.
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
184
f15cf2e @rmetzler an http
authored
185 Access to the data is made available via an HTTP interface by exchanging JSON
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
186 objects.
187
188 More information about CouchDB is available at the official website at
189 <http://couchdb.apache.org> or at the wiki at
190 <http://wiki.apache.org/couchdb/FrontPage>.
665ff0b @rmetzler added most words from the talk
authored
191
399c875 @DracoBlue Fixed indention
DracoBlue authored
192 # Experiments
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
193
194 To test the fault tolerance features of the distributed database systems,
31e5f87 @rmetzler lauter kleinigkeiten
authored
195 we focused on the behavior in case of network splits and synchronization
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
196 after adding new nodes.
197
31e5f87 @rmetzler lauter kleinigkeiten
authored
198 For this purpose we set up the virtual machines "Alice" and "Bob". Both
199 run a vanilla Debian Squeeze release in Virtual Box. For the Cassandra
200 experiments we added a third identical machine called "Charly".
665ff0b @rmetzler added most words from the talk
authored
201
399c875 @DracoBlue Fixed indention
DracoBlue authored
202 ## Experiment 1
665ff0b @rmetzler added most words from the talk
authored
203
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
204 The first experiment is meant to show what happens if a new node joins the
205 distributed database.
206
207 For this purpose we set up the node Alice and pushed 1000 data records into
31e5f87 @rmetzler lauter kleinigkeiten
authored
208 Alice. Then Bob joined the network. To check if Bob already had all data
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
209 we frequently tried to read the 1000th entry from Bob. If this was possible
31e5f87 @rmetzler lauter kleinigkeiten
authored
210 we assumed that Bob was in sync or at least capable to answer in a consistent
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
211 way.
212
213 Results (Replicating to a new node):
214
215 * Riak: 1 second
216 * Cassandra (3 nodes): 20 seconds
217 * CouchDB: 1 second
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
218 * MongoDb: 2 seconds
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
219
399c875 @DracoBlue Fixed indention
DracoBlue authored
220 ## Experiment 2
665ff0b @rmetzler added most words from the talk
authored
221
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
222 To test how the distributed database system is able to manage a network
223 split, we set up the second experiment.
224
31e5f87 @rmetzler lauter kleinigkeiten
authored
225 The two nodes Alice and Bob were synchronized and connected. Then we deactivated
226 Bob's network and wrote 1000 data records into Alice. Because of the network partition
227 it is impossible for Bob to have the fresh data.
228 We turned on the network and timed how long it takes for Bob to receive all data entries
229 (by querying for the 1000th entry).
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
230
31e5f87 @rmetzler lauter kleinigkeiten
authored
231 Results (replication after network split):
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
232
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
233 * Riak: 6 seconds
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
234 * Cassandra (3 nodes): 20 seconds
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
235 * CouchDB: 8 seconds
236 * MongoDb: Failed, because reading from Slave returns an error
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
237
80857d5 @rmetzler it's
authored
238 When configured as replica pair it is not possible to read from the MongoDB Slave.
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
239
399c875 @DracoBlue Fixed indention
DracoBlue authored
240 ## Experiment 2b
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
241
242 Since we had MongoDB as Replica Pair in the experiment 2, it was impossible
31e5f87 @rmetzler lauter kleinigkeiten
authored
243 to read from the slave. That's why we made an experiment 2b with a slightly
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
244 different configuration.
245
31e5f87 @rmetzler lauter kleinigkeiten
authored
246 We set up Alice and Bob as Replica Pair. Another MongoDB instance "Charly" was
247 used as arbiter (Quorum Device). MongoDB chose the first one (Alice) to be the
248 master and Bob the slave. Then we pushed 1000 data records into Alice.
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
249
31e5f87 @rmetzler lauter kleinigkeiten
authored
250 We stopped Alice in three different ways:
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
251
31e5f87 @rmetzler lauter kleinigkeiten
authored
252 * by removing the network connection
253 * by stopping it gracefully
254 * by using "kill -9"
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
255
31e5f87 @rmetzler lauter kleinigkeiten
authored
256 After that we timed how long Bob needs to recognize that Alice has
257 disappeared and Bob become master.
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
258
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
259 Result: It took 1 second for Bob to become Master and thus allowing the client
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
260 to read from Bob.
261
31e5f87 @rmetzler lauter kleinigkeiten
authored
262 We noticed that stopping the node and kill -9 worked great. But Bob did not notice
263 the network split if we just removed the network connection. We assume that this
264 is because the virtual machine does not send any interupt like a physical network
265 interface would waiting for the TCP connection to time out.
08de5b8 @DracoBlue * Description for experiments added.
DracoBlue authored
266
399c875 @DracoBlue Fixed indention
DracoBlue authored
267 # Sources
59fe30f @DracoBlue Explanation of the Products.
DracoBlue authored
268 * Eric Brewer: "Towards Robust Distributed Systems"
269 <http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf>
270 * Gilbert, Lynch: "Brewer‘s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services"
271 * Werner Vogels: "Eventual Consistent"
272 * W. Vogels et all: "Dynamo: Amazon's highly Available Key-Value Store
273 * Lakshman, Malik: "Cassandra - A Decentralized Structured Storage System"
274 * David Mytton: "Notes from a production MongoDB deployment"
275 <http://blog.boxedice.com/2010/02/28/notes-from-a-production-mongodb-deployment/>
7f03083 @rmetzler ring
authored
276 * Coulouris et al: "Distributed Systems. Concepts and Design"
Something went wrong with that request. Please try again.