Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

Added docs/specifications/backends/raic.rst for ticket #1760

  • Loading branch information...
commit 2f9f85341368655402b3768c2cfd8e6abd3c6311 1 parent 518e4ce
antagonismorg authored June 05, 2012 warner committed June 10, 2012

Showing 1 changed file with 405 additions and 0 deletions. Show diff stats Hide diff stats

  1. 405  docs/specifications/backends/raic.rst
405  docs/specifications/backends/raic.rst
Source Rendered
... ...
@@ -0,0 +1,405 @@
  1
+
  2
+=============================================================
  3
+Redundant Array of Independent Clouds: Share To Cloud Mapping
  4
+=============================================================
  5
+
  6
+
  7
+Introduction
  8
+============
  9
+
  10
+This document describes a proposed design for the mapping of LAFS shares to
  11
+objects in a cloud storage service. It also analyzes the costs for each of the
  12
+functional requirements, including network, disk, storage and API usage costs.
  13
+
  14
+
  15
+Terminology
  16
+===========
  17
+
  18
+*LAFS share*
  19
+   A Tahoe-LAFS share representing part of a file after encryption and
  20
+   erasure encoding.
  21
+
  22
+*LAFS shareset*
  23
+   The set of shares stored by a LAFS storage server for a given storage index.
  24
+   The shares within a shareset are numbered by a small integer.
  25
+
  26
+*Cloud storage service*
  27
+   A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
  28
+   Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
  29
+
  30
+*Cloud storage interface*
  31
+   A protocol interface supported by a cloud storage service, such as the
  32
+   S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
  33
+   Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
  34
+   multiple services implementing a given cloud storage interface. In this design,
  35
+   only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
  36
+
  37
+*Cloud object*
  38
+   A file-like abstraction provided by a cloud storage service, storing a
  39
+   sequence of bytes. Cloud objects are mutable in the sense that the contents
  40
+   and metadata of the cloud object with a given name in a given cloud container
  41
+   can be replaced. Cloud objects are called “blobs” in the Azure interface,
  42
+   and “objects” in the other interfaces.
  43
+
  44
+*Cloud container*
  45
+   A container for cloud objects provided by a cloud service. Cloud containers
  46
+   are called “buckets” in the S3 and Google Cloud Storage interfaces, and
  47
+   “containers” in the Azure and OpenStack Storage interfaces.
  48
+
  49
+
  50
+Functional Requirements
  51
+=======================
  52
+
  53
+* *Upload*: a LAFS share can be uploaded to an appropriately configured
  54
+  Tahoe-LAFS storage server and the data is stored to the cloud
  55
+  storage service.
  56
+
  57
+ * *Scalable shares*: there is no hard limit on the size of LAFS share
  58
+   that can be uploaded.
  59
+
  60
+   If the cloud storage interface offers scalable files, then this could be
  61
+   implemented by using that feature of the specific cloud storage
  62
+   interface. Alternately, it could be implemented by mapping from the LAFS
  63
+   abstraction of an unlimited-size immutable share to a set of size-limited
  64
+   cloud objects.
  65
+
  66
+ * *Streaming upload*: the size of the LAFS share that is uploaded
  67
+   can exceed the amount of RAM and even the amount of direct attached
  68
+   storage on the storage server. I.e., the storage server is required to
  69
+   stream the data directly to the ultimate cloud storage service while
  70
+   processing it, instead of to buffer the data until the client is finished
  71
+   uploading and then transfer the data to the cloud storage service.
  72
+
  73
+* *Download*: a LAFS share can be downloaded from an appropriately
  74
+  configured Tahoe-LAFS storage server, and the data is loaded from the
  75
+  cloud storage service.
  76
+
  77
+ * *Streaming download*: the size of the LAFS share that is
  78
+   downloaded can exceed the amount of RAM and even the amount of direct
  79
+   attached storage on the storage server. I.e. the storage server is
  80
+   required to stream the data directly to the client while processing it,
  81
+   instead of to buffer the data until the cloud storage service is finished
  82
+   serving and then transfer the data to the client.
  83
+
  84
+* *Modify*: a LAFS share can have part of its contents modified.
  85
+
  86
+  If the cloud storage interface offers scalable mutable files, then this
  87
+  could be implemented by using that feature of the specific cloud storage
  88
+  interface. Alternately, it could be implemented by mapping from the LAFS
  89
+  abstraction of an unlimited-size mutable share to a set of size-limited
  90
+  cloud objects.
  91
+
  92
+ * *Efficient modify*: the size of the LAFS share being
  93
+   modified can exceed the amount of RAM and even the amount of direct
  94
+   attached storage on the storage server. I.e. the storage server is
  95
+   required to download, patch, and upload only the segment(s) of the share
  96
+   that are being modified, instead of to download, patch, and upload the
  97
+   entire share.
  98
+
  99
+* *Tracking leases*: The Tahoe-LAFS storage server is required to track when
  100
+  each share has its lease renewed so that unused shares (shares whose lease
  101
+  has not been renewed within a time limit, e.g. 30 days) can be garbage
  102
+  collected. This does not necessarily require code specific to each cloud
  103
+  storage interface, because the lease tracking can be performed in the
  104
+  storage server's generic component rather than in the component supporting
  105
+  each interface.
  106
+
  107
+
  108
+Mapping
  109
+=======
  110
+
  111
+This section describes the mapping between LAFS shares and cloud objects.
  112
+
  113
+A LAFS share will be split into one or more “chunks” that are each stored in a
  114
+cloud object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
  115
+chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
  116
+(It is not possible for `C` to be zero, because valid shares always have a header,
  117
+so, there is at least one chunk for each share.)
  118
+
  119
+For an existing share, the chunk size is determined by the size of the first
  120
+chunk. For a new share, it is a parameter that may depend on the storage
  121
+interface. It is an error for any chunk to be larger than the first chunk, or
  122
+for any chunk other than the last to be smaller than the first chunk.
  123
+If a mutable share with total size less than the default chunk size for the
  124
+storage interface is being modified, the new contents are split using the
  125
+default chunk size.
  126
+
  127
+  *Rationale*: this design allows the `chunksize` parameter to be changed for
  128
+  new shares written via a particular storage interface, without breaking
  129
+  compatibility with existing stored shares. All cloud storage interfaces
  130
+  return the sizes of cloud objects with requests to list objects, and so
  131
+  the size of the first chunk can be determined without an additional request.
  132
+
  133
+The name of the cloud object for chunk `i` > 0 of a LAFS share with storage index
  134
+`STORAGEINDEX` and share number `SHNUM`, will be
  135
+
  136
+  shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
  137
+
  138
+where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
  139
+`.0` is omitted.
  140
+
  141
+  *Rationale*: this layout maintains compatibility with data stored by the
  142
+  prototype S3 backend, for which Least Authority Enterprises has existing
  143
+  customers. This prototype always used a single cloud object to store each
  144
+  share, with name
  145
+
  146
+    shares/`ST`/`STORAGEINDEX`/`SHNUM`
  147
+
  148
+  By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
  149
+  the storage server can obtain a list of cloud objects associated with a given
  150
+  shareset without having to know the layout in advance, and without having to
  151
+  make multiple API requests. This also simplifies sharing of test code between the
  152
+  disk and cloud backends.
  153
+
  154
+Mutable and immutable shares will be “chunked” in the same way.
  155
+
  156
+
  157
+Rationale for Chunking
  158
+----------------------
  159
+
  160
+Limiting the amount of data received or sent in a single request has the
  161
+following advantages:
  162
+
  163
+* It is unnecessary to write separate code to take advantage of the
  164
+  “large object” features of each cloud storage interface, which differ
  165
+  significantly in their design.
  166
+* Data needed for each PUT request can be discarded after it completes.
  167
+  If a PUT request fails, it can be retried while only holding the data
  168
+  for that request in memory.
  169
+
  170
+
  171
+Costs
  172
+=====
  173
+
  174
+In this section we analyze the costs of the proposed design in terms of network,
  175
+disk, memory, cloud storage, and API usage.
  176
+
  177
+
  178
+Network usage—bandwidth and number-of-round-trips
  179
+-------------------------------------------------
  180
+
  181
+When a Tahoe-LAFS storage client allocates a new share on a storage server,
  182
+the backend will request a list of the existing cloud objects with the
  183
+appropriate prefix. This takes one HTTP request in the common case, but may
  184
+take more for the S3 interface, which has a limit of 1000 objects returned in
  185
+a single “GET Bucket” request.
  186
+
  187
+If the share is to be read, the client will make a number of calls each
  188
+specifying the offset and length of the required span of bytes. On the first
  189
+request that overlaps a given chunk of the share, the server will make an
  190
+HTTP GET request for that cloud object. The server may also speculatively
  191
+make GET requests for cloud objects that are likely to be needed soon (which
  192
+can be predicted since reads are normally sequential), in order to reduce
  193
+latency.
  194
+
  195
+Each read will be satisfied as soon as the corresponding data is available,
  196
+without waiting for the rest of the chunk, in order to minimize read latency.
  197
+
  198
+All four cloud storage interfaces support GET requests using the
  199
+Range HTTP header. This could be used to optimize reads where the
  200
+Tahoe-LAFS storage client requires only part of a share.
  201
+
  202
+If the share is to be written, the server will make an HTTP PUT request for
  203
+each chunk that has been completed. Tahoe-LAFS clients only write immutable
  204
+shares sequentially, and so we can rely on that property to simplify the
  205
+implementation.
  206
+
  207
+When modifying shares of an existing mutable file, the storage server will
  208
+be able to make PUT requests only for chunks that have changed.
  209
+(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
  210
+future versions will probably do so for MDMF files.)
  211
+
  212
+In some cases, it may be necessary to retry a request (see the `Structure of
  213
+Implementation`_ section below). In the case of a PUT request, at the point
  214
+at which a retry is needed, the new chunk contents to be stored will still be
  215
+in memory and so this is not problematic.
  216
+
  217
+In the absence of retries, the maximum number of GET requests that will be made
  218
+when downloading a file, or the maximum number of PUT requests when uploading
  219
+or modifying a file, will be equal to the number of chunks in the file.
  220
+
  221
+If the new mutable share content has fewer chunks than the old content,
  222
+then the remaining cloud objects for old chunks must be deleted (using one
  223
+HTTP request each). When reading a share, the backend must tolerate the case
  224
+where these cloud objects have not been deleted successfully.
  225
+
  226
+The last write to a share will be reported as successful only when all
  227
+corresponding HTTP PUTs and DELETEs have completed successfully.
  228
+
  229
+
  230
+
  231
+Disk usage (local to the storage server)
  232
+----------------------------------------
  233
+
  234
+It is never necessary for the storage server to write the content of share
  235
+chunks to local disk, either when they are read or when they are written. Each
  236
+chunk is held only in memory.
  237
+
  238
+A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
  239
+database to store metadata about shares. In that case the same database would
  240
+be used for the cloud backend. This would enable lease tracking to be implemented
  241
+in the same way for disk and cloud backends.
  242
+
  243
+
  244
+Memory usage
  245
+------------
  246
+
  247
+The use of chunking simplifies bounding the memory usage of the storage server
  248
+when handling files that may be larger than memory. However, this depends on
  249
+limiting the number of chunks that are simultaneously held in memory.
  250
+Multiple chunks can be held in memory either because of pipelining of requests
  251
+for a single share, or because multiple shares are being read or written
  252
+(possibly by multiple clients).
  253
+
  254
+For immutable shares, the Tahoe-LAFS storage protocol requires the client to
  255
+specify in advance the maximum amount of data it will write. Also, a cooperative
  256
+client (including all existing released versions of the Tahoe-LAFS code) will
  257
+limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
  258
+size will be greater than that, it is possible to ensure that for each allocation,
  259
+the maximum chunk data memory usage is the lesser of two chunks, and the allocation
  260
+size. (There is some additional overhead but it is small compared to the chunk
  261
+data.) If the maximum memory usage of a new allocation would exceed the memory
  262
+available, the allocation can be delayed or possibly denied, so that the total
  263
+memory usage is bounded.
  264
+
  265
+It is not clear that the existing protocol allows allocations for mutable
  266
+shares to be bounded in general; this may be addressed in a future protocol change.
  267
+
  268
+The above discussion assumes that clients do not maliciously send large
  269
+messages as a denial-of-service attack. Foolscap (the protocol layer underlying
  270
+the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
  271
+
  272
+
  273
+Storage
  274
+-------
  275
+
  276
+The storage requirements, including not-yet-collected garbage shares, are
  277
+the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
  278
+objects stored is equal to the total size of shares that the disk backend
  279
+would store.
  280
+
  281
+Erasure coding causes the size of shares for each file to be a
  282
+factor `shares.total` / `shares.needed` times the file size, plus overhead
  283
+that is logarithmic in the file size `¹¹`_.
  284
+
  285
+
  286
+API usage
  287
+---------
  288
+
  289
+Cloud storage backends typically charge a small fee per API request. The number of
  290
+requests to the cloud storage service for various operations is discussed under
  291
+“network usage” above.
  292
+
  293
+
  294
+Structure of Implementation
  295
+===========================
  296
+
  297
+A generic “cloud backend”, based on the prototype S3 backend but with support
  298
+for chunking as described above, will be written.
  299
+
  300
+An instance of the cloud backend can be attached to one of several
  301
+“cloud interface adapters”, one for each cloud storage interface. These
  302
+adapters will operate only on chunks, and need not distinguish between
  303
+mutable and immutable shares. They will be a relatively “thin” abstraction
  304
+layer over the HTTP APIs of each cloud storage interface, similar to the
  305
+S3Bucket abstraction in the prototype.
  306
+
  307
+For some cloud storage services it may be necessary to transparently retry
  308
+requests in order to recover from transient failures. (Although the erasure
  309
+coding may enable a file to be retrieved even when shares are not stored by or
  310
+not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
  311
+be desirable to retry cloud storage service requests in order to improve overall
  312
+reliability.) Support for this will be implemented in the generic cloud backend,
  313
+and used whenever a cloud storage adaptor reports a transient failure. Our
  314
+experience with the prototype suggests that it is necessary to retry on transient
  315
+failures for Amazon's S3 service.
  316
+
  317
+There will also be a “mock” cloud interface adaptor, based on the prototype's
  318
+MockS3Bucket. This allows tests of the generic cloud backend to be run without
  319
+a connection to a real cloud service. The mock adaptor will be able to simulate
  320
+transient and non-transient failures.
  321
+
  322
+
  323
+Known Issues
  324
+============
  325
+
  326
+This design worsens a known “write hole” issue in Tahoe-LAFS when updating
  327
+the contents of mutable files. An update to a mutable file can require changing
  328
+the contents of multiple chunks, and if the client fails or is disconnected
  329
+during the operation the resulting state of the stored cloud objects may be
  330
+inconsistent—no longer containing all of the old version, but not yet containing
  331
+all of the new version. A mutable share can be left in an inconsistent state
  332
+even by the existing Tahoe-LAFS disk backend if it fails during a write, but
  333
+that has a smaller chance of occurrence because the current client behavior
  334
+leads to mutable shares being written to disk in a single system call.
  335
+
  336
+The best fix for this issue probably requires changing the Tahoe-LAFS storage
  337
+protocol, perhaps by extending it to use a two-phase or three-phase commit
  338
+(ticket #1755).
  339
+
  340
+
  341
+
  342
+References
  343
+===========
  344
+
  345
+¹ omitted
  346
+
  347
+.. _²:
  348
+
  349
+² “Amazon S3” Amazon (2012)
  350
+
  351
+   https://aws.amazon.com/s3/
  352
+
  353
+.. _³:
  354
+
  355
+³ “Rackspace Cloud Files” Rackspace (2012)
  356
+
  357
+   https://www.rackspace.com/cloud/cloud_hosting_products/files/
  358
+
  359
+.. _⁴:
  360
+
  361
+⁴ “Google Cloud Storage” Google (2012)
  362
+
  363
+   https://developers.google.com/storage/
  364
+
  365
+.. _⁵:
  366
+
  367
+⁵ “Windows Azure Storage” Microsoft (2012)
  368
+
  369
+   https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
  370
+
  371
+.. _⁶:
  372
+
  373
+⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
  374
+
  375
+   http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
  376
+
  377
+.. _⁷:
  378
+
  379
+⁷ “OpenStack Object Storage” openstack.org (2012)
  380
+
  381
+   http://openstack.org/projects/storage/
  382
+
  383
+.. _⁸:
  384
+
  385
+⁸ “Google Cloud Storage Reference Guide” Google (2012)
  386
+
  387
+   https://developers.google.com/storage/docs/reference-guide
  388
+
  389
+.. _⁹:
  390
+
  391
+⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
  392
+
  393
+   http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
  394
+
  395
+.. _¹⁰:
  396
+
  397
+¹⁰ “Representational state transfer” English Wikipedia (2012)
  398
+
  399
+    https://en.wikipedia.org/wiki/Representational_state_transfer
  400
+
  401
+.. _¹¹:
  402
+
  403
+¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
  404
+
  405
+    https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst

0 notes on commit 2f9f853

Please sign in to comment.
Something went wrong with that request. Please try again.