Skip to content

Commit

Permalink
Updated description of short strings and added description of short a…
Browse files Browse the repository at this point in the history
…rrays

Updated record sizes in the cache section
  • Loading branch information
digitalstain committed Nov 7, 2011
1 parent c205da7 commit 23fb825
Show file tree
Hide file tree
Showing 4 changed files with 63 additions and 18 deletions.
16 changes: 8 additions & 8 deletions kernel/src/docs/ops/cache.txt
Expand Up @@ -35,15 +35,15 @@ Each Neo4j storage file contains uniform fixed size records of a particular type
| Store file | Record size | Contents
| nodestore | 9 B | Nodes
| relstore | 33 B | Relationships
| propstore | 25 B | Properties for nodes and relationships
| stringstore | 133 B | Values of string properties
| arraystore | 133 B | Values of array properties
| propstore | 41 B | Properties for nodes and relationships
| stringstore | 128 B | Values of string properties
| arraystore | 128 B | Values of array properties
|============================================

For strings and arrays, where data can be of variable length, data is stored in one or more 120B chunks, with 13B record overhead.
For strings and arrays, where data can be of variable length, data is stored in one or more 120B chunks, with 8B record overhead.
The sizes of these blocks can actually be configured when the store is created using the `string_block_size` and `array_block_size` parameters.
The size of each record type can also be used to calculate the storage requirements of a Neo4j graph or the appropriate cache size for each file buffer cache.
Note that some strings can be stored without using the string store, see <<short-strings>>.
Note that some strings and arrays can be stored without using the string store or the array store respectively, see <<short-strings>> and <<short-arrays>>.

Neo4j uses multiple file buffer caches, one for each different storage file.
Each file buffer cache divides its storage file into a number of equally sized windows.
Expand Down Expand Up @@ -85,12 +85,12 @@ Configuration
Specifies the block size for storing strings.
This parameter is only honored when the store is created, otherwise it is ignored.
Note that each character in a string occupies two bytes, meaning that a block size of 120 (the default size) will hold a 60 character long string before overflowing into a second block.
Also note that each block carries an overhead of 13 bytes.
This means that if the block size is 120, the size of the stored records will be 133 bytes.
Also note that each block carries an overhead of 8 bytes.
This means that if the block size is 120, the size of the stored records will be 128 bytes.
| array_block_size |
Specifies the block size for storing arrays.
This parameter is only honored when the store is created, otherwise it is ignored.
The default block size is 120 bytes, and the overhead of each block is the same as for string blocks, i.e., 13 bytes.
The default block size is 120 bytes, and the overhead of each block is the same as for string blocks, i.e., 8 bytes.
| dump_configuration | `true` or `false` | If set to `true` the current configuration settings will be written to the default system output, mostly the console or the logfiles.
|========================================================

Expand Down
2 changes: 2 additions & 0 deletions kernel/src/docs/ops/index.txt
Expand Up @@ -23,6 +23,8 @@ include::filesystem.txt[]

include::short-strings.txt[]

include::short-arrays.txt[]

include::io-examples.txt[]

include::linux-performance.txt[]
23 changes: 23 additions & 0 deletions kernel/src/docs/ops/short-arrays.txt
@@ -0,0 +1,23 @@
[[short-arrays]]
Compressed storage of short arrays
===================================

Neo4j will try to store your primitive arrays in a compressed way, so as to save disk space and possibly an I/O operation.
To do that, it employs a "bit-shaving" algorithm that tries to reduce the number of bits required for storing the members
of the array. In particular:

1. For each member of the array, it determines the position of leftmost set bit.
2. Determines the largest such position among all members of the array
3. It reduces all members to that number of bits
4. Stores those values, prefixed by a small header.

That means that when even a single negative value is included in the array then the natural size of the primitives will be used.

There is a possibility that the result can be inlined in the property record if:

* It is less than 24 bytes after compression
* It has less than 64 members

For example, an array long[] {0L, 1L, 2L, 4L} will be inlined, as the largest entry (4) will require 3 bits to store so the whole array will be stored in 4*3=12 bits. The array long[] {-1L, 1L, 2L, 4L}
however will require the whole 64 bits for the -1 entry so it needs 64*4 = 32 bytes and it will end up in the dynamic store.

40 changes: 30 additions & 10 deletions kernel/src/docs/ops/short-strings.txt
Expand Up @@ -2,15 +2,35 @@
Compressed storage of short strings
===================================

Neo4j will classify your strings and store them accordingly.
If a string is classified as a short string it will be stored without indirection in the property store.
This means that there will be no string records created for storing that string.
Additionally, when no string record is needed to store the property, it can be read and written in a single lookup.
This leads to improvements in performance and lower storage overhead.
Neo4j will try to classify your strings in a short string class and if it manages that it will treat it accordingly.
In that case, it will be stored without indirection in the property store, inlining it instead in the property record,
meaning that the dynamic string store will not be involved in storing that value, leading to reduced disk footprint.
Additionally, when no string record is needed to store the property, it can be read and written in a single lookup,
leading to performance improvements.

For a string to be classified as a short string, one of the following must hold:
The various classes for short strings are:

* It is encodable in UTF-8 or Latin-1, 7 bytes or less.
* It is alphanumerical, and 10 characters or less (9 if using accented european characters).
* It consists of only upper case, or only lower case characters, including the punctuation characters space, underscore, period, dash, colon, or slash. Then it is allowed to be up to 12 characters.
* It consists of only numerical characters, inlcuding the punctuation characters plus, comma, single quote, space, period, or dash. Then it is allowed to be up to 15 characters.
* Numerical, consisting of digits 0..9 and the punctuation space, period, dash,, plus, comma and apostrophe.
* Date, consisting of digits 0..9 and the punctuation space dash, colon, slash, plus and comma.
* Uppercase, consisting of uppercase letters A..Z, and the punctuation space, underscore, period, dash, colon and slash.
* Lowercase, like upper but with lowercase letters a..z instead of uppercase
* E-mail, consisting of lowercase letters a..z and the punctuation comma, underscore, period, dash, plus and the at sign (@)
* URI, consisting of lowercase letters a..z, digits 0..9 and most punctuation available.
* Alphanumerical, consisting of both upper and lowercase letters a..zA..z, digits 0..9 and punctuation space and underscore.
* Alphasymbolical, consisting of both upper and lowercase letters a..zA..Z and the punctuation space, underscore, period, dash, colon, slash, plus, comma, apostrophe, at sign, pipe and semicolon.
* European, consisting of most accented european characters and digits plus punctuation space, dash, underscore and period - like latin1 but with less punctuation
* Latin 1
* UTF-8

In addition to the string's contents, the number of characters also determines if the string can be inlined or not. Each class has its own character count limits, which are

* For Numerical and Date, 54
* For Uppercase, Lowercase and E-mail, 43
* For URI, Alphanumerical and Alphasymbolical, 36
* For European, 31
* For Latin1, 27
* For UTF-8, 14

That means that the largest inline-able string is 54 characters long and must be of the Numerical class and also that all Strings of size 14 or less will always be inlined.

Also note that the above limits are for the default 41 byte PropertyRecord layout - if that parameter is changed the above have to be recalculated.

0 comments on commit 23fb825

Please sign in to comment.