-
Notifications
You must be signed in to change notification settings - Fork 272
Segmented Sequence Tree
Optimization for segmented sequences was needed which did not store base sequence offset for
every character but at the same did not impact random access to characters via charAt(n)
.
Some known approaches are an Interval tree for overlapping multiple intervals,
Segment tree - Wikipedia or Binary search tree. All have
In practice, search
Another desired optimization is to have each BasedSequence
provide its segment information
efficiently to allow construction of a new segmented sequences consisting of subsequences of the
segmented sequence. Being able to extract segment information using the offset and length of the
subsequence from parent structure is highly desired and will reduce construction time of
segmented sequences.
The tree structure is immutable like the sequences it represents so beyond cost the of creation and querying there is no concern for insert/delete/copy operation efficiency.
With query time of
An array of aggregated lengths int[]
provides the binary tree structure. The middle entry is
the root node for the search.
Advantage for compact storage is significant especially for long sequences with few segments.
Current implementation used int
for every character in the sequence, tripling the native Java
character storage. This huge memory footprint makes use of SegmentedSequence
for accumulation
of text output from HtmlRenderer
and Formatter
problematic for large files. In tests the
limit for Formatter
was about 250k of markdown source. Larger and the operation would fail
with out of memory exception. Running smaller files would require over 6GB of memory.
Query penalty for sequential access can be mitigated by per thread caching of last used segment information.
If a char index is within cached segment then the segment can be re-used. If it is after the end of the segment, then the next segment can be tried. If it is before start of segment, then previous segment can be tried.
If there is no match then first and last segments should be checked to optimize sequential char access of sequences from start or end of the sequence.
These tests are a quick index into int[]
so do not incur a heavy penalty. Additionally, failed
tests are not wasted because they serve to narrow the range of segments which must be searched.
If the requested index is before the previous segment then binary search should be done on [0, prevSeg) and if it is greater than the next segment's aggregated length then search is narrowed to (nextSeg, endSeg).
Because all segments are already ordered, the aggregated length array can be generated by a single iteration of the segments, which is already done to build the segmented sequence. Additional variable to aggregate lengths creates no overhead either. Therefore building of the binary tree is effectively free.
Subsequences can use the parent binary tree or a sub-section of it by defining an offset and length of their portion of it, eliminating the overhead of searching extra segments and copying of data. Effectively no penalty to subsequences for re-use of parent search structure.
Construction of segments from parent binary tree by subsequences is another feature which speeds up overall tree based segmented sequence use.
The total storage overhead for segmented sequences is (4 bytes * 2 + per segment overhead) * number of segments
. 4 bytes per integer, one being the aggregated length and the other an
offset of the segment in the serialized segment byte array.
Out of base characters have additional overhead for character storage and all have access penalty for de-serializing the segment data. This is relatively painless because each segment contains little data needed for class construction. Character data is not de-serialized and character access is done directly from serialized byte data as characters are accessed.
Deserialization is only performed once the segment is positively identified as one to be used. Even when trying to determine if previous or next segment contains an index, binary tree search data is used not the serialized segment data.
Segments are stored in a byte array and deserialized from consecutive bytes as needed.
Each type of segment is represented by a separate class all extending SeqSeg
. In Kotlin these
would be a sealed class. SeqSeg
are only deserialized from a byte[]
. Serialization is
performed by SegmentBuilder
from a Seg
element.
Serialized data is optimized to minimize number of bytes taken by each segment.
- Lengths and offsets of <16 require no extra information and fit into the first byte identifying the segment.
- Lengths and offsets are stored in 1, 2, 3 or 4 byte format as needed to fit the data
- Out of base text will use 1 byte per character if all characters in the segment have code < 256
- Repeated characters only store the character and repeat count
- Repeated spaces and
\n
only need length and if <16 then a single byte encodes all the information.
Numeric values for offsets and length can have only positive values and 0, with length of any segment limited to Integer.MAX_VALUE/4, ie. 0.5 GB which is plenty for the required application.
Segment types:
-
BASE
: base segment representing a range of base sequence characters. 1 to 9 bytes, in practice probably hitting an average of 3-4 bytes.-
ANCHOR
: base segment representing a range of 0 span of base sequence characters, 1 to 5 bytes, average most likely 2-3 bytes.
-
-
TEXT
: segment representing characters out of base sequence, general case 1-5 bytes + 2 bytes per character-
REPEATED_TEXT
: segment representing a repeated character span, 3-7 bytes.-
REPEATED_SPACE
: repeated space, 1 to 3 bytes. Most will probably be 1 or 2 bytes. The latter can handle up to 255 spaces.-
REPEATED_EOL
: repeated\n
, 1 to 3 bytes. Most will probably be 1 or 2 bytes. The latter can handle up to 255 EOLs.
-
-
REPEATED_ASCII
: repeated character with code < 256, 2-6 bytes, most likely 2 bytes for repeat <16, 3 bytes up to 255 repeats.
-
-
TEXT_ASCII
: segment where all characters code < 256, 1 to 5 bytes plus 1 byte per character.
-
First byte of serialized segment carries the information about the specifics and serialized format of the rest of the bytes:
-
SEG_TYPE
:- 0b0000_0000:
ANCHOR
:startOffset
- 0b0010_0000:
BASE
:startOffset
,length
- 0b0100_0000:
TEXT
:length
:char[]
- 0b0110_0000:
REPEATED_TEXT
:char
,length
- 0b1000_0000:
TEXT_ASCII
:length
,byte[]
- 0b1010_0000:
REPEATED_ASCII
:byte
,length
, not needed and should be freed up for other uses. - 0b1100_0000:
REPEATED_SPACE
:length
- 0b1110_0000:
REPEATED_EOL
:length
- 0b0000_0000:
-
SEG_BYTES
: encodes number of bytes used for segment's numeric values.- 0b0001_xxxx: no size/offset bytes. If has startOffset then offset = xxxx, length = 0, if has only length then length = xxxx, not valid if has both length and start offset
- 0b0000_0000: 1 byte for start
- 0b0000_0001: 2 bytes for start
- 0b0000_0010: 3 bytes for start
- 0b0000_0011: 4 bytes for start
- 0b0000_0000: 1 byte for length
- 0b0000_0100: 2 bytes for length
- 0b0000_1000: 3 bytes for length
- 0b0000_1100: 4 bytes for length