Skip to content

Requirements for Binning Directories

Enrico Seiler edited this page Feb 4, 2019 · 8 revisions

log2(k)

Shapes

Consecutive Shapes

Definition

Given a text T with |T| = n and a k-mer size k, the set of consecutive k-mers S is defined as
S = {[T0, T1, ..., Tk-1], [T1, T2, ..., Tk], ..., [Tn-k, Tn-k+1, ..., Tn-1]}

Example

k = 5
T =   GACTACGTAGC
S = [ GACTA,
       ACTAC,
        CTACG,
         TACGT,
          ACGTA,
           CGTAG,
            GTAGC ]

Thresholding

To determine whether a pattern P with |P| = m occurs in the text T with up to e errors, the well known k-mer lemma can be applied: A text T of length n contains n-k+1 k-mers and a occurrence of pattern P with |P| = m shares at least m-(e+1)*k-1 k-mers with T.


Offset Shapes

Definition

Given a text T with |T| = n, a k-mer size k and an offset o, the set of offset k-mers S = [ [T_0, T_1, ..., T_k-1], [T_o, T_o+1, ..., T_o+k-1], [T_2o, T_2o+1, ..., T_2o+k-1], ... [T_ceil((n-k+1)/o)-o, ..., T_ceil(ceil((n-k+1)/o)-o+k].

Thresholding

Given a text T of length n, a k-mer size k and an offset o, T contains ok = ceil((n-k+1)/o) many k-mers and a occurrence of pattern |P| with |P|=m shares at least ok - e*ceil(k/o) many k-mers.

Example

k = 5
o = 3
T =   GACTACGTAGC
S = [ GACTA,
         TACGT,
            GTAGC ]

Details

When looking for a pattern P in text T, all k-mers of T must be compared to the offset k-mers of P, since we don't know if P aligns to T with a multiple of offset o. In theory, it would also work the other way around (storing the offset k-mers of T and searching all k-mers of T). Storing the offset k-mers of T would reduce the memory needed to store the k-mers of T, while it (depending on how you look the k-mers up, this is the case for Binning Directories) wouldn't speed up the querying. For Binning Directories we usually store all k-mers of T and look up the offset k-mers of P to speed up the querying (usually by about a factor of o).

k = 3
o = 2
T =     GACTACGTAGC
S_T = [ GAC,
          CTA,
            ACG,
              GTA,
                AGC ]
P =      ACTACG
S_P = [  ACT,
           TAC ]

Even though P is a substring of T, no k-mers match.

k = 3
o = 2
T =     GACTACGTAGC
S_T = [ GAC,
         ACT,
          CTA,
           TAC,
            ACG,
             CGT,
              GTA,
               TAG,
                AGC ]
P =      ACTACG
S_P = [  ACT,
           TAC ]

Now S_P is contained in S_T.


Minimizer Shapes

Definition

Given a text T with |T| = n, a k-mer size k and a window size w, the set of minimizing k-mers over T is the set of the lexicographically smallest k-mers in consecutive substrings of length w of T and its reverse complement.

Thresholding

Heuristically we can look at the minimizer coverage of a pattern P and infer some consequences for the amount of minimizers destroyed by an error.

Example

[...]

Gapped Shapes

Definition

We have a mask.

Thresholding

NP-hard. There are heuristics.

Example

[...]

Note

In SeqAn2 the computation is done rather naively. We always go over the text and compute the hash (ignoring 0s and looking at the 1s of the mask). Since we know the mask at compile time we can probably devise a multiplication by a specific value that masks the values we want to ignore?


Dump

API:

  • insert/construct
  • update
  • select

Bitvector

  • Uncompressed
  • Compressed
  • Chunked

Configuration:

  • If possible at compile time
  • Things like k, w, o may be only determined at runtime (user set)
  • Same for Bitvector config

Also:

  • Resize bitvector (add bins) - required by third party (Ganon)

Decisions:

  • Since it is a vector, we may not enforce b (number bins) to be a multiple of the word size (64). This may cause a slow down for look ups though (check if that's the case).
  • Check if it is beneficial to use Fibonacci Hashing instead of a modulo to hash a k-mer to a position in the Binning directory. In this context also check if we always (especially if we enforce above constraint) always do a modulo by a power of 2 - this should be optimised by the compiler to a bit operation, but since the blockSize (the modulo operand) is decided at runtime this may not be the case (check) - in which case we could hard code the modulo as bit operation. Still, in this case, the Fibonacci hashing may have some advantaged regarding the distribution of the values ... yet again we have relatively uniform values to hash (arithmetic encoding of the k-mers, which we can assume as uniformly distributed), so Fibonacci Hashing may not have benefits in this regard.

Terms:

  • bins: number of bins
  • Size: the size of the bitvector
  • Metadatasize: the size of the bits that are appended to the bitvector to store information like k, b and so on. Deprecated with the use of Cereal for serialisation.
  • binWidth: how many words do we need to represent the bins (64 bins=1 64-bit word, 70 bins=2 64-bit words)
  • blockBitSize: How long is a single block representing the bins (binWidth * wordsize)
  • noOfBlocks: How many blocks do we have available/number of hash values (size/blockbitsize)
Clone this wiki locally