Skip to content

Bitsets

rgchris edited this page Nov 29, 2016 · 2 revisions

This page describes the BITSET! datatype as defined in R3.

Note: these changes apply to the next alpha release, not the current version.

Table of Contents

Major Changes

Various changes have been made to the BITSET! datatype. Here is a quick summary:

  • Supports Unicode characters and strings.
  • Bitsets are variable length for all cases (even when used for charsets).
  • New datatype action functions and features are supported.
  • Bit order of binary representation has changed.

Documentation

The official documentation for bitsets can be found here: Bitset Datatype

It is being revised and updated. If you see missing information, please add it.

Unicode Support

As you would expect, bitsets now support Unicode.

This means that both CHAR! and STRING! values used with bitsets will use the Unicode code-point value for the bit number.

Note that Unicode code-point values can be much larger than 8 bits, and this can effect the size of your bitset. For example, if you create a bitset that includes Unicode characters up to 800, then the bitset will require 100 bytes of storage.

Note: In general practice, this should not be a problem because we don't expect a large number of large bitsets to be defined in most applications. That's why we do not use sparse arrays for storage. In addition sparse arrays add structural and computational overhead.

Variable Length

You have always been able to define bitsets of any reasonable length, but now the length is more automatic:

  • Bitsets are only as long as needed.
  • Virtual bits outside physical range are valid.
  • Bitsets will auto expand if necessary.
For example, in R2 if you created a bitset:
bits: make bitset! 160

a 160 bit set would be created. However, if you wrote:

space: make bitset! " ^-^/"

a 256 bitset would be created (because it assumed you wanted an 8-bit character-based bitset).

In R3, this is no longer true. A bitset is only as long as needed to hold its maximum bit value. The line:

space: make bitset! " ^-^/"

creates a 40 bit set that looks like this:

make bitset! #{0060000080}

The bitset is only as long as it needs to be.

If you check for a bit outside its range, such a request is valid:

if find space "a" [print "found"]

In R2, this line would have caused an error, but it is fine in R3. All virtual bits are zero.

In addition, a bitset will auto-extend as needed. For example:

append space ".:;"

will expand the space bitset to be:

make bitset! #{006000008002}

Again, it is only as long as needed to hold the largest bit.

The above examples show only string bit values, but these rules apply to bitsets that are accessed with CHAR! and INTEGER! bit values as well.

Bitset Datatype Actions

These actions are defined:

MAKE, TO
creates new bitsets
COPY
copies the bitset
FIND, PICK
test bit values
COMPLEMENT, NEGATE
invert bit values
APPEND, INSERT, POKE
set bit values
REMOVE
clears specific bit values
CLEAR
clears all bit values
LENGTH?
returns the number of bits
AND, OR, XOR
perform bitwise logic operations (and return new result)
In addition, SAME?, EQUAL?, and NOT-EQUAL? work for bitsets.

The ZERO? action always returns FALSE. (But should it return TRUE if no bits are set?)

Notes:

  • APPEND, INSERT, POKE will expand bitset if needed.
  • FIND is TRUE if it matches ANY of the bit values.
  • REMOVE requires /PART to specify the target of the removal action.
  • INTERSECT does the same thing as AND.
  • UNION does the same thing as OR.

Bitset Specs

Most of the series access functions above allow the bits in a bitset to be specified with:

  • CHAR! - such as #" "
  • INTEGER! - such as 123
  • STRING! - such as "abc"
  • BINARY! - only for MAKE and TO
  • BLOCK! - for combinations
A block allows multiple bits to be specified using integers, chars, and strings. For example:
bits: make bitset! [#"-" #"a" - #"z" "!@#$" 201 - 220]

Note that characters and integers can specify inclusive ranges.

Other examples:

append bits "ABC"
append bits [#"A" - #"Z"]
find bits "abc"
find bits [#"C" - #"X"]
remove/part bits "abc"
remove/part bits [#"a" - #"z"]

Bitset Binary Order

The binary representation for bitsets has changed. It is now left-to-right bit continuous (as if bits were written in binary).

For example, in R3:

>> make bitset! "abcd"
== make bitset! #{00000000000000000000000078}

but, in R2:

>> make bitset! "abcd"
== make bitset! #{
0000000000000000000000001E00000000000000000000000000000000000000
}

This change is not a requirement, so let Carl know immediately if you think it may cause an important incompatibility.

Possible Features

If there are any other possible features that you need for bitsets, please note them here (and use 4 ~ chars to indicate your name/date).

  • FIND will return TRUE if ANY char is listed. Do we want PICK to require ALL chars listed, or it could be a refinement of FIND such as FIND/ALL (or reverse logic FIND/ANY). -Carl 01:22, 16 January 2008 (EST)
  • Make sure we can INSERT (APPEND, POKE) a bitset into a bitset, for an in-place UNION. -BrianH 14:35, 16 January 2008 (EST)
  • Make sure we can REMOVE/PART a bitset from a bitset, for an in-place INTERSECT. -BrianH 14:35, 16 January 2008 (EST)
  • I'm asking if we could use the 'third func on a bitset to get a pointer on the binary series (like for struct!). It could limit the memory overhead when using to-binary bitset! or to-bitset binary!. Steeve 09:10, 17 January 2008 (EST)
Bitsets can be used to store a list of positive integers too (we can see them like a list of index):
Index: make bitset! 65536

insert Index 250

insert Index 8000

insert Index 0

Each integer take only one bit in the bitset. It's a very good solution to store huge list of integers.

In a block each integer take 16 bytes (correct me if i'm wrong). So, to distinguish if we must use a bitset in replacement of a block, we can do this calculation:

V = value of the biggest integer.
N = Number of integers.

If the equation V/8 < N*16 is true, we should use a bitset instead of a block of integers!.

All native operations like insert, find, remove, are much faster when acting on bitsets (instead of blocks).

The only one missing feature is a fast way to enumerate all stored integers. i can do this:

repeat n 65536 [
    if find Index n - 1 [print n - 1]
]

But it's slow. We should have a thing like:

foreach int Index [
    print int
]

Any idea ? Steeve 09:10, 17 January 2008 (EST)

Clone this wiki locally