# Compact Indexing and Composite Ranges

In my recent post "A short introduction to vaex" I presented a problem of processing a lot of data that requires a significant amount of indexing.

Of the documentation I've read, databases stick to sending an array of booleans to communicate what the mask looks like.

```
> mask = [1010111010111010111010111010111010111010111...1110101]
> len(mask)
42648123
```

42,648,123 bits = 5,331,016 bytes = 5.3Mb. So I wondered - Could I somehow shrink that? 
Could, for example, a few ranges treated as sets describe the same index?

```
> mask1 = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...1 1 1 1]  # range(1, len(mask), step=2)
> mask2 = [     1     1     1     1     1     1     1 ... 1     ]  # range(6, len(mask), step=6)
> mask1 + mask2
[1 1 111 1 111 1 111 1 111 1 111 1 111 1 111...111 1 1]
> mask1 + mask2 == mask
True
```

*Proposal*

Assuming for now that all the world's number theorists are onto some sort of theory of ranges, it seems
plausible to treat my indexing problem as sets.

> Represents a set of integers. Can be called as Range(stop), Range(start, stop), or Range(start, stop, step); when step is not given it defaults to 1.
> Range(stop) is the same as Range(0,stop,1) and the stop value (just as for Python ranges) is not included in the Range values.

While it annoys me that `stop` is not included, I appreciate that `__len__` only returns a sensible result if stop is omitted in an implementation where `start` defaults to zero.

I'll need to make range functions behave like sets, such that the following arithmetic is valid:

```
A,B = range, range
A in B
A is subset of B
A is superset of B
A==B
A!=B
A < B
A <= B
A.intersect(B)
```

I will also need to have a way of representing collection of sets, despite that collections are not
enumerated. For example:

```
A,B = range(1,n,3), range(6,n,6)
C = A+B  # composite set where multiple of A and B only appear once.
```

This is of course only interesting if I can verify using `__iter__`:

```
list(C) = sorted(list(A)+list(B))
```

Since I'm including addition, it seems valid to use subtraction as well:

```
B = C-A
```

Multiplication or division however, do not make sense and neither does repetition or chaining of sets.

This leaves the problem of inferring the range object from a list of values.

```
> C = Range.infer([1,3,4,5,7,8,9,11,12])
> type(C)
CompositeRange(Range(1,2,11),Range(4,4,12))
```

I'm sure that someone will try to test these ideas with some auto-incrementing sequence only to see if the memory footprint
of using CompositeRange will be worse that the source of numbers is worse that the source itself as a set.

```
> L = []
> i, j=1, 3
> for _ in range(12):
>     L.append(i)
>     i,j = j,i+j
>
[1,3,4,7,11,18,29,47,76,123,199,322]
>
> C = Range.infer(L)
CompositeRange({1,3,4,7,11,18,29,47,76,123,199,322})
```

So having an escape hatch if there is obvious pattern to detect detection, seems only sensible.


### limitations

I'll hold off from including floating point numbers due to the round off error.

```
A,B = Range(-4,4.8,0.1), Range(0, math.tau, math.pi)  # Not included!
```

Whilst trying to stick to Python's range attribute, I refuse to implement `count`. Yes. range actually has a `count(x)` method. As range is a set of numbers,
this would only ever return KeyError or 1. I see no point.


Time to write some code...

In [6]:
dir(range(1))

['__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'count',
 'index',
 'start',
 'step',
 'stop']

In [4]:
dir(set())

['__and__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__iand__',
 '__init__',
 '__init_subclass__',
 '__ior__',
 '__isub__',
 '__iter__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__or__',
 '__rand__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__ror__',
 '__rsub__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__xor__',
 'add',
 'clear',
 'copy',
 'difference',
 'difference_update',
 'discard',
 'intersection',
 'intersection_update',
 'isdisjoint',
 'issubset',
 'issuperset',
 'pop',
 'remove',
 'symmetric_difference',
 'symmetric_difference_update',
 'union',
 'update']

In [8]:
class CompositeRange(object):
    def __init__(self, start=0,stop=None,step=1,*args):
        if start,stop,step and not args:
            pass # it's a range
        if args and stop is None:
            pass # it's a composite construct
        self.start = start
        self.stop = stop
        self.step = step

    def __index__(self, value):
        pass # returns the index of a number or index error

    def __bool__(self):
        pass  # returns non-empty.



1