# Data Indexed Arrays

## Sets

We've now seen several implementations of the Set (or Map) ADT.

![](images/implement.png)

| Implementation | Worst Case Runtime <br> `contains(x)` | Worst Case Runtime <br> `add(x)` | Notes |
| --- | --- | --- | --- |
| `ArraySet` | $\Theta(N)$ | $\Theta(N)$ | |
| BST | $\Theta(N)$ | $\Theta(N)$ | Randomly generated trees runtime are $\Theta(log N)$ but not safe to assume that our tree is random|
| `2-3 Tree` | $\Theta(log N)$ | $\Theta(log N)$ | Very good idea, very hard to implement|
| LLRB | $\Theta(log N)$ | $\Theta(log N)$ | Maintains bijection (`1-1 mapping` with `2-3 tree`. Hard to implement|

## Limits of Search Tree Based Sets

Our search-tree-based sets require items to be comparable
* Need to ask "is `X < Y`?". Not true for all types
    * Some types in Java don't implement comparable interface
* Could we somehow avoid the need for objects to be comparable?

Our search tree sets have excellent performance, but could it be better?
* $\Theta(log N)$ is amazing.
    * 1 billion items results in roughly only height of 30
* Could we somehow do better than this?

## Using Data as an Index

Create an array of booleans indexed by data!
* Initially, set all values to be `false`
* When an item is added, set appropriate index to true


In [None]:
DataIndexedIntegerSet diis;
diis = new DataIndexedIntegerSet();

![](images/false.png)

In [None]:
diis.add(0);
diis.add(5);
diis.add(10);
diis.add(11);

![](images/11.png)

## DataIndexedIntegerSet Implementation

In [None]:
public class DataIndexedIntegerSet{
    private boolean[] present;
    
    // Creatre a huge array of boolean
    public DataIndexedIntegerSet() {
        present = new boolean[20000000];
    }
    
    // When we add an element, set that element index to true
    public add(int i) {
        present[i] = true;
    }
    
    // Simply present whether input index is true or false
    public contains(int i) {
        return present[i];
    }
}

For `contains(x)` and `add(x)`, the runtime is constant $\Theta(1)$!

## Downside of This Approach

* Extremely wasteful of memory
    * To support checking presence of all positive integers, we need > 2 billion booleans
* Need a way to generalize data type (not limited to integer)