# DataIndexedEnglishWordSet

## Generalizing the DataIndexedIntegerSet Idea

Ideally, we want a data indexed set that can store any types. For starters, let's start with storing Strings.

Suppose we want to add `cat`. The key questions:
* What's the `cat`th element of a list?
* An idea: use the first letter of the word as an index
    * `a` = 1, `b` = 2, `c` = 3, ..., `z` = 26
    
The problem with this approach:
* There are other words that start with the same letters
    * `cat` and `chupacabra` collides each other
* Can't store string that starts with symbols
    * e.g. can't store `"=98yaesfsad"`

## Avoiding Collisions

Use all digits by multiplying each by a power of 27.
* `a=1`, `b=2`, `c=3`, ..., `z=26`
* Thus the index for `"cat"` is:
    * `c = 3`
    * `a = 1`
    * `t = 20`

$$ (3 \times 27^2) + (1 \times 27^1) + (20 \times 27^0)  = 2234$$

![](images/cat.png)

Notice the strange pattern of powers (2, 1, 0). Why this specific pattern?

## The Decimal Number System vs. Our System for Strings

In the decimal number system, we have 10 digits: `0, 1, 2, 3, 4, 5, 6, 7, 8, 9`. If want numbers larger than 9, we use a sequence of digits. For example, `7091` in base `10`:

$$ 7091_{10} = (7 \times 10^3) + (0 \times 10^2) + (9 \times 10^1) + (1 \times 10^0) $$

Our technique of converting strings to numbers is similar to this approach.

## Test Understanding

If we convert the word `bee` into a number using the `powers of 27` technique (recall `b = 2` and `e = 5`),

$$ (2 \times 27^2) + ( 5 \times 27^1) + ( 5 \times 27^0) = 1598_{10} $$

## Uniqueness

As long as we pick a base $\ge$ 26, this algorithm is guaranteed to give each lowercase English word a unique number. For example with `bee` with a base of `27`, no other words will get the number 1598. 

In other words, it's guaranteed that we'll never have a collision.