# Hash Tables - Handling Collisions

## Resolving Ambiguity

Pigeonhole principle states that collisions are inevitable due to integer overflow.

Example: hash code for `"moo"` and `"Nep"` might both be 718.

Suppose `N` items have the same numerical representation `h`. Instead of storing `true/false` in position `h`, store a "bucket" of these `N` items at position `h`.

![](images/nep.png)

How do we implement "bucket"?
* Conceptually the simplest way: LinkedList
* We can also use ArrayLists or ArraySet
* We'll see that it doesn't matter what implementation we use

## The Separate Chaining Data Indexed Array

Each bucket in our array is initially empty. When an item `x` is added at index `h`, if the bucket `h`...
* ...is empty, create a new list containing `x` and store it at index `h`
* ...is already a list and `x` is not in there, add `x` to this list

We might call this a "separate chaining data indexed array"
* Bucket `'#h` is a "separate chain" of all items that have the same hash code `h`

Initially, all buckets are empty.

![](images/empty.png)

In [None]:
add("a");

![](images/length.png)

In [None]:
add("abomamora");

![](images/abomamora.png)

In [None]:
add("adevilish";)

![](images/adevilish.png)

In [None]:
contains("adevilish");
// Java will look at all items in bucket 111239443 to see 
// "adevilish" is present

## Separate Chaining Performance

| Worst Case Runtime | `contains(x)` | `add(x)` | 
| --- | --- | --- |
| Bushy BSTs | $\Theta(log N)$ | $\Theta(log N)$ |
| DataIndexedArray | $\Theta(1)$ | $\Theta(1)$|
| Separate Chaining Data Indexed Array | $\Theta(Q)$ | $\Theta(Q)$ |

`Q` = length of longest list
* If the list has a length of $Q$, we'll have to spend $Q$ time finding an item

Observation: worst case runtime will be proportional to the length of longest list

Note that `add` is $\Theta(Q)$ instead of $\Theta(1)$! Isn't it constant since we only need to add the element?
* No. We still need to check if the element is already within the list.

## Saving Memory Using Separate Chaining

We don't need billions of buckets.

![](images/go.png)


## Saving Memory Using Separate Chaining and Modulus

We can use modulus `%` of hashcode to reduce bucket count.
* Put in bucket `hashCode % 10`

![](images/hash.png)

Downside of this approach: the list of each bucket can potentially get long.

## The Hash Table

What we've just created is called a **hash table**.
* Data is converted by a **hash function** into an integer representation called a **hash code**
* The **hash code** is then **reduced** to a valid **index**
    * Usually using the modulus `%` operator
        * e.g. `23487628 % 10 = 8`
        
![](images/caveat.png)