# L5c: Associative and Uniqueness Collections: Conceptual Overview
Let's explore the fundamental concepts behind associative collections, such as dictionaries, and uniqueness-based collections, such as sets. These data structures are essential for efficient data management and retrieval in many applications. 

We have already made use of these collections in our previous work, but now we will delve deeper into their underlying mechanisms and how they can be effectively utilized.

> __What are they?__ Associative collections, such as dictionaries, allow us to store key-value pairs where each key is unique. This enables fast lookups, insertions, and deletions based on the key. Uniqueness-based collections, such as sets, store only unique elements and provide efficient membership tests and set operations (union, intersection, and difference).

Let's dive into the details of these collections, starting with dictionaries and then moving on to sets.

## Dictionaries
A dictionary `D{K, V}` associates each key of type `K` with a value of type `V`, enabling average-case constant-time $\mathcal{O}(1)$ lookup, insertion, and deletion.
* _Secret sauce_: Under the hood, dictionaries use **hashing** to turn each key into an index into a fixed-size array (whose elements are called buckets). When we look up a key, the hash function is applied to find the index and retrieve the value. Thus, dictionaries are implemented as arrays with a hashing mechanism!

Let's look at the classic $\texttt{djb2}$ (or times 33) string‐hash algorithm, originally written by Daniel J. Bernstein. This is a simplified version of what happens in a dictionary.

__Initialize__: Given a string $s$, a table of size $n$, a seed $R\gets{5381}$. Set $\text{index}\gets{0}$.

1. Compute the length of the string: $L \gets \text{len}(s)$
2. Compute the integer values of the character array: $C \gets [\texttt{CodePoint}(c)\;\big|\;c \in s]$
3. For $i = {0}\;\text{to}\; L-1$ __do__:
   - Set $R\gets\underbrace{\left[(R << 5) + R\right]}_{=R \times 33} + C_{i}$ 
4. Return the index: $\text{index} \gets (R \mod n + n) \mod n$.

Let's implement this logic, test it with a few example strings, and then walk through what is going on.

In [1]:
index, teststring = let

    # initialize -
    string_to_hash = "This is a test string";
    L = length(string_to_hash);
    C = collect(string_to_hash) .|> x -> Int(x); # convert to Int
    R = 5381; # a large prime number
    n = 10000; # number of buckets in the hash table

    # hash function -
    for i ∈ 1:L
        R = (R << 5) + R + C[i]; # << 5 is a left shift operation by 5 bits, 
    end
    index = (R % n + n) % n;

    index, string_to_hash # return
end

(7329, "This is a test string")

### What is going with the bitwise operation `R << 5`?
The expression `R << 5` is a bitwise left shift operation. In this case, it shifts the bits of the integer `R` to the left by 5 positions. This is equivalent to multiplying `R` by $2^{5}$ or 32. Let's show this in action with a simple example. 

> __Idea__: If the `<<` operator is new to you, think of it as a fast way to multiply by powers of two. For example, `x << 1` is the same as `x * 2`, or `x << 2` is the same as `x * 4`, and so on. Thus, `R << 5` is the same as $R\times{2^5} = R\times{32}$. So, ` 1 << 5` is the same as `1 * 32`, which equals `32`.

Let's check this out in Julia:

In [2]:
@assert 1 << 5 == 32 # if true, you'll see no output. If incorrect, you'll see an error.
@assert 2 << 5 == 64 # if true, you'll see no output. If incorrect, you'll see an error.

So why would we use a bitwise operation instead of just multiplying by 32 directly? The reason is that bitwise operations are generally faster than arithmetic operations on most hardware. This can lead to performance improvements in scenarios where such operations are performed frequently, such as in hashing algorithms.

### What is a collision?
A collision occurs when two different keys hash to the same index in the dictionary. This can happen when the hash function produces the same index for different keys. How we respond to collisions is next level, but it's pretty easy to see how this can happen. 
> __Intuition__: if we have two keys that are very similar, like `cat` and `mat`, and a small number of buckets in the hash table, they are likely to collide. There is just not enough space to store both keys in the hash table without them pointing to the same index.

Let's see what happens when we hash the strings `cat` and `mat` with a small number of buckets (N = 10) in the hash table.

In [3]:
index, teststring = let

    # initialize -
    string_to_hash = "mat"; # try mat - what happens?
    L = length(string_to_hash);
    C = collect(string_to_hash) .|> x -> Int(x); # convert to Int
    R = 5381; # a large prime number
    N = 10; # number of buckets in the hash table (make it small to force collisions)

    # hash function -
    for i ∈ 1:L
        R = (R << 5) + R + C[i]; # << 5 is a left shift operation by 5 bits, 
    end
    index = (R % N + N) % N;

    index, string_to_hash # return
end

(5, "mat")

___

## Sets
Now that we understand dictionaries, we can explore sets. Sets leverage the same hash-table machinery as dictionaries, but discard the `value` and only track keys. Sets are interesting for a few reasons:
* __Unique elements__: When you add an element, its hash code points to a slot. If the slot is empty, the element is stored; if it already contains that key, the insertion is skipped, ensuring uniqueness. Of course, this assumes that the hash function is well-designed to minimize collisions.
* __Membership tests__: To test membership, the element’s hash code points to a slot; if that key occupies it, the element is in the set. This runs in $\mathcal{O}(1)$ time.  
* __Under the hood__: In Julia, `Set{T}` is a thin wrapper around `Dict{T, Nothing}`; in Python, `set` is a standalone C-level hash table that holds only keys. Both reuse the same hash functions, bucket arrays, and collision-resolution logic to deliver $\mathcal{O}(1)$ performance for inserts, deletes, and lookups.

Let's build a set and explore its operations. Consider the `s::Set{Char}` example:

In [4]:
s = let

    s = Set{Char}(); # empty at this point
    push!(s, 'a'); # we add items to set using push!
    push!(s, 'b');
    push!(s, 'c');
    push!(s, 'd');
    push!(s, 'e');
    push!(s, 'e'); # notice: we try to add 'e' again, but it won't change the set since sets do not allow duplicates.

    s
end

Set{Char} with 5 elements:
  'a'
  'c'
  'd'
  'e'
  'b'

__Why was the character `'e'` not added twice?__ 

In a set, adding an element that already exists does not change the set. This is because sets are designed to hold only unique elements by using a hash table to track the elements. 

When we try to add `'e'` again, the set checks if it already exists in the hash table and finds that it does, so it does not add it again.