# Example: Fun with Hashing Functions
This example familiarizes students with the concept of [hash functions](https://en.wikipedia.org/wiki/Hash_function), i.e., a function that maps data to an index of a fixed-size table. In particular, we'll look at the `myhash(...)` implementation below, which is an example of a polynomial rolling hash function (also known as Horner's method) that computes a hash value for a string $s$ with characters $c_1, c_2, \ldots, c_n$ using the formula:
$$
\begin{align*}
h(s) & = \left(\sum_{i=1}^{n} c_i \cdot \beta^{n-i}\right)~\text{mod}~{m}
\end{align*}
$$
where $c_i$ represents the ASCII value of the $i$-th character of the string $s$, $\beta$ is a base parameter (typically a prime number like 31), the operator $\texttt{mod}$ denotes the [modulo operation](https://en.wikipedia.org/wiki/Modulo), and $m$ is the size of the hash table. The function returns an index in the range $h(s)\in[0, m-1]$. The size parameter $m$ strongly influences the likelihood of a collision.

> __What is a collision?__: A collision occurs when the hash function produces the same index for different keys. How we respond to collisions is next level, but it's pretty easy to see how this can happen. If we have two keys that are very similar, like `cat` and `mat`, and a small number of buckets in the hash table, they are likely to collide. 

### Task Overview
* **Task 1**: Compute the hashcode of a string using the `myhash(...)` function to understand how dictionaries convert keys into array indices. We'll demonstrate the basic mechanics of hash computation with a sample string.
* **Task 2**: Explore hash collisions by testing how different hash table sizes affect collision frequency. We'll systematically examine substrings to observe how smaller table sizes increase the likelihood of different keys producing the same hash value.

Hashing is at the core of many data structures, including dictionaries and sets, and is a fundamental concept in computer science. Understanding how hashing works will help you understand how these data structures efficiently store and retrieve data.

So let's get started!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

The [include command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [1]:
include("Include.jl");

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl), check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types and data used in this material. 

### Implementations
The `myhash(...)` function implements a polynomial rolling hash using Horner's method to efficiently compute hash values for strings. 

__What is going on in this function?__

The function takes a string `key` and iterates through each character, building up the hash value incrementally by multiplying the current hash by the base parameter `β` (default 31) and adding the ASCII value of the current character. At each step, the modulo operation ensures the hash value stays within the bounds of the hash table size `size`. This approach processes characters from left to right, treating the string as a polynomial where each character's ASCII value serves as a coefficient and `β` acts as the base. 

The choice of 31 as the default base is common in hash function design because it's a prime number that provides good distribution properties while being computationally efficient (31 = 32 - 1, allowing for optimization via bit shifting in some implementations).

In [2]:
"""
    function myhash(key::String; β::Int64 = 31, size::Int64 = 1000) -> Int64

The `myhash` function computes a hash value for a given `key` string. 
The hash value is computed using the `β` and `size` parameters. The function returns the hash value as an `Int64`.

### Arguments
- `key::String`: A string to be hashed.
- `β::Int64`: A prime number used in the hash computation. Default is `31`.
- `size::Int64`: The size of the hash table. Default is `1000`.

### Returns
- `Int64`: The hash value for the given `key` string.

"""
function myhash(key::String; β::Int64 = 31, size::Int64 = 1000)::Int64

    # initialize -
    hash = 0

    # main loop -
    for i ∈ eachindex(key)
        keyvalue = key[i];
        hash = (hash*β + convert(Int, keyvalue)) |> x -> mod(x, size);
    end

    # return -
    return hash
end;

### Horner's Method vs. djb2 Hash Function

You might be wondering how our `myhash(...)` implementation compares to other popular hash functions like $\texttt{djb2}$ that we discussed earlier. Both are polynomial rolling hash functions, but they have some key differences worth understanding.

> __djb2 Algorithm__: The $\texttt{djb2}$ hash function uses the formula `hash = hash * 33 + c` where `c` is the ASCII value of each character. It's known for its simplicity and good distribution properties. The choice of 33 as the multiplier is somewhat arbitrary but has proven effective in practice.

> __Horner's Method (our implementation)__: Our `myhash(...)` function implements the mathematically rigorous Horner's method for polynomial evaluation. Instead of an arbitrary multiplier, we use a configurable base parameter `β` (defaulting to 31) and apply the modulo operation at each step to prevent integer overflow.

The key differences include computational structure (Horner's method provides an efficient way to evaluate polynomials), flexibility (our implementation allows you to adjust both the base parameter and table size), and mathematical foundation (Horner's method has a well-established theoretical basis in polynomial arithmetic). While djb2 is faster due to its simplicity, our implementation provides better educational insight into how polynomial hash functions work mathematically and offers more control over hash distribution through parameter tuning.

___

## Task 1: Computing the hashcode of a String
One of the interesting (and amazing!) things about the [Dictionary type in Julia](https://docs.julialang.org/en/v1/base/collections/#Base.Dict) or [Python](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) is the ability to map any key => to any value, where the `key` is unique. But how does this work?

> __Secret sauce:__ Behind the scenes, we use a `hash(...)` function to take a `key` and convert it into an integer index into an `Array{typeof{value},1}` that holds the value. Thus, the magic of a dictionary is just a clever way of computing an array index.

Let's dig a little deeper. Specify a `my_test_string::String` with some data in it:

In [3]:
my_test_string = "This is a test string. In lecture ... wow!";

Next, let's use the `myhash(...)` function to compute the hashcode of the value in the `my_test_string::String`

In [4]:
test_hashed_value = myhash(my_test_string, β = 31, size = 1000) # big size

227

## Task 2: OK, but does this trick always work?
The short answer is __no__! Sometimes, there may be collisions when generating the hashcode of a `key.` When the size parameter $m$ is small, there is a higher likelihood that two strings will get mapped to the same hashcode.

Let's check this out with another example. 

In [5]:
another_test_string = "CHEME-143-M2-Fall-2026";

The `another_test_string::String` variable will give us a hashcode (using the `myhash(...)` function) and some default values for the parameters.

In [6]:
another_hashed_value = myhash(another_test_string, β = 31, size = 10)

8

### Hmmmm. Can we generate a collision?
Let's chop up `another_test_string::String` and hash _each substring_ with small and large values of the `size::Int64` parameter. 

> __What do we expect?__ We expect that with small values of the size parameter, we will see collisions. However, as size becomes larger, the frequency of collisions should decrease.

 Let's check out this intuition by writing a simple conflict detection algorithm.

__What's happening in the code below?__

The collision detection algorithm tests for hash collisions by creating progressively longer substrings of our test string and computing their hash values. We iterate through each position from 1 to the string length, creating substrings that grow longer with each iteration (e.g., "C", "CH", "CHE", "CHEM", etc.). Each substring gets hashed using our `myhash(...)` function with a small table size (`size = 10`) to increase collision probability.

> __Test logic:__ For each new hash value, we check if any previously computed substring produced the same hash value. If so, we've found a collision between two different strings that map to the same index. We maintain a dictionary `hash_values` that stores each substring and its corresponding hash value, allowing us to identify exactly which strings collided when we find duplicate hash values.

This approach effectively demonstrates how hash collisions occur in practice and shows that even with a well-designed hash function, collisions become inevitable when the hash table size is small relative to the number of keys being stored.

In [7]:
hash_values = let
    
    # initialize -
    N = length(another_test_string)
    hash_values = Dict{String, Int64}();
    collision_found = false;
    m = 10; # size of the hash table, can be adjusted
    
    for i ∈ 1:N
        
        test_string = another_test_string[1:i]
        test_value = myhash(test_string, β = 31, size = m)
    
        # Check if this hash value has been seen before with a different string
        for (existing_string, existing_hash)  ∈ hash_values
            if existing_hash == test_value && existing_string != test_string
                println("We have a collision: '$(existing_string)' and '$(test_string)' both hash to $(test_value)")
                collision_found = true;
            end
        end
        
        # Store this string and its hash value
        hash_values[test_string] = test_value;
    end

    hash_values;
end;

We have a collision: 'CH' and 'CHEME-' both hash to 9
We have a collision: 'CHE' and 'CHEME-1' both hash to 8
We have a collision: 'CHEME-143-M' and 'CHEME-143-M2' both hash to 3
We have a collision: 'CHE' and 'CHEME-143-M2-' both hash to 8
We have a collision: 'CHEME-1' and 'CHEME-143-M2-' both hash to 8
We have a collision: 'CHEME-143-M2-' and 'CHEME-143-M2-F' both hash to 8
We have a collision: 'CHE' and 'CHEME-143-M2-F' both hash to 8
We have a collision: 'CHEME-1' and 'CHEME-143-M2-F' both hash to 8
We have a collision: 'CHEM' and 'CHEME-143-M2-Fa' both hash to 5
We have a collision: 'CHEME-143-M2' and 'CHEME-143-M2-Fal' both hash to 3
We have a collision: 'CHEME-143-M' and 'CHEME-143-M2-Fal' both hash to 3
We have a collision: 'CHEME-143' and 'CHEME-143-M2-Fall' both hash to 1
We have a collision: 'CHEME-143-' and 'CHEME-143-M2-Fall-' both hash to 6
We have a collision: 'CHEME-143-M2-Fall-' and 'CHEME-143-M2-Fall-2' both hash to 6
We have a collision: 'CHEME-143-' and 'CHEME-143-

## Summary
In this activity, we explored the fundamental concept of hash functions by implementing and testing a polynomial rolling hash function that converts strings into array indices. We discovered how the `myhash(...)` function uses Horner's method to efficiently compute hash values by iterating through string characters and building up a polynomial representation with base parameter β = 31.

Through practical experimentation, we demonstrated the core mechanics of how dictionaries work under the hood by computing hash values for sample strings. We then investigated the collision phenomenon by systematically testing substrings with different hash table sizes, confirming that smaller table sizes lead to higher collision frequencies while larger tables provide better distribution of hash values.

This hands-on exploration revealed the fundamental trade-off in hash table design between memory usage and collision avoidance, providing insight into why choosing appropriate table sizes and hash functions is crucial for efficient dictionary implementations.