# Chapter 5: hash tables

Hash tables are a unique data structure that allows the immediate lookup of values in O(1) Big O time.

## Hash functions

Put simply, a hash function is a function which maps a string input to a number, such that the mapping is consistent and the output is unique.  These numbers correspond to the memory address where the data are stored, which facilitates the O(1) time complexity.  The hash function will generate the same number for the same string so we can go right to that location in memory and access the data without any searching involved.

In Python, hash tables are implemented as dictionaries.

In [4]:
# Create new hash table for groceries
book = dict()

# Add a few items and prices
book['apple'] = 0.67
book['milk'] = 1.49
book['avocado'] = 1.49

for key, val in book.items():
    print(f'Item: {key}\nPrice: ${val}\n')

Item: apple
Price: $0.67

Item: milk
Price: $1.49

Item: avocado
Price: $1.49



Just out of curiosity, I want to see the numeric values that are generated with the hash function for each key in the dictionary.

In [5]:
for key in book.keys():
    print(f'Item: {key} --> {hash(key)}')

Item: apple --> 2153814947804476626
Item: milk --> -589866124921558863
Item: avocado --> -1750040582741651925


## Exercises

For each question, describe whether or not each of the hash functions is consistent.

**5.1) `f(x) = 1`**

This would be a pretty bad hash function because although it's consistent, it just returns 1 for whatever `x` is, so we would not be able to discern the output from one string versus another.

**5.2) `f(x) = rand()`**

This wouldn't be a good hash function either, because it would consistently return a different output each time the same string is used as input.

**5.3) `f(x) = next_empty_slot()`**

This would also be a pretty lousy hash function, because if the same string was ever at a different index in another hash table, you'd get a different mapping.

**5.4) `f(x) = len(x)`**

This hash function would give you the same output for a given string, but there's no way to discern two strings that are the same length.

## Use cases

Here, we'll simulate a phone book using a hash table.

In [1]:
# Initialize hash table
phone_book = dict()

# Populate
phone_book['jenny'] = 8675309
phone_book['emergency'] = 911

In [3]:
# Access Jenny's number
print(f"Jenny's phone number is: {phone_book['jenny']}")

Jenny's phone number is: 8675309


Implement a hash table to keep track of who has voted.  First, we'll initialize the hash table and prompt the user to type their first and last name (spelling doesn't matter because we'll convert both to lower case to avoid any mistakes).  Then, we'll check to see if the user's name is in the `voted` dictionary keys.  If it isn't already there, the user hasn't voted, so we'll add them to the list and allow them to vote.  However, if their name is in the keys, we'll let them know that they've already voted.

In [11]:
# Initialize hash table
voted = {}
print('Please enter your first and last name, separated by a space.')
print('To quit, type "q"\n')

while True:
    
    user = input('First and last name: ' )
    
    if user == 'q':
        break
    else: 
        if user.lower() not in voted.keys():
            voted[user.lower()] = True
            print(f'Thanks for voting, {user.title()}!\n')

        else: 
            print(f'{user.title()}-- you already voted!\n')

Please enter your first and last name, separated by a space.
To quit, type "q"

First and last name: stan piotrowski
Thanks for voting, Stan Piotrowski!

First and last name: frodo baggins
Thanks for voting, Frodo Baggins!

First and last name: samwise gamgee
Thanks for voting, Samwise Gamgee!

First and last name: han solo
Thanks for voting, Han Solo!

First and last name: stan piotrowski
Stan Piotrowski-- you already voted!

First and last name: q


In [12]:
# Get the names of the people that voted
print('The following people have voted:')
for name in voted.keys():
    print(f'{name.title()}')

The following people have voted:
Stan Piotrowski
Frodo Baggins
Samwise Gamgee
Han Solo


We can write this a slightly different way using the `get()` dictionary method.

In [16]:
# Initialize new hash table
voted = {}
print('Please enter your first and last name, separated by a space.')
print('To quit, type "q"\n')

while True:
    
    user = input('First and last name: ')
    
    if user == 'q':
        break
    else: 
        user_info = voted.get(user)
        if user_info is None:
            voted[user] = True
            print(f'Let {user.title()} vote!\n')
        else: 
            print(f'{user.title()} has already voted-- kick them out!\n')

Please enter your first and last name, separated by a space.
To quit, type "q"

First and last name: stan piotrowski
Let Stan Piotrowski vote!

First and last name: frodo baggins
Let Frodo Baggins vote!

First and last name: samwise gamgee
Let Samwise Gamgee vote!

First and last name: han solo
Let Han Solo vote!

First and last name: han solo
Han Solo has already voted-- kick them out!

First and last name: q


## Collisions

Collisions happen when multiple values are mapped to the same key in a has table.  This can happen due to a bad hash function, and in general when this happens, the values that all point to the same key are stored in a linked list.

## Performance

In reality, hash lookups are not instantaneous; rather, their time complexity is O(1), meaning constant time.  This means that regardless of the size of the hash table, the lookup will be the same. However, under the worst case scenario, it would take O(n) to perform hash table operations and this is influenced by the hash function and the load factor.  

The load factor is the number of keys that are loaded onto each index in an array of values.  More information about the load factor can be found [here](https://www.geeksforgeeks.org/load-factor-and-rehashing/).  If the load factor is low, there are relatively few values at each index, so we won't have to traverse a linked list.  At worst though, if the load factor is high, it means we'll have to traverse a lenghty linked list with O(n) time complexity.

Hash tables store data in arrays and we can calculate the load factor by dividing the number of occupied positions in the array by the total number of positions in the array.  For example, the load factor for the following 3-element array `[ ,30, ]` would be 1/3, because only 1 out of the 3 total positions are occupied.

When the load factor is greater than 1, that means that there are more items in the array than total slots.  In these cases, we can increase the size of the array (resizing) and then re-hash the contents of the array.  The textbook recommends resizing when the load factor is greater than 0.7.

The other piece to hash table performance is the hash function, which will ideally distribute the values evenly throughout the array to avoid collisions.  A bad hash function will be biased to produce the same hashes for different strings, which will force us to store data in a linked list for values mapped to the same keys.

## Exercises

For this set of exercises, we need to compare four different hash functions: 1) return "1" for all input; 2) use the length of the string as the index; 3) use the first character of the string as the index; and 4) map every letter to a prime number, sum the integers and modulo the hash size.  To compare each of these in the exercise questions, let's build each of the hash functions.

In [31]:
# Return 1 for all input
def constant_hash(string):
    """Hash function that returns 1 for all input."""
    return 1

In [32]:
# Test
assert constant_hash('some_string') == 1, 'Hash should be 1.'

In [35]:
# Return the length of the string as the index
def length_hash(string):
    """Hash function that returns the length of the string as the index."""
    return len(string)
        

In [36]:
# Test 
assert length_hash('some_string') == 11, 'Hash should be 11.'

In [38]:
# Use the first character of the string as the index
def first_char_hash(string):
    """Hash function that returns the first character of the string as the index."""
    return string[0]

In [43]:
# Test
assert first_char_hash('some_string') == 's', 'Hash should be s.'

# Test with a series of inputs
strings = ['apple', 'avocado', 'orange', 'pear', 
          'banana', 'kiwi', 'peach', 'mango']

string_dict = {}
for string in strings: 
    dict_key = first_char_hash(string) # generate hash
    if string_dict.get(dict_key) is None:
        # if not in the dictionary, create a list as the value and add the string to it
        string_dict[dict_key] = [string]
    else: 
        # If it's already in the dictionary, append the string to the value list
        string_dict[dict_key].append(string)
        
# Print the contents of the dictionary
print('Contents of hash table:')
for key, val in string_dict.items():
    print(f'Key: {key}\nValues: {val}\n')

Contents of hash table:
Key: a
Values: ['apple', 'avocado']

Key: o
Values: ['orange']

Key: p
Values: ['pear', 'peach']

Key: b
Values: ['banana']

Key: k
Values: ['kiwi']

Key: m
Values: ['mango']



In [69]:
# Define function to check for prime numbers
def check_prime(num):
    """
    Check if a number is prime or not.
    A prime number is a number greater than 1 that is only divisible by 1 and itself.
    If a number is prime, we return True.
    If it is not prime, we return False.
    """
    
    prime_flag = 0
    
    if num > 1:
        for i in range(2, int(num / 2) + 1):
            if num % i == 0:
                prime_flag = 1 # switch the prime flag if divisible by any number other than 1 or itself
                break
        if prime_flag == 0:
            return True # if prime flag hasn't switched, it's prime
        else: 
            return False
                
    else: 
        return False

In [84]:
# Prime number hash function
def prime_num_hash(string, hash_size):
    """
    Map every letter to a prime number, sum all numbers, modulo the size of the hash.
    """
    
    # Build letters list
    letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 
              'h', 'i', 'j', 'k', 'l', 'm', 'n', 
              'o', 'p', 'q', 'r', 's', 't', 'u', 
              'v', 'w', 'x', 'y', 'z']
    
    # Build list of primes
    primes = []
    num = 1
    
    while True: 
        check = check_prime(num)
        if check is True:
            primes.append(num)
        num += 1
        
        if len(primes) <= len(letters):
            continue
        else:
            break
    
    # Build letter hash table
    lookup_dict = {}
    for letter, prime_num in zip(letters, primes):
        lookup_dict[letter] = prime_num
    
    # Create list of integers from lookup dictionary
    index_list = []
    for s in string:
        index_list.append(lookup_dict.get(s))
    
    # Return single index hash
    index = sum(index_list) % hash_size
    return index

In [86]:
# Test with hash size of 10
assert prime_num_hash('bag', 10) # True

Now that we've built all of the hash functions, we want to find the best one for each of the examples below.  The best hash function will provide a good distribution, such that the keys aren't all mapped to the same position in the hash table.  

**5.5) Phonebook where the keys are names and values are phone numbers.  The names are Esther, Ben, Bob, and Dan.**

Right off the bat, the first hash function `constant_hash()` will always map all of the names to the same slot.

Now let's look at the second hash function, `length_hash()`.

In [90]:
names = ['esther', 'ben', 'bob', 'dan']
names_dict = {}

for name in names:
    name_hash = length_hash(name)
    if names_dict.get(name_hash) is None:
        names_dict[name_hash] = [name]
    else:
        names_dict[name_hash].append(name)
        
for key, val in names_dict.items():
    print(f'Key: {key}\nValues: {val}\n')

Key: 6
Values: ['esther']

Key: 3
Values: ['ben', 'bob', 'dan']



This function did a little better, but we still end up with three names mapped to the same location.  Let's try the third hash function.

In [91]:
names_dict = {}

for name in names:
    name_hash = first_char_hash(name)
    if names_dict.get(name_hash) is None:
        names_dict[name_hash] = [name]
    else:
        names_dict[name_hash].append(name)
        
for key, val in names_dict.items():
    print(f'Key: {key}\nValues: {val}\n')

Key: e
Values: ['esther']

Key: b
Values: ['ben', 'bob']

Key: d
Values: ['dan']



As expected, this function did slightly better than the last, but still ends up mapping two values to the same location.  Finally for the fourth and final fucntion.

In [93]:
names_dict = {}

for name in names:
    name_hash = prime_num_hash(name, 10)
    if names_dict.get(name_hash) is None:
        names_dict[name_hash] = [name]
    else:
        names_dict[name_hash].append(name)
        
for key, val in names_dict.items():
    print(f'Key: {key}\nValues: {val}\n')

Key: 0
Values: ['esther']

Key: 7
Values: ['ben']

Key: 3
Values: ['bob']

Key: 2
Values: ['dan']



As expected, the most complex out of the hash functions we've built is the best in terms of distribution with this simple example.

**5.6) A mapping from battery size to power (A, AA, AAA, AAAA).**

In this example, `constant_hash()` and `length_hash()` will both produce the same result-- all mappings will be to the same location.  `first_char_hash()` and `prime_num_hash()` would provide good distribution here.

**5.7) A mapping from book titles to authors (Maus, Fun, Home, and Watchmen**

Here, the first function will be the worst (as is always the case).  The `length_hash()` and `first_char_hash()` functions will work well, but now we may run into a problem with `prime_num_hash()` if our hash size is 10.  Let's check.

In [96]:
titles = ['maus', 'fun', 'home', 'watchmen']
titles_dict = {}

for title in titles:
    title_hash = prime_num_hash(title, 10)
    if titles_dict.get(title_hash) is None:
        titles_dict[title_hash] = [title]
    else:
        titles_dict[title_hash].append(title)
        
for key, val in titles_dict.items():
    print(f'Key: {key}\nValues: {val}\n')

Key: 3
Values: ['maus']

Key: 9
Values: ['fun']

Key: 8
Values: ['home']

Key: 5
Values: ['watchmen']



Turns out that this function will also work well here and we won't need to worry about resizing.