# Hash-Maps


Representing one of the mostly used and handy datastructure. A hash-map has a major advantage over all datastructures: accessing elements with constant complexity using a hash function.

Lets discuss a plain map first.


## Maps
A map is basically a set of keys, each corresponding to value. 

The array with its index does represent such a map. There is a big difference, however. The keys (aka the indexes) are generated automatically based on the position of the value in the array.

Example: lets consider a Python list holding three names:

```names = ["Paul", "Duncan", "Leto"]```

The corresponding keys are represented by the order of the list and hence

```
names[0] = "Paul"
names[1] = "Duncan"
names[2] = "Leto"
```

We could, of course define our own keys and create a more flexible map that would not need to be dependent on the positions in the list. 

### A Simple Map Implementation

For a simple map, we do not really need much. A key and a value attribute in class. Let's try it out.


In [6]:
class KeyValueItem:
    def __init__(self, key, value):
        self.key = key
        self.value = value
    

class SimpleMap:
    def __init__(self):
        self.list_of_key_values = []

    def add(self, key, value):
        key_value_item = KeyValueItem(key, value)
        self.list_of_key_values.append(key_value_item)

    def get_item(self, key):
        for item in self.list_of_key_values:
            if item.key == key:
                return item.value

map = SimpleMap()
map.add("Paul", "Atreides")
map.add("Duncan", "Idaho")

print(map.get_item("Paul"))
print(map.get_item("Duncan"))

Atreides
Idaho


In order to store a flexible amount of data, we have created a list that holds the key value pairs. To get the value corresponding to a key, we would need to iterate through that list.

Not very efficient, right?! Well, this is where hashing comes in handy.


## Hashing in Maps


From all the datastructures we have looked into within the scope of the last notebooks, the array with its index is the most efficient one if it comes to accessing an information, **provided we know the index in advance!**

Example: if we have the names list from above, accessing the value of "Paul" has complexity `O(1)` in case we know that the value is stored at index `0`. 
```
names[0] = "Paul"
```

There is simply no need for a search. Hence, no iteration through the list is necessary. 

Now, keeping this efficiency in mind, we can come up with a procedure for converting high level information such as the name "Paul" into a (compact) low level information such as an index. **This procedure is called hashing.**


#### A simple hash algorithm
Let's say we have a big number and we want to convert this number into a smaller one, we could use the last two digits and divide the number represented by these digits by a fixed number. The result would be a hash value that we can use as an index.

Let's try this out.
Idea: use `int` representation of the last letter in a string and divide the resulting number by a constant.

In [45]:
def generate_hash_value_from_string(string):
    fixed_number = 9
    return int((ord(string[-1])/fixed_number))

print(generate_hash_value_from_string("Paul"))


12


=> seems that we have a good index. Lets put this function into our HashMap.

We would need to adjust the code from above a little bit.
1. Instead of appending items into the array one by one, we will use the index represented by the hash
2. Since we now have fixed indexes, we need to initialize the array holding our key value items in advance


In [40]:
class KeyValueItem:
    def __init__(self, key, value):
        self.key = key
        self.value = value
    

class HashMap:
    def __init__(self, initial_list_size = 100):
        # initialize size of list
        self.list_of_key_values = [None for _ in range(initial_list_size)]

    def generate_hash_value_from_string(string):
        fixed_number = 9
        return int((ord(string[-1])/fixed_number))

    def add_item(self, key, value):
        key_value_item = KeyValueItem(key, value)
        hash_value_from_key = generate_hash_value_from_string(key)
        self.list_of_key_values[hash_value_from_key] = key_value_item

    def get_item(self, key):
        hash_value_from_key = generate_hash_value_from_string(key)
        return self.list_of_key_values[hash_value_from_key].value

map = HashMap()
map.add_item("Paul", "Atreides")
map.add_item("Duncan", "Idaho")

print(map.get_item("Paul"))
print(map.get_item("Duncan"))

Idaho
Idaho


It seems that we are doing something wrong as we are overriding values in our array. 

A closer look reveals that the hash values for "Paul" "Idaho" are identical. This implies that we override the array at index 12.

In [23]:
def generate_hash_value_from_string(string):
    fixed_number = 9
    return int((ord(string[-1])/fixed_number))

print(generate_hash_value_from_string("Paul"))
print(generate_hash_value_from_string("Idaho"))

12
12


### Collision Handling

The issue above is common when using hash values and is referred to as collisions. Handling collisions is an interesting topic on its own and we would focus on the most basic approaches here.

One way to handle them is to **adjust the hash function**:

We could, of course use more letters for our hash value. This would imply bigger indices and hence a bigger size of the underlying array => it is not beneficial to reserve many items in an array just in order to make sure that we will not override them.

Another idea is to **accept these collisions and come up with a more general solution** (which we will discuss in the following).

In practice, we use a combination of both: advanced hash functions (implying small chances for collisions) as well as handling collisions in the hash map.


For the latter, we would need to re-think the implementation of our array holding the key value items. Why just use one array, when we can have a nested one?

The idea is simple: instead of just storing the key value items into ONE array based on the index resulting from a hash value, we could create a second array representing items with identical hash values. This nested array, we would call our "bucket".

Sounds a bit more complicated than it actually is. Let's just give it a try

In [44]:
class KeyValueItem:
    def __init__(self, key, value):
        self.key = key
        self.value = value

class HashMap:
    def __init__(self, initial_list_size = 100):
        # initialize size of list with lists
        self.buckets = [[] for _ in range(initial_list_size)]

    def generate_hash_value_from_string(string):
        fixed_number = 9
        return int((ord(string[-1])/fixed_number))

    def add_item(self, key, value):
        key_value_item = KeyValueItem(key, value)
        hash_value_from_key = generate_hash_value_from_string(key)
        if key_value_item not in self.buckets[hash_value_from_key]:
            self.buckets[hash_value_from_key].append(key_value_item)

    def get_item(self, key):
        hash_value_from_key = generate_hash_value_from_string(key)
        for item in self.buckets[hash_value_from_key]:
            if item.key == key:
                return item.value

map = HashMap()
map.add_item("Paul", "Atreides")
map.add_item("Duncan", "Idaho")

print(map.get_item("Paul"))
print(map.get_item("Duncan"))

Atreides
Idaho


So, we do iterate after all ?! Yes, but over small arrays that are hopefully distinct (depending on the hash function used). 

Note, that in many implementations, a linked list is suggested for the nested part. Since we do nothing than iterating within the nested list, I left the usage of a linked list out of the implementation above.

Also note that in practice we use a combination of both approaches for collision handling: advanced hash functions (implying small chances for collisions) as well as handling collisions in the Hash Map using the bucket.

### Re-Hashing

An important aspect when working with hash maps is re-hashing which becomes necessary when the bucket exceeds a certain size.

Since the bucket size is defined upon the creation of the hash map and since we want to enable a dynamic expansion of the hash map, the bucket might require re-sizing. This happens in case we have placed a high number of elements (with unique hash values) into the map. In general we re-hash the bucket after it reaches a load of 70%.

Re-hashing is a straighforward procedure related to re-sizing fixed sized arrays:

1. copy initial array
2. create new array with increased size
3. place copy of initial array into new array

This operation has complexity of `O(n)` implying that hash maps are not ALWAYS have constant time.

So, can we still expect a constant complexity when acessing items from a hash map? Well, there is no straightforward answer to this but in practise, we DO expect constant complexity due to the reasons mentioned previously:

1. most hash maps use advanced hash functions (implying almost no collisions)
2. although the worst case implies `O(n)`, re-hashings are rare events that are mostly neglected for the complexity considerations.