# HASH TABLES / DICTIONARIES

### Motivation:
    * Store a key value pairs;
    * PROBLEM: Arrays transforms any key into indexes, this is why hashfunctions came to be;
    
### Basics:
    * Balance BST -> O(logN) for several operations;
    * In hashtables we can achieve O(1);
    * h() functions maps keys to indexes in an array;
    
    In general we have n items to be storem + m buckets in which we can store them
    Problem: Keys are not always nonnegatives integers, so we have to do prehashing in order to map string keys to indexes of an array;
    
    Key Space -> Buckets;
    Relationship between the keys and the array slots;
    
    Hashing: We can map a certain key of any type to a random array index;
    
#### Hash Function
    - Distribute the keys uniformly into buckets;
    - n: number of keys;
    - m: number of buckets;
    - h(x) = n % m
    - We should use prime numbers for both to make sure the distribution of the generated indexes will be uniform;
    - String keys: Calculate the ASCII value for each character, add them up -> make % modulo

### Collisions

   - Two keys mapped to the same bucket;
   - Solutions:
       1. **Chaining**: Store both values at the same bucket with linked lists;
       2. **Open addressing**: Generate a new index for the item;
   - Open addressing > chaining:
       - If there are many collision, the chaining solution blocks the O(1) time complexity. Moreover, it needs more memory due to the references.
   - **Open Addressing**: If a collision occurs we find an empty slot instead:
        * **Linear Probing**: Try the next slot until we find an empty one;
        * **Quadratic Probing**: Try slots 1, 2, 4, 8 far;
        * **Rehashing**: Hash the result again in order to find an empty slot;
        
|          |  AVERAGE  |  WORST CASE  |
| -------- | --------- | ------------ |
|  Space   |     O(N)  |     O(N)     |
|  Search  |     O(1)  |     O(N)     |
|  Insert  |     O(1)  |     O(N)     |
|  Delete  |     O(1)  |     O(N)     |

## Dinamic Resizing
   - **Load Factor**: Number of entries divided by the number os slots/buckets;
        - N/M
        - 0 if the hashtable is empty;
        - 1 if the hashtable is full;
        - **Load factor is approximately 1**: Nearly full, the performance will be decrease, the operations will be slow;
        - **Load factor is approximately 0**: Nearly empty, there will be a lot of memory wasted; \
        **SOLUTION**: DINAMIC RESIZING
        
- **DINAMIC RESIZING**:
    - Performance depends of the load factor;
    - Space-time tradeoff is important: the solution is to resize table, when its load factor exceeds given threshold;
    - Java: Load factor > 0.75 -> Hashmap automatically resized;
    - **Python**: The threashold is 0.66:
        - Hash values depend on thable's size so hashes of entries are changed when resizing and algorithm can't just copy data from old storage to the new;
        - Resizing takes **O(N)** to complete, where _n_ is a number of entries of the table. This may make dynamic-sized hash tables inappropriate for real-time applications.
        
- **APPLICATIONS**:
    - Databases: Sometimes search trees, sometimes hashing is better;
    - Counting given word accurence is a particular document;
    - Storing data + lookup tables (password checks);
    - Lookup tables in huge networks (lookup for IP addresses);
    - The hashing technique can be used for substring search -> **_Rabin-Karp Algorithm_**.
    

# Linear Probing Implementation

In [3]:
class HashTable:
    
    def __init__(self):
        
        self.size = 10
        self.keys = [None] * self.size
        self.values = [None] * self.size
    
    def put(self, key, data):
        index = self.hashfunction(key)
        
        # not None -> It is a collision
        while self.keys[index] is not None:
            if self.keys[index] == key:
                self.values[index] = data # Update
                return
            
            # Rehash try to find another slot
            index = (index+1) % self.size
        
        # Insert
        self.keys[index] = key
        self.values[index] = data
    
    def hashfunction(self, key):
        """
            Must return an integer -> The index of the arrayslot
            Use ASCII values for the characters
            Sum them up and use the modulo operator to transform the final
        index in a valid range
            Normalize it with the size of the underlying array
        
            We should use prime numbers to make the collision less probable
        """
        sum = 0
        for pos in range(len(key)):
            sum = sum + ord(key[pos])
        return sum % self.size
    
    def get(self, key):
        
        index = self.hashfunction(key)
        
        while self.keys[index] is not None:
            if self.keys[index] == key:
                return self.values[index]
            
            index = (index+1) % self.size
            
        #It means that the key is not present in the associative array
        return None

In [8]:
table = HashTable()

table.put("apple", 10)
table.put("orange", 20)
table.put("car", 30)
table.put("table", 40)

print(table.get("table"))

40


## Hashing Applications

1. Index generation in hashmaps and dictionaries:
    - We can achieve O(1) running time complexity for insertion and retrieval with a perfect hashfunction;
2. Hashes are important in cryptography -> Cryptographic fingerprints:
    - We can generate the hash for a given file and it will uniquely identify that document;
    - If anything changes then the hash will change as well;
3. Password verification:
    - In the server, in order to protect from hacker attacks, we store the hashes of the passwords;
    - This way the attacker can use a valid password;
    - Of course when the user enters the password a hashfunction must be applied;
4. Blockchains: The identifiers of the blocks are SHA-256 hashes.
    - The blockchain itself is a linked list with hash-pointers;
    - Every node has 2 hash values: own hash and the hash value of the previous block;