# Hash 

* Searches for keys for less than log(n) time. almost constant time 
* linear seach looks for table that is not sorted, it takes 0(n) times
* binary search requires the table to have been sorted.
* hash table: 
    * key has the same value as the value which allows fast search
    * a lot of space is required and wasted. 
    * keys should be mapped to hash table 
* four relational mappings:
    1. one-one 
    2. one-many
    3. many-one 
    4. many-many
* one-one, many-one is property of function 
* we don't use elements directly, we use hash function which gives us the index, and we store the key at that index given by hash function 
* you provide the value to the hash function, it returns the index. 
____________________________________________________________________________________________________________
* Hash table is a form of associative array, that maps keys to their associated value, it does this using hash functions
* the hash  function uses a key to compute an index, into the slots that are in the hash table and map the key to the value.
* ideally, hash function will assign each key to a unique slot in the table where the values are stored 
    * in reality, sometimes there are collisions in which two separate keys map to the same slot in the table.
    * for that case, hash table should have a way to handle those collisions 
* key-value mappings are unique 
* they are faster than any other table lookup 
* hash table dont order entries in predictable way
    

## Hash tables


* The worst case running time of map operations in n-entry hash table is 0(n).
* A hash table can usually perform these operations in 0(1) expected time.
* In general, a hash table consists of two major components, 
    * A bucket array
    * A hash function 
* A **bucket array** for a hash table is an array A of size N, where each cell of A is thought of as a "bucket" and the integer N defines the capacity of the array.
* A **bucket** is a collection of key-value pairs. 
* A **hash function** maps each key k in the map to an integer in the range [0, N-1], where N is the capacity of the bucket array for this table
* The hash function value h(k), is used as index into our bucket array A instead of the key K. 
* we store the entry (k,v) in the bucket ```A[h(k)]```
* if there are two or more keys with the same hash value., then two different entries will be mapped to the same bucket in A, in that case we say collision has occured.
* A good hash function maps the keys in our map to minimize collisions as much as possible.


* The evaluation of hash function h(K) consists of two functions:
    * mapping the key k to an integer called hash code
    * mapping the hash code to an integer within the range of indices ```([0,N-1])``` of a bucket array, called the compression function. 

    

#### Hash code
* the integer assigned to  a key is called the hash code for k. 
* This integer need not be in the range ```[0,N-1]```
* to be consistent with all our keys, the hash code we use for a key k should be same as the hash code for any key that is equal to k
* for C++,types char, short, and int, we can acheive a good hash code simply by casting this type to int
* for long type, cast it down to an integer, and then apply the integer hash code 
* A better hash code, which takes all the original bits into consideration sums an integer representation of the high order bits with an integer representation

#### Hashing technique

* ideal hashing: time taken for deleting, searching, sorting an element is constant
* Draw back of ideal hashing is a lot 
* hash function provides the index
* two keys mapped to one place, collision occurs 
* ideal hash function is one-one
* modified hash function often produce collisions as they may not be one to one but many to one
* to solve collision problems. there are two ways:
    1. open hashing (extra space beyond hash table is consumed)
        *  chaining 
    2. closed hashing (available space is utilized instead of increasing space)
        * Open addressing 
            * Linear probing
            * Quadratic probing 
            * Double hashing 
* open addressing choose another spot in the hash table if the another key is already pointing there


#### Chaining 
* modulus hash function: $ h(x) = x\%10 $ , remainder of division of x by 10, it makes values range from 0 to 10
* hash table is implemented using array of pointers 
* When two keys point to same spot, a new value is added in sorted linked list/ array of chains 
* you can store as many keys as you want but the size keeps growing 
* time analysis: 
    * loading factor ($\alpha$ or $\lambda$): (no. of keys / size of table)
    * analysis of hashing is always done based on loading factor 
    * if loading factor is 10, it means there are 10 keys at each place in table 
    * using hash function to get index take 0(1) constant time 
    * searching through linked list has average time 0($\lambda$\ 2)
    * unsuccessful search time : 0(1+lambda/2)
    * inserting and deleting takes similar time as searching
* you can modify the hash function
* The benefit of taking the mod is it limits the size of table 
* mod 10 means size goes from 0 to 9 

* If you dont know how to select proper hash function, you dont know hashing 
* for chaining, we need linked list  

```cpp
// CPP program to implement hashing with chaining 
#include<iostream> 
#include <list> 
using namespace std; 

class Hash 
{ 
	int BUCKET; // No. of buckets 

	// Pointer to an array containing buckets 
	list<int> *table; 
public: 
	Hash(int b); // Constructor 

	// inserts a key into hash table 
	void insertItem(int x); 

	// deletes a key from hash table 
	void deleteItem(int key); 

	// hash function to map values to key 
	int hashFunction(int x) { 
		return (x % BUCKET); 
	} 

	void displayHash(); 
}; 

Hash::Hash(int b) 
{ 
	this->BUCKET = b; 
	table = new list<int>[BUCKET]; 
} 

void Hash::insertItem(int key) 
{ 
	int index = hashFunction(key); 
	table[index].push_back(key); 
} 

void Hash::deleteItem(int key) 
{ 
// get the hash index of key 
int index = hashFunction(key); 

// find the key in (inex)th list 
list <int> :: iterator i; 
for (i = table[index].begin(); 
		i != table[index].end(); i++) { 
	if (*i == key) 
	break; 
} 

// if key is found in hash table, remove it 
if (i != table[index].end()) 
	table[index].erase(i); 
} 

// function to display hash table 
void Hash::displayHash() { 
for (int i = 0; i < BUCKET; i++) { 
	cout << i; 
	for (auto x : table[i]) 
	cout << " --> " << x; 
	cout << endl; 
} 
} 

// Driver program 
int main() 
{ 
// array that contains keys to be mapped 
int a[] = {15, 11, 27, 8, 12}; 
int n = sizeof(a)/sizeof(a[0]); 

// insert the keys into the hash table 
Hash h(10); // 7 is count of buckets in 
			// hash table 
for (int i = 0; i < n; i++) 
	h.insertItem(a[i]); 

// delete 12 from hash table 
h.deleteItem(12); 

// display the Hash table 
h.displayHash(); 

return 0; 
} 

---------------------------------------------
0
1 --> 11
2
3
4
5 --> 15
6
7 --> 27
8 --> 8
9
```

#### A simple implementation 

```cpp
#include<iostream>
#include<list>
#include<vector>
using namespace std; 

void printhash(const vector<list<int>> & h){
    for(const auto a: h){
        for(const auto i:a){
            cout << i << "--> "; 
        }
         cout << endl;
    }
}

int hashfunc(int x){
    return x % 10; 
}

void insert(vector<list<int>> & h, int data){
    int index = hashfunc(data); 
    h[index].push_back(data); 
}
int main(){

    // universal 
    vector<list<int>> ht(10); 
    insert(ht, 4);
    insert(ht, 44);
    insert(ht, 5);
    insert(ht, 3);
    insert(ht, 24);

    printhash(ht); 

    return 0; 
}
```

#### Linear probing 
* we use another function  $ h'(x) = (h(x)+f(i)) \% 10 $ in conjuction with $h(x) = x \% 10$
* when a collision is likely to occur, move to next block 
* Whenever there is collision, try to use the next free space 
* keep incrementing i and perform the calculation until a free space is found 
* the increment is cyclic, after getting to end, start from begin 
* the time complexity just over constant 
* while searching if you encounter space just stop 
* if empty, that means element is not there
* Analysis is done based on loading factor not size on input 
* loading factor is number of elements divided by size of table 
* **loading factor should always be less than or equal to 0.5** 
* that means if the hash table size is 10, you should not fill more than 5 keys 
* In this case, a hash table can at most be half filled

* Drawbacks of linear probing: 
    * keep the half table empty, space wasted 
    * lot of keys may cluster together, time is wasted going through it, primary clustering
    * deleting requires moving the original address of a value. 
    * solution for above problem is rehashing 
    * in linear probing, deleting is not suggested 
* rehashing: take all the keys out and put them back 


```cpp
#include<cstdio>
#include<iostream>
#include<vector>
#include<string>
#include<list>
#include<algorithm>
#include<typeinfo>

#define size 10 
using namespace std;

int hashf(int key){
    return key % size; 
}

int probe(int h[], int key){
    int index = hashf(key); 
    int i = 0; 
    while(h[(index + i)%size] !=0){
        i++; 
    }
    return (index+i)%size; 
}

void insert(int h[], int key){
    int index = hashf(key); 
    if(h[index] != 0){
        index = probe(h, key); 
    }
    h[index] = key; 
}

int search(int h[], int key){
    int index = hashf(key); 
    int i = 0; 
    while(h[(index+i)%size] != key){
        i++; 
    }
    return (index+i)%size; 
}
int main(){
    int ht[10]; 
    insert(ht, 12); 
    insert(ht, 25); 
    insert(ht, 35); 
    insert(ht, 26); 
    return 0; 
}

```

* changing "index+i" to "index+i*i" makes in quadratic probing 

#### quadratic probing 
* $ h'(x) = (h(x)+f(i)) \% 10 $ , where $ f(i) = i^2 $
* Average successful search is : $ -log_e (1-\lambda)$
* unsuccessful search : $ 1 \div(1-\lambda)$

#### double hashing 
* There are two hash functions:
    * the basic hash function 
    * Another hash function to assist when collision occurs in the first one
* $h1(x) = x \% 10 $
* $h2(x) = R-(x \% R) $ , R is a prime number 
* $h'(x) = (h1(x) + i*h2(x)) \% 10 $

* we use regular hash function, but when collision occurs we use the second hash function 


#### hashing function ideas 
* Mid square hash function 
    * what ever the key is, take the square of that and take the middle digit 
    * suppose index =11, then square makes 121, and then middle digit 2 is taken, so 11 is stored at 2
* Folding 
    * takes all keys group them by 2 or 3, using them as single number add them, sum can be used as an index 
    * if keys are strings, then each letter can be seen as it ascii code 
* The goal of mapping should be to distribute the data as uniformly as possible
* for chaining, hash table size can be anything but if you are using linear probing, then hash table size should be double the data size(#elements) 
* for $ h(x) = x \% size $ , it is preferred that size be a prime number. 


* You can design your own hash functions you have to make sure that the results are consistent while searching and inserting 
* 

#### Summary 
* arrays search takes linear time
* sorted array, binary search, takes 0(log(n)) time, insertion, deletion are costy 
* linked list, search takes linear time 
* balanced binary search tree search, insert, delete all take 0(log(n))
* Direct access table. all three operations take 0(1) 
    * size of table is large 
* Direct access table improved 
    * using hashing 
* hash function maps a big number or string to a small integer that can be used as an index in hash table 
* A good hash function: 
    * Efficiently computable 
    * should uniformly distribute the keys
* Collision handling: 
    1. chaining: The idea is to make each cell of the hash table point to a linked list of records that have same hash function value       
    2. open addressing: All elements are stored in the hash table itself
         * linear probing 
         * quadratic probing
         * double hashing 
* hashing provides 0(1) time on average for insert, search and delete
* loading function: avg #key per slot = #keys stored in table / #slots in table 
* Expected time to insert/delete/search: $(1+loading factor)$
* Chaining: 
    * simple to implement 
    * hash table never fills up, we can always add more elements to the chain 
    * less sensitive to the hash function or load factors 
    * useful when number of elements or the frequency of insert and delete operations are unknown 
    * there is wastage of space 
    * if the chain becomes long, then search time can become 0(n) in the worst case 
* Open addressing: 
    * All items(keys) are stored in table itself
    * size of table >= No of keys 
    * Hash function specifies order of slots to probe (try) for a key (for insert/search/delete), not just one slot 
    * search, insert and delete take $0(\frac{1}{1-loading_factor})$
* linear probing:
    * hi(x) = ( hash(x) + i ) % hash_table_size
    * Easy to implement 
    * Best cache performance 
    * Suffers from clustering 
* Quadratic probing 
    * hi(x) = ( hash(x) + i^2 ) % hash_table_size
    * Avg cache performance 
    * Suffers a lesser clustering than linear probing 
* Double hashing
    * use another hash function hash2(x) and look for $i*hash2(x)$ slot in ith iteration
    * hi(x) = ( hash(x) + i* hash2(x) ) % hash_table_size
    * Poor cache performance 
    * No clustering 
    * Requires more computation time 



#### Best practices 
* If the hash function does a good job of spreading objects across underlaying array, and take 0(1) time to compute, on average, lookups, insertions, and deletions have **0(1+n/m)** time complexity. where n = # objects, m = array length 
* If the load n/m grows large rehashing can be applied to the hash table. 
* A new array with a larger number of locations is allocated, and the objects are moved to the new array. 
* Rehashing is expensive 0(n+m) time but if it done infrequently eg. when #entries double. its amortized cost is low
* keys do not appear in order 
* Randomization plays a central role 
* One hard requirement:
    * Equal keys should have equal hash codes. 
* one soft requirement: 
    * The hash function should spread keys. i.e. the hash codes for a subset of objects should be uniformly distributed across the underlying array. 
    * A hash function should be efficient to compute. 
* A key that's present in hash table cant be updated else lookup for that key will fail, even though it is still in the hash table
* if you have to update a key:
    * first remove it
    * update it 
    * add it back
* Avoid using mutable objects as keys




#### Hash function suitable for strings.
* first, hash function should examine all the characters in the string. 
* It should give you a level range of values, and should not let one character dominate 
* A rolling hash function, one in which if a character is deleted from the front of the string, and another added to the end, the new hash code can be computed in 0(1) time. 

```cpp
int stringhash(const string & s, int modulus){
    const int kmult = 997; 
    int val = 0; 
    for(char c: s){
        val = (val * kmult + c) % modulus;
    }
    return val; 
}
```




In [None]:
#### Write a program that takes an input a set of words and returns group of anagrams for those words. 
* two words are anagrams if and only if they result in equal strings after sorting 
* key idea is to map strings to a representative
* given a string its sorted version can be used as a unique identifier for the anagram group it belongs to 
* what we want is a map from a sorted string to the anagram it corresponds to 
* hash table is excellent choice if you want to store a set of string
* the sorted strings are keys and the values are arrays of the corresponding strings from the original input
* 

```cpp
vector<vector<string>> findAnagram(const vector<string> & dictionary){
    unordered_map<string, vector<string>> sorted_str_to_an; 
    for(const string &s: dictionary){
        string sorted_str(s);
        sort(sorted_str.begin(), sorted_str.end()); 
        sorted_str_to_an[sorted_str].emplace_back(s);
    }

    vector<vector<string>> anagram_grp; 
    for(const auto & p: sorted_str_to_an){
        if(p.second.size() >= 2){
            anagram_grp.emplace_back(p.second);
        }
    }

    return anagram_grp; 

}
```

#### Rabin-karp algorithm (pattern searching)
* if two strings are equal, their hash values are also equal 
* the string value is reduced to computing the hash value of the search pattern and 
* then looking for substrings of the text with that hash value. 
* steps:
    * calculate the hash value of the pattern 
    * A window of size of the pattern is created and hash value for that window is calculated
    * if hash value of window and pattern comes out to be same, each characters of the window is compared to the pattern to see if they match 
* hash function requirement: hash at the next shift must be efficiently computable 0(1) from the current hash value and next character in text 
* sliding the window is called **rolling hash function** 
* When same hash code with different pattern is encountered, its called **spurious hits**
* A strong hash function is necessary to avoid spurious hits 
* Each code for letters are multiplied by 10^(m-1) and 10^(m-2) and so on
* abin Karp algorithm needs to calculate hash values for following strings:
    1) Pattern itself.
    2) All the substrings of text of length m
* To do rehashing, we need to take off the most significant digit and add the new least significant digit for in hash value. Rehashing is done using the following formula.
    

## Practice problems 

#### count pairs with given sum 
* Given: an array of integers, a number sum 
* to do: find the number of pairs of integers in the array whose sum is equal to "sum"


    