# Hashing
A hash function is any function that can be used to map data of arbitrary size onto data of a fixed size. While the potential domain of the data could be huge, its practical domain is often small (i.e., the realized values in the dataset). So if we are careful about how we hash the data there is often a unique encoding for each value.

So what does a typical hash function look like? Here is a hash function that hashes strings to integer values:

In [1]:
import binascii
import random

# maximum integer value
MAXINT = 2**32-1

# We need the next largest prime number above MAXINT
NEXTPRIME = 4294967311

# Two numbers that parametrize the hash
A = random.randint(0, MAXINT)
B = random.randint(0, MAXINT)

def hashcode(st):
    val = binascii.crc32(bytes(str(st),'utf-8')) & 0xffffffff
    return (A * val + B) % NEXTPRIME

If we run this hash function on a list of strings, we get:

In [2]:
for s in ['the','quick','brown','fox','the','lazy', 'dog']:
    print(hashcode(s))

803015294
3329900216
1598068164
3663318587
803015294
2477368713
3523361111


There are a couple of interesting things to note in the output. First, the word 'the' predictably gets the same hash code (integer value). We can also restrict this value to be integers in a certain range (e.g., 0-50) by applying another remainder operation:

In [3]:
for s in ['the','quick','brown','fox','the','lazy', 'dog']:
    print(hashcode(s) % 50)

44
16
14
37
44
13
11


If the code space (the range) is too small, then you get "collisions", namely, two unequal values get the same hash code.

In [4]:
for s in ['the','quick','brown','fox','the','lazy', 'dog']:
    print(hashcode(s) % 10)

4
6
4
7
4
3
1


## A More Efficient Matcher
We can revisit the match operator from the previous lectures with a significantly more efficient implementation that uses hashing. This is the basis of the Hash Join algorithm. The hash join is an example of a join algorithm and is used in database systems. Hash joins are typically more efficient for larger result sets than nested loops, but can only be used for equality joins.

In [5]:
class MatchOperator:


    def __init__(self, input, codespace=50):
        '''
        Takes in a tuple of input iterators (i1,i2)
        '''
        self.in1, self.in2 = input
        self.codespace = codespace
        # a list of iterators

        
    def __iter__(self):
        '''
        Initializes the iterators and fetches the first element
        '''

        self.it1 = iter(self.in1) # initialize the first input
        self.it2 = iter(self.in2) # initialize the second input
        
        self.hashtable = [[] for i in range(self.codespace)]
        
        #build the hash table
        #for i,v in enumerate(it1):
            #hash v and append i to a list
            
        for i, v in enumerate(self.it1):
            self.hashtable[hashcode(v) % self.codespace].append(v)
            
        
        #keep track of the bucket number and next value
        self.nextval = next(self.it2)
        self.nextb = 0
        
        return self


    
    def __next__(self):
        '''
        The next method fetches the next element
        '''
        probe = self.hashtable[hashcode(self.nextval) % self.codespace]
        
        if len(probe) <= self.nextb:
            self.nextval = next(self.it2)
            self.nextb = 0
            return self.__next__()
        elif probe[self.nextb] == self.nextval:
            rtn = self.nextval
            self.nextb += 1            
            return (rtn, rtn)
        else:
            self.nextb += 1 
            return self.__next__()

When we run this code, the input and the output behavior are exactly the same as before but it is a more efficient implementation:

In [6]:
for i in MatchOperator(([1,2,4,5],[5,4,3,6])):
    print(i)

(5, 5)
(4, 4)


This approach is clearly faster than the "nested loop" version, but the problem is that it requires one of the iterators to completely fit in memory. How do we get around this dilema?

## External Hashing
The key idea is to partition the data into chunks that fit into memory. This partitioning is also done with hashing. The structure of this algorithm is very similar to the external sorting we saw before. We recursively subdivide the data until it fits into memory.

In [10]:
from iosim import *

#A function to hash partition the data
def partition(infile, partitions):
    hashtable = [[] for i in range(partitions)]
    output_files = []
    
    for i, v in enumerate(Load(infile)):
        hashtable[hashcode(v) % partitions].append(v)
    
    for code in range(0,partitions):
        filename = infile + "." + str(code)
        
        if len(hashtable[code]) > 0:
            Flush(hashtable[code],filename)
            output_files.append((filename, Size(filename)))
    
    return output_files

This breaks up the file into k smaller components. However, some of the files might be very large due to collisions.

In [11]:
partition('input.txt', 3)

[('input.txt.0', 39), ('input.txt.1', 5), ('input.txt.2', 35)]

So we want to recursively subdivide these partitions until they all meet a given size limit:

In [13]:
def passes(infile, limit, partitions):
    
    file, size = infile
    rtn = []
    if size > limit:
        for partfile in partition(file, partitions):
            rtn.extend(passes(partfile,limit, partitions))
    else:
         rtn = [infile]
            
    return rtn

passes(("input.txt", Size("input.txt")), 10, 3)

[('input.txt', 82),
 ('input.txt.0', 39),
 ('input.txt.0.0', 33),
 ('input.txt.0.0.0', 34),
 ('input.txt.0.0.0.0', 8),
 ('input.txt.0.0.0.1', 0),
 ('input.txt.0.0.0.2', 0),
 ('input.txt.0.0.1', 0),
 ('input.txt.0.0.2', 0),
 ('input.txt.0.1', 7),
 ('input.txt.0.2', 0),
 ('input.txt.1', 5),
 ('input.txt.2', 35),
 ('input.txt.2.0', 0),
 ('input.txt.2.1', 0),
 ('input.txt.2.2', 5)]

In the previous lecture, we considered a disk (or more generally "external memory") api, where the system could `load` and `flush` data. We will add a new primitive to this api called `seek` which efficiently returns a single element at a given position in the disk file. The `seek` operation will be crucial to implement indexing. Indexing is a way to optimize performance of a database by minimizing the number of disk accesses required when a query is processed. An index or database index is a data structure which is used to quickly locate and access the data in a database table.

## Seek
We can really think of any storage device as a large array which can be indexed. In our model, we model the disk as files divided by lines. The `Seek` operator takes in an iterator of "indices" (think line numbers in a file) and selectively returns the line at that number:

In [9]:
from iosim import *

for val in Seek([2,0,1], 'test.csv'):
    print(val)






`Seek` takes a line number and returns a value. What if we wanted to do the reverse (take a value and return the line numbers at which it occurs)? This is the basic concept of indexing. Given a value determine if the value exists on a disk system and if it does return the whole line(s). 

## Hash Indexing
You will learn about more complicated types of indexes in a database systems class.