###  Data Set

In [1]:
# R consists of 15 pairs, each comprising two attributes (nominal and numeric)
R = [('Adele',8),('Bob',22),('Clement',16),('Dave',23),('Ed',11),
     ('Fung',25),('Goel',3),('Harry',17),('Irene',14),('Joanna',2),
     ('Kelly',6),('Lim',20),('Meng',1),('Noor',5),('Omar',19)]

# S consists of 8 pairs, each comprising two attributes (nominal and numeric)
S = [('Arts',8),('Business',15),('CompSc',2),('Dance',12),('Engineering',7),
     ('Finance',21),('Geology',10),('Health',11),('IT',18)]


### 1. Serial Join Algorithms

Let's first understand serial join algorithms - join algorithms implemented in nonparallel machines. Parallel join algorithms adopt a data partitioning parallelism approach, whereby parallelism is achieved through data partitioning. That is, a join operation implemented on each processor would employ a serial join algorithm. In Section 2, we will learn more about parallel join algorithms.

In this activity, we will consider the following three serial join algorithms:

   - Nested-loop join algorithm,
   - Sort-merge join algorithm,
   - Hash-based join algorithm

##### 1.1 Nested-Loop Join Algorithm

Nested-loop join is the simplest form of join algorithm. For each record of the first table, it goes through all records of the second table. This is repeated for all records of the first table. It is called a nested loop because it consists of two levels of loops: inner loop (looping for the second table) and outer loop (looping for the first table).

Exercise: Undertand and run the nested-loop join algorithm using the join attribute - the numeric attribute in two tables R and S. Then, discuss the time complexity of this algorithm as well as its pros and cons.


In [2]:
def NL_join(T1, T2):
    """
    Perform the nested-loop join algorithm.
    The join attribute is the numeric attribute in the input tables T1 & T2

    Arguments:
    T1 & T2 -- Tables to be joined

    Return:
    result -- the joined table
    """
    result = []
    
    for r1 in T1:
        for r2 in T2:
            if r1[1] == r2[1]:
                result.append({", ".join([r1[0], str(r1[1]), r2[0]])})
    return result


In [3]:
NL_join(R,S)

[{'Adele, 8, Arts'}, {'Ed, 11, Health'}, {'Joanna, 2, CompSc'}]


### 1.2 Sort-Merge Join Algorithm

Sort-merge join is based on sorting and merging operations. The first step of joining is to sort the two tables based on the joining attribute in an ascending order, and the second step is merging the two sorted tables. If the value of the joining attribute in R is smaller than that in S, it skips to the next value of the joining attribute in R. On the other hand, if the value of the joining attribute in R is greater than that in S, it skips to the next value of the joining attribute in S. When the two values match, the two corresponding records are concatenated and placed into the query result.

Exercise: Complete the sort-merge join algorithm based on the above definition by implementing the following code block between '### START CODE HERE ###' and '### END CODE HERE ###'. Discuss the time complexity of this algorithm in terms if its efficiency. Also, compare it with the nest-loop join algorithm.


In [4]:
def SM_join(T1, T2):
    """
    Perform the sort-merge join algorithm.
    The join attribute is the numeric attribute in the input tables T1 & T2

    Arguments:
    T1 & T2 -- Tables to be joined

    Return:
    result -- the joined table
    """
    result = []
    
    # sort T1 based on the join attribute
    s_T1 = list(T1)
    s_T1 = sorted(s_T1, key=lambda s_T1: s_T1[1])
    
    # sort T2 based on the join attribute
    s_T2 = list(T2)
    s_T2 = sorted(s_T2, key=lambda s_T2: s_T2[1])
   
    ### START CODE HERE ### 
    i = j = 0
    while (i < len(s_T1)-1 and j < len(s_T2)-1):
        r = s_T1[i][1]
        s = s_T2[j][1]
        # If join attribute s_T1(i) < join attribute s_T2(i)
        if r < s:
            i += 1
        
        # else 
        else:
            if r == s:
                result.append({', '.join([s_T1[i][0],str(s_T1[i][1]),s_T2[j][0]])})
                i += 1
                j += 1
            # if join attribute s_T1(1) > join attribute s_T2(1)
            # #---Implement here
            
            # else 
            else:
                # put records s_T1(i) and s_T2(j) into the result and i++, j++
                # #---Implement here
                j += 1

    ### END CODE HERE ###

    return result


In [5]:
SM_join(R,S)

[{'Joanna, 2, CompSc'}, {'Adele, 8, Arts'}, {'Ed, 11, Health'}]

In [6]:
i = 0
j =0
while (i == 2) or (j == 3):
    print('yes')
    i += 1
    j += 1

In [7]:
dic = dict()
dic['a'] = set(['s','d'])

In [8]:
dic['a'].add('c')

In [9]:
dic['a']


{'c', 'd', 's'}


### 1.3 Hash-Based Join Algorithm

A hash-based join is basically made up of two processes: hashing and probing. A hash table is created by hashing all records of the first table using a particular hash function. Records from the second table are also hashed with the same hash function and probed. If any match is found, the two records are concatenated and placed in the query result.

A decision must be made about which table is to be hashed and which table is to be probed. Since a hash table has to be created, it would be better to choose the smaller table for hashing and the larger table for probing.

Exercise: Complete the hash-based join algorithm by implementing the following code block between '### START CODE HERE ###' and '### END CODE HERE ###'. Discuss the time complexity of this algorithm in terms if its efficiency. Also, compare it with the above two join algorithms.


In [26]:
def H(r):
    """
    We define a hash function 'H' that is used in the hashing process works 
    by summing the first and second digits of the hashed attribute, which
    in this case is the join attribute. 
    
    Arguments:
    r -- a record where hashing will be applied on its join attribute

    Return:
    result -- the hash index of the record r
    """
    digits = []
    if len(str(r[1])) >= 2:
        digits.append(int(str(r[1])[0]))
        digits.append(int(str(r[1])[1]))
    else:
        digits.append(int(str(r[1])[0]))
    return sum(digits)

In [27]:
H(R[1])

4

In [28]:
def HB_join(T1, T2):
    """
    Perform the hash-based join algorithm.
    The join attribute is the numeric attribute in the input tables T1 & T2

    Arguments:
    T1 & T2 -- Tables to be joined

    Return:
    result -- the joined table
    """
    pass