# FIT5148 - Distributed Databases and Big Data

# Assignment 1 - Solution Workbook


**Instructions:**
- You will be using Python 3.
- Read the assignment instruction carefully and implement the algorithms in this workbook. 
- You can use the datasets fireData and climateData (provided below) if you are aiming for Credit Task.
- For Distinction and High Distinction tasks, you are required to read the files FireData.csv and ClimateData.CSV provided with the assignment programatically and prepare the data in the correct format so that it can be used in your algorithm. 
- You can introduce new cells as necessary.

**Your details**
- Name: Boyu Zhang
- Student ID:28491300 

- Name:
- Student ID:

Let's get started!

In [1]:
#import multiprocessing as mp
import csv
from datetime import datetime
import multiprocessing as mp

In [2]:
firePath = './data/FireData.csv'
climatePath = './data/ClimateData.csv'

In [3]:
def read_to_list(path):
    with open(path,'r') as f:
        reader = csv.reader(f)
        return list(reader)

In [4]:
climateData = read_to_list(climatePath)[1:]
fireData = read_to_list(firePath)[1:]

In [5]:
#a glance on the data of climate
climateData == sorted(climateData, key=lambda x : x[1]), climateData[0]

(True,
 ['948700',
  '2016-12-31',
  '19',
  '56.8',
  '7.9',
  '11.1',
  '   72.0*',
  '  61.9*',
  ' 0.00I'])

In [6]:
#a glance on the data of fire
fireData == sorted(fireData, key=lambda x : x[-1]), fireData[:2]

(False,
 [['-37.966',
   '145.051',
   '341.8',
   '2017-12-27T04:16:51',
   '26.7',
   '78',
   '2017-12-27',
   '68'],
  ['-35.541',
   '143.311',
   '336.3',
   '2017-12-27T00:02:15',
   '62',
   '82',
   '2017-12-27',
   '63']])

## Task 1 Parallel Search
#### 1. 
Write an algorithm to search climate data for the records on ​15th December 2017​. 
Justify your choice of the data partition technique and search technique you have used.


**Justification**:From the above exploration of the data in the climateData list, we can find out that all data are already sorted in term of the data column which is is excatly our search key.

Given this, we can just pick a simplest partition method such as round-robin to partition the dataset evenly which maintains the balance of load without compromising on efficiency.

The binary search is obviously the desirable option when the source data is already sorted

In [7]:
#here starts the first half of task1
#first pick a partition method:
def rr_partition(data,n):
    """
    Perform a simple round robin partition on the given data set
    
    Parameters:
    data: the dataset to be partitioned, which is a list
    n: the number of groups that the dataset will be divided into
    
    Return:
    result: the partitioned subset of the dataset 
    """
    result = []
    for i in  range(n):
        result.append([])
    for index,element in enumerate(data):
        index_bin = index%n
        result[index_bin].append(element)
    return result
    
#then pick a search method:
def binary_search(data,key):
    """
    Perform binary search given certain key
    
    Parameters:
    data: the input dataset which is a list
    key: an query record
    
    Return:
    found: the mathced record and its position in a tuple, return (-1,None) if not found 
    """
    position = -1
    found = None
    upper = len(data) - 1
    lower = 0
    
    while lower <= upper and not found:
        mid = (upper + lower)//2
        if data[mid][1] == key:
            found = data[mid]
            position = mid
        elif data[mid][1] < key:
            lower = mid + 1
        else:
            upper = mid - 1     
    return found

#the complete parrallel search:
from multiprocessing import Pool
def parallel_search_date(data,query,n_processor):
    """
    A method doing parallel search on a given dataset ,
    when given a search clue like a single key or a range for certain column value
    
    Parameters:
    data: the dataset to be searched, which is a list
    query: a query record
    n_processer: the number of processor to parallize the search job
    
    Return:
    results: the list of all search results in all processors
    """
    results = [read_to_list(climatePath)[0]]
    pool = Pool(processes=n_processor)
    datasets = rr_partition(data, n_processor)
    for partition in datasets:
        result = pool.apply_async(binary_search, args=(partition,query))
        output = result.get()
        results.append(output)
    return results

In [8]:
#test the output
parallel_search_date(climateData,'2017-12-15',6)

[['Station',
  ' Date',
  '   Air Temperature(Celcius)',
  '  Relative Humidity',
  '  WindSpeed  (knots)',
  ' Max Wind Speed',
  '   MAX  ',
  '  MIN  ',
  'Precipitation '],
 ['948702',
  '2017-12-15',
  '18',
  '52',
  '7.1',
  '14',
  '   74.5*',
  '53.1',
  ' 0.00I'],
 None,
 None,
 None,
 None,
 None]

Process PoolWorker-5:
    self.run()
Process PoolWorker-2:
Traceback (most recent call last):
Process PoolWorker-6:
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
Process PoolWorker-3:
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 114, in run
Traceback (most recent call last):
Process PoolWorker-4:
    self._target(*self._args, **self._kwargs)
Process PoolWorker-1:
Traceback (most recent call last):
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/pool.py", line 102, in worker
Traceback (most recent call last):
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  Fi

#### 2.
Write an algorithm to find the​ ``latitude``​, ​``longitude`` ​and ​``confidence`` ​when the surface
temperature (°C) was between ​65 °C​ and​ 100 °C​. Justify your choice of the data partition
technique and search technique you have used.


**Justification**:From the foregoing exploration we can see that the record of fire data is not sorted by surface temperature. THis time the query is a range, it is easy to consider range partition first, however, it appears if the partition range matches the query range, then there is totally no point search in the other partitions which is not a parallized case any more; on the other hand if the two ranges do not match, then there is no point using range patition.

All the remaining partition method don't help with optimize the performance of parallel search, thus we still pick the simplest one -- round-robin which has the lowest time complexity and ensure load balance.

As for the search method, binary search is not quite compatitable with a range query, which means the mechanism don't reduce search time complexity and can even lead to confusing output.Thus this time we pick just the linear seach.

In [9]:
#here starts task1 part2
def linear_seach(data,key):
    """
    Perform linear search on given dataset
    
    Parameters:
    dat: the dataset to be searched
    key: the key(can be a range) used for searching
    
    Return:
    result: a tuple containing the index of the matched record and the query result
    """
    position = -1
    found = None
    result = []
    for record in data:
        if int(record[-1]) in range(key[0],key[1]):
            found = record[:2] + [record[-3]]
            position = data.index(record)
            result.append(found)
    return result

def parallel_search_temperature(data,query,n_processor):
    """
    A method doing parallel search on a given dataset ,
    when given a search clue like a single key or a range for certain column value
    
    Parameters:
    data: the dataset to be searched, which is a list
    query: a query record
    n_processer: the number of processor to parallize the search job
    
    Return:
    results: the list of all search results in all processors
    """
    results = [['Latitude','Longitude','Confidence']]
    pool = Pool(processes=n_processor)
    datasets = rr_partition(data, n_processor)
    for partition in datasets:
        result = pool.apply_async(linear_seach, args=(partition,query))
        output = result.get()
        results += output
    return results

In [11]:
#test function
parallel_search_temperature(fireData,[65,100],2)[:10]

[['Latitude', 'Longitude', 'Confidence'],
 ['-37.966', '145.051', '78'],
 ['-37.875', '142.51', '93'],
 ['-37.613', '149.305', '95'],
 ['-37.624', '149.314', '90'],
 ['-37.95', '142.366', '92'],
 ['-37.634', '149.237', '100'],
 ['-37.6', '149.325', '99'],
 ['-37.609', '149.32', '99'],
 ['-37.862', '144.175', '87']]

Process PoolWorker-10:
Traceback (most recent call last):
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
Process PoolWorker-9:
Traceback (most recent call last):
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/queues.py", line 376, in get
    return recv()
KeyboardInterrupt
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zby0902/anaconda3/envs/5148/lib/python2.7/multiprocessing/pool.py", lin

## Task 3  Parallel Sort
Write an algorithm to sort fire data based on surface temperature(°C) in a ascending order. Justify your choice of the data partition technique and sorting technique you have used.


**Justification**:Since each record in the firedata is quite similar, which means the processing time for each of them can be regarded as the same, thus to make best use of all the processors in the parallized process, it is best to consider round-robin as a data partition method which is the one that can maintains best load-balancing as well as smallest complexity in both terms of time and space as a partition method.

Here I decide to pick `merge sort` as the internal sorting algorithm. For `merge sort` is an identical comparison-based sorting algorithm whose time complexity is O(nlgn) while quick-sort is just quick in implementation but the worst case can be n^2.

Thus in term parallel sorting I am using the merge-sort as the internal sorting method, and the `merge-all` as the technique to merge the sorted list of all processors.


In [12]:
#here is the internal sorting method --'mergesort'
def merge(a,b):
    """ Function to merge two arrays which has length shorter equal 1 in a sorted way
    """
    c = []
    while len(a) != 0 and len(b) != 0:
        if int(a[0][-1]) < int(b[0][-1]):
            c.append(a[0])
            a.remove(a[0])
        else:
            c.append(b[0])
            b.remove(b[0])
    if len(a) == 0:
        c += b
    else:
        c += a
    return c

def mergesort(x):
    """ Function to sort an array using merge sort algorithm """
    if len(x) == 0 or len(x) == 1:
        return x
    else:
        middle = len(x)//2
        a = mergesort(x[:middle])
        b = mergesort(x[middle:])
        return merge(a,b)

In [15]:
mergesort(fireData)[:3]

[['-37.886',
  '147.207',
  '302',
  '2017-07-02T04:28:42',
  '10.7',
  '50',
  '2017-07-02',
  '28'],
 ['-37.886',
  '147.207',
  '302',
  '2017-07-02T04:28:42',
  '10.7',
  '50',
  '2017-07-02',
  '28'],
 ['-37.062',
  '141.373',
  '303.1',
  '2017-07-01T13:11:41',
  '16.1',
  '53',
  '2017-07-01',
  '29']]

In [71]:
# Let's first look at 'k-way merging algorithm' that will be used 
# to merge sub-record sets in our external sorting algorithm.
import sys

# Find the smallest record
def find_min(records):    
    """ 
    Find the smallest record
    
    Arguments:
    records -- the input record set

    Return:
    result -- the smallest record's index
    """
    try:
        m = int(records[0][-1])# records = [[],[],[]]
    except:
        print(records)
    index = 0
    for i in range(len(records)):
        if int(records[i][-1]) < m:  
            index = i
            m = int(records[i][-1])
    return index

def k_way_merge(record_sets):#records_sets = [[[r1],[r2]],[[r3],[r4]],[[r5],[r6]]]
    """ 
    K-way merging algorithm
    
    Arguments:
    record_sets -- the set of mulitple sorted sub-record sets

    Return:
    result -- the sorted and merged record set
    """
    
    # indexes will keep the indexes of sorted records in the given buffers
    indexes = []
    for x in record_sets:
        indexes.append(0) # initialisation with 0

    # final result will be stored in this variable
    result = []  
    
    # the merging unit (i.e. # of the given buffers)
    sub = []
    
    while(True):
        sub = [] # initialise the merging unit
        
        # This loop gets the current position of every buffer
        for i in range(len(record_sets)):
            if(indexes[i] >= len(record_sets[i])):
                sub.append([sys.maxsize])
            else:
                sub.append(record_sets[i][indexes[i]])  
                
        # find the smallest record 
        smallest = find_min(sub)
        # if we only have sys.maxsize on the tuple, we reached the end of every record set
        if(sub[smallest] == [sys.maxsize]):
            break

        # This record is the next on the merged list
        result.append(record_sets[smallest][indexes[smallest]])
        indexes[smallest] +=1
   
    return result

In [72]:
test = [[['-35.554', '143.307', '326.8', '2017-12-27T00:02:15', '23.8', '67', '2017-12-27', '53']], [['-35.541', '143.311', '336.3', '2017-12-27T00:02:15', '62', '82', '2017-12-27', '63']], [['-35.543', '143.316', '340.4', '2017-12-27T00:02:14', '84.2', '86', '2017-12-27', '67']]]

In [73]:
k_way_merge(test)

[['-35.554',
  '143.307',
  '326.8',
  '2017-12-27T00:02:15',
  '23.8',
  '67',
  '2017-12-27',
  '53'],
 ['-35.541',
  '143.311',
  '336.3',
  '2017-12-27T00:02:15',
  '62',
  '82',
  '2017-12-27',
  '63'],
 ['-35.543',
  '143.316',
  '340.4',
  '2017-12-27T00:02:14',
  '84.2',
  '86',
  '2017-12-27',
  '67']]

In [74]:
# The serial sorting method
def serial_sorting(dataset, buffer_size):
    """
    Perform a serial external sorting method based on sort-merge
    The buffer size determines the size of eac sub-record set

    Arguments:
    dataset -- the entire record set to be sorted
    buffer_size -- the buffer size determining the size of each sub-record set

    Return:
    result -- the sorted record set
    """
    
    if (buffer_size <= 2):
        print("Error: buffer size should be greater than 2")
        return
    
    result = []

    ### START CODE HERE ### 
    
    # --- Sort Phase ---
    sorted_set = []
    
    # Read buffer_size pages at a time into memory and
    # sort them, and write out a sub-record set (i.e. variable: subset)
    start_pos = 0
    N = len(dataset)
    while True:
        if ((N - start_pos) > buffer_size):
            # read B-records from the input, where B = buffer_size
            subset = dataset[start_pos:start_pos + buffer_size] 
            # sort the subset (using qucksort defined above)
            sorted_subset = mergesort(subset) 
            sorted_set.append(sorted_subset)
            start_pos += buffer_size
        else:
            # read the last B-records from the input, where B is less than buffer_size
            subset = dataset[start_pos:] 
            # sort the subset (using qucksort defined above)
            sorted_subset = mergesort(subset) 
            sorted_set.append(sorted_subset)
            break
    
    # --- Merge Phase ---
    merge_buffer_size = buffer_size - 1
    dataset = sorted_set
    while True:
        merged_set = []

        N = len(dataset)
        start_pos = 0
        while True:
            if ((N - start_pos) > merge_buffer_size): 
                # read C-record sets from the merged record sets, where C = merge_buffer_size
                subset = dataset[start_pos:start_pos + merge_buffer_size]
                merged_set.append(k_way_merge(subset)) # merge lists in subset
                start_pos += merge_buffer_size
            else:
                # read C-record sets from the merged sets, where C is less than merge_buffer_size
                subset = dataset[start_pos:]
                merged_set.append(k_way_merge(subset)) # merge lists in subset
                break

        dataset = merged_set
        if (len(dataset) <= 1): # if the size of merged record set is 1, then stop 
            result = merged_set
            break
    ### END CODE HERE ###
    
    return result

In [75]:
serial_sorting(fireData,4) 

[[['-37.886',
   '147.207',
   '302',
   '2017-07-02T04:28:42',
   '10.7',
   '50',
   '2017-07-02',
   '28'],
  ['-37.886',
   '147.207',
   '302',
   '2017-07-02T04:28:42',
   '10.7',
   '50',
   '2017-07-02',
   '28'],
  ['-36.943',
   '143.286',
   '302.7',
   '2017-11-11T15:08:00',
   '18.8',
   '51',
   '2017-11-11',
   '29'],
  ['-37.466',
   '148.1',
   '302.2',
   '2017-10-02T23:44:31',
   '10.9',
   '50',
   '2017-10-02',
   '29'],
  ['-37.062',
   '141.373',
   '303.1',
   '2017-07-01T13:11:41',
   '16.1',
   '53',
   '2017-07-01',
   '29'],
  ['-37.062',
   '141.373',
   '303.1',
   '2017-07-01T13:11:41',
   '16.1',
   '53',
   '2017-07-01',
   '29'],
  ['-37.38',
   '149.334',
   '304.5',
   '2017-11-30T15:38:32',
   '14.1',
   '61',
   '2017-11-30',
   '31'],
  ['-37.227',
   '141.146',
   '305.1',
   '2017-10-03T01:22:44',
   '41.2',
   '54',
   '2017-10-03',
   '31'],
  ['-35.646',
   '142.282',
   '305.6',
   '2017-12-24T13:12:01',
   '11.8',
   '65',
   '2017-12-24',


In [83]:
def parallel_merge_all_sorting(dataset, n_processor, buffer_size):
    """
    Perform a parallel merge-all sorting method

    Arguments:
    dataset -- entire record set to be sorted
    n_processor -- number of parallel processors
    buffer_size -- buffer size determining the size of each sub-record set

    Return:
    result -- the merged record set
    """
    if (buffer_size <= 2):
        print("Error: buffer size should be greater than 2")
        return
    
    result = []

    ### START CODE HERE ### 
    
    # Pre-requisite: Perform data partitioning using round-robin partitioning
    subsets = rr_partition(dataset, n_processor)
    
    # Pool: a Python method enabling parallel processing. 
    pool = mp.Pool(processes = n_processor)

    # ----- Sort phase -----
    sorted_set = []
    for s in subsets:
        # call the serial_sorting method above
        sorted_set.append(*pool.apply(serial_sorting, [s, buffer_size]))
    pool.close()
    
    # ---- Final merge phase ----
    result = k_way_merge(sorted_set)
    ### END CODE HERE ###
    
    return result

In [84]:
parallel_merge_all_sorting(fireData,4,4)

[['-37.886',
  '147.207',
  '302',
  '2017-07-02T04:28:42',
  '10.7',
  '50',
  '2017-07-02',
  '28'],
 ['-37.886',
  '147.207',
  '302',
  '2017-07-02T04:28:42',
  '10.7',
  '50',
  '2017-07-02',
  '28'],
 ['-36.943',
  '143.286',
  '302.7',
  '2017-11-11T15:08:00',
  '18.8',
  '51',
  '2017-11-11',
  '29'],
 ['-37.062',
  '141.373',
  '303.1',
  '2017-07-01T13:11:41',
  '16.1',
  '53',
  '2017-07-01',
  '29'],
 ['-37.466',
  '148.1',
  '302.2',
  '2017-10-02T23:44:31',
  '10.9',
  '50',
  '2017-10-02',
  '29'],
 ['-37.062',
  '141.373',
  '303.1',
  '2017-07-01T13:11:41',
  '16.1',
  '53',
  '2017-07-01',
  '29'],
 ['-37.227',
  '141.146',
  '305.1',
  '2017-10-03T01:22:44',
  '41.2',
  '54',
  '2017-10-03',
  '31'],
 ['-37.38',
  '149.334',
  '304.5',
  '2017-11-30T15:38:32',
  '14.1',
  '61',
  '2017-11-30',
  '31'],
 ['-36.779',
  '146.108',
  '305.3',
  '2017-07-01T03:46:08',
  '25.7',
  '61',
  '2017-07-01',
  '32'],
 ['-35.646',
  '142.282',
  '305.6',
  '2017-12-24T13:12:01',
