# FIT5148 - Distributed Databases and Big Data

# Assignment 1 - Solution Workbook


**Instructions:**
- You will be using Python 3.
- Read the assignment instruction carefully and implement the algorithms in this workbook. 
- You can use the datasets fireData and climateData (provided below) if you are aiming for Credit Task.
- For Distinction and High Distinction tasks, you are required to read the files FireData.csv and ClimateData.CSV provided with the assignment programatically and prepare the data in the correct format so that it can be used in your algorithm. 
- You can introduce new cells as necessary.

**Your details**
- Name: Boyu Zhang
- Student ID:28491300 

- Name:
- Student ID:

Let's get started!

In [1]:
#import multiprocessing as mp
import csv
from datetime import datetime

In [2]:
firePath = './data/FireData.csv'
climatePath = './data/ClimateData.csv'


In [3]:
def read_to_list(path):
    with open(path,'r') as f:
        reader = csv.reader(f)
        return list(reader)


In [4]:
climateData = read_to_list(climatePath)[1:]
fireData = read_to_list(firePath)[1:]

In [5]:
#a glance on the data of climate
climateData == sorted(climateData, key=lambda x : x[1]), climateData[0]

(True,
 ['948700',
  '2016-12-31',
  '19',
  '56.8',
  '7.9',
  '11.1',
  '   72.0*',
  '  61.9*',
  ' 0.00I'])

In [6]:
#a glance on the data of fire
fireData == sorted(fireData, key=lambda x : x[-1]), fireData[:2]

(False,
 [['-37.966',
   '145.051',
   '341.8',
   '2017-12-27T04:16:51',
   '26.7',
   '78',
   '2017-12-27',
   '68'],
  ['-35.541',
   '143.311',
   '336.3',
   '2017-12-27T00:02:15',
   '62',
   '82',
   '2017-12-27',
   '63']])

## Task 1 Parallel Search
#### 1. 
Write an algorithm to search climate data for the records on ​15th December 2017​. 
Justify your choice of the data partition technique and search technique you have used.


**Justification**:From the above exploration of the data in the climateData list, we can find out that all data are already sorted in term of the data column which is is excatly our search key.

Given this, we can just pick a simplest partition method such as round-robin to partition the dataset evenly which maintains the balance of load without compromising on efficiency.

The binary search is obviously the desirable option when the source data is already sorted

In [7]:
#here starts the first half of task1
#first pick a partition method:
def rr_partition(data,n):
    """
    Perform a simple round robin partition on the given data set
    
    Parameters:
    data: the dataset to be partitioned, which is a list
    n: the number of groups that the dataset will be divided into
    
    Return:
    result: the partitioned subset of the dataset 
    """
    result = []
    for i in  range(n):
        result.append([])
    for index,element in enumerate(data):
        index_bin = index%n
        result[index_bin].append(element)
    return result
    
#then pick a search method:
def binary_search(data,key):
    """
    Perform binary search given certain key
    
    Parameters:
    data: the input dataset which is a list
    key: an query record
    
    Return:
    found: the mathced record and its position in a tuple, return (-1,None) if not found 
    """
    position = -1
    found = None
    upper = len(data) - 1
    lower = 0
    
    while lower <= upper and not found:
        mid = (upper + lower)//2
        if data[mid][1] == key:
            found = data[mid]
            position = mid
        elif data[mid][1] < key:
            lower = mid + 1
        else:
            upper = mid - 1     
    return found

#the complete parrallel search:x
from multiprocessing import Pool
def parallel_search_date(data,query,n_processor):
    """
    A method doing parallel search on a given dataset ,
    when given a search clue like a single key or a range for certain column value
    
    Parameters:
    data: the dataset to be searched, which is a list
    query: a query record
    n_processer: the number of processor to parallize the search job
    
    Return:
    results: the list of all search results in all processors
    """
    results = [read_to_list(climatePath)[0]]
    pool = Pool(processes=n_processor)
    datasets = rr_partition(data, n_processor)
    for partition in datasets:
        result = pool.apply_async(binary_search, args=(partition,query))
        output = result.get()
        results.append(output)
    return results

In [8]:
read_to_list(climatePath)[0]

['Station',
 ' Date',
 '   Air Temperature(Celcius)',
 '  Relative Humidity',
 '  WindSpeed  (knots)',
 ' Max Wind Speed',
 '   MAX  ',
 '  MIN  ',
 'Precipitation ']

In [9]:
#test the output
parallel_search_date(climateData,'2017-12-15',6)

[['Station',
  ' Date',
  '   Air Temperature(Celcius)',
  '  Relative Humidity',
  '  WindSpeed  (knots)',
  ' Max Wind Speed',
  '   MAX  ',
  '  MIN  ',
  'Precipitation '],
 ['948702',
  '2017-12-15',
  '18',
  '52',
  '7.1',
  '14',
  '   74.5*',
  '53.1',
  ' 0.00I'],
 None,
 None,
 None,
 None,
 None]

#### 2.
Write an algorithm to find the​ ``latitude``​, ​``longitude`` ​and ​``confidence`` ​when the surface
temperature (°C) was between ​65 °C​ and​ 100 °C​. Justify your choice of the data partition
technique and search technique you have used.


**Justification**:From the foregoing exploration we can see that the record of fire data is not sorted by surface temperature. THis time the query is a range, it is easy to consider range partition first, however, it appears if the partition range matches the query range, then there is totally no point search in the other partitions which is not a parallized case any more; on the other hand if the two ranges do not match, then there is no point using range patition.

All the remaining partition method don't help with optimize the performance of parallel search, thus we still pick the simplest one -- round-robin which has the lowest time complexity and ensure load balance.

As for the search method, binary search is not quite compatitable with a range query, which means the mechanism don't reduce search time complexity and can even lead to confusing output.Thus this time we pick just the linear seach.

In [10]:
#here starts task1 part2
def linear_seach(data,key):
    """
    Perform linear search on given dataset
    
    Parameters:
    dat: the dataset to be searched
    key: the key(can be a range) used for searching
    
    Return:
    result: a tuple containing the index of the matched record and the query result
    """
    position = -1
    found = None
    result = []
    for record in data:
        if int(record[-1]) in range(key[0],key[1]):
            found = record[:2] + [record[-3]]
            position = data.index(record)
            result.append(found)
    return result

def parallel_search_temperature(data,query,n_processor):
    """
    A method doing parallel search on a given dataset ,
    when given a search clue like a single key or a range for certain column value
    
    Parameters:
    data: the dataset to be searched, which is a list
    query: a query record
    n_processer: the number of processor to parallize the search job
    
    Return:
    results: the list of all search results in all processors
    """
    results = [['Latitude','Longitude','Confidence']]
    pool = Pool(processes=n_processor)
    datasets = rr_partition(data, n_processor)
    for partition in datasets:
        result = pool.apply_async(linear_seach, args=(partition,query))
        output = result.get()
        results += output
    return results

In [11]:
#test function
parallel_search_temperature(fireData,[65,100],2)

[['Latitude', 'Longitude', 'Confidence'],
 ['-37.966', '145.051', '78'],
 ['-37.875', '142.51', '93'],
 ['-37.613', '149.305', '95'],
 ['-37.624', '149.314', '90'],
 ['-37.95', '142.366', '92'],
 ['-37.634', '149.237', '100'],
 ['-37.6', '149.325', '99'],
 ['-37.609', '149.32', '99'],
 ['-37.862', '144.175', '87'],
 ['-36.94', '143.281', '89'],
 ['-37.477', '143.352', '100'],
 ['-37.69', '143.605', '97'],
 ['-36.098', '143.74', '92'],
 ['-37.247', '141.278', '91'],
 ['-37.294', '141.232', '100'],
 ['-38.127', '143.82', '96'],
 ['-37.46', '148.102', '88'],
 ['-37.406', '148.123', '100'],
 ['-37.452', '148.115', '68'],
 ['-37.45', '148.126', '64'],
 ['-37.444', '148.101', '73'],
 ['-37.379', '148.126', '58'],
 ['-37.399', '148.081', '100'],
 ['-37.4796', '141.9403', '94'],
 ['-38.3998', '147.064', '89'],
 ['-36.3774', '143.7079', '91'],
 ['-36.369', '143.7132', '97'],
 ['-36.7685', '142.7134', '91'],
 ['-37.9085', '141.1038', '98'],
 ['-38.4792', '146.3081', '89'],
 ['-37.253', '144.3681

## Task 3  Parallel Sort