# CP2410 Assignment 2
## Data Structure Analysis
### Dictionaries
Dictionaries, also known as maps, are potentially the most significant data structure. This data type maps unique keys to associated values. Dictionaries share a number of characteristics with lists, such as they are both mutable, dynamic and can be nested. However, unlike a list or standard array, the indices don’t need to be consecutive or numeric, and are referred to as keys. The main operations of this data structure are searching, inserting, and deleting. Maps can be implemented using an unsorted list by storing the entries in a doubly-linked list. With this implantation, inserting is the most efficient operation and only take O(1) time, due to a new entry being able to be inserted at every position of the list. Searching and Removals however take O(n) time, since the entire list must be traversed to look for an item with the given key. Due to these characteristics the unsorted list implementation is only effective for dictionaries of small size, or for dictionaries which primarily use insertion and rarely perform removals and searches.

### Binary Search Trees
Binary Search Trees are a node-based binary tree data structure is excellent for storing items of a map. The properties of this data structure state that the left subtree of the node contains only nodes with keys lesser than the node’s key and the right subtree of a node contains only nodes with keys greater than the node’s key. The internal nodes store keys, however, the external nodes are empty. When searching for a key in a binary search tree, a downward path is traced  from the root. Determining the next visited node is dependant on the comparison between the desired key and the key of the current node. If a leaf or external node is reached then the key is not found. The standard operations such as insertion, searching and deletion all take O(h) time to run, where h is the height of the tree. In the worst case, the execution time of an operation is O(n), and O(logn) in the best case. The worst case occurs when the height of a tree is equal to the number of nodes it contains. 

## Algorithm Efficiency

### Merge-Sort
Merge-sort is an efficient, divide-and-conquer based sorting algorithm. The divide-and-conquer algorithmic design pattern consists of three steps; divide, conquer and combine. First, the n-element sequence is divided into half, resulting in two sub-sequences. Next, the sub-sequences are recursively sorted using the algorithm, or recursively divided. Finally, the solutions from each subset are combined to produce the sorted sequence. The execution of this algorithm can be depicted as a binary tree where the root is the initial call and each internal node represents a recursive call of merge-sort. The merge-sort tree can be analysed to determine the running time. Since each recursive call divides the sequence in half, the height of the merge-sort tree is O(logn). At each depth of the tree the overall amount of work done at the nodes in O(n). Therefore the total running time of merge-sort is O(nlogn).

### Insertion-Sort
Insertion-sort is a simple sorting algorithm which searches the array sequentially and the unsorted items are moved and inserted into the sorted sub-list. Starting with the first element, the next element is considered, if it smaller than the first element then it is swapped otherwise the focus transfers to the third element. This element is swapped with the element on its left until it is in its proper ordered position. For the remaining elements this procedure is repeated until the entire array is sorted. The nested loops in this algorithm lead to a O(n^2) running time in the worst case, which is if the order of the initial array is in reverse. However, the insertion sort algorithm works most effectively on sequences which are already almost sorted; in this case the insertion-sort runs in O(n) time due to there being limited iterations of the inner loop.

## Code Design
This solution presented for solving the Travelling Santa Problem uses the data structure dictionaries, or maps, and compares two sorting algorithms, merge-sort and insertion-sort. In both implementations the program the ‘cities.csv’ file is imported into a data frame using Pandas, and then an additional column is created which contains the distance between the current city and the origin. The extended data frame is then converted into a dictionary. Next, the ‘calcTotalDistance’  function along with the ‘mergeSort’ or ‘insertionSort’ function are defined. The ‘calcTotalDistance’ function uses a simple for loop to iterate through the list of cities in order, calculating the distance between the current point and the next, and then adding the result to the total distance which is returned. 
The ‘mergeSort’ function, which implements the divide-and-conquer based algorithm, uses an if statement to determine if the array is larger than 1. If so, the array is then divided in half and the ‘mergeSort’ function I recursively applied to the two halves. A while loop is then utilised to iterate through and combine the two arrays in the sorted order using conditional logic. The theoretical efficiency of this function is O(nlogn). 
The ‘insertionSort’ function is implemented using a for loop to iterate through the array. For each element in the array, an additional while loop is used to traverse through and using conditional logic determine the appropriate position for the element. Hence, resulting in a sorted array. This function has a theoretical efficiency of O(n^2). 
The remaining code in the program simply runs the appropriate functions and displays the results such as the start distance, sorted distance, the improvement as a percentage and the total time taken. A graph is also developed and displayed which shows the time compared to the number of iterations. 


### Merge-Sort Implementation

In [None]:
import math
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


data = pd.read_csv("../input/cities.csv")

data_use = 1

# Remove unwanted rows of data
data_cutoff = int(data.count(0)['X']*data_use)
data = data.drop(data.index[data_cutoff:])
origin = data[data.CityId == 0]

data['Distance'] = np.sqrt(pow((data['X'] - float(origin.X)), 2) + pow((data['Y'] - float(origin.Y)), 2))


# print(data.head())

# Put data into dictionary
data_dict = {}
index_list = []


for index, row in data.iterrows():
        data_dict[row['CityId'].astype(int)] = (
            row['X'].astype(float), row['Y'].astype(float), row['Distance'].astype(float))
        index_list.append(row['CityId'].astype(int))


def calcTotalDist(arr):
        total_distance = 0

        for i in range(0, len(arr) - 1):
                first_point = data_dict[arr[i]]
                second_point = data_dict[arr[i + 1]]

                total_distance += math.sqrt(
                        pow((second_point[0] - first_point[0]), 2) + pow((second_point[1] - first_point[1]), 2))
        return total_distance


def mergeSort(arr):
    t1 = time.time()
    time_array.append(t1 - start_time)

    if len(arr) > 1:
        mid = len(arr) // 2
        left_split = arr[:mid]
        right_split = arr[mid:]

        mergeSort(left_split)
        mergeSort(right_split)

        i = j = k = 0

        # Copy data to temp arrays left_split[] and right_split[]
        while i < len(left_split) and j < len(right_split):

            if data_dict[left_split[i]][2] < data_dict[right_split[j]][2]:
                arr[k] = left_split[i]
                i += 1
            else:
                arr[k] = right_split[j]
                j += 1
            k += 1

        # Checking if any element was left
        while i < len(left_split):
            arr[k] = left_split[i]
            i += 1
            k += 1

        while j < len(right_split):
            arr[k] = right_split[j]
            j += 1
            k += 1
            

# Run Function
time_array = []

start_dist = calcTotalDist(index_list)
print("Start Distance: ", start_dist)

start_time = time.time()
mergeSort(index_list)
finish_time = time.time()

total_time = finish_time-start_time
time_array.append(total_time)

sorted_dist = calcTotalDist(index_list)
print("Sorted Distance: ", sorted_dist)

print("Improvement: ", 100-(sorted_dist/start_dist)*100, "%")
print("Total Time: ", finish_time-start_time, "s")

# Time vs Recursion Count Graph
bars = range(1, len(time_array) + 1)
y_pos = np.arange(len(bars))
plt.plot(y_pos, time_array)

plt.title('Merge Sort Time Graph')
plt.xlabel('No. of Iterations')
plt.ylabel('Time (Sec)')

plt.show()


### Insertion Sort Implementation

In [None]:
import math
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


data = pd.read_csv("../input/cities.csv")
# scriptPath = os.path.dirname(os.path.realpath("_file_"))
# print(scriptPath)
# file_name = scriptPath + "\cities.csv"
# data = pd.read_csv(file_name, index_col=False)

data_use = 1

# Remove unwanted rows of data
data_cutoff = int(data.count(0)['X']*data_use)
data = data.drop(data.index[data_cutoff:])
origin = data[data.CityId == 0]

data['Distance'] = np.sqrt(pow((data['X'] - float(origin.X)), 2) + pow((data['Y'] - float(origin.Y)), 2))


# print(data.head())

# Put data into dictionary
data_dict = {}
index_list = []


for index, row in data.iterrows():
        data_dict[row['CityId'].astype(int)] = (
            row['X'].astype(float), row['Y'].astype(float), row['Distance'].astype(float))
        index_list.append(row['CityId'].astype(int))


def calcTotalDist(arr):
        total_distance = 0

        for i in range(0, len(arr) - 1):
                first_point = data_dict[arr[i]]
                second_point = data_dict[arr[i + 1]]

                total_distance += math.sqrt(
                        pow((second_point[0] - first_point[0]), 2) + pow((second_point[1] - first_point[1]), 2))
        return total_distance


def insertionSort(arr):

    # Iterate through the array
    for i in range(1, len(arr)):

        t1 = time.time()
        time_array.append(t1 - start_time)

        key = arr[i]

        j = i - 1
        while j >= 0 and data_dict[key][2] < data_dict[arr[j]][2]:
            arr[j + 1] = arr[j]
            j -= 1
        arr[j + 1] = key


# Run Function
time_array = []

start_dist = calcTotalDist(index_list)
print("Start Distance: ", start_dist)

start_time = time.time()
insertionSort(index_list)
finish_time = time.time()

sorted_dist = calcTotalDist(index_list)
print("Sorted Distance: ", sorted_dist)

print("Improvement: ", 100-(sorted_dist/start_dist)*100, "%")

print("Total time: ", finish_time - start_time, "s")
# Time vs Recursion Count Graph
bars = range(1, len(time_array) + 1)
y_pos = np.arange(len(bars))
plt.plot(y_pos, time_array)

plt.title('Merge Sort Time Graph')
plt.xlabel('No. of Iterations')
plt.ylabel('Time (Sec)')

plt.show()

## Results
The two implementations produced interesting results. The insertion-sort graph very clearly shows a parabolic trend which matches the theoretical efficiency of O(n^2). Similarly, the merge-sort graph matches the expected trend according to the theoretical efficiency which is O(nlogn). Both sorting methods show approximately 50% improvement - 51.73% to be exact - in regards to the overall distance however the merge-sort function is significantly quicker taking only 3.5 seconds whereas the insertion-sort takes 3.07 hours (11069.5 seconds). Due to no improvement gain compared to a large time gain it has been determined that the merge-sort algorithm is the better performing solution for the Travelling Santa Problem.