**CP2410 Assignment 1**  
  
***Introduction***  
  
This assignment uses data from the Santa Workshop Tour (SWT) 2019 challenge (https://www.kaggle.com/c/santa-workshop-tour-2019) to analyse various data structures and algorithms. The goal of SWT is to match a family to a day they can visit Santa's workshop (through their given preferences), without overbooking a day or without the hefty costs from picking lower preference days. Given are a list of families and their top 10 preferences in order. The algorithm should return a matrix of the family number, their assigned day, and a total cost for the preference letdowns.  
Below, the data structures and algorithms used will be explained, then have their running time and final score evaluated to compare efficiency/success. Running time will be evaluated through the Big-Oh method, which is the worst-possible case for running time of an operation/algorithm.  
  
***Algorithm 1 – Recursion (from chapter 04 of textbook)***  
  
Recursion is the process of a function calling itself in order to achieve a final result. Recursion can be used as a form of repetition that has memory. An application of recursion would be to calculate the nth Fibonacci number. This is detailed below in an example from chapter 4 of the textbook: Data Structures and Algorithms in Python, by Michael T. Goodrich, Roberto Tamassia, and Michael H. Goldwasser.  
  
def good_fibonacci(n):  
&nbsp;&nbsp;&nbsp;&nbsp;"""Return pair of Fibonacci numbers, F(n) and F(n-1)."""  
&nbsp;&nbsp;&nbsp;&nbsp;if n <= 1:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return (n,0)  
&nbsp;&nbsp;&nbsp;&nbsp;else:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(a, b) = good_fibonacci(n-1)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return (a+b, a)  
  
As demonstrated, parameters that are a result of another call of the function are used in the current call. In this case, 'a' is the previous result and 'b' is the result before 'a'. When run, this function will continue to call itself until hitting the base case (n<=1), where it will return a default response. At this point, the stack of function calls will be solved one by one in the reverse order they were summoned (0 -> n). Because this function is called 'n' times, we can say it has O(n) run-time.  
  
Despite the benefit of memory, there are a few significant issues with recursion. For one, without a base case, the function will continue to call itself to no end and eventually run out of memory to hold all the function calls. Another issue is that it is incredibly easy to implement inefficiently. For example, if you had two calls to your recursive function, inside of said recursive function, the worst-case time for the algorithm changes from O(n) to O(2^n), which is exponential as opposed to linear.  
  
In the case of this project, we can use recursion to compare the scores of different days to the previous best score to find the best overall solution.  
    
***Algorithm 2 - 'For' Loop***  
  
A for loop is a different method of repetition that cycles through a range (normally a list) and enacts a set of operations. It is one of two types of lists (the other being a while-loop) and is normally used for executing a finite number of cycles.  
  
While-loops have an advantage over for-loops in that they can stop execution after a condition has been met (which can save efficiency). For-loops also do not have the memory of the previous iteration like recursion does without extra work. This can be achieved by changing a variable that exists outside of the for-loop scope, as the scope of the for-loop is reset with each iteration.  
  
For loops can be used in the scope of this project to iterate through the different family preferences to try and find the best possible combination.  
  
***Data Structure 1 – Lists (from chapter 5 of textbook)***  
  
Primitive arrays are stored as a sequence of bytes in memory. Each object in the array should use the same number of bytes and hence, you only need the position of the starting element to find what each element of the array holds through n^th_element=start_position + n\*element_byte_length. Often, each element in an array doesn’t need the same number of bytes, and when dealing with large data, this can be wasteful. Python fixes this issue by using referential arrays, or lists, that store object references. Not only do they benefit from having a fixed sized array (since memory addresses are always stored with 64-bits), but having the elements of the list stored elsewhere means that a list can reference the same object twice, or multiple lists can reference the same object with no extra cost to memory.  
  
This does come at a cost, however. Since each object in the list is a reference object, copying it would create a duplicate list with references to the same objects, meaning if any of the objects were to change, it would be reflected in both lists. To solve this, the list can be ‘hard-copied’, which makes new instances of all the referenced objects. This operation takes O(n) time and should be considered while comparing and developing with lists. The running time of appending to the list should also be considered. While normal assignments only take O(1) time, if the index of the assignment is outside of the lists bounds, the list will have to expand. By default, python doubles the memory size allocated to the list (e.g. length 5 list would double to length 10), which takes O(n) time. To generalise the efficiency of appending to a list, amortised analysis can be used.  
  
For this project, we can use lists to store all the data on families/preferences and modify each value (using the previously mentioned algorithms) to obtain the best score possible.  
  
***Data Structure 2 – Stacks (from chapter 6 of textbook)***   
  
Stacks are an extremely simple data type that contains multiple elements that can be accessed in a last-in first-out (LIFO) way. This makes them particularly useful for applications like tracking histories, for example, the undo mechanism in Microsoft Word, or the back button on a web browser. In python, there is no in-built stack class, but a stack can be implemented using a list.  
  
Stack functions, such as push, pop, top, is_empty and len, all have a time complexity of O(1), but in return, it is a struggle to access any elements other than the last element of the stack. 
For this project, the solution can be implemented using stack logic to try and cut down on the operating time that list operations induce.  

## Code Initialisation
Taken from https://www.kaggle.com/inversion/santa-s-2019-starter-notebook

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import time

fpath = '/kaggle/input/santa-workshop-tour-2019/family_data.csv'
data = pd.read_csv(fpath, index_col='family_id')
fpath = '/kaggle/input/santa-workshop-tour-2019/sample_submission.csv'
submission = pd.read_csv(fpath, index_col='family_id')
family_size_dict = data[['n_people']].to_dict()['n_people'] 
cols = [f'choice_{i}' for i in range(10)]
choice_dict = data[cols].to_dict()
days = list(range(100, 0, -1))

## Penalty Calculator
Below is my algorithm for calculating the penalty induced from the family/day arrangement. This was heavily based off of https://www.kaggle.com/inversion/santa-s-2019-starter-notebook  
For all future parts of the code, I have comments on each line with the Big-Oh notation detailing the number of primitive operations in each line, all of which will be summed below each chunk of code.

In [None]:
def calc_penalty(table):
    penalty = 0    # 1
    people_scheduled = {k: 0 for k in days}    # n
    for family_id, day in enumerate(table):    # n
        number_of_people = family_size_dict[family_id]    # 2
        people_scheduled[day] += number_of_people    # 3
        # At most, this if statement has to do 10 evaluations, which all take 3
        if day == choice_dict['choice_0'][family_id]:
            penalty += 0    # 2
        elif day == choice_dict['choice_1'][family_id]:
            penalty += 50    # 2
        elif day == choice_dict['choice_2'][family_id]:
            penalty += 50 + 9 * number_of_people    # 4
        elif day == choice_dict['choice_3'][family_id]:
            penalty += 100 + 9 * number_of_people    # 4
        elif day == choice_dict['choice_4'][family_id]:
            penalty += 200 + 9 * number_of_people    # 4
        elif day == choice_dict['choice_5'][family_id]:
            penalty += 200 + 18 * number_of_people    # 4
        elif day == choice_dict['choice_6'][family_id]:
            penalty += 300 + 18 * number_of_people    # 4
        elif day == choice_dict['choice_7'][family_id]:
            penalty += 300 + 36 * number_of_people    # 4
        elif day == choice_dict['choice_8'][family_id]:
            penalty += 400 + 36 * number_of_people    # 4
        elif day == choice_dict['choice_9'][family_id]:
            penalty += 500 + 36 * number_of_people + 199 * number_of_people    # 6
        else:
            penalty += 500 + 36 * number_of_people + 398 * number_of_people    # 6

    for _, occupancy in people_scheduled.items():    # n
        if (occupancy < 125) or (occupancy > 300):    # 2 at worst
            # Use occupancy in penalty to incentivise picking under-occupied days
            penalty += (9999999999 - occupancy*10000)    # 4

    return penalty # 1

Evaluating the above, we can see the maximum time it will take is:  
Operations = 1 + n + n(2 + 3 + (10 \* 3) + 6) + n(2 + 4) + 1  
Operations = 48n + 2  
Which is O(n)

## Algorithm 1: Recursion attempt

In [None]:
submission = pd.read_csv(fpath, index_col='family_id')

def find_best_score(family, table, pick=0):
    day = choice_dict[f'choice_{pick}'][family]    # 3
    test_table = table.copy()    # n
    test_table[family] = day    # 2
    if pick == 9:    # 1
        return test_table    # 1
    else:
        new_table = find_best_score(family, test_table, pick + 1)    # 1
        test_score = calc_penalty(test_table)     # 48n + 3
        new_score = calc_penalty(new_table)    # 48n + 3 
        if new_score < test_score:    # 1
            return new_table    # 1
        else:
            return test_table    # 1


start = time.process_time()    # 1

table = submission['assigned_day'].tolist()    # n
new = table.copy()    # n
for fam_id, _ in enumerate(table):    # n
    current_score = calc_penalty(new)    # 48n + 3
    new_table = find_best_score(fam_id, new).copy()    # n(97n + 16)
    new_score = calc_penalty(new_table)    # 48n + 3
    if new_score < current_score:    # 1
        new = new_table.copy()    # n
submission['assigned_day'] = new    # n
score = calc_penalty(new)    # 48n + 3
print(f'Recursion Score: {score}')
print(f'Recursion Time: {time.process_time() - start}')
submission.to_csv(f'submission_{score}.csv')

For find_best_score(), the worst case is:  
Operations = 3 + n + 2 + 1 + 1 + 1 + 48n + 3 + 48n + 3 + 1 + 1  
Operations = 97n + 16  
However, this is guaranteed to call itself n times, therefore O(n<sup>2</sup>)

Therefore, for the worst case recursion attempt:  
Operations = 1 + n + n + n(48n + 3 + n(97n + 16) + 48n + 3 + 1 + n) + n + 48n + 3  
Operations = 97n<sup>3</sup> + 113n<sup>2</sup> + 58n + 4  
Which is O(n<sup>3</sup>)  

## Algorithm 2: For-loop attempt

In [None]:
submission = pd.read_csv(fpath, index_col='family_id')

start = time.process_time()    # 1

table = submission['assigned_day'].tolist()    # n
new2 = table.copy()    # n
for fam_id, _ in enumerate(new2):    # n
    current_score = calc_penalty(new2)    # 48n + 3
    trial = new2.copy()    # n
    new_scores = list(range(0, len(choice_dict)))    # 5 (3 for range, 1 for len, 1 for assignment)
    for i in new_scores:    # n
        trial[fam_id] = choice_dict[f'choice_{i}'][fam_id]    # 4
        new_scores[i] = calc_penalty(trial)    # 48n + 4
    if current_score > min(new_scores):    # n + 1
        best_choice = new_scores.index(min(new_scores))    # 2n + 1
        new2[fam_id] = choice_dict[f'choice_{best_choice}'][fam_id]    # 4

submission['assigned_day'] = new2    # n
score = calc_penalty(new2)    # 48n + 3
print(f'For Loop Score: {score}')
print(f'For Loop Time: {time.process_time() - start}')
submission.to_csv(f'submission_{score}.csv')

For the worst case for-loop attempt:  
Operations = 1 + n + n + n(48n + 3 + n + 5 + n(4 + 48n + 4) + n + 1 + 2n + 1 + 4) + n + 48n + 3  
Operations = 48n<sup>3</sup> + 62n<sup>2</sup> + 65n + 4  
Which is O(n<sup>3</sup>)

## Algorithm 3: For-loop with stacks attempt

In [None]:
submission = pd.read_csv(fpath, index_col='family_id')

# Stack reversal code was sourced from https://stackoverflow.com/questions/32975344/reversing-a-stack-in-python
def reverse(stack):    # Total: 3n + 1
    items = []    # 1
    while stack:    # n
        items.append(stack.pop())    # 2
    for item in items:    # n
        stack.append(item)    # 1

        
start = time.process_time()    # 1
table = submission['assigned_day'].tolist()    # n
answer = []    # 1
fam_id = 0    # 1
my_table = table.copy()    # n
reverse(my_table)    # 3n + 1

while my_table:    # n
    answer.append(my_table.pop())    # 2
    for i in choice_dict:    # n
        current_score = calc_penalty(answer)    # 48((n+1)/2) + 3 (since answer is building up to n in size, sub n with of (n+1)/2 to average n)
        current = answer.pop()    # 2
        answer.append(choice_dict[i][fam_id])    # 3
        new_score = calc_penalty(answer)    # 48((n+1)/2) + 3 (same as above)
        if current_score < new_score:    # 1
            answer.pop()    # 1
            answer.append(current)    # 1
    fam_id += 1    # 2

submission['assigned_day'] = answer    # n
score = calc_penalty(answer)    # 48n + 3
print(f'For Loop (Stacks) Score: {score}')
print(f'For Loop (Stacks) Time: {time.process_time() - start}')
submission.to_csv(f'submission_{score}.csv')

For the worst case for-loop with stacks attempt:  
Operations = 1 + n + 1 + 1 + n + 3n + 1 + n(2 + n(48((n+1)/2) + 3 + 2 + 3 + 48((n+1)/2) + 3 + 1 + 1 + 1) + 2) + n + 48n + 3  
Operations = 48n<sup>3</sup> + 62n<sup>2</sup> + 58n + 7  
Which is O(n<sup>3</sup>)

## Comparisons

All 3 algorithms have a Big-Oh of O(n<sup>3</sup>). While this does mean that they will grow in a similar shape (based on # of inputs), that does not necessarily mean that these algorithms are equal. As demonstrated, algorithm 1 took 551s, algorithm 2 took 302s and algorithm 3 took 134s. The operation equations for each of the algorithms have been plotted below:

In [None]:
# Import our modules that we are using
import matplotlib.pyplot as plt

# Create the vectors X and Y
x = np.array(range(100),dtype='int64')
y_1 = np.array(range(100),dtype='int64')
y_2 = np.array(range(100),dtype='int64')
y_3 = np.array(range(100),dtype='int64')

for i in x:
    y_1[i] = 97*(i**3) + 110*(i**2) + 56*i + 3
    y_2[i] = 48*(i**3) + 58*(i**2) + 64*i + 3
    y_3[i] = 48*(i**3) + 62*(i**2) + 58*i + 7


# Create the plot
plt.plot(x,y_1,label='Recursion')
plt.plot(x,y_2,label='For loop')
plt.plot(x,y_3,label='For loop (with stacks)')


# Add a title
plt.title('Runtime comparison')

# Add X and y Label
plt.xlabel('Inputs')
plt.ylabel('# of primitive operations')

# Add a Legend
plt.legend()

Above shows that the recursion algortihm scales significantly worse with the number of inputs than the two for-loop algorithms, which correlates with the recorded data. However, the plotted for-loop equations seem quite similar in nature, whereas the recorded data shows they have quite different final results. Algorithm 3 achieves a vastly lower running time in reality because the value of n that makes up the majority of the factors in the equation is smaller. While the n<sup>3</sup> terms are caused by nearly identical operations, the n<sup>2</sup> terms are caused by vastly different operations. In algortihm 2, one of the n<sup>2</sup> terms is caused by a list.copy() function that was used to trial new family/day arrangements, which means 5000 entries would have to be copied every time that piece of code was reached. By using a brand-new stack that slowly built up throughout the program's execution, this trial system was bypassed completely. However, this came at the cost of running the calc_penalty function twice. Even so, the smaller size of the answer stack near the beginning of the program's execution makes up for the lost time that occurs calling the function twice. It also appears that starting from a fresh stack yielded a better result (963,859 vs. 573,618), which is most likely a result of the calc_penalty function not allowing days families visit to be temporarily moved out of days at the minimum capacity or temporarily moved into days at the maximum capacity. This was temporarily bypassed by creating a new stack (as all days were at minimum capacity) and hence, the extra 'underbooking penalty' was able to be ignored for the first few families.

To conclude, the recursive algorithm was much less effective than the for-loop algorithms as there needed to be two calls to calc_penalty to maintain the current score throughout the recursive calls. The for-loop function was able to bypass this, as it kept a variable with the highest score in-tact outside of this scope. Perhaps the recursive function could be imporved by implementing something of a similar calibre. Algorithms 2 and 3 appear functionally similar, but 3 occurs much faster in reality. This is because algorithm 3 uses a brand-new stack instead of doing time-complex list.copy() function to test the data. This also yielded a better overall result as it bypassed some of the penalty function's limitations. Most of the stack operations (such as pop and append) also had a constant time, whereas the list function often used functions such as min (which is linear) or had to reference an index and an assingment (which adds constant time).