In [2]:
import numpy as np

### 1. Rescaling Data 


Consider our baseball player example. This time, we consider three variables: (1) height, (2) weight, and (3) salary. We consider 6 players with the corresponding information:

- height (inches): [72,78,69,71,76,79]
- weight (lbs): [180, 215, 210, 188, 176, 209]
- salary (USD): [2200000, 2500000, 2700000, 6500000, 2600000, 3500000]


Run the cell below to generate two lists. baseball_players stores the data, and features stores the feature names. 

In [3]:
baseball_players = [[72,78,69,71,76,79], 
                    [180, 215, 210, 188, 176, 209], 
                    [2200000, 2500000, 2700000, 6500000, 2600000, 3500000]]
features = ['height', 'weight', 'salary']


- Convert the 2d list to a 2d numpy array, the array's name is: baseball_players_arr.
- Use a loop to create a dictionary, dict_baseball_players, where the keys are the variable names (i.e., elements from features), and content is the data under the corresponding feature name. 
    - Hint 1: Use dict_baseball_players = {} to create an empty dictionary before the loop.
    - Hint 2: You can use either enumerate function, or your own counter for the loop.
- (Discussion Question, Don't Code up Anything) Can you create a dictionary manually? Briefly discuss the advantage of using a loop if the variable number is large.

In [4]:
# YOUR CODE HERE (convert 2d list to 2d numpy array)
baseball_players_arr=np.array(baseball_players)
baseball_players_arr


array([[     72,      78,      69,      71,      76,      79],
       [    180,     215,     210,     188,     176,     209],
       [2200000, 2500000, 2700000, 6500000, 2600000, 3500000]])

In [None]:
# "Create a dictionary" Hint: your output should be something like:
'''
{'height': [72, 78, 69, 71, 76, 79],
 'weight': [180, 215, 210, 188, 176, 209],
 'salary': [2200000, 2500000, 2700000, 6500000, 2600000, 3500000]}
'''

In [13]:
# YOUR CODE HERE (Use a loop to create dict_baseball_players dictionary)
dict_baseball_players={}
for i, v in enumerate(features):
        dict_baseball_players[v]=baseball_players[i]
print(dict_baseball_players)

{'height': [72, 78, 69, 71, 76, 79], 'weight': [180, 215, 210, 188, 176, 209], 'salary': [2200000, 2500000, 2700000, 6500000, 2600000, 3500000]}


In [None]:
# YOUR DISCUSSION HERE
#Yes, we can create a dictionary manually using key and value pair.
#Advantages of using loop are as follows

#1)With the help of loops we can automate the repetitive tasks and hence save time to increase our efficiency.
#2)When we are dealing with large datasets it is ideal to use a loop to automate the task in hand. This in turn reduces the manual labour hours.



### Scaling Data - Continued

The range of values of raw data varies widely. For example, if the heights of baseball players 
are measured in inches, the values should be around 75. In the meantime, if we also have the
annual income (measured in dollars) of each baseball player, the values are several millions. Yet,
major statistical models compute the distance between two points (e.g. two baseball players) by
the Euclidean distance. If one of the dimensions has a broad range of values, the distance will be
governed by this particular dimension. Therefore, the range of all features should be scaled
so that each dimension contributes approximately proportionately to the final distance.

Rescaling data is a very important pre-processing step in many data analytics tasks.  To this end, this question asks you to scale a variable based on the following equation: 

$$x\_scaled = \frac{x-min(x)}{max(x) - min(x)}  $$

This makes it so that each data point lies in the range [0,1].


- Create a 1D numpy array, height, that stores the height information.
- Scale height based on the equation above. Store the result as height_scaled. Print the result. *Hint: max and min can be obtained using np.max(x) and np.min(x), correspondingly.*

In [14]:
# YOUR CODE HERE (Create height)
height=np.array([72, 78, 69, 71, 76, 79])
print(height)

[72 78 69 71 76 79]


In [15]:
# YOUR CODE HERE (Compute height_scaled, print it)
height_scaled=[]
min_height=np.min(height)
max_height=np.max(height)
height_scaled=(height-min_height)/(max_height-min_height)
print(height_scaled)

[0.3 0.9 0.  0.2 0.7 1. ]



Usually we need to scale all the variables. Thus, it is more convenient to store the process as a function. 
- Create a function called scale. The function should take a 1D numpy array, and return a new 1D numpy array.
    - The output should be the scaled result of the input (i.e., the same as Question 2).
    - Note: In your code, you do not need to specifically check whether the input is a 1D numpy array or not.
    
- Create a new dictionary dict_baseball_players_scaled which is a scaled version of dict_baseball_players and print the result.
    - Use each feature name as a key, with a numpy array containing the corresponding scaled values
    - Use a loop and apply the function we wrote for the previous prompt.


In [17]:
# YOUR CODE HERE (Define scale function)
def scale(array):
    min_arr = np.min(array)
    max_arr = np.max(array)
    arr_scaled = (array - min_arr) / (max_arr - min_arr)
    return arr_scaled

In [19]:
# YOUR CODE HERE (Create dict_baseball_players_scaled)
dict_baseball_players_scaled = {}
for x, z in dict_baseball_players.items():
    dict_baseball_players_scaled[x] = scale(z)
print(dict_baseball_players_scaled)

{'height': array([0.3, 0.9, 0. , 0.2, 0.7, 1. ]), 'weight': array([0.1025641 , 1.        , 0.87179487, 0.30769231, 0.        ,
       0.84615385]), 'salary': array([0.        , 0.06976744, 0.11627907, 1.        , 0.09302326,
       0.30232558])}


### 2. Automating Task Assignment

You have found yourself managing a new group at GenericDataScienceCo at a particularly busy time.  You have a backlog of tasks that need to be completed as soon as possible.  There are 4 employees in your group that you can assign these tasks to: Alice, Bob, Cheryl, and David.  Below we have code giving the information for the tasks and a list of your employee's names.

Each task is a tuple, where the first entry of the tuple is a string corresponding to its name and the second entry of the tuple is a number, corresponding to how long the task takes to complete.

In [2]:
# Run this code to make these variables available
tasks = [("Task1", 10), ("Task2", 4), ("Task3", 3), ("Task4", 27), ("Task5", 14), ("Task6", 5), 
         ("Task7", 4), ("Task8", 25), ("Task9", 19), ("Task10", 40), ("Task11", 3), ("Task12", 27)]
employees = ["Alice", "Bob", "Cheryl", "Daniel"]

Your goal is to assign each task to a single employee, with the goal of completing all of the tasks as soon as possible (assuming that each employee works on their tasks in parallel and with no interruptions).  

After thinking about it for a bit, you came up with the following greedy heuristic which might work well: Iterate over each task and assign it to the employee who currently has the *least* amount of work (breaking ties arbitrarily).  We want to implement a function called simple_assignment which realizes this heuristic.  We will do this over a few steps in order to create a clean implementation.



We will use a dictionary to represent the assignment of tasks to workers.  Below I've given an example.

In [3]:
# Run the following code to make this variable available
example_assignment = {'Alice': [("Task1", 10), ("Task2", 4), ("Task3", 3)],
                      'Bob': [("Task4", 27), ("Task5", 14), ("Task6", 5)],
                      'Cheryl': [("Task7", 4), ("Task8", 25), ("Task9", 19)],
                      'Daniel': [("Task10", 40), ("Task11", 3), ("Task12", 27)]}

Given an assignment we would like to keep track of each employees workload (the total amount of time needed for tasks assigned to them), and the total time needed to finish all of the tasks (the largest workload of any employee).

- Complete the function below, compute_workloads, which takes an assignment dictionary as input and should output a dictionary containing the workload of each employee.
    - The output of the function should be a dictionary workloads with the employee names as keys.
    - workloads[employee] should store the total time for all of the tasks assigned to employee.
    - e.g. workloads['Alice'] = 17
- Apply the function to example_assignment below and print out the time needed to finish all of the tasks under this assignment (the largest workload).

In [4]:
def compute_workloads(assignment):
    # YOUR CODE HERE
    #Computing the total workload of the respective employees
    workloads={}
    for employee, tasks in assignment.items():
        workloads[employee]=sum(task[1] for task in tasks)
    return workloads

# YOUR CODE HERE (Print out time needed to finish all tasks)   
example_workloads = compute_workloads(example_assignment)
print(example_workloads)
#Calculating the maximum workload
max_workload=max(example_workloads.values())
print(("Maximum Workload is:", max_workload))


{'Alice': 17, 'Bob': 46, 'Cheryl': 48, 'Daniel': 70}
('Maximum Workload is:', 70)



Below we have defined three functions which together will implement our assignment heuristic.  The first two functions below, initialize_assignment and assign_task are currently undefined, and it is your job to finish these.  The last function, greedy_assignment, is given and puts together the previous two functions to implement our heuristic.

- Complete the function initialize assignment
    - This function should output two dictionaries (assignment and workloads) whose keys are the employee names
    - assignment[employee] should be the empty list for each employee
    - workloads[employee] should be 0 for each employee
    
- Complete the function assign_task
    - This function takes two dictionaries (assignment and workloads) and a tuple (task) as input.  
    - The function should modify assignment and workloads so that an employee with the smallest workload gets the task added to their list and their workload is updated to account for the new task
    - Example - Suppose that:
        - assignment = {'Alice': [("Task1", 10)], 'Bob': [("Task4", 27)], 'Cheryl': [("Task7", 4)], 'Daniel': [("Task10", 40)]}
        - workloads = {'Alice': 10, 'Bob': 27, 'Cheryl': 4, 'Daniel': 40]}
        - task = ("Task2", 4).  
    - Then after executing assign_task(assignment, workloads, task) we should have:
        - assignment == {'Alice': [("Task1", 10)], 'Bob': [("Task4", 27)], 'Cheryl': [("Task7", 4), ("Task2", 4)], 'Daniel': [("Task10", 40)]}
        - workloads == {'Alice': 10, 'Bob': 27, 'Cheryl': 8, 'Daniel': 40]}

In [5]:
def initialize_assignment(employees):
    assignment={employee: [] for employee in employees}
    workloads={employee: 0 for employee in employees}
    return assignment, workloads


In [6]:
def assign_task(assignment, workloads, task):
    employee_min_workload=min(workloads, key=workloads.get)
    assignment[employee_min_workload].append(task)
    workloads[employee_min_workload]+=task[1]


In [7]:
# The code below shows how we put together the two functions above.
def greedy_assignment(tasks, employees):
    # initialize assignment and workload dictionaries
    assignment, workloads = initialize_assignment(employees)
        
    # Iterate over each task to assign it
    for task in tasks:
        assign_task(assignment, workloads, task)
        
    return assignment, workloads

assignment, workloads = greedy_assignment(tasks, employees)


Finally let's write a function to cleanly output the results and apply this function to the result of our greedy_assignment function.  Is this the best possible way to assign the tasks to the employees?

- Complete the function print_assignment below.  Each employees task list should be displayed on a separate line.  Next, print out the assignment length (time needed to finish all tasks) on a separate line.
- (Discussion question, no code necessary)  Is this assignment the best assignment (with respect to the amount of time it needs to finish)?  If not, how might we find a better (or even the best) assignment?



In [8]:
def print_assignment(assignment, workloads):
    # YOUR CODE HERE
    for employee, tasks in assignment.items():
        print(employee, ":", tasks)
        total_time=sum(workloads.values())
    assignment_length=sum(workloads.values())
    print("Time needed to finish all the tasks is:", assignment_length)
    
print_assignment(assignment, workloads)

Alice : [('Task1', 10), ('Task8', 25)]
Bob : [('Task2', 4), ('Task6', 5), ('Task7', 4), ('Task9', 19)]
Cheryl : [('Task3', 3), ('Task5', 14), ('Task10', 40)]
Daniel : [('Task4', 27), ('Task11', 3), ('Task12', 27)]
Time needed to finish all the tasks is: 181


In [1]:
# YOUR DISCUSSION HERE 

#The greedy assignment algorithm is used when speed is priority as it is fast and suboptimal method.
#Apart from the Greedy Assignment Algorithm The Hungarian algorithm provides the optimal solution but it is comparatively slower. 
#Other Assignment algorithms include Network flow algorithms and integer programming and they provide exact solutions but are much slower. 
#To summarise we can say that the choice of algorithm depends on the specific job requirements of the problem and quality of solution and computational time.