## Project 3 - Scheduling and Decision Analysis with Uncertainty

*Deanna Schneider contributed the bulk of helped this project. THANK YOU Deanna. Don't blame her for the decision analysis though ... that was our idea.*

For the final project, we're going to combine concepts from Lesson 7 (Constraint Programming), Lesson 8 (Simulation), and Lesson 9 (Decision Analysis). We'll do this by revisiting the scheduling problem from Lesson 7. But, we're going to make it a little more true-to-life by acknowledging some of the uncertainty in our estimates, and using simulation to help us come up with better estimates. We'll use our estimated profits to construct a payoff table and make a decision about how to proceed with the building project.

When we originally created the problem, we used the following estimates for time that each task would take:

<img src='images/reliable_table.png' width="450"/>

But based on past experience, we know that these are just the most likely estimates of the time needed for each task. Here's our estimated ranges of values (in days instead of weeks) for each task:

<img src='images/reliable-estimate-ranges.png' width="450"/>

Further, we're going to consider the following factors:

* The base amount that Reliable will earn is \$5.4 million.
* If Reliable completes the project in 280 days or less, they will get a bonus of \$150,000.
* If Reliable misses the deadline of 329 days, there will be a \$25,000 penalty for each day over 329.

### Part One

Create a simulation that uses a triangular distribution to estimate the duration for each of the activities. Use the Optimistic Estimate, Most Likely Estimate, and Pessimistic Estimate for the 3 parameters of your triangular distribution.   Use CP-SAT to find the minimal schedule length in each iteration.  Track the total weeks each simulation takes and the profit for the company.

Put your simulation code in the cell below.  Use at least 1000 iterations.  Check your simulation results to make sure the tasks are being executed in the correct order!

<font color = "blue"> *** 8 points -  answer in cell below *** (don't delete this cell) </font>

### Define data

In [1]:
import numpy as np

activity_map = {
    'A': 'excavate',
    'B': 'lay_foundation',
    'C': 'rough_wall',
    'D': 'roof',
    'E': 'exterior_plumbing',
    'F': 'interior_plumbing',
    'G': 'exterior_siding',
    'H': 'exterior_painting',
    'I': 'electrical_work',
    'J': 'wallboard',
    'K': 'flooring',
    'L': 'interior_painting',
    'M': 'exterior_fixtures',
    'N': 'interior_fixtures'
}

activities = {
    'A': '',
    'B': 'A',
    'C': 'B',
    'D': 'C',
    'E': 'C',
    'F': 'E',
    'G': 'D',
    'H': ['E','G'],
    'I': 'C',
    'J': ['F','I'],
    'K': 'J',
    'L': 'J',
    'M': 'H',
    'N': ['K','L']
}

precedence_dict = {}

# populate precedence dict with dependencies in reverse
for k, v in activities.items():
    for k2, v2 in activities.items():
        if isinstance(v2, list):
            if k in v2 or k == v2:
                if activity_map[k] not in precedence_dict:
                    precedence_dict[activity_map[k]] = [activity_map[k2]]
                else:
                    precedence_dict[activity_map[k]].append(activity_map[k2])
        else:
            if k == v2:
                if activity_map[k] not in precedence_dict:
                    precedence_dict[activity_map[k]] = [activity_map[k2]]
                else:
                    precedence_dict[activity_map[k]].append(activity_map[k2])

### Define functions

In [2]:
def simulate(artifacts_found = True, buy_insurance = False, print_schedule = False):
    """
    function to simulate profit according to task duration and costs
    """
    # days in a week
    n = 7

    min_schedule_ls = []
    profit_ls = []

    np.random.seed(6)
    
    if isinstance(artifacts_found, (int, float)):
        prob_found = artifacts_found
        prob_dist = (1-prob_found, prob_found)

    for j in range(1000):
        if prob_dist:
            artifacts_found = np.random.choice([0,1], 1, p=[*prob_dist])

        if artifacts_found == 1 or artifacts_found is True:
            time_dist = int(np.random.triangular(7/n, 15/n, 365/n))
        else:
            time_dist = int(np.random.triangular(7/n, 14/n, 21/n))

        task_duration_dict = {
            'excavate': time_dist,
            'lay_foundation': int(np.random.triangular(14/n, 21/n, 56/n)),
            'rough_wall': int(np.random.triangular(42/n, 63/n, 126/n)),
            'roof': int(np.random.triangular(28/n, 35/n, 70/n)),
            'exterior_plumbing': int(np.random.triangular(7/n, 28/n, 35/n)),
            'interior_plumbing': int(np.random.triangular(28/n, 35/n, 70/n)),
            'exterior_siding': int(np.random.triangular(35/n, 42/n, 77/n)),
            'exterior_painting': int(np.random.triangular(35/n, 56/n, 119/n)),
            'electrical_work': int(np.random.triangular(21/n, 49/n, 63/n)),
            'wallboard': int(np.random.triangular(21/n, 63/n, 63/n)),
            'flooring': int(np.random.triangular(21/n, 28/n, 28/n)),
            'interior_painting': int(np.random.triangular(7/n, 35/n, 49/n)),
            'exterior_fixtures': int(np.random.triangular(7/n, 14/n, 21/n)),
            'interior_fixtures': int(np.random.triangular(35/n, 35/n, 63/n))
        }
        task_names = list(task_duration_dict.keys())
        num_tasks = len(task_names)
        durations = list(task_duration_dict.values())

        task_name_to_number_dict = dict(zip(task_names, np.arange(0, num_tasks)))

        horizon = sum(task_duration_dict.values())

        from ortools.sat.python import cp_model
        model = cp_model.CpModel()

        start_vars = [
            model.NewIntVar(0, horizon, name=f'start_{t}') for t in task_names
        ]
        end_vars = [model.NewIntVar(0, horizon, name=f'end_{t}') for t in task_names]

        # the `NewIntervalVar` are both variables and constraints, the internally enforce that start + duration = end
        intervals = [
            model.NewIntervalVar(start_vars[i],
                                 durations[i],
                                 end_vars[i],
                                 name=f'interval_{task_names[i]}')
            for i in range(num_tasks)
        ]

        # precedence constraints
        for before in list(precedence_dict.keys()):
            for after in precedence_dict[before]:
                before_index = task_name_to_number_dict[before]
                after_index = task_name_to_number_dict[after]
                model.Add(end_vars[before_index] <= start_vars[after_index])

        obj_var = model.NewIntVar(0, horizon, 'largest_end_time')
        model.AddMaxEquality(obj_var, end_vars)
        model.Minimize(obj_var)

        solver = cp_model.CpSolver()
        status = solver.Solve(model)
        
        # optimal schedule in days
        osl_days = solver.ObjectiveValue()*n
        
        # append optimal schedule to list
        min_schedule_ls.append(osl_days)
        
        import math
        
        base, bonus, penalty = 5400000, 0, 0

        # define insurance costs
        if buy_insurance == False:
            insurance_cost = 0
        else:
            insurance_cost = 500000

        # define artifact costs 
        if artifacts_found == 1 or artifacts_found == True:
            if buy_insurance == False:
                artifact_cost = np.random.exponential(scale=100000)
            else:
                artifact_cost = 0

        else:
            artifact_cost = 0

        # define bonus and penalty for days over deadline
        if math.ceil(osl_days) < 280:
            bonus = 150000
            
        elif math.ceil(osl_days) > 329:
            days_over = int(math.ceil(osl_days)) - 329
            penalty = days_over * 25000
        
        # calculate profit
        profit = base + bonus - (penalty + artifact_cost + insurance_cost)
        profit_ls.append(profit)
        
    if print_schedule is True:
        print(f'Optimal Schedule Length (weeks): {solver.ObjectiveValue()}')
        for i in range(num_tasks):
            print(
                f'{task_names[i]} start at {solver.Value(start_vars[i])} and end at {solver.Value(end_vars[i])}'
            )

    return profit_ls, min_schedule_ls

def return_stats(profit_ls, min_schedule_ls, show_summary = True):
    """
    function to return simulation statistics
    """    
    profit_ls = np.array(profit_ls)
    min_schedule_ls = np.array(min_schedule_ls)

    # calculate summary stats
    mean_profit = int(np.mean(profit_ls))
    less_than_280 = len(np.where(min_schedule_ls < 280)[0])/len(min_schedule_ls)
    between_280_and_329 = len(np.intersect1d(np.where(min_schedule_ls >= 280)[0],np.where(min_schedule_ls <= 329)[0]))/len(min_schedule_ls)
    over_329 = len(np.where(min_schedule_ls > 329)[0])/len(min_schedule_ls)

    # print summary stats else return them
    if show_summary is True:
        print(f"""Summary Stats:
        mean profit: ${mean_profit:,.2f}
        prob less than 280 days: {round(less_than_280*100,2)}%
        prob between 280 and 329 days: {round(between_280_and_329*100,2)}%
        prob over 329 days: {round(over_329*100,2)}%
        prob sum: {(less_than_280 + between_280_and_329 + over_329)*100}%""")

    else:
        return mean_profit

def show_payoff_table(artifacts_found = .30):
    """
    function to calculate payoff table for use in Bayes Decision Rule
    """ 
    import pandas as pd

    # define states, alternatives, and prior probs
    alternatives = {'Buy_Insurance': True,'No_Insurance': False}
    states =  {'Artifacts': True, 'No_Artifacts': False}
    df  = pd.DataFrame(columns = list(states.keys()), index=list(alternatives.keys()))
    prior_probs = [artifacts_found, 1-artifacts_found]

    # populate payoff table
    for alt_name, alt_val in alternatives.items():
        for state_name, state_value in states.items():
            profit_ls, min_schedule_ls = simulate(artifacts_found = state_value, buy_insurance = alt_val)
            mean_profit = return_stats(profit_ls, min_schedule_ls, show_summary = False)
            df.loc[alt_name][state_name] = round(mean_profit/1000000,1)

    return df, prior_probs


def bayes_calc(prior_probs, df):
    """
    function to calculate expected payoffs and best alternative
    """ 
    # create arrays of alternatives and prior probs
    alt_states = np.array([df.loc["Buy_Insurance"].tolist(),df.loc["No_Insurance"].tolist()])
    prior_probs = np.array(prior_probs)
    expected_payoffs = {}

    # calculate expected payoffs using Bayes' decision rule
    for i, alt in enumerate(alt_states):
        ep = sum(prior_probs * np.array(alt))
        expected_payoffs[df.index[i]] = ep

    # get maximum payoff and best alternative
    best_alt = max(expected_payoffs, key=expected_payoffs.get)
    max_val = expected_payoffs[best_alt]

    return best_alt, max_val

What is the probability that Reliable Company will finish the bid in less than 280 days, between 280 and 329 days, and over 329 days? What is their average profit?

Include code to answer these questions with output below:

<font color = "blue"> *** 2 points -  answer in cell below *** (don't delete this cell) </font>

In [3]:
profit_ls, min_schedule_ls = simulate(artifacts_found = False)
return_stats(profit_ls, min_schedule_ls)

Summary Stats:
        mean profit: $5,410,850.00
        prob less than 280 days: 27.3%
        prob between 280 and 329 days: 65.0%
        prob over 329 days: 7.7%
        prob sum: 100.0%


### Part Two
From past experience, we know that special artifacts are sometimes found in the area where Reliable Construction is planning this building project.  When special artifacts are found, the excavation phase takes considerably longer and the entire project costs more - sometimes much more. They're never quite sure how much longer it will take, but it averages around an extra 15 days, and takes at least an extra 7 days. They've seen some sites where relocating the special artifacts took as much as 365 extra days (yes - a whole year)! 

In addition, there are usually unanticipated costs that include fines and other things.  The accounting departments suggest that we model those costs with an exponential distribution with mean (scale) \\$100,000.


Run a second simulation with these new parameters and using at least 1000 iterations.

Put your simulation code in the cell below.

<font color = "blue"> *** 8 points -  answer in cell below *** (don't delete this cell) </font>

In [4]:
profit_ls, min_schedule_ls = simulate(artifacts_found = True)

What is the probability of meeting the Under 280, 280-329 or over 329 cutoff points now? What's the average profit now?

Include code to answer these questions with output below:

<font color = "blue"> *** 2 points -  answer in cell below *** (don't delete this cell) </font>

In [5]:
return_stats(profit_ls, min_schedule_ls)

Summary Stats:
        mean profit: $3,187,411.00
        prob less than 280 days: 2.4%
        prob between 280 and 329 days: 18.8%
        prob over 329 days: 78.8%
        prob sum: 100.0%


### Part Three

Clearly dealing with artifacts can be very costly for Reliable Construction.  It is known from past experience that about 30% of building sites in this area contain special artifacts.  Fortunately, they can purchase an insurance policy - a quite expensive insurance policy. The insurance policy costs \$500000, but it covers all fines and penalities for delays in the event that special artifacts are found that require remediation. Effectively, this means that Reliable could expect the same profit they would get if no artifacts were found (minus the cost of the policy).

Given the estimated profit without artifacts, the estimated profit with artifacts, the cost of insurance, the 30% likelihood of finding artifacts, create a payoff table and use Baye's Decision Rule to determine what decision Reliable should make.  You should round the simulated costs to nearest \\$100,000 and use units of millions of dollars so that, for example, \\$8,675,309 is 8.7 million dollars.

Provide appropriate evidence for the best decision such as a payoff table or picture of a suitable (small) decision tree.

<font color = "blue"> *** 6 points -  answer in cell below *** (don't delete this cell) </font>

In [6]:
df, prior_probs = show_payoff_table()
df.style.format('${0:,.1f}M')

Unnamed: 0,Artifacts,No_Artifacts
Buy_Insurance,$2.7M,$4.9M
No_Insurance,$3.2M,$5.4M


Describe, in words, the best decision and the reason for that decision:

<font color = "blue"> *** 2 points -  answer in cell below *** (don't delete this cell) </font>

In [7]:
best_alt, max_val = bayes_calc(prior_probs, df)
best_alt = best_alt.replace("_"," ")
print(f"""The best decision for Reliable Construction is take {best_alt} with an expected payoff of ${max_val}M.
This is the best decision for Reliable C. given the simulated trials and prior probabilities of finding artifacts.""")

The best decision for Reliable Construction is take No Insurance with an expected payoff of $4.74M.
This is the best decision for Reliable C. given the simulated trials and prior probabilities of finding artifacts.


### Part 4
Reliable has been contacted by an archeological consulting firm. They assess sites and predict whether special artifacts are present. They have a pretty solid track record of being right when they predict that artifacts are present - they get it right about 86% of the time. Their track record is less great when they predict there are no artifacts. They're right about 72% of the time.

First find the posterior probabilities and provide evidence for how you got them (Silver Decisions screenshot or ?).

<font color = "blue"> *** 6 points -  answer in cell below *** (don't delete this cell) </font>

In [8]:
# prior probs
pp1 = .30
pp2 = .70

f1 = (.86*pp1)+(.28*pp2)
f2 = (.14*pp1)+(.72*pp2)

# posterior probs
p_s1_f1 = (.86*pp1)/f1
p_s2_f1 = (.28*pp2)/f1

p_s1_f2 = (.14*pp1)/f2
p_s2_f2 = (.72*pp2)/f2

# summary
print(f"""Posterior probs are:
P(s1|f1) (Predicted Artifact - Artifact i.e. True Positive) = {round(p_s1_f1,3)}
P(s2|f1) (Predicted Artifact - No Artifact i.e. False Positive) = {round(p_s2_f1,3)}
P(s1|f2) (Predicted No Artifact - Artifact i.e. False Negative) = {round(p_s1_f2,3)}
P(s2|f2) (Predicted No Artifact - No Artifact i.e. True Negative)= {round(p_s2_f2,3)}""")

Posterior probs are:
P(s1|f1) (Predicted Artifact - Artifact i.e. True Positive) = 0.568
P(s2|f1) (Predicted Artifact - No Artifact i.e. False Positive) = 0.432
P(s1|f2) (Predicted No Artifact - Artifact i.e. False Negative) = 0.077
P(s2|f2) (Predicted No Artifact - No Artifact i.e. True Negative)= 0.923


The consulting fee for the site in question is \$50,000. 

Construct a decision tree to help Reliable decide if they should hire the consulting firm or not and if they should buy insurance or not.  Again, you should round the simulated costs to nearest $100,000 and use units of millions of dollars (e.g. 3.8 million dollars) in your decision tree.

Include a picture of the tree exported from Silver Decisions.

<font color = "blue"> *** 10 points -  answer in cell below *** (don't delete this cell) </font>

<img src='dt.png' width=100% height=100%/>

Summarize the optimal policy in words here:

<font color = "blue"> *** 2 points -  answer in cell below *** (don't delete this cell) </font>

<font color = "green">
The optimal policy for Reliable C. is not to hire a consulting firm and not to buy insurance.
</font>

### Part 5

How confident do you feel about the results of your decision analysis? If you were being paid to complete this analysis, what further steps might you take to increase your confidence in your results?

<font color = "blue"> *** 4 points -  answer in cell below *** (don't delete this cell) </font>

<font color = "green">
I feel fairly confident in my analysis. To increase my confidence in the results, I would further consider:

* Running a sensitivity analysis
* Running a parameter analysis (like in lesson 8) for 10 different percentiles (of finding artifacts)
* Including geographic/location data to calculate a rough probability of finding artifacts
* Running the exercise with different probability distributions for various variables and taking the average
* Running the exercise with deviations to posterior probabilities
* Increasing simulation size

These are perhaps the primary ideas that come to mind.
</font>