# PART 1: DATA FABRICATION

**Objective:** As there are no publicly-available datasets that fit the scope of this project, this notebook will execute a series of strategic data fabrication steps to generate the necessary datasets needed down the project pipeline.

---

In [39]:
# Data Management
import pandas as pd

# Data Manipulation
import numpy as np
from scipy.stats import truncnorm

# Utils
import math
import datetime
from datetime import timedelta
from faker import Faker
from random import shuffle, choice, randrange

In [9]:
# Instantiate data fabricator 
faker = Faker()

---
---

## 1A: Clinic Information

**Objective:** Generate a dataset consisting of clinic names, locations, and distances.

---

**Methodology / Approach:** 

A total of 5 clinics located around the Denver area will be fabricated to include the following information:
- `cities` : Name of the clinic branch location (serves as index)
- `lat` : Latitude of clinic
- `lon` : Longitude of clinic
- `to_denver` : Shortest distance to arrive at Denver location from clinic
- `to_edgewater` : Shortest distance to arrive at Edgewater location from clinic
- `to_wheatridge` : Shortest distance to arrive at Wheatridge location from clinic
- `to_rino` : Shortest distance to arrive at RINO location from clinic
- `to_lakewood` : Shortest distance to arrive at Lakewood location from clinic
- `nearby_clinics` : The nearest (within ~5 miles) clinics 

The urgent care branch locations for each of the 5 clinics was chosen around 5 known Denver areas, with corresponding latitude / longitude / distance values retrieved from Google Maps. This data will be used to also feature engineer the `nearby_clinics` attribute which will be useful for connecting nearby clinical staff members. 

*Note: In production, the clinic location & distances data would be retrieved through Google Maps' official API. This will enable us to account for real-time directions / drive-time between locations to periodically update "nearest" clinic info for better accuracy. However, that is is a cost-based service, and therefore the information was manually compiled for the purporses of this project. The shortest distance will be used in lieu of drive-time as the basis for "nearest" distinction.*

#### Generate a clinic dataset with names, locations, and distances:

In [171]:
# Construct the clinic dataframe based on Google Maps data
clinics_df = pd.DataFrame({
    'branch_name': ['denver', 'edgewater', 'wheatridge', 'rino', 'lakewood'],
    'lat': [39.73906432357836, 39.753954449845445, 39.76685732722651, 39.767327859566265, 39.70455155721396],
    'lon': [-104.98969659655802, -105.06778796142915, -105.08198265044479, -104.98113186098168, -105.0798829449297],
    'denver': [0, 5, 6.3, 1.9, 7.5],      # shortest dist. (miles) to Denver branch from each clinic
    'edgewater': [5.1, 0, 2, 7.8, 4.3],   # shortest dist. (miles) to Edgewater branch from each clinic
    'wheatridge': [6.3, 2, 0, 7.5, 12],   # shortest dist. (miles) to Wheatridge branch from each clinic
    'rino':[2, 7.7, 7.8, 0, 11],          # shortest dist. (miles) to RINO branch from each clinic
    'lakewood':[8, 4.4, 4.8, 10.7, 0]     # shortest dist. (miles) to Lakewood branch from each clinic
}) 

clinics_df = clinics_df.set_index('branch_name', drop=True)

#### Feature engineer "nearby" clinic info based on distances:

In [181]:
distances = clinics_df[['denver', 'edgewater', 'wheatridge', 'rino', 'lakewood']].to_dict()

In [189]:
# Instantiate empty list to hold lists of nearby clinics for each location
nearby_clinics = []

# Iterate through each branch location
for index, row in clinics_df.iterrows(): 
    nearest = []
    dist_to_clinic = ['denver', 'edgewater', 'wheatridge', 'rino', 'lakewood']
    
    # Collect clinic names that are within threshold distance from location
    for dist in dist_to_clinic:
        if row[dist] > 0: # and round(row[dist]) <= 5:
            city = dist #.split('to_')[1]
            nearest.append((city, row[dist]))
            nearest.sort(key=lambda x: x[1])
    nearby_clinics.append(nearest)
    

# Add nearby clinic info to dataframe
clinics_df['nearby_clinics'] = nearby_clinics

[('rino', 2.0), ('edgewater', 5.1), ('wheatridge', 6.3), ('lakewood', 8.0)]

---
---

## 1B: Employee Records

**Objective:** Generate a dataset consisting employee names, IDs, and roles.

---

**Methodology / Approach:** 

A dataset will be constructed for a total of 15 providers and 35 technicians and include the following features:
- `e_id` : Employee's ID (serves as index)
- `e_name` : Employee's name
- `e_role` : Employee's role (provider vs technician)

As there are 5 clinics, and 2 providers are generally assigned to a clinic per day, a total of 15 employees will be distinguished as providers (physicians / physician assistants). The remaining 35 employees in this dataset will be distinguished as 'technicians' which for the purposes of this project are moveable among clinics. For simplicity and relevancy, other employee types are discluded in this dataset as there won't be a part of the analysis or modeling.

In [13]:
# Specify desired counts
num_docs = 15
num_techs = 35

### Employee IDs

In [14]:
def generate_ids(num_employees):
    """Generates 2-digit staff IDs based on input number of employees."""
    
    eids = list(range(11, num_employees+11))
    return eids

### Employee Names

In [15]:
def generate_names(eids):
    """Generates randomized employee names. Input employee IDs are used for seeding purposes."""

    e_names = []
    for eid in eids:
        Faker.seed(eid)  # for consistency
        e_names.append(faker.unique.name())
    return e_names

### Employee Roles

In [16]:
def generate_roles(num_docs, num_techs):
    """Generates employee roles based on pre-determined tally."""
    
    roles = []
    
    # Generate role title strings based on input tally
    for i in range(num_docs):
        roles.append('Provider')
    for j in range(num_techs):
        roles.append('Technician')
    
    shuffle(roles)
    return roles

### Compile employee information and construct dataset:

In [17]:
# Generate employee IDs based on desired number of each role
eids = generate_ids(num_docs+num_techs)

employees_df = pd.DataFrame({
    'e_id': eids,
    'e_name': generate_names(eids),
    'e_role': generate_roles(num_docs, num_techs)
})
employees_df = employees_df.set_index('e_id', drop=True)

employees_df.sample(5)

Unnamed: 0_level_0,e_name,e_role
e_id,Unnamed: 1_level_1,Unnamed: 2_level_1
35,Nicole Burgess,Provider
15,Alexandra Fitzgerald,Technician
11,Justin Glass,Provider
39,James Lee,Provider
24,Sara White,Technician


---
---

## 1C: Patient Records ("Past Year")

**Objective:** Generate a dataset consisting of past patient records including the location & date/time of visit.

---

**Methodology / Approach:** 

Past patient records will include the following information:
- `pt_id` : Patient's assigned ID number
- `pt_name` : Patient's name
- `pt_dob` : Patient's date of birth
- `pt_age` : Patient's age (engineered from dob)
- `visit_reason` : Patient's reason for visiting clinic
- `visit_location` : Clinic visited
- `visit_date` : Date of visit
- `checkin_time` : Time the patient checked-in to clinic
- `checkout_time` : Time the patient checked-out of clinic
<br>
Only information relevant to the study will be included in this dataset. Address, insurance, and official diagnosis / code are not going to be generated for this set as it is irrelevant for the purpose of the patient records, which is to inform the scheduler. 

According to American Academy of Urgent Care Medicine (AAUCM), the average urgent care sees about 60-80 patients per day. Therefore, in order to generate a dataset of past patient records for each of the 5 chain clinics, this guideline measure will be used to fabricate records for an entire year's worth of data. 

To simulate real-world data, each location was strategically chosen from different areas that would see different traffic of incoming patients based on city size and location. To that extent, corresponding proportion of total yearly patients was decided as 0.25 for the more-populated Denver & Lakewood locations, 0.2 for Wheatridge, and 0.15 for the smaller Edgewater and RINO areas. Based on the size computed from the proportion, a list consisting of location names was constructed to be inputted as the locations that patients visited in the patient records.

First, a series of helper functions will be defined for data generation of each corresponding attribute. Additional functions to add "noise" to the data will also be strategically setup. Then, data will be constructed for each clinic and compiled together for end output.

### Patient IDs

In [18]:
def generate_ids(branch, num_pts):
    """Generates 7-digit patient IDs based on input clinic location and patient count."""
    
    # Define starting ID code based on branch location
    start_digit = {
        'denver': 1000001, 
        'edgewater': 2000001, 
        'wheatridge': 3000001, 
        'rino': 4000001, 
        'lakewood': 5000001
    }
    
    # Generate IDs based on start digit and num_pts
    start_id = start_digit[branch]
    end_id = start_id + num_pts
    pids = list(range(start_id, end_id))
    
    return pids

Since there will be a total of over 100000 patient logs generated, each patient will be given an ID for easier data handling. This ID number was chosen to be 7-digits so that the length is standardized across all patients. As each location's data will be built independently, for easier data handling, the starting digit was distinguished accordingly to prevent data leakage.

### Patient Names

In [19]:
def generate_names(pids): 
    """Generates randomized patient names. Input patient IDs are used for seeding purposes."""
    pt_names = []
    for pid in pids:
        Faker.seed(pid)  # for consistency
        pt_names.append(faker.unique.name())
    return pt_names

A function was set up to generate randomized names through `Faker` module. A seed (based on unique patient ID) will be used to ensure each run yields consistently the exact same names. 

### Patient DOBs / Ages

In [20]:
def generate_dobs(num_pts):
    """Generates date of births based on real-world distributions (with added noise)."""
    
    # Construct dict of age groups with corresponding age range and proportion
    age_probs = {
        'age_group_1': [0, 10, 0.14],
        'age_group_2': [11, 20, 0.15],
        'age_group_3': [21, 30, 0.18],
        'age_group_4': [31, 40, 0.16],
        'age_group_5': [41, 50, 0.13],
        'age_group_6': [51, 60, 0.11],
        'age_group_7': [61, 80, 0.13],
    }
    
    dobs = []
    
    # Iterate through each age group and generate DOBs
    for key, val in age_probs.items():
        
        # Assign appropriate naming for easy-to-follow data handling
        min_age, max_age, p = val[0], val[1], val[2]
        
        # Compute number of patients in this age group
        num_pts_in_group = int(num_pts * p)  # floor multiplication
        
        # Iterate through each patient in the current iteration of age group
        for i in range(num_pts_in_group):
            Faker.seed(i)
            dob = faker.date_of_birth(minimum_age=min_age, maximum_age=max_age)
            dobs.append(dob)
    
    # Account for patient-count discrepancy due to rounding/floor multiplication
    discrepancy = num_pts - len(dobs)
    for i in range(discrepancy):
        # Assign leftover patients to the most popular age group
        dob = faker.date_of_birth(minimum_age=21, maximum_age=30) 
        dobs.append(dob)    
    
    # Randomize order of DOBs (so not in order of age groups)
    shuffle(dobs) 
    
    return dobs

The function above is setup to take an input number of patients (so that it can be called individually for each branch location).

According to The Journal Of Urgent Care Medicine (JUCM), patient visits can be broken down by the following age proportions:
- Infant to 10: 13.8%
- 11 to 20: 14.8%
- 21 to 30: 18.3%
- 31 to 40: 15.9%
- 41 to 50: 12.8%
- 51 to 60: 11.2%
- 61+: 13.3

Therefore, the function was designed to closely adhere to this distribution as possible. These proportions were slightly rounded to yield whole numbers and age-limit caps at the max end to account for the extremely rare 80+ patients. Lastly, it was important to set seed for each iteration to yield a consistent set of dates for each run and account for any leftover patients due to floor multiplication of the proportions.

In [21]:
def convert_dob(dob):
    """Converts input date of birth to age, based on today's date."""

    today = datetime.date.today()
    age = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
    return age

The helper function above feature engineers each DOB value into age. Age is computed using the current date of script execution.

### Reason for Visit (& Code)

In [22]:
def generate_reasons(num_pts):
    """Generates randomized list of reasons-of-visit based on sensible proportions."""
    
    reason_probs = {
        'cold/flu/fever': 12/80,
        'sore-throat': 5.5/80, 
        'cough': 5.8/80, 
        'chest-pain': 5/80,
        'stomach-pain': 4.5/80, 
        'diarrhea': 0.8/80, 
        'weakness/dizziness': 2.7/80,
        'headache': 2.6/80, 
        'UTI': 4/80, 
        'pink-eye': 0.7/80, 
        'ear-pain': 5.3/80, 
        'rash/allergy': 2.5/80, 
        'cuts/abscess': 1.5/80,
        'ache/pain': 5/80, 
        'injury/accident': 1.5/80, 
        'covid-test': 7.5/80, 
        'vaccination': 5.6/80, 
        'physical': 1/80, 
        'drug-test': 3/80, 
        'blood/lab-work': 3.5/80
    }
    
    reasons = []
    
    # Iterate through each common visit-reasons
    for reason, prob in reason_probs.items():
        
        # Compute number of patients from input total based on reason prob
        num_pts_for_reason = int(num_pts * prob)  # floor multiplication
        
        # Based on pt count for reason, add reasons to master list
        reasons.extend([reason for i in range(num_pts_for_reason)])
    
    # Account for patient-count discrepancy due to rounding/floor multiplication
    discrepancy = num_pts - len(reasons)
    for i in range(discrepancy):
        # Assign leftover patients to the most popular reason
        reasons.append('cold/flu/fever')    
    
    # Randomize order of reasons
    shuffle(reasons) 
    
    return reasons

The function above generates patient visit-reasons. 20 of the most common visit-reasons were chosen based on CDC data, as well as, personal work experience. Although literature was sought to strategize the tuning of these proportions, only ER data was available which isn't directly applicable for urgent cares. Therefore, for this segment, personal work experience in urgent cares was drawn from to tune the appropriate amount of reason types. This variation in the data is essentially constructed for more insightful EDA and to simulate the streaming as close to real-life as possible. In production, these values would be replaced by the proportions observed in actual data, rather than pure fabrication.

In [23]:
def generate_severity_code(reason):
    """Outputs a severity-level for input reason of visit."""
    
    # Construct a dict of reasons and corresponding code
    reasons_code = {
        'cold/flu/fever': 4, 
        'sore-throat': 4, 
        'cough': 4,  
        'chest-pain': 5, 
        'stomach-pain': 5,
        'diarrhea': 5, 
        'weakness/dizziness': 5,
        'headache': 5, 
        'UTI': 4, 
        'pink-eye': 4,
        'ear-pain': 4,
        'rash/allergy': 5,
        'cuts/abscess': 5, 
        'ache/pain': 4, 
        'injury/accident': 5, 
        'covid-test': 4,  
        'vaccination': 3,
        'physical': 3, 
        'drug-test': 3, 
        'blood/lab-work': 3
    }
    
    # Output corresponding severity level / code
    code = reasons_code[reason]    
    return code

The helper function above was setup as a possible added layer of sophistication for patient data. It generates the severity code based on patient-visit reasons that would affect reaction time of technicians and other staff members when faced with multiple waiting patients. The set {3, 4, 5} was chosen based on a 3-tiered system followed at urgent care clinics in St. Louis, MO (based on work-experience). This information will also be useful for exploratory visualizations to inform scheduling and modeling stages down the pipeline.

### Patient Visit-Dates

In [24]:
def generate_dates(ppd):
    """Generates dates data based on input patients-per-day array of past year."""
    
    # Get unique dates from past year
    past_yr_dates = pd.date_range(datetime.date(2021,5,1), periods=365).tolist()
    
    # Generate duplicate dates for each unique date based on patient-per-day records
    dates_all = []
    for i in range(len(past_yr_dates)):
        date = past_yr_dates[i]
        dates_all.extend([date] * ppd[i])
    
    return dates_all

The function above takes in an input array consisting of the daily patient tally for an entire year for a single clinic. Based on that information, it will fabricate dates records for all patients belonging to that clinic. Note: COVID-related reasons "cold/flu/fever", etc. are not representative of Denver's 2021-2022 waves. Rather, it is meant to reflect moreso the latter stages of a pandemic where tests and cases are not in its peak for the objective of this simulation project.

### Patient Check-In Times

In [25]:
### Define possible specifications for each location's peak hours / "noise"

# Denver-Clinic: 
denver_ctime_specs = {
    'weekday_means1': [8, 8.25, 8.5, 8.75, 9],         # First weekday peak possibilities of Denver location
    'weekday_means2': [11, 11.25, 11.5, 11.75, 12],    # Second weekday peak possibilities of Denver location
    'weekday_means3': [16.5, 17, 17.5, 18, 18.5],      # Third weekday peak possibilities of Denver location
    'weekday_sigmas': [0.6, 0.8, 1.0, 1.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [10.5, 11, 11.5, 12],            # First weekend peak possibilities of Denver location
    'weekend_means2': [14, 14.5, 15, 15.5],            # Second weekend peak possibilities of Denver location
    'weekend_means3': [17, 17.25, 17.5, 17.75, 18],    # Third weekend peak possibilities of Denver location
    'weekend_sigmas': [0.6, 1.2, 1.8, 2.4]             # Possible weekend variations (standard-deviations)
}

# Edgewater-Clinic: 
edgewater_ctime_specs = {
    'weekday_means1': [9, 9.25, 9.5, 9.75, 10],        # First weekday peak possibilities of Edgewater location
    'weekday_means2': [13, 13.5, 14, 14.5, 15],        # Second weekday peak possibilities of Edgewater location
    'weekday_means3': [18, 18.5, 19, 19.5, 19.75],     # Third weekday peak possibilities of Edgewater location
    'weekday_sigmas': [0.6, 0.8, 1.0, 1.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [8, 8.25, 8.5, 8.75, 9],         # First weekend peak possibilities of Edgewater location
    'weekend_means2': [12, 12.25, 12.5, 12.75, 13],    # Second weekend peak possibilities of Edgewater location
    'weekend_means3': [16, 16.25, 16.5, 16.75, 16],    # Third weekend peak possibilities of Edgewater location
    'weekend_sigmas': [0.6, 1.2, 1.8, 2.4]             # Possible weekend variations (standard-deviations)
}
    
# Wheatridge-Clinic: 
wheatridge_ctime_specs = {
    'weekday_means1': [8, 8.25, 8.5, 8.75, 9],         # First weekday peak possibilities of Wheatridge location
    'weekday_means2': [11, 11.25, 11.5, 11.75, 12],    # Second weekday peak possibilities of Wheatridge location
    'weekday_means3': [16.5, 17, 17.5, 18, 18.5],      # Third weekday peak possibilities of Wheatridge location
    'weekday_sigmas': [0.6, 0.8, 1.0, 1.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [10.5, 11, 11.5, 12],            # First weekend peak possibilities of Wheatridge location
    'weekend_means2': [14, 14.5, 15, 15.5],            # Second weekend peak possibilities of Wheatridge location
    'weekend_means3': [17, 17.25, 17.5, 17.75, 18],    # Third weekend peak possibilities of Wheatridge location
    'weekend_sigmas': [0.6, 1.2, 1.8, 2.4]             # Possible weekend variations (standard-deviations)
}

# RINO-Clinic: 
rino_ctime_specs = {
    'weekday_means1': [9, 9.25, 9.5, 9.75, 10],        # First weekday peak possibilities of RINO location
    'weekday_means2': [13, 13.5, 14, 14.5, 15],        # Second weekday peak possibilities of RINO location
    'weekday_means3': [18, 18.5, 19, 19.5, 19.75],     # Third weekday peak possibilities of RINO location
    'weekday_sigmas': [0.6, 0.8, 1.0, 1.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [8, 8.25, 8.5, 8.75, 9],         # First weekend peak possibilities of RINO location
    'weekend_means2': [12, 12.25, 12.5, 12.75, 13],    # Second weekend peak possibilities of RINO location
    'weekend_means3': [16, 16.25, 16.5, 16.75, 16],    # Third weekend peak possibilities of RINO location
    'weekend_sigmas': [0.6, 1.2, 1.8, 2.4]             # Possible weekend variations (standard-deviations)
}

# Lakewood-Clinic: 
lakewood_ctime_specs = {
    'weekday_means1': [8.5, 9, 9.5, 10, 10.5],         # First weekday peak possibilities of Lakewood location
    'weekday_means2': [11.5, 12, 12.5, 13],            # Second weekday peak possibilities of Lakewood location
    'weekday_means3': [15.5, 16, 17, 17.5, 19],        # Third weekday peak possibilities of Lakewood location
    'weekday_sigmas': [0.6, 0.8, 1.0, 1.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [10.5, 11, 11.5, 12],            # First weekend peak possibilities of Lakewood location
    'weekend_means2': [15, 15.5, 16, 16.5],            # Second weekend peak possibilities of Lakewood location
    'weekend_means3': [19, 19.25, 19.5, 19.75],        # Third weekend peak possibilities of Lakewood location
    'weekend_sigmas': [0.6, 1.2, 1.8, 2.4]             # Possible weekend variations (standard-deviations)
}

"Peak-Hours" specifications for the bi-modal week-day model and normal week-end models are specified above for each location. To best capture the diversity that exists in real-world urgent care clinic's daily influx charts (according to Google Maps' "Popular Times" feature), different peaks and standard deviations are strategically setup above. These will be used and further tuned to generate the patient check-in times information. 

In [26]:
### WEEKDAY CHECK-IN TIME GENERATOR
def generate_weekday_ctimes(N, means, sigmas):
    """Generates N check-in times from tri-modal model based on input means & sigmas."""
    
    # Specify the lower and upper limit based on clinic operating times (8am-8pm / 8:00-20:00)
    low, upp = 8, 20
    
    # Split N into three groups
    leftover = N % 3
    N_vals = [int(N/3), int(N/3), int(N/3)+leftover]
    
    # Create a set of times distributed around the three input means
    X = []
    for i in range(3):
        vals = truncnorm((low - means[i]) / sigmas[i], (upp - means[i]) / sigmas[i], loc=means[i], scale=sigmas[i])
        vals = vals.rvs(N_vals[i])
        X.append(vals)
    
    # Concatenate the three weekday "peaks"
    X = np.concatenate([X[0],X[1],X[2]])
    
    return X

### WEEKEND CHECK-IN TIME GENERATOR
def generate_weekend_ctimes(N, means, sigmas):
    """Generates N check-in times from tri-modal model based on input means & sigmas."""
    
    # Specify the lower and upper limit based on clinic operating times (8am-8pm / 8:00-20:00)
    low, upp = 8, 20
    
    # Split N into three groups
    leftover = N % 3
    N_vals = [int(N/3), int(N/3), int(N/3)+leftover]
    
    # Create a set of times distributed around two input means
    X = []
    for i in range(3):
        vals = truncnorm((low - means[i]) / sigmas[i], (upp - means[i]) / sigmas[i], loc=means[i], scale=sigmas[i])
        vals = vals.rvs(N_vals[i])
        X.append(vals)
    
    # Concatenate the two weekday "peaks"
    X = np.concatenate([X[0],X[1],X[2]])
    
    return X


### TIME GENERATOR
def generate_ctimes(ppd, branch_specs):
    """Calls on appropriate functions with corresponding branch specs to generate check-in times."""
    
    check_in_times = []
    
    # Define unique dates of last year
    dates = pd.date_range(datetime.date(2021,5,1), periods=365).tolist()

    for i in range(len(dates)):
        
        # Examine whether date is weekend or weekday
        day = dates[i].weekday()
        
        # Gather specs and generate check-in times for weekend
        if day == 5 or day == 6:
            means = [choice(branch_specs['weekend_means1']), choice(branch_specs['weekend_means2']), choice(branch_specs['weekend_means3'])]
            sigmas = [choice(branch_specs['weekend_sigmas']), choice(branch_specs['weekend_sigmas']), choice(branch_specs['weekend_sigmas'])]
            ctimes = generate_weekday_ctimes(ppd[i], means, sigmas)
            check_in_times.extend(ctimes)   
        
        # Gather specs and generate check-in times for weekdays
        else:
            means = [choice(branch_specs['weekday_means1']), choice(branch_specs['weekday_means2']), choice(branch_specs['weekday_means3'])]
            sigmas = [choice(branch_specs['weekday_sigmas']), choice(branch_specs['weekday_sigmas']), choice(branch_specs['weekday_sigmas'])]
            ctimes = generate_weekday_ctimes(ppd[i], means, sigmas)
            check_in_times.extend(ctimes)      
        
    return check_in_times
    

### HELPER FUNCTION
def convert_time(x):
    """Converts time from decimal format to appropriate datetime object."""
    
    # Grab each portion of input time
    hour = int(abs(x))
    leftover_decimal = x - hour
    minutes = int(leftover_decimal * 60)
    seconds = int(leftover_decimal * 60 * 60 % 60)
    
    # Convert to datetime object
    time = datetime.time(hour, minutes, seconds).strftime('%X')
    
    return time

The function above was created to generate weekday check-in times based on a tri-modal (3 "peaks") distribution, one for each part of the day. Another function is setup to generate weekend check-in times based on a single distribution. Lastly, a helper function was created to convert each time value from float to appropriate datetime object.

### Patient Check-Out Times

In [27]:
def generate_checkout_times(data):
    """Generates a checkout time (with added 'noise') based on check-in time and reason of visit."""
    
    # Unpack input date
    check_in_time, reason = data[0], data[1]
    
    # Construct a dict of reasons and expected time of stay
    reasons_time = {
        'cold/flu/fever': 45, 
        'sore-throat': 30, 
        'cough': 30,  
        'chest-pain': 60, 
        'stomach-pain': 60,
        'diarrhea': 60, 
        'weakness/dizziness': 60,
        'headache': 45, 
        'UTI': 45, 
        'pink-eye': 30,
        'ear-pain': 30,
        'rash/allergy': 45,
        'cuts/abscess': 45, 
        'ache/pain': 30, 
        'injury/accident': 60, 
        'covid-test': 20,  
        'vaccination': 20,
        'physical': 20, 
        'drug-test': 20, 
        'blood/lab-work': 30
    }
    
    # Possible variations in appointment time (in minutes)
    variations = [i for i in range(-5, 5, 1)]
    
    # Compute check-out time based on check-in time with added "noise"/variation
    check_in_time = datetime.datetime.strptime(check_in_time, '%H:%M:%S')
    checkout_time = check_in_time + timedelta(minutes=reasons_time[reason] + choice(variations))
    
    return checkout_time

The function above is setup to be used as part of dataframe's apply/map functionality to take an input of both the patient's reason of visit & check-in-time and output a computed check-out time. As data is unavailable for the specific tuning of the length of visits per reason, personal experience working at urgent-care facilities was once again relied upon to determine these values. To add extra "noise" to these times, each checkout time is outputted with a variation in minutes.

### "Past" Patient Influx (per Clinic & per Day)

In [28]:
### Generate lists of patients-per-day at each location

ppd_denver = np.random.normal(-7, 7, 365)
ppd_denver = ppd_denver + 80
ppd_denver = ppd_denver.astype(int)

ppd_edgewater = np.random.normal(-5, 5, 365)
ppd_edgewater = ppd_edgewater + 55
ppd_edgewater = ppd_edgewater.astype(int)

ppd_wheatridge = np.random.normal(-6, 6, 365)
ppd_wheatridge = ppd_wheatridge + 65
ppd_wheatridge = ppd_wheatridge.astype(int)

ppd_rino = np.random.normal(-4, 4, 365)
ppd_rino = ppd_rino + 60
ppd_rino = ppd_rino.astype(int)

ppd_lakewood = np.random.normal(-6, 6, 365)
ppd_lakewood = ppd_lakewood + 70
ppd_lakewood = ppd_lakewood.astype(int)

# DEBUG / EXAMINATION
for ppd in [ppd_denver, ppd_edgewater, ppd_wheatridge, ppd_rino, ppd_lakewood]:
    # print(len(ppd))
    # print(pd.DataFrame(ppd).describe())
    # print(ppd)
    continue

The code-block above constructs lists for each clinic branch, consisting of the patient counts for each day of the past year. For the purposes of generating "noise" in this data fabrication step, numbers were pulled from the normal distribution based on a different input mean per location. This input mean was decided based on the populations of each area. This will allow for added noise to replicate real-world data as much as possible and pave the way for the remaining features of the patient dataset handled below.

### Compile patient information for each clinic, feature-engineer desired attributes, and construct finalized dataset:

In [29]:
def rolling_stats(branch_df):
    """Generates rolling count of patients in clinic for each record for input branch's records."""
    
    rolling_ct = []
    rolling_severities = []
    
    # Iterate through each possible data
    for date in branch_df['visit_date'].unique():
        
        # Grab the corresponding check-in & check-out times and severity codes for the date
        checkins = branch_df[branch_df.visit_date==date].checkin_time.values
        df = branch_df.copy()
        df['checkout_time'] = df['checkout_time'].apply(lambda x: x.time())
        checkouts = df[df.visit_date==date].checkout_time.values
        severity_levels = branch_df[branch_df.visit_date==date].visit_code.values
        
        # Iterate through each check-in time
        for i in range(len(checkins)):
            
            # Current iteration of check-in time
            current_checkin = checkins[i]
            
            # Instantiate patient count at 1
            counter = 1
            current_severities = [0]
            
            # Iterate through all past check-out times (before current check-in time)
            for j in range(i):
                
                # If previous patient's check-out time is after current patient's check-in time, increment counter
                if checkouts[j] > current_checkin:
                    counter += 1
                    current_severities.append(severity_levels[j])
            
            # Average out visit codes for patients currently in clinic 
            if len(current_severities) == 1:
                mean_severity = 0
            else:
                mean_severity = np.mean(current_severities[1:])
            
            rolling_ct.append(counter)
            rolling_severities.append(round(mean_severity, 1))
            
            
    return rolling_ct, rolling_severities

The helper function above iterates through patient records and tracks how many patients are in a clinic at the time a new patient walks in. This will be useful for EDA and modeling purposes to better inform the scheduling process.

In [30]:
def generate_dataset(branch, ppd, specs):
    """Executes necessary functions to generate patient records for a input branch location."""
    
    # Grab total number of pts visiting input branch location in past year
    num_pts = ppd.sum() 

    # Execute functions to generate data for each attribute
    ids = generate_ids(branch, num_pts)
    names = generate_names(ids)
    dobs = generate_dobs(num_pts)
    ages = [convert_dob(dob) for dob in dobs]
    locations = [branch for i in range(num_pts)]
    reasons = generate_reasons(num_pts)
    codes = [generate_severity_code(reason) for reason in reasons]
    dates = generate_dates(ppd)
    checkin_times = generate_ctimes(ppd, specs)
    checkin_times = [convert_time(time) for time in checkin_times]

    # Construct dataframe from generated data
    pts_df = pd.DataFrame({
        'pt_id': ids,
        'pt_name': names,
        'pt_dob': dobs,
        'pt_age': ages,
        'visit_location': locations,
        'visit_reason': reasons,
        'visit_code': codes,
        'visit_date': dates,
        'checkin_time': checkin_times
    })

    # Create check-out times
    pts_df['checkout_time'] = pts_df[['checkin_time', 'visit_reason']].apply(generate_checkout_times, axis=1)

    # Ensure check-in times are in datetime format
    pts_df['checkin_time'] = pts_df['checkin_time'].apply(lambda x: datetime.datetime.strptime(x, '%H:%M:%S').time())

    # Sort records based on visit_date and checkin_time
    pts_df = pts_df.sort_values(['visit_date', 'checkin_time', 'checkout_time'])
    
    # Feature-engineer a rolling count and a rolling severity level of patients in clinic
    pts_df['rolling_ct'], pts_df['rolling_code'] = rolling_stats(pts_df)
    
    # Adjust check-out time based on clinic's current rolling severity-code
    pts_df['checkout_time'] = pts_df[['checkout_time', 'rolling_code']].apply(lambda x: x[0] + datetime.timedelta(minutes=x[1] * 10), axis=1)
    pts_df['checkout_time'] = pts_df['checkout_time'].apply(lambda x: x.time())
    
    # Feature-engineer day information based on dates
    pts_df['visit_day'] = pts_df['visit_date'].apply(lambda x: x.weekday()) \
            .map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})
    
    # Rearrange column order as desired
    pts_df = pts_df[[
        'pt_id', 'pt_name', 'pt_dob', 'pt_age',                                                     # Patient Info
        'visit_location', 'visit_reason', 'visit_code',                                             # Visit Info (Location & Reason)
        'visit_date', 'visit_day', 'checkin_time', 'checkout_time', 'rolling_ct', 'rolling_code'    # Visit Info (Day/Date/Time)
    ]]
    
    return pts_df    

In [31]:
### DENVER
denver_df = generate_dataset('denver', ppd_denver, denver_ctime_specs)

### EDGEWATER
edgewater_df = generate_dataset('edgewater', ppd_edgewater, edgewater_ctime_specs)

### WHEATRIDGE
wheatridge_df = generate_dataset('wheatridge', ppd_wheatridge, wheatridge_ctime_specs)

### RINO
rino_df = generate_dataset('rino', ppd_rino, rino_ctime_specs)

### LAKEWOOD
lakewood_df = generate_dataset('lakewood', ppd_lakewood, lakewood_ctime_specs)

### TOTAL
patients_df = pd.concat([denver_df, edgewater_df, wheatridge_df, rino_df, lakewood_df], axis=0)
patients_df = patients_df.set_index('pt_id', drop=True)
# patients_df

---
---

## 1D: New Patient Records (for "real-time" streaming)

**Objective:** Generate a dataset consisting of patient records that will be streamed for real-time model.

---

**Methodology / Approach:** 

This dataset will consist of the same attributes existing in the past petient records, as the info taken at registration when a patient enters the clinic. 

---
---

## 1E: Output

**Objective:** Save constructed datasets into *CSV* files for use in the next stages of the project pipeline.

---

In [32]:
clinics_df.to_csv('./uc_clinics.csv')

clinics_df

Unnamed: 0_level_0,lat,lon,to_denver,to_edgewater,to_wheatridge,to_rino,to_lakewood,nearby_clinics
branch_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
denver,39.739064,-104.989697,0.0,5.1,6.3,2.0,8.0,"[(rino, 2.0), (edgewater, 5.1)]"
edgewater,39.753954,-105.067788,5.0,0.0,2.0,7.7,4.4,"[(wheatridge, 2.0), (lakewood, 4.4), (denver, ..."
wheatridge,39.766857,-105.081983,6.3,2.0,0.0,7.8,4.8,"[(edgewater, 2.0), (lakewood, 4.8)]"
rino,39.767328,-104.981132,1.9,7.8,7.5,0.0,10.7,"[(denver, 1.9)]"
lakewood,39.704552,-105.079883,7.5,4.3,12.0,11.0,0.0,"[(edgewater, 4.3)]"


In [33]:
employees_df.to_csv('./uc_employees.csv')

employees_df.sample(5)

Unnamed: 0_level_0,e_name,e_role
e_id,Unnamed: 1_level_1,Unnamed: 2_level_1
22,Brenda Johnson,Technician
34,Laurie Bridges,Technician
42,Allison Hill,Technician
59,Timothy Bautista,Provider
32,Daniel Turner,Provider


In [34]:
patients_df.to_csv('./uc_past_patients.csv')
patients_df

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1000013,Kimberly Davies,2005-02-07,17,denver,sore-throat,4,2021-05-01,Saturday,08:09:54,08:39:54,1,0.0
1000010,Paul Hammond,1979-01-13,43,denver,covid-test,4,2021-05-01,Saturday,08:18:52,09:22:52,2,4.0
1000023,Christopher Wright,2001-10-31,20,denver,ear-pain,4,2021-05-01,Saturday,09:37:22,10:06:22,1,0.0
1000002,Joshua Oliver,2003-09-28,18,denver,covid-test,4,2021-05-01,Saturday,09:58:19,10:59:19,2,4.0
1000014,Janet Rowe,1999-03-11,23,denver,chest-pain,5,2021-05-01,Saturday,10:08:21,11:49:21,2,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5023063,Stacy Barrett,1959-01-06,63,lakewood,chest-pain,5,2022-04-30,Saturday,19:35:16,21:25:16,8,4.9
5023048,Jeffery Murphy MD,1952-06-24,69,lakewood,stomach-pain,5,2022-04-30,Saturday,19:36:15,21:29:15,9,4.9
5023044,Marissa Richardson,1998-04-18,24,lakewood,blood/lab-work,3,2022-04-30,Saturday,19:41:09,21:02:09,9,4.9
5023057,Brittney Spence,2004-07-10,17,lakewood,cold/flu/fever,4,2022-04-30,Saturday,19:45:34,21:18:34,8,4.7


In [35]:
daily_df = patients_df.groupby(['visit_location', 'visit_date', 'visit_day']).max().reset_index(drop=False)[['visit_location', 'visit_date', 'rolling_ct']]
daily_df['num_techs'] = daily_df.rolling_ct.copy().apply(lambda x: int(x/3)+1)
daily_df[daily_df.visit_location == 'lakewood']

Unnamed: 0,visit_location,visit_date,rolling_ct,num_techs
730,lakewood,2021-05-01,13,5
731,lakewood,2021-05-02,12,5
732,lakewood,2021-05-03,11,4
733,lakewood,2021-05-04,11,4
734,lakewood,2021-05-05,13,5
...,...,...,...,...
1090,lakewood,2022-04-26,9,4
1091,lakewood,2022-04-27,8,3
1092,lakewood,2022-04-28,10,4
1093,lakewood,2022-04-29,12,5


In [36]:
zipper = zip(daily_df.visit_date, daily_df.visit_location, daily_df.num_techs)
d = {}
for i in zipper:
    d[(i[0], i[1])] = i[2]

patients_df['num_techs'] = patients_df[['visit_date', 'visit_location']].copy() \
    .apply(lambda x: (x[0], x[1]), axis=1) \
    .map(d)
patients_df.tail(70)

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,num_techs
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
5022978,Aaron Stone,1962-06-10,59,lakewood,cold/flu/fever,4,2022-04-29,Friday,18:00:21,19:33:21,7,4.7,5
5022980,Katherine Mccall MD,1994-11-23,27,lakewood,drug-test,3,2022-04-29,Friday,18:06:51,19:07:51,7,4.5,5
5022982,Jeanne King,1982-04-12,40,lakewood,ear-pain,4,2022-04-29,Friday,18:35:46,19:54:46,4,4.7,5
5022976,Joann Chavez,1945-04-28,77,lakewood,covid-test,4,2022-04-29,Friday,18:42:40,19:40:40,3,4.0,5
5023008,Stacy Chambers DDS,1994-01-07,28,lakewood,physical,3,2022-04-30,Saturday,09:05:34,09:29:34,1,0.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5023063,Stacy Barrett,1959-01-06,63,lakewood,chest-pain,5,2022-04-30,Saturday,19:35:16,21:25:16,8,4.9,4
5023048,Jeffery Murphy MD,1952-06-24,69,lakewood,stomach-pain,5,2022-04-30,Saturday,19:36:15,21:29:15,9,4.9,4
5023044,Marissa Richardson,1998-04-18,24,lakewood,blood/lab-work,3,2022-04-30,Saturday,19:41:09,21:02:09,9,4.9,4
5023057,Brittney Spence,2004-07-10,17,lakewood,cold/flu/fever,4,2022-04-30,Saturday,19:45:34,21:18:34,8,4.7,4


In [37]:
datetime = []
for i in zip(patients_df.visit_date, patients_df.checkin_time):
    datetime.append(str(i[0].date()) + ' ' + str(i[1]))

patients_df['datetime'] = datetime
patients_df.head()

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,num_techs,datetime
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000013,Kimberly Davies,2005-02-07,17,denver,sore-throat,4,2021-05-01,Saturday,08:09:54,08:39:54,1,0.0,6,2021-05-01 08:09:54
1000010,Paul Hammond,1979-01-13,43,denver,covid-test,4,2021-05-01,Saturday,08:18:52,09:22:52,2,4.0,6,2021-05-01 08:18:52
1000023,Christopher Wright,2001-10-31,20,denver,ear-pain,4,2021-05-01,Saturday,09:37:22,10:06:22,1,0.0,6,2021-05-01 09:37:22
1000002,Joshua Oliver,2003-09-28,18,denver,covid-test,4,2021-05-01,Saturday,09:58:19,10:59:19,2,4.0,6,2021-05-01 09:58:19
1000014,Janet Rowe,1999-03-11,23,denver,chest-pain,5,2021-05-01,Saturday,10:08:21,11:49:21,2,4.0,6,2021-05-01 10:08:21


In [150]:
patients_df['needed_techs'] = patients_df['rolling_ct'].apply(lambda x: math.ceil(x/3))
patients_df.head(50)

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,num_techs,datetime,needed_techs
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1000013,Kimberly Davies,2005-02-07,17,denver,sore-throat,4,2021-05-01,Saturday,08:09:54,08:39:54,1,0.0,6,2021-05-01 08:09:54,1
1000010,Paul Hammond,1979-01-13,43,denver,covid-test,4,2021-05-01,Saturday,08:18:52,09:22:52,2,4.0,6,2021-05-01 08:18:52,1
1000023,Christopher Wright,2001-10-31,20,denver,ear-pain,4,2021-05-01,Saturday,09:37:22,10:06:22,1,0.0,6,2021-05-01 09:37:22,1
1000002,Joshua Oliver,2003-09-28,18,denver,covid-test,4,2021-05-01,Saturday,09:58:19,10:59:19,2,4.0,6,2021-05-01 09:58:19,1
1000014,Janet Rowe,1999-03-11,23,denver,chest-pain,5,2021-05-01,Saturday,10:08:21,11:49:21,2,4.0,6,2021-05-01 10:08:21,1
1000015,Tracey Winters,2014-03-10,8,denver,weakness/dizziness,5,2021-05-01,Saturday,10:30:13,12:16:13,2,5.0,6,2021-05-01 10:30:13,1
1000009,Laura Anderson,1987-08-29,34,denver,chest-pain,5,2021-05-01,Saturday,10:47:37,12:36:37,3,5.0,6,2021-05-01 10:47:37,1
1000004,Christopher Clark Jr.,1989-02-11,33,denver,weakness/dizziness,5,2021-05-01,Saturday,10:51:27,12:42:27,4,5.0,6,2021-05-01 10:51:27,2
1000001,Kevin Howe,2006-06-09,15,denver,stomach-pain,5,2021-05-01,Saturday,10:51:36,12:42:36,5,5.0,6,2021-05-01 10:51:36,2
1000019,Jeffrey Landry,2005-10-12,16,denver,covid-test,4,2021-05-01,Saturday,10:59:16,12:05:16,6,5.0,6,2021-05-01 10:59:16,2


In [41]:
patients_df.to_csv('./uc_past_patients.csv')

In [141]:
test = patients_df.sort_values('datetime')[200:400]

In [152]:
test['num_techs'] = test.num_techs.apply(lambda x: x-1)
test

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,num_techs,datetime,needed_techs
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1000026,Kara Mitchell,1956-06-23,65,denver,ear-pain,4,2021-05-01,Saturday,15:29:45,16:40:45,8,4.3,4,2021-05-01 15:29:45,3
1000031,Tina Clements DVM,1994-10-29,27,denver,chest-pain,5,2021-05-01,Saturday,15:31:11,17:16:11,8,4.1,4,2021-05-01 15:31:11,3
1000045,Melissa Russell MD,1999-02-05,23,denver,chest-pain,5,2021-05-01,Saturday,15:33:40,17:19:40,8,4.3,4,2021-05-01 15:33:40,3
5000041,Xavier Allen,1944-03-02,78,lakewood,cough,4,2021-05-01,Saturday,15:34:51,16:42:51,2,4.0,3,2021-05-01 15:34:51,1
1000027,Bradley Martinez,2019-12-04,2,denver,ache/pain,4,2021-05-01,Saturday,15:35:46,16:44:46,9,4.4,4,2021-05-01 15:35:46,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000090,Chad Morris,1999-07-27,22,wheatridge,sore-throat,4,2021-05-02,Sunday,11:40:46,12:47:46,8,4.1,2,2021-05-02 11:40:46,3
4000071,Scott Hayes,1971-03-09,51,rino,vaccination,3,2021-05-02,Sunday,11:44:01,12:50:01,8,4.6,1,2021-05-02 11:44:01,3
3000073,Monica Crane,1983-09-22,38,wheatridge,headache,5,2021-05-02,Sunday,11:48:57,13:11:57,9,4.1,2,2021-05-02 11:48:57,3
2000054,Christian Berry,2007-11-01,14,edgewater,ache/pain,4,2021-05-02,Sunday,11:49:06,13:02:06,2,4.0,2,2021-05-02 11:49:06,1


In [166]:
clinics_df

Unnamed: 0_level_0,lat,lon,to_denver,to_edgewater,to_wheatridge,to_rino,to_lakewood,nearby_clinics
branch_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
denver,39.739064,-104.989697,0.0,5.1,6.3,2.0,8.0,"[(rino, 2.0), (edgewater, 5.1)]"
edgewater,39.753954,-105.067788,5.0,0.0,2.0,7.7,4.4,"[(wheatridge, 2.0), (lakewood, 4.4), (denver, ..."
wheatridge,39.766857,-105.081983,6.3,2.0,0.0,7.8,4.8,"[(edgewater, 2.0), (lakewood, 4.8)]"
rino,39.767328,-104.981132,1.9,7.8,7.5,0.0,10.7,"[(denver, 1.9)]"
lakewood,39.704552,-105.079883,7.5,4.3,12.0,11.0,0.0,"[(edgewater, 4.3)]"


{'lat': {'denver': 39.73906432357836,
  'edgewater': 39.753954449845445,
  'wheatridge': 39.76685732722651,
  'rino': 39.767327859566265,
  'lakewood': 39.70455155721396},
 'lon': {'denver': -104.98969659655802,
  'edgewater': -105.06778796142915,
  'wheatridge': -105.08198265044479,
  'rino': -104.98113186098168,
  'lakewood': -105.0798829449297},
 'to_denver': {'denver': 0.0,
  'edgewater': 5.0,
  'wheatridge': 6.3,
  'rino': 1.9,
  'lakewood': 7.5},
 'to_edgewater': {'denver': 5.1,
  'edgewater': 0.0,
  'wheatridge': 2.0,
  'rino': 7.8,
  'lakewood': 4.3},
 'to_wheatridge': {'denver': 6.3,
  'edgewater': 2.0,
  'wheatridge': 0.0,
  'rino': 7.5,
  'lakewood': 12.0},
 'to_rino': {'denver': 2.0,
  'edgewater': 7.7,
  'wheatridge': 7.8,
  'rino': 0.0,
  'lakewood': 11.0},
 'to_lakewood': {'denver': 8.0,
  'edgewater': 4.4,
  'wheatridge': 4.8,
  'rino': 10.7,
  'lakewood': 0.0},
 'nearby_clinics': {'denver': [('rino', 2.0), ('edgewater', 5.1)],
  'edgewater': [('wheatridge', 2.0), ('lak

In [200]:
import pprint
pp = pprint.PrettyPrinter(indent=0)

d = {'denver': {'checkin_time': None, 'num_techs': None,'needed_techs': None, 'flag': 0, 'available_techs': None},
     'edgewater': {'checkin_time': None, 'num_techs': None,'needed_techs': None, 'flag': 0, 'available_techs': None},
     'wheatridge': {'checkin_time': None, 'num_techs': None,'needed_techs': None, 'flag': 0, 'available_techs': None}, 
     'rino': {'checkin_time': None, 'num_techs': None,'needed_techs': None, 'flag': 0, 'available_techs': None},
     'lakewood': {'checkin_time': None, 'num_techs': None,'needed_techs': None, 'flag': 0, 'available_techs': None}}


for row in test.iterrows():
     row = row[1]
     location = row['visit_location']
     checkin_time = row['checkin_time']
     num_techs = row['num_techs']
     needed_techs = row['needed_techs']
     d[location]['checkin_time'] = checkin_time
     d[location]['num_techs'] = num_techs
     d[location]['needed_techs'] = needed_techs
     available_techs = num_techs - needed_techs
     d[location]['available_techs'] = available_techs

     # pp.pprint(d[location])
     # print()
     if d[location]['flag'] == 1:
          # print(location)
          # print('flag was 1')
          # pp.pprint(d[location])
          # print()
          if d[location]['num_techs'] < d[location]['needed_techs']:
               print(f"{checkin_time}: {location} needs {needed_techs - num_techs} techs!")
               print("Check to see what's available:")
               avail = [(i, d[i]['available_techs']) for i in d.keys() if d[i]['available_techs'] > 0]
               print(avail)
               for clinic in clinics_df.loc[location, 'nearby_clinics']:
                    print(clinic)
                    avail_techs = d[clinic[0]]['available_techs']
                    if avail_techs > 0:
                         


                    print(f"{clinic[0]} has {d[clinic[0]]['available_techs']} available techs")
               print()
               
     else:
          if d[location]['num_techs'] == d[location]['needed_techs']:
               d[location]['flag'] = 1
               # print(location)
               # print('flag = 1')
               # pp.pprint(d[location])
               # print()
# what if denver has 2 and rino only has 1. gotta look to the next location

15:55:50: denver needs 1 techs!
Check to see what's available:
[('wheatridge', 1), ('rino', 2), ('lakewood', 2)]
('rino', 2.0)
rino has 2 available techs
('edgewater', 5.1)
edgewater has -2 available techs
('wheatridge', 6.3)
wheatridge has 1 available techs
('lakewood', 8.0)
lakewood has 2 available techs

16:03:13: denver needs 1 techs!
Check to see what's available:
[('wheatridge', 1), ('rino', 1), ('lakewood', 1)]
('rino', 2.0)
rino has 1 available techs
('edgewater', 5.1)
edgewater has -2 available techs
('wheatridge', 6.3)
wheatridge has 1 available techs
('lakewood', 8.0)
lakewood has 1 available techs

16:03:41: denver needs 1 techs!
Check to see what's available:
[('wheatridge', 1), ('rino', 1), ('lakewood', 1)]
('rino', 2.0)
rino has 1 available techs
('edgewater', 5.1)
edgewater has -2 available techs
('wheatridge', 6.3)
wheatridge has 1 available techs
('lakewood', 8.0)
lakewood has 1 available techs

16:17:38: denver needs 1 techs!
Check to see what's available:
[('wheatri