# PART 1: DATA FABRICATION

**Objective:** As there are no publicly-available datasets that fit the scope of this project, this notebook will execute a series of data fabrication steps to generate datasets to emulate the real-world as much as possible. To aid this process, secondary research was relied upon to to strategize the approach at each level. These datasets will serve as the raw data source down the project pipeline. <br> (Estimated RunTime: ~3-4min)

---

In [1]:
# Data Management
import numpy as np
import pandas as pd

# Computations
from scipy.stats import truncnorm

# Utils
import datetime
from datetime import timedelta
from faker import Faker
from random import choice, choices, shuffle, randrange

In [2]:
# Instantiate data fabricator 
faker = Faker()

---
---

## 1A: Clinic Information

**Objective:** Generate a dataset consisting of clinic names, locations, and travel-time.

---

**Methodology / Approach:** 

A total of 5 clinics located around the Denver area will be fabricated to include the following information:
- `cities` : Name of the clinic branch location (serves as index)
- `lat` : Latitude of clinic
- `lon` : Longitude of clinic
- `to_denver` : Shortest avg. travel-time to arrive at Denver location from clinic
- `to_edgewater` : Shortest avg. travel-time to arrive at Edgewater location from clinic
- `to_wheatridge` : Shortest avg. travel-time to arrive at Wheatridge location from clinic
- `to_rino` : Shortest avg. travel-time to arrive at RINO location from clinic
- `to_lakewood` : Shortest avg. travel-time to arrive at Lakewood location from clinic
- `nearby_clinics` : Nearby clinics sorted by travel-time

The urgent care branch locations for each of the 5 clinics was chosen around 5 known municipalities in the Denver metropolitan area, with corresponding latitude / longitude / distance values retrieved from *Google Maps*. This data will be used to also feature engineer the `nearby_clinics` attribute which will be useful for connecting nearby clinical staff members. 

*Note: In production, the clinic location & distances data would be retrieved through Google Maps' official API. This will enable us to account for real-time directions / drive-time between locations to periodically update "nearest" clinic info for better accuracy. However, that is a cost-based service, and therefore the information was manually compiled for the purporses of this project. The shortest travel-time will be used in lieu of drive-time as the basis for "nearest" distinction. When these metrics are used during the navigation phase of the project, randomized variations will be added to these base-level travel times as a way to mimic real-world variations in drive-time and traffic variations.*

#### Generate a clinic dataset with names, locations, and travel-times:

In [3]:
# Construct the clinic dataframe based on Google Maps data
clinics_df = pd.DataFrame({
    'branch_name': ['denver', 'edgewater', 'wheatridge', 'rino', 'lakewood'],
    'lat': [39.73906432357836, 39.753954449845445, 39.76685732722651, 39.767327859566265, 39.70455155721396],
    'lon': [-104.98969659655802, -105.06778796142915, -105.08198265044479, -104.98113186098168, -105.0798829449297],
    'to_denver': [0, 14, 14, 6, 12],      # shortest time (minutes) to Denver branch from each clinic
    'to_edgewater': [12, 0, 5, 14, 8],   # shortest time (minutes) to Edgewater branch from each clinic
    'to_wheatridge': [14, 5, 0, 14, 8],   # shortest time (minutes) to Wheatridge branch from each clinic
    'to_rino':[7, 12, 10, 0, 12],          # shortest time (minutes) to RINO branch from each clinic
    'to_lakewood':[14, 9, 9, 14, 0]     # shortest time (minutes) to Lakewood branch from each clinic
}) 

clinics_df = clinics_df.set_index('branch_name', drop=True)

#### Feature engineer "nearby" clinic info based on travel-times:

In [4]:
# Instantiate empty list to hold tuples of nearby clinics for each location
nearby_clinics = []

# Iterate through each branch location
for index, row in clinics_df.iterrows():
    nearest = []
    dist_to_clinic = ['to_denver', 'to_edgewater', 'to_wheatridge', 'to_rino', 'to_lakewood']
    
    # Collect clinic names that are within a desired threshold distance from location (or all)
    for dist in dist_to_clinic:
        if row[dist] > 0:  
            city = dist.split('to_')[1]
            nearest.append((city, row[dist]))
            nearest.sort(key=lambda x: x[1])
    nearby_clinics.append(nearest)
    
# Add nearby clinic info to dataframe
clinics_df['nearby_clinics'] = nearby_clinics

In [5]:
clinics_df

Unnamed: 0_level_0,lat,lon,to_denver,to_edgewater,to_wheatridge,to_rino,to_lakewood,nearby_clinics
branch_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
denver,39.739064,-104.989697,0,12,14,7,14,"[(rino, 7.0), (edgewater, 12.0), (wheatridge, ..."
edgewater,39.753954,-105.067788,14,0,5,12,9,"[(wheatridge, 5.0), (lakewood, 9.0), (rino, 12..."
wheatridge,39.766857,-105.081983,14,5,0,10,9,"[(edgewater, 5.0), (lakewood, 9.0), (rino, 10...."
rino,39.767328,-104.981132,6,14,14,0,14,"[(denver, 6.0), (edgewater, 14.0), (wheatridge..."
lakewood,39.704552,-105.079883,12,8,8,12,0,"[(edgewater, 8.0), (wheatridge, 8.0), (denver,..."


---
---

## 1B: Employee Records

**Objective:** Generate a dataset consisting of employee names, IDs, and roles.

---

**Methodology / Approach:** 

A dataset will be constructed for a total of 15 providers and 35 technicians and include the following features:
- `e_id` : Employee's ID (serves as index)
- `e_name` : Employee's name
- `e_role` : Employee's role (provider vs technician)

As there are 5 clinics, and 2 providers are generally assigned to a clinic per day, a total of 15 employees will be distinguished as providers (physicians / physician assistants). The remaining 35 employees in this dataset will be distinguished as 'technicians' which will serve as the umbrella term for clinic techs, lab techs, and scribes that are moveable between the clinics. For simplicity and relevancy, other employee types are not included in this dataset as there won't be a part of the analysis or modeling.

In [6]:
# Specify desired counts
num_docs = 15
num_techs = 35

### Employee IDs

In [7]:
def generate_ids(num_employees):
    """Generates 2-digit staff IDs based on input number of employees."""
    
    eids = list(range(11, num_employees+11))
    return eids

### Employee Names

In [8]:
def generate_names(eids):
    """Generates randomized employee names. Input employee IDs are used for seeding purposes."""

    e_names = []
    for eid in eids:
        Faker.seed(eid)  # for consistency
        name = faker.first_name() + ' ' + faker.last_name()  # w/o prefix/suffix 
        e_names.append(name)
    return e_names

### Employee Roles

In [9]:
def generate_roles(num_docs, num_techs):
    """Generates employee roles based on desired counts."""
    
    roles = []
    
    # Generate role title strings based on desired counts
    for i in range(num_docs):
        roles.append('Provider')
    for j in range(num_techs):
        roles.append('Technician')
    
    shuffle(roles)
    return roles

### Compile employee information and construct dataset:

In [10]:
# Generate employee IDs based on desired number of each role
eids = generate_ids(num_docs+num_techs)

employees_df = pd.DataFrame({
    'e_id': eids,
    'e_name': generate_names(eids),
    'e_role': generate_roles(num_docs, num_techs)
})
employees_df = employees_df.set_index('e_id', drop=True)

employees_df.sample(5)

Unnamed: 0_level_0,e_name,e_role
e_id,Unnamed: 1_level_1,Unnamed: 2_level_1
60,Kyle Lane,Provider
14,Daniel Park,Provider
44,Ricky Weaver,Technician
32,Chris Lewis,Technician
38,Danielle Patterson,Technician


---
---

## 1C: Patient Records

**Objective:** Generate a dataset consisting of patient records including the location & date/time of visit.

---

**Methodology / Approach:** 

Patient records will include the following information:
- `pt_id` : Patient's assigned ID number
- `pt_name` : Patient's name
- `pt_dob` : Patient's date of birth
- `pt_age` : Patient's age (engineered from DOB)
- `visit_reason` : Patient's reason for visiting clinic
- `visit_location` : Clinic visited
- `visit_date` : Date of visit
- `checkin_time` : Time the patient checked-in to clinic
- `checkout_time` : Time the patient checked-out of clinic
- `rolling_ct`: Number of patients in clinic when current patient is checking-in
- `rolling_code`: Clinic's average severity level when current patient is checking-in
<br>
Only information relevant to the study will be included in this dataset. Address, insurance, and official diagnosis / code are not going to be generated for this set as it is irrelevant for the purpose of the patient records, which is to provide insights for the modeling & navigation stage. These features in a client-given dataset would be truncated during the pre-processing stage for this project objective, after any initial explorations.

According to American Academy of Urgent Care Medicine (AAUCM), the average urgent care sees about 60-80 patients per day. Therefore, in order to generate a dataset of past patient records for each of the 5 clinics of the chain, this guideline measure will be used to fabricate records for an entire year's worth of data. 

To simulate real-world data, each location was strategically chosen from different areas that would see different traffic of incoming patients based on city size and location. To that extent, corresponding proportion of total yearly patients was decided to be approximately 0.25 for the more-populated Denver & Lakewood locations, 0.2 for Wheatridge, and 0.15 for the smaller Edgewater and RINO areas. Based on the size computed from the proportion, a list consisting of location names was constructed to be inputted as the locations that patients visited in the patient records.

First, a series of helper functions will be defined for data generation of each corresponding attribute. Additional functions to add "noise" to the data will also be strategically setup. Then, data will be constructed for each clinic and compiled together for end output.

### Patient IDs

In [11]:
def generate_ids(branch, num_pts):
    """Generates 7-digit patient IDs based on input clinic location and patient count."""
    
    # Define starting ID code based on branch location
    start_digit = {
        'denver': 1000001, 
        'edgewater': 2000001, 
        'wheatridge': 3000001, 
        'rino': 4000001, 
        'lakewood': 5000001
    }
    
    # Generate IDs based on start digit and num_pts
    start_id = start_digit[branch]
    end_id = start_id + num_pts
    pids = list(range(start_id, end_id))
    
    return pids

Since there will be approximately 100,000 patient logs generated, each patient will be given an ID for easier data handling. This ID number was chosen to be 7-digits so that the length is standardized across all patients. As each location's data will be built independently for easier data handling, the starting digit was distinguished accordingly to prevent data leakage.

### Patient Names

In [12]:
def generate_names(pids): 
    """Generates randomized patient names. Input patient IDs are used for seeding purposes."""
    pt_names = []
    for pid in pids:
        Faker.seed(pid)  # for consistency
        # name = faker.unique.name()  # version includes prefix/suffix
        name = faker.first_name() + ' ' + faker.last_name()  # w/o prefix/suffix 
        pt_names.append(name)
    return pt_names

A function was set up to generate randomized names through `Faker` module. A seed (based on unique patient ID) will be used to ensure each run yields consistently the exact same names. 

### Patient DOBs / Ages

In [13]:
def generate_dobs(num_pts):
    """Generates date of births based on real-world distributions (with added noise)."""
    
    # Construct dict of age groups with corresponding age range and proportion
    age_probs = {
        'age_group_1': [0, 10, 0.14],
        'age_group_2': [11, 20, 0.15],
        'age_group_3': [21, 30, 0.18],
        'age_group_4': [31, 40, 0.16],
        'age_group_5': [41, 50, 0.13],
        'age_group_6': [51, 60, 0.11],
        'age_group_7': [61, 80, 0.13],
    }
    
    dobs = []
    
    # Iterate through each age group and generate DOBs
    for key, val in age_probs.items():
        
        # Assign appropriate naming for easy-to-follow data handling
        min_age, max_age, p = val[0], val[1], val[2]
        
        # Compute number of patients in this age group
        num_pts_in_group = int(num_pts * p)  # floor multiplication
        
        # Iterate through each patient in the current iteration of age group
        for i in range(num_pts_in_group):
            Faker.seed(i)
            dob = faker.date_of_birth(minimum_age=min_age, maximum_age=max_age)
            dobs.append(dob)
    
    # Account for patient-count discrepancy due to rounding/floor multiplication
    discrepancy = num_pts - len(dobs)
    for i in range(discrepancy):
        # Assign leftover patients to the most popular age group
        dob = faker.date_of_birth(minimum_age=21, maximum_age=30) 
        dobs.append(dob)    
    
    # Randomize order of DOBs (so not in order of age groups)
    shuffle(dobs) 
    
    return dobs

The function above is setup to take an input number of patients (so that it can be called individually for each branch location).

According to The Journal Of Urgent Care Medicine (JUCM), patient visits can be broken down by the following age proportions:
- Infant to 10: 13.8%
- 11 to 20: 14.8%
- 21 to 30: 18.3%
- 31 to 40: 15.9%
- 41 to 50: 12.8%
- 51 to 60: 11.2%
- 61+: 13.3%

*Source: https://www.jucm.com/urgent-care-is-an-appropriate-setting-for-any-age-but-what-ages-are-showing-up-the-most/*

Therefore, the function was designed to closely adhere to this distribution as much as possible. These proportions were slightly rounded to yield whole numbers and age-limit caps at the max end to account for the extremely rare 80+ patients. Lastly, it was important to set seed for each iteration to yield a consistent set of dates for each run and account for any leftover patients due to floor multiplication of the proportions.

In [14]:
def convert_dob(dob):
    """Converts input date of birth to age, based on today's date."""

    today = datetime.date.today()
    age = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
    return age

The helper function above feature engineers each DOB value into age. Age is computed using the current date of script execution.

### Reason for Visit (& Code)

In [15]:
def generate_reasons(num_pts):
    """Generates randomized list of reasons-of-visit based on sensible proportions."""
    
    reason_probs = {
        'cold/flu/fever': 12/80,
        'sore-throat': 5.5/80, 
        'cough': 5.8/80, 
        'chest-pain': 5/80,
        'stomach-pain': 4.5/80, 
        'diarrhea': 0.8/80, 
        'weakness/dizziness': 2.7/80,
        'headache': 2.6/80, 
        'UTI': 4/80, 
        'pink-eye': 0.7/80, 
        'ear-pain': 5.3/80, 
        'rash/allergy': 2.5/80, 
        'cuts/abscess': 1.5/80,
        'ache/pain': 5/80, 
        'injury/accident': 1.5/80, 
        'covid-test': 7.5/80, 
        'vaccination': 5.6/80, 
        'physical': 1/80, 
        'drug-test': 3/80, 
        'lab-work': 3.5/80
    }
    
    reasons = []
    
    # Iterate through each common visit-reasons
    for reason, prob in reason_probs.items():
        
        # Compute number of patients from input total based on reason probability
        num_pts_for_reason = int(num_pts * prob)  # floor multiplication
        
        # Based on patient count for reason, add reasons to master list
        reasons.extend([reason for i in range(num_pts_for_reason)])
    
    # Account for patient count discrepancy due to rounding/floor multiplication
    discrepancy = num_pts - len(reasons)
    for i in range(discrepancy):
        # Assign leftover patients to the most popular reason
        reasons.append('cold/flu/fever')    
    
    # Randomize order of reasons
    shuffle(reasons) 
    
    return reasons

The function above generates patient visit-reasons. 20 of the most common visit-reasons were chosen based on CDC data, as well as, personal work experience. Although literature was sought to strategize the tuning of these proportions, only ER data was available which isn't directly applicable for urgent cares. Therefore, for this segment, personal work experience in urgent cares was drawn from to tune the appropriate amount of reason types. This variation in the data is essentially constructed for more insightful EDA and to simulate the streaming as close to real-life as possible. In production, these values would be replaced by the proportions observed in actual data, rather than pure fabrication.

*Sources Referenced:*
- *https://www.drtsbeck.com/blog/top-10-reasons-to-seek-an-urgent-care-visit*
- *https://inandoutexpresscare.com/top-9-reasons-for-an-urgent-care-visit/*
- *https://www.cdc.gov/nchs/data/hus/2012/fig23.pdf*
- *https://www.hcup-us.ahrq.gov/reports/statbriefs/sb286-ED-Frequent-Conditions-2018.pdf*
- *https://www.definitivehc.com/blog/top-20-most-common-er-diagnoses*
- *https://aspe.hhs.gov/sites/default/files/private/pdf/265086/ED-report-to-Congress.pdf* (trends, COVID discussion)
- *https://www.jucm.com/what-are-the-reporting-obligations-of-urgent-care-centers-for-covid-19-patients/* (high-level discussion of UrgentCare reporting requirements, including context to COVID)

From our research, we find that ER and urgent care visits aren't similar enough for us to draw information from the more-available ER data to fabricate the less-commonly available urgent care data. Therefore, these sources were referenced for the common visit reasons and proportions were tuned based on work experience as a technician in an urgent-care setting. For the purposes of the project, it is more important to have different reasons with varying proportions and visit lengths and less important what those actual values are. Since this project aims to study the variation in patient traffic, the accuracy of corresponding proportions and visit lengths are less important than the actual data characteristic of having the variation between different reasons.

In [16]:
def generate_severity_code(reason):
    """Outputs a severity-level for input reason of visit."""
    
    # Construct a dict of reasons and corresponding code
    reasons_code = {
        'cold/flu/fever': 4, 
        'sore-throat': 4, 
        'cough': 4,  
        'chest-pain': 5, 
        'stomach-pain': 5,
        'diarrhea': 5, 
        'weakness/dizziness': 5,
        'headache': 5, 
        'UTI': 4, 
        'pink-eye': 4,
        'ear-pain': 4,
        'rash/allergy': 5,
        'cuts/abscess': 5, 
        'ache/pain': 4, 
        'injury/accident': 5, 
        'covid-test': 4,  
        'vaccination': 3,
        'physical': 3, 
        'drug-test': 3, 
        'lab-work': 3
    }
    
    # Output corresponding severity level / code
    code = reasons_code[reason]    
    return code

The helper function above was setup as a possible added layer of sophistication for patient data. It generates the severity code based on patient-visit reasons that would affect reaction time of technicians and other staff members when faced with multiple waiting patients. The set {3, 4, 5} was chosen based on a 3-tiered system followed at urgent care clinics in St. Louis, MO (based on work-experience). This information will also be useful for exploratory visualizations to inform scheduling and modeling stages down the pipeline.

### Patient Visit-Dates

In [17]:
def generate_dates(ppd):
    """Generates dates data based on input patients-per-day array of past year."""
    
    # Get unique dates from past year
    past_yr_dates = pd.date_range(datetime.date(2021,5,1), periods=396).tolist()
    
    # Generate duplicate dates for each unique date based on patient-per-day records
    dates_all = []
    for i in range(len(past_yr_dates)):
        date = past_yr_dates[i]
        dates_all.extend([date] * ppd[i])
    
    return dates_all

The function above takes in an input array consisting of the daily patient tally for an entire year for a single clinic. Based on that information, it will fabricate dates records for all patients belonging to that clinic. Note: COVID-related reasons "cold/flu/fever", etc. are not representative of Denver's 2021-2022 waves. Rather, it is meant to reflect the latter stages of a pandemic where tests and cases are not in its peak for the objective of this simulation project.

### Patient Check-In Times

In [18]:
### Define possible specifications for each location's peak hours / "noise"

# Denver-Clinic: 
denver_ctime_specs = {
    'weekday_means1': [8, 8.25, 8.5, 8.75, 9],         # First weekday peak possibilities of Denver location
    'weekday_means2': [11, 11.25, 11.5, 11.75],        # Second weekday peak possibilities of Denver location
    'weekday_means3': [16, 16.25, 16.5, 16.75],        # Third weekday peak possibilities of Denver location
    'weekday_sigmas': [1.8, 1.9, 2.1, 2.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [10.5, 11, 11.5, 12],            # First weekend peak possibilities of Denver location
    'weekend_means2': [14, 14.5, 15, 15.5],            # Second weekend peak possibilities of Denver location
    'weekend_means3': [17, 17.25, 17.5, 17.75],        # Third weekend peak possibilities of Denver location
    'weekend_sigmas': [1.8, 1.9, 2.1, 2.2]             # Possible weekend variations (standard-deviations)
}

# Edgewater-Clinic: 
edgewater_ctime_specs = {
    'weekday_means1': [9, 9.25, 9.5, 9.75],            # First weekday peak possibilities of Edgewater location
    'weekday_means2': [13, 13.25, 13.5, 13.75],        # Second weekday peak possibilities of Edgewater location
    'weekday_means3': [18.25, 18.5, 18.75],            # Third weekday peak possibilities of Edgewater location
    'weekday_sigmas': [1.8, 1.9, 2.1, 2.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [8, 8.25, 8.5, 8.75],            # First weekend peak possibilities of Edgewater location
    'weekend_means2': [12, 12.25, 12.5, 12.75],        # Second weekend peak possibilities of Edgewater location
    'weekend_means3': [16, 16.25, 16.5, 16.75],        # Third weekend peak possibilities of Edgewater location
    'weekend_sigmas': [1.8, 1.9, 2.1, 2.2]             # Possible weekend variations (standard-deviations)
}
    
# Wheatridge-Clinic: 
wheatridge_ctime_specs = {
    'weekday_means1': [8, 8.25, 8.5, 8.75, 9],         # First weekday peak possibilities of Wheatridge location
    'weekday_means2': [11, 11.25, 11.5, 11.75],        # Second weekday peak possibilities of Wheatridge location
    'weekday_means3': [16.5, 16.75, 17, 17.25],        # Third weekday peak possibilities of Wheatridge location
    'weekday_sigmas': [1.8, 1.9, 2.1, 2.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [10.5, 11, 11.5, 12],            # First weekend peak possibilities of Wheatridge location
    'weekend_means2': [14, 14.5, 15, 15.5],            # Second weekend peak possibilities of Wheatridge location
    'weekend_means3': [17, 17.25, 17.5, 17.75],        # Third weekend peak possibilities of Wheatridge location
    'weekend_sigmas': [1.8, 1.9, 2.1, 2.2]             # Possible weekend variations (standard-deviations)
}

# RINO-Clinic: 
rino_ctime_specs = {
    'weekday_means1': [9, 9.25, 9.5, 9.75, 10],        # First weekday peak possibilities of RINO location
    'weekday_means2': [13, 13.25, 13.5, 13.75],        # Second weekday peak possibilities of RINO location
    'weekday_means3': [18, 18.25, 18.5, 18.75],        # Third weekday peak possibilities of RINO location
    'weekday_sigmas': [1.8, 1.9, 2.1, 2.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [8, 8.25, 8.5, 8.75, 9],         # First weekend peak possibilities of RINO location
    'weekend_means2': [12, 12.25, 12.5, 12.75],        # Second weekend peak possibilities of RINO location
    'weekend_means3': [16, 16.25, 16.5, 16.75],        # Third weekend peak possibilities of RINO location
    'weekend_sigmas': [1.8, 1.9, 2.1, 2.2]             # Possible weekend variations (standard-deviations)
}

# Lakewood-Clinic: 
lakewood_ctime_specs = {
    'weekday_means1': [8, 8.25, 8.5, 8.75],            # First weekday peak possibilities of Lakewood location
    'weekday_means2': [11.25, 11.5, 11.75],            # Second weekday peak possibilities of Lakewood location
    'weekday_means3': [15, 15.25, 15.5, 15.75],        # Third weekday peak possibilities of Lakewood location
    'weekday_sigmas': [1.8, 1.9, 2.1, 2.2],            # Possible weekday variations (standard-deviations)
    'weekend_means1': [10, 10.25, 10.5, 10.75],        # First weekend peak possibilities of Lakewood location
    'weekend_means2': [15, 15.25, 15.5, 15.75],        # Second weekend peak possibilities of Lakewood location
    'weekend_means3': [19, 19.25, 19.5, 19.75],        # Third weekend peak possibilities of Lakewood location
    'weekend_sigmas': [1.8, 1.9, 2.1, 2.2]             # Possible weekend variations (standard-deviations)
}

"Peak-Hours" specifications for the multi-modal week-day model and normal week-end models are specified above for each location. A variety of peaks and standard deviations are strategically set up above to best capture the diversity that exists in real-world urgent care clinic's daily influx charts (according to Google Maps' "Popular Times" feature). These will be used and further tuned to generate the patient check-in time data. 

In [19]:
### WEEKDAY CHECK-IN TIME GENERATOR
def generate_weekday_ctimes(N, means, sigmas):
    """Generates N check-in times from multi-modal distribution based on input mus & sigmas."""
    
    # Specify the lower and upper limit based on clinic operating times (8am-8pm / 8:00-20:00)
    low, upp = 8, 20
    
    # Split N into three groups
    leftover = N % 3
    N_vals = [int(N/3), int(N/3), int(N/3)+leftover]
    
    # Create a set of times distributed around the three input means
    X = []
    for i in range(3):
        vals = truncnorm((low - means[i]) / sigmas[i], (upp - means[i]) / sigmas[i], loc=means[i], scale=sigmas[i])
        vals = vals.rvs(N_vals[i])
        X.append(vals)
    
    # Concatenate the three weekday "peaks"
    X = np.concatenate([X[0],X[1],X[2]])
    
    return X

### WEEKEND CHECK-IN TIME GENERATOR
def generate_weekend_ctimes(N, means, sigmas):
    """Generates N check-in times from multi-modal distribution based on input mus & sigmas."""
    
    # Specify the lower and upper limit based on clinic operating times (8am-8pm / 8:00-20:00)
    low, upp = 8, 20
    
    # Split N into three groups
    leftover = N % 3
    N_vals = [int(N/3), int(N/3), int(N/3)+leftover]
    
    # Create a set of times distributed around two input means
    X = []
    for i in range(3):
        vals = truncnorm((low - means[i]) / sigmas[i], (upp - means[i]) / sigmas[i], loc=means[i], scale=sigmas[i])
        vals = vals.rvs(N_vals[i])
        X.append(vals)
    
    # Concatenate the two weekday "peaks"
    X = np.concatenate([X[0],X[1],X[2]])
    
    return X


### TIME GENERATOR
def generate_ctimes(ppd, branch_specs):
    """Calls on appropriate functions with corresponding branch specs to generate check-in times."""
    
    check_in_times = []
    
    # Define unique dates of last year
    dates = pd.date_range(datetime.date(2021,5,1), periods=396).tolist()

    for i in range(len(dates)):
        
        # Examine whether date is weekend or weekday
        day = dates[i].weekday()
        
        # Gather specs and generate check-in times for weekend
        if day == 5 or day == 6:
            means = [choice(branch_specs['weekend_means1']), choice(branch_specs['weekend_means2']), choice(branch_specs['weekend_means3'])]
            sigmas = [choice(branch_specs['weekend_sigmas']), choice(branch_specs['weekend_sigmas']), choice(branch_specs['weekend_sigmas'])]
            ctimes = generate_weekday_ctimes(ppd[i], means, sigmas)
            check_in_times.extend(ctimes)   
        
        # Gather specs and generate check-in times for weekdays
        else:
            means = [choice(branch_specs['weekday_means1']), choice(branch_specs['weekday_means2']), choice(branch_specs['weekday_means3'])]
            sigmas = [choice(branch_specs['weekday_sigmas']), choice(branch_specs['weekday_sigmas']), choice(branch_specs['weekday_sigmas'])]
            ctimes = generate_weekday_ctimes(ppd[i], means, sigmas)
            check_in_times.extend(ctimes)      
        
    return check_in_times
    

### HELPER FUNCTION
def convert_time(x):
    """Converts time from decimal format to appropriate datetime object."""
    
    # Grab each portion of input time
    hour = int(abs(x))
    leftover_decimal = x - hour
    minutes = int(leftover_decimal * 60)
    seconds = int(leftover_decimal * 60 * 60 % 60)
    
    # Convert to datetime object
    time = datetime.time(hour, minutes, seconds).strftime('%X')
    
    return time

The function above was created to generate weekday check-in times based on a tri-modal (3 "peaks") distribution representing different parts of the day (morning, afternoon, evening). Another function is setup to generate weekend check-in times based on a single distribution. Lastly, a helper function was created to convert each time value from float to appropriate datetime object.

### Patient Check-Out Times

In [20]:
def generate_checkout_times(data):
    """Generates a checkout time (with added 'noise') based on check-in time and reason for visit."""
    
    # Unpack input date
    check_in_time, reason = data[0], data[1]
    
    # Construct a dict of reasons and expected time of stay
    reasons_time = {
        'cold/flu/fever': 60, 
        'sore-throat': 30, 
        'cough': 45,  
        'chest-pain': 75, 
        'stomach-pain': 80,
        'diarrhea': 55, 
        'weakness/dizziness': 70,
        'headache': 50, 
        'UTI': 40, 
        'pink-eye': 25,
        'ear-pain': 35,
        'rash/allergy': 46,
        'cuts/abscess': 42, 
        'ache/pain': 38, 
        'injury/accident': 72, 
        'covid-test': 22,  
        'vaccination': 18,
        'physical': 24, 
        'drug-test': 28, 
        'lab-work': 37
    }
    
    # Possible variations in appointment time (in minutes)
    variations = [i for i in range(-5, 5, 1)]
    
    # Compute check-out time based on check-in time with added "noise"/variation
    check_in_time = datetime.datetime.strptime(check_in_time, '%H:%M:%S')
    checkout_time = check_in_time + timedelta(minutes=reasons_time[reason] + choice(variations))
    
    return checkout_time

Checkout times are generated based on expected time-of-stay based on given reason, with added variations to mimic the imperfections of real-world data. Although there is no source available to give us a direct estimates of patient-stay based on reasons, our research showed that most visits range from 20min to 1+hour (*https://www.healthline.com/health/right-care-right-time/know-before-you-go*). Starting with this base, personal work experience was tapped in once again to tune the specific amounts for each reason. As with the proportions of visit reasons, the actual numbers are less important than fabricating the actual data characteristic of diversity in checkout times based on different visit reasons, which we accomplish above. 

### "Past" Patient Influx (per Clinic & per Day)

In [21]:
### Generate lists of patients-per-day at each location

ppd_denver = np.random.normal(-7, 7, 396)
ppd_denver = ppd_denver + 80
ppd_denver = ppd_denver.astype(int)

ppd_edgewater = np.random.normal(-5, 5, 396)
ppd_edgewater = ppd_edgewater + 55
ppd_edgewater = ppd_edgewater.astype(int)

ppd_wheatridge = np.random.normal(-6, 6, 396)
ppd_wheatridge = ppd_wheatridge + 65
ppd_wheatridge = ppd_wheatridge.astype(int)

ppd_rino = np.random.normal(-4, 4, 396)
ppd_rino = ppd_rino + 60
ppd_rino = ppd_rino.astype(int)

ppd_lakewood = np.random.normal(-6, 6, 396)
ppd_lakewood = ppd_lakewood + 70
ppd_lakewood = ppd_lakewood.astype(int)

The code-block above constructs lists for each clinic branch, consisting of the patient counts for each day of the past year. For the purposes of generating "noise" in this data fabrication step, numbers were pulled from the normal distribution based on a different input mean per location. This input mean was decided based on the populations of each area. This will allow for added noise to replicate real-world data as much as possible and pave the way for the remaining features of the patient dataset handled below.

According to American Academy of Urgent Care Medicine (AAUCM), urgent cares typically see 60-80 patients per day (*https://aaucm.org/faq/*), which aligns with personal work experience. These estimates serve as the base for each location, with slight deviations based on area's population and randomized day-to-day variations.

### Compile patient information for each clinic, feature-engineer desired attributes, and construct finalized dataset:

In [22]:
def rolling_stats(branch_df):
    """Generates rolling count of patients in clinic for each record for input branch's records."""
    
    rolling_ct = []
    rolling_severities = []
    
    # Iterate through each possible data
    for date in branch_df['visit_date'].unique():
        
        # Grab the corresponding check-in & check-out times and severity codes for the date
        checkins = branch_df[branch_df.visit_date==date].checkin_time.values
        df = branch_df.copy()
        df['checkout_time'] = df['checkout_time'].apply(lambda x: x.time())
        checkouts = df[df.visit_date==date].checkout_time.values
        severity_levels = branch_df[branch_df.visit_date==date].visit_code.values
        
        # Iterate through each check-in time
        for i in range(len(checkins)):
            
            # Current iteration of check-in time
            current_checkin = checkins[i]
            
            # Instantiate patient count at 1
            counter = 1
            current_severities = [0]
            
            # Iterate through all past check-out times (before current check-in time)
            for j in range(i):
                
                # If previous patient's check-out time is after current patient's check-in time, increment counter
                if checkouts[j] > current_checkin:
                    counter += 1
                    current_severities.append(severity_levels[j])
            
            # Average out visit codes for patients currently in clinic 
            if len(current_severities) == 1:
                mean_severity = 0
            else:
                mean_severity = np.mean(current_severities[1:])
            
            rolling_ct.append(counter)
            rolling_severities.append(round(mean_severity, 1))
            
            
    return rolling_ct, rolling_severities

The helper function above iterates through patient records and tracks how many patients are in a clinic at the time a new patient checks in. This will be useful for EDA and modeling purposes to better inform the scheduling process.

In [23]:
def generate_dataset(branch, ppd, specs):
    """Executes necessary functions to generate patient records for a input branch location."""
    
    # Grab total number of patients visiting input branch location in past year
    num_pts = ppd.sum() 

    # Execute functions to generate data for each attribute
    ids = generate_ids(branch, num_pts)
    names = generate_names(ids)
    dobs = generate_dobs(num_pts)
    ages = [convert_dob(dob) for dob in dobs]
    locations = [branch for i in range(num_pts)]
    reasons = generate_reasons(num_pts)
    codes = [generate_severity_code(reason) for reason in reasons]
    dates = generate_dates(ppd)
    checkin_times = generate_ctimes(ppd, specs)
    checkin_times = [convert_time(time) for time in checkin_times]

    # Construct dataframe from generated data
    pts_df = pd.DataFrame({
        'pt_id': ids,
        'pt_name': names,
        'pt_dob': dobs,
        'pt_age': ages,
        'visit_location': locations,
        'visit_reason': reasons,
        'visit_code': codes,
        'visit_date': dates,
        'checkin_time': checkin_times
    })

    # Create check-out times
    pts_df['checkout_time'] = pts_df[['checkin_time', 'visit_reason']].apply(generate_checkout_times, axis=1)

    # Ensure check-in times are in datetime format
    pts_df['checkin_time'] = pts_df['checkin_time'].apply(lambda x: datetime.datetime.strptime(x, '%H:%M:%S').time())

    # Sort records based on visit_date and checkin_time
    pts_df = pts_df.sort_values(['visit_date', 'checkin_time', 'checkout_time'])
    
    # Feature-engineer a rolling count and a rolling severity level of patients in clinic
    pts_df['rolling_ct'], pts_df['rolling_code'] = rolling_stats(pts_df)
    
    # Adjust check-out time based on clinic's current rolling severity-code
    pts_df['checkout_time'] = pts_df[['checkout_time', 'rolling_code']].apply(lambda x: x[0] + datetime.timedelta(minutes=x[1]*2), axis=1)
    pts_df['checkout_time'] = pts_df['checkout_time'].apply(lambda x: x.time())
    
    # Feature-engineer day information based on dates
    pts_df['visit_day'] = pts_df['visit_date'].apply(lambda x: x.weekday()) \
            .map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})
    
    # Rearrange column order as desired
    pts_df = pts_df[[
        'pt_id', 'pt_name', 'pt_dob', 'pt_age',                                                     # Patient Info
        'visit_location', 'visit_reason', 'visit_code',                                             # Visit Info (Location & Reason)
        'visit_date', 'visit_day', 'checkin_time', 'checkout_time', 'rolling_ct', 'rolling_code'    # Visit Info (Day/Date/Time)
    ]]
    
    return pts_df    

In [24]:
### DENVER
denver_df = generate_dataset('denver', ppd_denver, denver_ctime_specs)

### EDGEWATER
edgewater_df = generate_dataset('edgewater', ppd_edgewater, edgewater_ctime_specs)

### WHEATRIDGE
wheatridge_df = generate_dataset('wheatridge', ppd_wheatridge, wheatridge_ctime_specs)

### RINO
rino_df = generate_dataset('rino', ppd_rino, rino_ctime_specs)

### LAKEWOOD
lakewood_df = generate_dataset('lakewood', ppd_lakewood, lakewood_ctime_specs)

### TOTAL
patients_df = pd.concat([denver_df, edgewater_df, wheatridge_df, rino_df, lakewood_df], axis=0)
patients_df = patients_df.set_index('pt_id', drop=True)

In [25]:
# patients_df

Above, each separately designed functions were executed to compile the entire patient-record features together and form the patients dataset. The functionality was split and organized as such for more accessible tinkering with the fabricated parameters throughout (without having to change multiple moving parts of code for each attempt) and ease of execution.

---
---

## 1D: Employee Scheduler

**Objective:** Fabricate employee schedule for dates in past patient records.

---

**Methodology / Approach:** 

From the perspective of the client (urgent care clinic chain), their old method of making an employee schedule involves assigning techs based on the anticipated number of patients during peak hours for that clinic and day. The acceptable threshold for the client is to maintain 3:1 ratio of patients:technician as much as possible. Therefore, the past patient records will incorporate this data to further study and optimize during modeling in later stages down the project pipeline.

#### Generate schedule based on peak number of patients:

In [26]:
# Retrieve technicans for assignment
techs_df = employees_df[employees_df.e_role == 'Technician']

# Grab unique dates for each location and max patients at peak time for schedule generation
schedule_df = patients_df.groupby(['visit_date', 'visit_location']).max()[['rolling_ct']].reset_index(drop=False)

# Assign techs based on 3:1 ratio at peak hours
schedule_df['assigned_num_techs'] = schedule_df.rolling_ct.apply(lambda x: int(x/3)+1 if x%3 != 0 else int(x/3))

# Assign specific techs to each location and date (with no overlaps on same day)
tech_names = []
for date in schedule_df.visit_date.unique():
    assigned_techs = choices(techs_df.e_name.tolist(), k=15)
    split_names = np.array_split(assigned_techs, 5)
    split_names = [i.tolist() for i in split_names]
    tech_names.extend(split_names)
schedule_df['assigned_techs'] = tech_names       

schedule_df

Unnamed: 0,visit_date,visit_location,rolling_ct,assigned_num_techs,assigned_techs
0,2021-05-01,denver,10,4,"[John Shaw, Amy Anderson, Chris Lewis]"
1,2021-05-01,edgewater,7,3,"[Pamela Ross, Erica Glover, Howard Stanley]"
2,2021-05-01,lakewood,11,4,"[Pamela Ross, Parker Hall, Sara Fry]"
3,2021-05-01,rino,9,3,"[Ricky Weaver, Ana Christensen, Thomas Sandoval]"
4,2021-05-01,wheatridge,9,3,"[Jonathon Jennings, Rebecca Parks, Christina Wu]"
...,...,...,...,...,...
1975,2022-05-31,denver,9,3,"[Sara Fry, Ricky Weaver, Sara Fry]"
1976,2022-05-31,edgewater,7,3,"[Thomas Sandoval, Thomas Sandoval, Ricky Weaver]"
1977,2022-05-31,lakewood,9,3,"[Danielle Johnson, Amy Anderson, Tammy Castro]"
1978,2022-05-31,rino,8,3,"[Robin Taylor, Nicholas Mcintosh, Ricky Weaver]"


#### Integrate number of working technicians in patient logs:

In [27]:
# Create dictionary object consisting of assigned number of techs per date & location
schedule_zipper = zip(schedule_df.visit_date, schedule_df.visit_location, schedule_df.assigned_num_techs)
schedule_dict = {}
for i in schedule_zipper:
    schedule_dict[(i[0], i[1])] = i[2]

# Assign techs in patient logs based on created dictionary
patients_df['assigned_num_techs'] = patients_df[['visit_date', 'visit_location']] \
    .apply(lambda x: (x[0], x[1]), axis=1) \
    .map(schedule_dict)

# patients_df

#### Compute number of needed technicians at any given patient check-in time:

In [28]:
patients_df['needed_num_techs'] = patients_df.rolling_ct.apply(lambda x: int(x/3)+1 if x%3 != 0 else int(x/3))

In [29]:
patients_df

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,assigned_num_techs,needed_num_techs
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000009,Zachary Duncan,1943-05-04,79,denver,cough,4,2021-05-01,Saturday,08:26:14,09:07:14,1,0.0,4,1
1000017,Adriana Johnston,2002-03-17,20,denver,sore-throat,4,2021-05-01,Saturday,08:35:53,09:13:53,2,4.0,4,1
1000013,Christine Ryan,1945-08-22,76,denver,headache,5,2021-05-01,Saturday,09:33:24,10:24:24,1,0.0,4,1
1000021,Brittany Mendoza,1956-12-11,65,denver,stomach-pain,5,2021-05-01,Saturday,09:40:57,11:10:57,2,5.0,4,1
1000016,Luke Williams,1998-12-15,23,denver,covid-test,4,2021-05-01,Saturday,10:08:45,10:35:45,3,5.0,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5025098,Matthew Reynolds,2001-09-11,20,lakewood,stomach-pain,5,2022-05-31,Tuesday,17:11:36,18:39:00,5,4.2,3,2
5025078,Kevin Lane,1972-03-12,50,lakewood,cough,4,2022-05-31,Tuesday,17:35:02,18:28:02,3,4.5,3,1
5025077,Stephanie Gonzalez,1970-04-20,52,lakewood,sore-throat,4,2022-05-31,Tuesday,18:13:17,18:49:17,3,4.5,3,1
5025088,Wanda French,1972-03-11,50,lakewood,ear-pain,4,2022-05-31,Tuesday,18:13:43,19:00:19,4,4.3,3,2


---
---

## 1E: Patient Records Partitioning (Past vs. New)

**Objective:** Separate past patient records from new "real-time" records based on dates from fabricated patient data.

---

**Methodology / Approach:** 

This dataset will consist of the same attributes existing in the past patient records, to match the info that would be taken at registration when a patient enters the clinic. 

In [30]:
past_patients_df = patients_df[~((patients_df['visit_date'].dt.year==2022) & (patients_df['visit_date'].dt.month==5))]
new_patients_df = patients_df[(patients_df['visit_date'].dt.year==2022) & (patients_df['visit_date'].dt.month==5)]

In [31]:
past_patients_df

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,assigned_num_techs,needed_num_techs
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000009,Zachary Duncan,1943-05-04,79,denver,cough,4,2021-05-01,Saturday,08:26:14,09:07:14,1,0.0,4,1
1000017,Adriana Johnston,2002-03-17,20,denver,sore-throat,4,2021-05-01,Saturday,08:35:53,09:13:53,2,4.0,4,1
1000013,Christine Ryan,1945-08-22,76,denver,headache,5,2021-05-01,Saturday,09:33:24,10:24:24,1,0.0,4,1
1000021,Brittany Mendoza,1956-12-11,65,denver,stomach-pain,5,2021-05-01,Saturday,09:40:57,11:10:57,2,5.0,4,1
1000016,Luke Williams,1998-12-15,23,denver,covid-test,4,2021-05-01,Saturday,10:08:45,10:35:45,3,5.0,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5023128,Jeffrey White,1964-10-16,57,lakewood,ear-pain,4,2022-04-30,Saturday,19:20:41,20:05:41,3,4.0,4,1
5023142,Brenda Miller,2020-11-17,1,lakewood,lab-work,3,2022-04-30,Saturday,19:21:58,20:09:58,4,4.0,4,2
5023140,Jennifer Romero,2001-02-08,21,lakewood,covid-test,4,2022-04-30,Saturday,19:23:58,19:48:34,5,3.8,4,2
5023124,Ian Ortiz,1968-02-23,54,lakewood,ear-pain,4,2022-04-30,Saturday,19:33:44,20:17:20,6,3.8,4,2


In [32]:
new_patients_df

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,assigned_num_techs,needed_num_techs
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1026554,Teresa Waller,1943-12-12,78,denver,physical,3,2022-05-01,Sunday,08:56:28,09:16:28,1,0.0,5,1
1026552,Jeremy Simpson,1986-01-22,36,denver,UTI,4,2022-05-01,Sunday,09:13:38,09:56:38,2,3.0,5,1
1026548,Andrew Anderson,1979-04-16,43,denver,physical,3,2022-05-01,Sunday,10:03:01,10:24:01,1,0.0,5,1
1026568,Debra Bonilla,2005-07-14,16,denver,chest-pain,5,2022-05-01,Sunday,10:21:06,11:46:06,2,3.0,5,1
1026539,Jeffrey Wright,1976-11-16,45,denver,cold/flu/fever,4,2022-05-01,Sunday,10:27:48,11:34:48,2,5.0,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5025098,Matthew Reynolds,2001-09-11,20,lakewood,stomach-pain,5,2022-05-31,Tuesday,17:11:36,18:39:00,5,4.2,3,2
5025078,Kevin Lane,1972-03-12,50,lakewood,cough,4,2022-05-31,Tuesday,17:35:02,18:28:02,3,4.5,3,1
5025077,Stephanie Gonzalez,1970-04-20,52,lakewood,sore-throat,4,2022-05-31,Tuesday,18:13:17,18:49:17,3,4.5,3,1
5025088,Wanda French,1972-03-11,50,lakewood,ear-pain,4,2022-05-31,Tuesday,18:13:43,19:00:19,4,4.3,3,2


---
---

## 1F: Output

**Objective:** Save constructed datasets into *CSV* files for use in the next stages of the project pipeline.

---

In [33]:
clinics_df.to_csv('./uc_clinics.csv')

clinics_df

Unnamed: 0_level_0,lat,lon,to_denver,to_edgewater,to_wheatridge,to_rino,to_lakewood,nearby_clinics
branch_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
denver,39.739064,-104.989697,0,12,14,7,14,"[(rino, 7.0), (edgewater, 12.0), (wheatridge, ..."
edgewater,39.753954,-105.067788,14,0,5,12,9,"[(wheatridge, 5.0), (lakewood, 9.0), (rino, 12..."
wheatridge,39.766857,-105.081983,14,5,0,10,9,"[(edgewater, 5.0), (lakewood, 9.0), (rino, 10...."
rino,39.767328,-104.981132,6,14,14,0,14,"[(denver, 6.0), (edgewater, 14.0), (wheatridge..."
lakewood,39.704552,-105.079883,12,8,8,12,0,"[(edgewater, 8.0), (wheatridge, 8.0), (denver,..."


In [34]:
employees_df.to_csv('./uc_employees.csv')

employees_df.sample(5)

Unnamed: 0_level_0,e_name,e_role
e_id,Unnamed: 1_level_1,Unnamed: 2_level_1
48,Amy Anderson,Technician
31,Alexander Rodriguez,Technician
33,Ashley Chen,Provider
59,Jeffrey Cantu,Provider
22,Susan Taylor,Technician


In [35]:
past_patients_df.to_csv('./uc_past_patients.csv')
past_patients_df

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,assigned_num_techs,needed_num_techs
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000009,Zachary Duncan,1943-05-04,79,denver,cough,4,2021-05-01,Saturday,08:26:14,09:07:14,1,0.0,4,1
1000017,Adriana Johnston,2002-03-17,20,denver,sore-throat,4,2021-05-01,Saturday,08:35:53,09:13:53,2,4.0,4,1
1000013,Christine Ryan,1945-08-22,76,denver,headache,5,2021-05-01,Saturday,09:33:24,10:24:24,1,0.0,4,1
1000021,Brittany Mendoza,1956-12-11,65,denver,stomach-pain,5,2021-05-01,Saturday,09:40:57,11:10:57,2,5.0,4,1
1000016,Luke Williams,1998-12-15,23,denver,covid-test,4,2021-05-01,Saturday,10:08:45,10:35:45,3,5.0,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5023128,Jeffrey White,1964-10-16,57,lakewood,ear-pain,4,2022-04-30,Saturday,19:20:41,20:05:41,3,4.0,4,1
5023142,Brenda Miller,2020-11-17,1,lakewood,lab-work,3,2022-04-30,Saturday,19:21:58,20:09:58,4,4.0,4,2
5023140,Jennifer Romero,2001-02-08,21,lakewood,covid-test,4,2022-04-30,Saturday,19:23:58,19:48:34,5,3.8,4,2
5023124,Ian Ortiz,1968-02-23,54,lakewood,ear-pain,4,2022-04-30,Saturday,19:33:44,20:17:20,6,3.8,4,2


In [36]:
new_patients_df.to_csv('./uc_new_patients.csv')
new_patients_df

Unnamed: 0_level_0,pt_name,pt_dob,pt_age,visit_location,visit_reason,visit_code,visit_date,visit_day,checkin_time,checkout_time,rolling_ct,rolling_code,assigned_num_techs,needed_num_techs
pt_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1026554,Teresa Waller,1943-12-12,78,denver,physical,3,2022-05-01,Sunday,08:56:28,09:16:28,1,0.0,5,1
1026552,Jeremy Simpson,1986-01-22,36,denver,UTI,4,2022-05-01,Sunday,09:13:38,09:56:38,2,3.0,5,1
1026548,Andrew Anderson,1979-04-16,43,denver,physical,3,2022-05-01,Sunday,10:03:01,10:24:01,1,0.0,5,1
1026568,Debra Bonilla,2005-07-14,16,denver,chest-pain,5,2022-05-01,Sunday,10:21:06,11:46:06,2,3.0,5,1
1026539,Jeffrey Wright,1976-11-16,45,denver,cold/flu/fever,4,2022-05-01,Sunday,10:27:48,11:34:48,2,5.0,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5025098,Matthew Reynolds,2001-09-11,20,lakewood,stomach-pain,5,2022-05-31,Tuesday,17:11:36,18:39:00,5,4.2,3,2
5025078,Kevin Lane,1972-03-12,50,lakewood,cough,4,2022-05-31,Tuesday,17:35:02,18:28:02,3,4.5,3,1
5025077,Stephanie Gonzalez,1970-04-20,52,lakewood,sore-throat,4,2022-05-31,Tuesday,18:13:17,18:49:17,3,4.5,3,1
5025088,Wanda French,1972-03-11,50,lakewood,ear-pain,4,2022-05-31,Tuesday,18:13:43,19:00:19,4,4.3,3,2
