# Case Study - Patient Arrivals in Singapore’s Major Public Hospitals

## Learning Objectives:
1. Explain the basic attributes of DataFrame/Series
2. Manipulate data through indexers
3. Filter data through Boolean indexing  

<i><b>Background</b></i>: Understanding demand is always a key issue in business operations. In healthcare management, patient arrivals are the key to affecting the efficiency of the hospital/clinic operations. Without a sufficient number of healthcare professionals to serve patients, the consequence is a long waiting time for patients; thus their lives may be jeopardized. Increasing the number of healthcare professionals, without a doubt, can build a very efficient healthcare system with a shorter waiting time, thereby gaining the great satisfaction of patients. However, the corresponding labor cost will become a big burden of the operations. From a managerial point of view, it is importance to balance the operation cost and patients’ satisfaction. To achieve this, the first task is being able to know the pattern of patient arrivals as accurate as possible. 
<n>

The `EDdata.csv` contains Singaporeans’ arrivals at some major public hospitals’ emergency departments (EDs) in Oct 2011 and April 2012. Those hospitals are Tan Tock Seng Hospital, 
Singapore General Hospital, National University Hospital, Changi General Hospital, Alexandra Hospital, Khoo Teck Puat Hospital, and KK Women's and Children's Hospital. The data were retrieved from each hospital’s data warehouse system and were a random sample from all the patients who arrived at those hospitals’ EDs during a study period. Please import `EDdata.csv` first and check the data.


In [None]:
import pandas as pd

df = pd.read_csv("EDdata.csv")  
df.head(10)

In [None]:
# You can assign a column in dataset as the row index labels.
df = pd.read_csv("EDdata.csv", index_col = "Case")  
df.head(10)

In [None]:
# Identify first row
df.loc[92408]

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.reset_index(inplace = True)

In [None]:
df.iloc[0] # You can use the traditional row index to do indexing

## Task 1-1
<i><b>Do male Singaporeans have preferences over different hospitals to attend in case of an emergency? </b></i>

Please remember to delete the patient visits to KKH in the data set. KKH is a Women's and Children's hospital. If an emergency happens, the male patient will not be sent to KKH basically.

- Male Singaporeans

In [None]:
# filter out females and patients visiting KKH.
fil_1 = (df["Gender"] == "M") & (df["Hospital_Name"] != "KKH")
df_male = df.loc[fil_1].copy()   # subset data to meet the filtering condition

# find out the unique hospitals in the Hospital_Name column
hosp_name = df_male["Hospital_Name"].unique()   

out_table = {}  # Create an empty dictionary 
for hos in hosp_name:
    out_table.update({str(hos): list()})

print(out_table)

In [None]:
for hos in hosp_name:
    filter_hos = (df_male["Hospital_Name"] == hos)
    out_table[hos] = filter_hos.sum()

result = pd.Series(out_table)
result/result.sum()

- Female Singaporeans

In [None]:
fil_2 = (df["Gender"] == "F") & (df["Hospital_Name"] != "KKH") 
df_female = df.loc[fil_2,:].copy()   # subset data to meet the filtering condition

out_table2 = {}                                  # Create an empty dictionary 
for hos in hosp_name:
    out_table2.update({str(hos): list()})
    
for hos in hosp_name:
    filter_hos = (df_female["Hospital_Name"] == hos)
    df_temp = df_female.loc[filter_hos,:]
    out_table2[hos] = filter_hos.sum()
    
result2 = pd.Series(out_table2)
result2/result2.sum()

## Task 1-2
<n>

<i><b>Are the patients’ waiting time distributions similar across different public hospitals? </b></i>
- To find the waiting time, there are two possible scenarios as follows:
    1. (Case 1) triage time is larger than registration time (normal cases)
    2. (Case 2) the triage will be conducted after midnight. However, the "sec" columns are always computed using `00:00:00` as the origin

In [None]:
df["Wait_time"] = 0
n_pat = len(df)
df_reg = df["reg_sec"]
df_tri = df["triage_sec"]
df_wait = df["Wait_time"]

df_wait = (24*60*60) - df_reg + df_tri    # handle Case 2 first

filt_1 = df["triage_sec"] > df["reg_sec"] # handle Case 1
df_wait[filt_1] = df_tri[filt_1] - df_reg[filt_1]

df["Wait_time"] = df_wait        
df["Wait_min"] = df_wait/60.0

df.head()

In [None]:
import numpy as np
out_dic = {}                               # Create a dictionary to store the computation results
hosp_name = df["Hospital_Name"].unique()   # Find out the unique hospitals in the data set
for hos in hosp_name:
    out_dic.update({str(hos): np.zeros(5)}) # We just want to find out 5 summary statistics. Thus, create a 5-element array

print(out_dic)

In [None]:
for hos in hosp_name:
    hos_filter = (df["Hospital_Name"] == hos)  # Create a hospital-specific filter
    df_hos = df.loc[hos_filter, "Wait_min"]    # Subset the data to include the target hospital only
    out_dic[hos][0] = df_hos.mean()
    out_dic[hos][1] = df_hos.median()
    out_dic[hos][2:] = df_hos.quantile([.25, .75, .99]) # 25th, 75th and 99th percentiles
    
resultQ2 = pd.DataFrame(out_dic, index = ["mean", "median", "Q1", "Q3", "99%"])
resultQ2

<i><b>Do you notice any anomaly in the table generated? </b></i>

- Please filter the records with waiting time larger than 300 minutes.

In [None]:
filter_check = df["Wait_min"] > 300
df_check = df.loc[filter_check, ["REGIS_TIME", "Triage Time", "reg_sec", "triage_sec", "Wait_time"]]
df_check.head()

In practice, it is common to have anomalous data. Moreover, anomalous data values are due mainly to two possible reasons:
1. The way/logic you use to compute values is incorrect. (Logical error!)
2. The data records are not correct. (Data entry error!)

## Task 1-3
<n>

To make a staffing plan, which decides the number of nurses and doctors to serve patients, a deep understanding of patient arrivals is crucial. The staffing plan in practice will be made on an hourly basis (24 intervals) every day. Thus, please create a new column, `REGIS_HOUR`, in df. Moreover, the patients’ arrival pattern may vary by the day of a month. Please also create a new column, `REGIS_DAY`, in df.

In [None]:
np.zeros(3, dtype = int)

In [None]:
date = np.zeros(df.shape[0], dtype = int) # df.shape can get the number of rows and the number of columns
time = np.zeros(df.shape[0], dtype = int)
year = np.zeros(df.shape[0], dtype = int)

for i in range(df.shape[0]):
    dd, *_, yy = df.loc[i, 'REGIS_DATE'].split("/")
    hh, *_ = df.loc[i, 'REGIS_TIME'].split(":")
    date[i] = dd
    time[i] = hh
    year[i] = yy
    
df["REGIS_DAY"] = date
df["REGIS_HOUR"] = time
df["REGIS_YEAR"] = year
df.head(10)

## Task 1-4
<n>

Find out the average number of patient arrivals in each hour of a day. To answer this question, we assume the arrival pattern is similar across different days and only utilise the data in 2011.

In [None]:
filt_year = (df['REGIS_YEAR'] == 2011)
df_2011 = df.loc[filt_year].copy()
df_2011.head(10)

In [None]:
df_2011.info() # show the numbers of rows and columns and all columns' data types at the same time  

In [None]:
filt = (df_2011.REGIS_DAY == 2) & (df_2011.REGIS_HOUR == 1)
filt.sum()

In [None]:
table_31by24 = np.zeros((31,24), dtype = float)

for i in range(31):
    for j in range(24):
        filt = (df_2011.REGIS_DAY == (i+1)) & (df_2011.REGIS_HOUR == j)
        table_31by24[i,j] = filt.sum()

#pd.set_option("display.max_columns", 24)      
pd.DataFrame(table_31by24)

In [None]:
df_table_31by24 = pd.DataFrame(table_31by24)
table_24 = df_table_31by24.mean()

print(table_24)

## Task 1-5
<n>

The assumption that the arrival pattern is similar across different days is too strong to be true. Let's discuss the weekday effect (including Saturday and Sunday) on the arrival pattern of patients. Please create a `WEEKDAY` column in df. For example, if a patient's arrival occurred on 01/10/2011, the corresponding value in `WEEKDAY` column is Saturday.

In [None]:
df_2011.head(5)

In [None]:
weekday_check = (df_2011["REGIS_DAY"] + 5) % 7
weekday_check

In [None]:
df_2011['weekday_check'] = weekday_check
df_2011.head(5)

In [None]:
weekday={0: 'Sunday',
         1: 'Monday',
         2: 'Tuesday',
         3: 'Wednesday',
         4: 'Thursday',
         5: 'Friday',
         6: 'Saturday'}

df_2011['WEEKDAY']=df_2011.weekday_check.map(weekday)
df_2011.tail(20)

## Task 1-6
<n>
    
With the `WEEKDAY` column, please find out the average number of patient arrivals in each hour by weekday categories. Your answer should be a 7-by-24 table. 

In [None]:
day_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
table_7by24 = np.zeros((7,24), dtype = float)

for i in range(7):
    for j in range(24):
        filt = (df_2011.WEEKDAY == day_list[i]) & (df_2011.REGIS_HOUR == j)
        table_7by24[i,j] = filt.sum()

In [None]:
table_7by24 = pd.DataFrame(table_7by24)
table_7by24

In [None]:
table_7by24.iloc[0,:] = table_7by24.iloc[0,:]/5  # 5 Sundays in October 2011  
table_7by24.iloc[1,:] = table_7by24.iloc[1,:]/5  # 5 Mondays in October 2011
table_7by24.iloc[2,:] = table_7by24.iloc[2,:]/4 
table_7by24.iloc[3,:] = table_7by24.iloc[3,:]/4
table_7by24.iloc[4,:] = table_7by24.iloc[4,:]/4
table_7by24.iloc[5,:] = table_7by24.iloc[5,:]/4
table_7by24.iloc[6,:] = table_7by24.iloc[6,:]/5

In [None]:
table_7by24