# Time-Series Integration
## Summary
In this Jupyter Notebook,
- The time-series tables are imported from CSV files and converted to dataframes
- The time-series data is further pre-processed (beyond the feature selection and filtering performed in other notebooks) to account for (1) time alignment, (2) handling missing values, (3) handling outliers, and (4) creating fixed windows, among other small pre-processing steps
- The relevant tables are joined together to create a single representation of the time-series data

## Integration of Tables
The representation of the data for the ANN will be in a sequence format with dimensions of `(# samples, timesteps, # features)`.

## Relevant DataFrames
- `hosp_labevents`
- `hosp_microbiologyevents`
- `icu_chartevents`
- `icu_outputevents`,
- `icu_procedureevents`

### Read in all tables

In [2]:
import pandas as pd
import csv
import numpy as np
import seaborn as sns
import os
import psycopg2
from psycopg2 import OperationalError, DatabaseError, sql
import fuzzywuzzy
from fuzzywuzzy import process

def read_csv(csv_file_path):
    # Read the CSV file into a DataFrame
    df = pd.read_csv(csv_file_path)
    print(csv_file_path)
    print('Shape:', df.shape)
    print(df.head())
    return df

In [3]:
### Environment Variables for Connection ###
DB_NAME = 'smcdougall'
USERNAME = 'postgres'
PASSWORD = 'postgres'
HOST = 'localhost'
PORT = 5432 

def connect_to_postgres(db_name, username, password, host, port):
    connection = None
    try:
        connection = psycopg2.connect(
            dbname=db_name,
            user=username,
            password=password,
            host=host,
            port=port
        )
        print('Connected to db:', db_name)
        return connection
    except OperationalError as e:
        print('Received the following error:', e)
        return None

def verify_postgres_connection(connection):
    if connection is not None:
        try:
            cur = connection.cursor()
            cur.execute('SELECT version();')
            db_version = cur.fetchone()
            print('The Postgres database version is:', db_version)
            cur.close()
        except DatabaseError as e:
            print('Received the following error:', e)
    else:
        print('Connection to Postgres failed.')

def close_connection(connection):
    if connection is not None:
        connection.close()
        print('Postgres connection has been closed.')

connection = connect_to_postgres(DB_NAME, USERNAME, PASSWORD, HOST, PORT)
verify_postgres_connection(connection)
close_connection(connection)

Connected to db: smcdougall
The Postgres database version is: ('PostgreSQL 14.5 on aarch64-apple-darwin20.6.0, compiled by Apple clang version 12.0.5 (clang-1205.0.22.9), 64-bit',)
Postgres connection has been closed.


In [6]:
labevents_df = read_csv('dataframes/hosp_labevents.csv')

dataframes/hosp_labevents.csv
Shape: (3533581, 16)
   labevent_id  subject_id     hadm_id  specimen_id  itemid order_provider_id  \
0         2437    10000719         NaN     70783909   51221            P30FVI   
1         2444    10000719         NaN     70783909   51256            P30FVI   
2         2449    10000719  24558333.0      9035511   51221               NaN   
3         2458    10000719  24558333.0     93908058   51221               NaN   
4         2464    10000719         NaN     99456512   51221            P484YY   

             charttime            storetime value  valuenum valueuom  \
0  2139-09-14 15:10:00  2139-09-14 20:08:00  32.1      32.1        %   
1  2139-09-14 15:10:00  2139-09-14 20:08:00  75.6      75.6        %   
2  2140-04-15 00:22:00  2140-04-15 01:01:00  31.4      31.4        %   
3  2140-04-16 06:40:00  2140-04-16 07:54:00  32.6      32.6        %   
4  2140-11-14 17:08:00  2140-11-14 20:01:00  35.4      35.4        %   

   ref_range_lower  ref_range

In [4]:
microbiology_df = read_csv('dataframes/hosp_microbiologyevents.csv')

dataframes/hosp_microbiologyevents.csv
Shape: (47058, 6)
   microevent_id  subject_id  hadm_id  micro_specimen_id   chartdate  \
0             95    10000719      NaN            4691510  2140-03-28   
1            420    10001319      NaN            2654897  2135-06-10   
2            614    10001884      NaN            2893463  2125-12-01   
3            871    10002266      NaN            3980855  2124-08-28   
4            927    10002428      NaN            1704201  2156-05-11   

                                 org_name  
0  POSITIVE FOR GROUP B BETA STREPTOCOCCI  
1                  GRAM POSITIVE BACTERIA  
2  POSITIVE FOR INFLUENZA A VIRAL ANTIGEN  
3                  GRAM POSITIVE BACTERIA  
4                   LACTOBACILLUS SPECIES  


Note that for many of the microbiology rows, an `hadm_id` is not available. The documentation suggests that the nearest `hadm_id` should be used, meaning we should look at the `chartdate` value and join it with the nearest `hadm_id`

In [5]:
chartevents_df = read_csv('dataframes/icu_chartevents.csv')

dataframes/icu_chartevents.csv
Shape: (994991, 8)
   subject_id   hadm_id   stay_id            charttime  itemid  value  \
0    10001884  26184834  37510196  2131-01-18 19:00:00  220210   17.0   
1    10001884  26184834  37510196  2131-01-18 20:00:00  220210   16.0   
2    10001884  26184834  37510196  2131-01-18 21:00:00  220210   15.0   
3    10001884  26184834  37510196  2131-01-18 22:00:00  220210   13.0   
4    10001884  26184834  37510196  2131-01-18 23:00:00  220210   12.0   

   valuenum  valueuom  
0      17.0  insp/min  
1      16.0  insp/min  
2      15.0  insp/min  
3      13.0  insp/min  
4      12.0  insp/min  


In [6]:
outputevents_df = read_csv('dataframes/icu_outputevents.csv')

dataframes/icu_outputevents.csv
Shape: (338, 7)
   subject_id   hadm_id   stay_id            charttime  itemid  value valueuom
0    10553084  28481755  35401987  2199-10-17 20:06:00  226590  400.0       ml
1    10553084  28481755  35401987  2199-10-17 22:00:00  226590  450.0       ml
2    10553084  28481755  35401987  2199-10-18 04:00:00  226590   50.0       ml
3    10553084  28481755  35401987  2199-10-18 06:00:00  226590  100.0       ml
4    10553084  28481755  35401987  2199-10-18 12:00:00  226590  100.0       ml


In [7]:
procedure_df = read_csv('dataframes/icu_procedureevents.csv')

dataframes/icu_procedureevents.csv
Shape: (37436, 16)
   subject_id   hadm_id   stay_id            starttime              endtime  \
0    10001884  26184834  37510196  2131-01-12 21:30:00  2131-01-13 04:00:00   
1    10001884  26184834  37510196  2131-01-12 17:40:00  2131-01-12 17:41:00   
2    10001884  26184834  37510196  2131-01-19 18:44:00  2131-01-19 18:45:00   
3    10001884  26184834  37510196  2131-01-13 16:14:00  2131-01-13 16:15:00   
4    10001884  26184834  37510196  2131-01-13 16:14:00  2131-01-13 16:15:00   

   itemid  value valueuom locationcategory  orderid  linkorderid  \
0  225794  390.0      min          Unknown  4809276      4809276   
1  227194    1.0      NaN          Unknown  6470885      6470885   
2  228128    1.0      NaN          Unknown  9459863      9459863   
3  225401    1.0      NaN          Unknown  4595950      4595950   
4  225454    1.0      NaN          Unknown  5410081      5410081   

       ordercategoryname ordercategorydescription  patientweig

### Unit Standardization

Steps:
- for all the tables with `value` and `valueuom` fields, do a check that for the given item id, only one unit of measurement is being used. If not, map the item id to its description to determine the valid/common units of measurement and determine next steps
- Observe range of patient weight - docs say it is measured in kilograms

In [8]:
def load_lab_items_table(connection):
    cur = connection.cursor()
    cur.execute("""
        SELECT itemid, label
        FROM mimiciv_hosp.d_labitems
    """)
    rows = cur.fetchall()
    cur.close()
    df = pd.DataFrame(rows, columns=["itemid", "label"])
    return df

connection = connect_to_postgres(DB_NAME, USERNAME, PASSWORD, HOST, PORT)
lab_item_df = load_lab_items_table(connection)
lab_item_df.head()

Connected to db: smcdougall


Unnamed: 0,itemid,label
0,50801,Alveolar-arterial Gradient
1,50802,Base Excess
2,50803,"Calculated Bicarbonate, Whole Blood"
3,50804,Calculated Total CO2
4,50805,Carboxyhemoglobin


In [9]:
def count_unique_units(df):
    # get unique units of measurement for each item id in the dataframe
    non_null_df = df.dropna(subset=['valueuom'])
    result = non_null_df.groupby('itemid')['valueuom'].nunique().reset_index()
    result.columns = ['itemid', 'unique_valueuom_count']
    # return results with more than 1 unit of measurement recorded
    filtered_result = result[result['unique_valueuom_count'] > 1]
    return filtered_result

In [10]:
labevents_uom = count_unique_units(labevents_df)
print(labevents_uom)

    itemid  unique_valueuom_count
42   51464                      2


In [11]:
labevents_df[labevents_df['itemid'] == 51464]['valueuom'].value_counts()

valueuom
mg/dL    82878
EU/dL     1670
Name: count, dtype: int64

In [12]:
lab_item_df[lab_item_df['itemid'] == 51464]

Unnamed: 0,itemid,label
624,51464,Bilirubin


For Bilirubin, mg/dL (milligrams/decileter) is the common unit of measurement in US. EU/dL measures Endotoxin Units per milliliter -- these are not units of measurements that can be converted from one to the other.

In [13]:
chartevents_uom = count_unique_units(chartevents_df)
print(chartevents_uom)

Empty DataFrame
Columns: [itemid, unique_valueuom_count]
Index: []


In [14]:
outputevents_uom = count_unique_units(outputevents_df)
print(outputevents_uom)

Empty DataFrame
Columns: [itemid, unique_valueuom_count]
Index: []


In [15]:
procedure_uom = count_unique_units(procedure_df)
print(procedure_uom)

    itemid  unique_valueuom_count
0   224263                      3
1   224264                      2
2   224267                      2
4   224269                      2
5   224270                      2
8   224274                      3
9   224275                      3
10  224276                      3
11  224277                      3
12  224560                      2
14  225199                      2
15  225202                      3
17  225204                      3
18  225205                      2
19  225315                      2
20  225441                      2
21  225752                      3
22  225789                      2
23  225792                      3
24  225794                      3
25  225802                      3
26  225805                      2
28  227551                      2
29  227719                      2
33  228286                      2
34  229351                      3
40  229519                      2
44  229526                      2
46  229532    

In [16]:
def load_icu_lab_table(connection):
    cur = connection.cursor()
    cur.execute("""
        SELECT itemid, label
        FROM mimiciv_icu.d_items
    """)
    rows = cur.fetchall()
    cur.close()
    df = pd.DataFrame(rows, columns=["itemid", "label"])
    return df

connection = connect_to_postgres(DB_NAME, USERNAME, PASSWORD, HOST, PORT)
item_df = load_icu_lab_table(connection)
print(item_df.head())

Connected to db: smcdougall
   itemid                    label
0  220001             Problem List
1  220003       ICU Admission date
2  220045               Heart Rate
3  220046  Heart rate Alarm - High
4  220047   Heart Rate Alarm - Low


In [17]:
for procedure_id in procedure_uom['itemid'].tolist():
    label = item_df[item_df['itemid'] == procedure_id]['label']
    values = procedure_df[procedure_df['itemid'] == procedure_id]['valueuom'].unique().tolist()
    print(procedure_id)
    print(label)
    print(values)
    print()

224263
597    Multi Lumen
Name: label, dtype: object
['min', 'day', 'hour']

224264
598    PICC Line
Name: label, dtype: object
['min', 'day']

224267
599    Cordis/Introducer
Name: label, dtype: object
['min', 'day']

224269
601    CCO PAC
Name: label, dtype: object
['min', 'day']

224270
602    Dialysis Catheter
Name: label, dtype: object
['min', 'day']

224274
605    22 Gauge
Name: label, dtype: object
['min', 'day', 'hour']

224275
606    20 Gauge
Name: label, dtype: object
['min', 'day', 'hour']

224276
607    16 Gauge
Name: label, dtype: object
['min', 'day', 'hour']

224277
608    18 Gauge
Name: label, dtype: object
['min', 'day', 'hour']

224560
742    PA Catheter
Name: label, dtype: object
['min', 'day']

225199
1144    Triple Introducer
Name: label, dtype: object
['min', 'day']

225202
1145    Indwelling Port (PortaCath)
Name: label, dtype: object
['min', 'hour', 'day']

225204
1147    Midline
Name: label, dtype: object
['min', 'day', 'hour']

225205
1148    RIC
Name: label, 

We can see that all of the above are time measurements - min, hour, day. All of the items have "min" as a used value. Let's use "hour" as the standard, and convert "min" and "day" for the item ids that have length of time as the measurement. "Hour" is chosen as the common unit of measurement because the vital signs measurements will be converted to hourly time increments.

In [18]:
def transform_time_units_to_hour(df):
    # Make a copy of the DataFrame to avoid modifying the original
    df_transformed = df.copy()
    
    conversions = {
        'day': 24,    # 1 day = 24 hours
        'min': 1/60   # 1 minute = 1/60 hour
    }
    
    for uom, factor in conversions.items():
        mask = df_transformed['valueuom'] == uom
        df_transformed.loc[mask, 'value'] *= factor
        df_transformed.loc[mask, 'valueuom'] = 'hour'
    
    return df_transformed

procedure_df = transform_time_units_to_hour(procedure_df)
procedure_df['valueuom'].value_counts()

valueuom
hour    19675
Name: count, dtype: int64

In [19]:
procedure_df['patientweight'].describe()

count    37436.000000
mean        76.206259
std         27.832650
min          1.000000
25%         60.000000
50%         70.700000
75%         87.000000
max        648.800000
Name: patientweight, dtype: float64

In looking at patientweight, there are definitely outliers... but hard to determine whether they are actually in lb or kg. Make the assumption that they are all in kg, but we will need to account for outliers like the min of 1 kg (2lb) and 648 kg (1428 lb)

In [20]:
procedure_df[procedure_df['subject_id'] == 10199945]

Unnamed: 0,subject_id,hadm_id,stay_id,starttime,endtime,itemid,value,valueuom,locationcategory,orderid,linkorderid,ordercategoryname,ordercategorydescription,patientweight,continueinnextdept,statusdescription
573,10199945,26750128,39313825,2173-03-18 19:41:00,2173-03-19 03:31:00,225794,7.833333,hour,Unknown,2415853,2415853,Ventilation,ContinuousProcess,1.0,0,FinishedRunning
574,10199945,26750128,39313825,2173-03-18 22:03:00,2173-03-19 12:12:00,229351,14.15,hour,"Catheter, GU",5339997,5339997,Tubes,ContinuousProcess,1.0,0,FinishedRunning
575,10199945,26750128,39313825,2173-03-18 22:22:00,2173-03-19 04:15:00,224275,5.883333,hour,Peripheral,5673066,5673066,Peripheral Lines,ContinuousProcess,1.0,0,FinishedRunning
576,10199945,26750128,39313825,2173-03-18 22:23:00,2173-03-19 21:02:00,224263,22.65,hour,Unknown,2350702,2350702,Invasive Lines,ContinuousProcess,1.0,0,FinishedRunning


### Handling Outliers

Use inspiration from MIMIC-Extract:
- use `variable_ranges.csv` - list of clinically reasonable variable ranges provided in source code of Harutyunyan et. al
- "developed in conversation with clinical experts"
- if raw observed value is outside the threshold, treat as missing

In [21]:
variable_ranges_csv = pd.read_csv('variable_ranges.csv')
variable_ranges_csv.head()

Unnamed: 0,LEVEL2,LEVEL1,OUTLIER LOW,VALID LOW,IMPUTE,VALID HIGH,OUTLIER HIGH
0,Alanine aminotransferase,,0.0,2.0,34.0,10000.0,11000.0
1,Albumin,,0.0,0.6,3.1,6.0,60.0
2,Alkaline phosphate,,0.0,20.0,106.0,3625.0,4000.0
3,Anion Gap,,0.0,5.0,13.0,50.0,55.0
4,Asparate aminotransferase,,0.0,6.0,40.0,20000.0,22000.0


In [22]:
variable_ranges_csv[variable_ranges_csv['LEVEL2'] == 'Weight']

Unnamed: 0,LEVEL2,LEVEL1,OUTLIER LOW,VALID LOW,IMPUTE,VALID HIGH,OUTLIER HIGH
64,Weight,,0.0,0.0,81.8,250.0,250.0


Not sure what the units are - looks like the MIMIC Extract paper used "lbs" so assume 250 pounds is considered outlier high

Let's use a different criteria since outlier is 0 and this field instead uses 1... **Use Z-Scores**

In [23]:
def replace_weight_outliers(df, column):
    mean = df[column].mean()
    std_dev = df[column].std()
    
    df['z_score'] = (df[column] - mean) / std_dev
    # values of '1' (or around 1) are outliers if used for weight
    outliers = (df['z_score'].abs() > 3) | (df[column] < 20)
    
    df.loc[outliers, column] = np.nan
    df.drop(columns=['z_score'], inplace=True)
    return df

In [24]:
procedure_df['patientweight'].isna().sum()

0

In [25]:
procedure_df = replace_weight_outliers(procedure_df, 'patientweight')
print(procedure_df['patientweight'].isna().sum())
print(procedure_df['patientweight'].describe())

526
count    36910.000000
mean        74.846283
std         21.499190
min         23.000000
25%         60.000000
50%         70.400000
75%         86.000000
max        159.000000
Name: patientweight, dtype: float64


IQR left ~1400 empty records, which seemed to be very high - z-score is more flexible.

#### Handling Outliers for Vital Signs Data
The variable_ranges.csv file has variable ranges for different vital signs, which is preferred to be used over something like z-score or IQR as the criteria. The variables are listed out by name, so we need to map them back to their item ids.

In [26]:
import json

# Read JSON file
with open('mimiciv_vitalsigns_labels.json', 'r') as file:
    mimiciv_labels = json.load(file)

In [27]:
mimiciv_labels_list = list(mimiciv_labels.values())

def find_closest_label_match(label, choices):
    match, score = fuzzywuzzy.process.extractOne(label, choices)
    if score >= 95:
        return match

# Map LEVEL2 column to closest matches in mimiciv_labels
variable_ranges_csv['itemid_label'] = variable_ranges_csv['LEVEL2'].apply(lambda x: find_closest_label_match(x, mimiciv_labels_list))
# reverse mapping
label_to_id = {v: k for k, v in mimiciv_labels.items()}
variable_ranges_csv['itemid'] = variable_ranges_csv['itemid_label'].map(label_to_id)

print(variable_ranges_csv)

                       LEVEL2 LEVEL1  OUTLIER LOW  VALID LOW  IMPUTE  \
0    Alanine aminotransferase    NaN          0.0       2.00    34.0   
1                     Albumin    NaN          0.0       0.60     3.1   
2          Alkaline phosphate    NaN          0.0      20.00   106.0   
3                   Anion Gap    NaN          0.0       5.00    13.0   
4   Asparate aminotransferase    NaN          0.0       6.00    40.0   
..                        ...    ...          ...        ...     ...   
61                 Troponin-I    NaN          0.0       0.01     2.3   
62                 Troponin-T    NaN          0.0       0.01     0.1   
63               Urine output    NaN          0.0       0.00    80.0   
64                     Weight    NaN          0.0       0.00    81.8   
65     White blood cell count    NaN          0.0       0.00     9.9   

    VALID HIGH  OUTLIER HIGH               itemid_label itemid  
0     10000.00       11000.0   Alanine Aminotransferase  53084  
1    

In [28]:
variable_ranges_csv['itemid'].isna().sum()

27

In [29]:
def convert_to_int(value):
    try:
        # Check if the value is not NaN and is a valid number
        if pd.notna(value):
            return int(value)
        else:
            return np.nan
    except ValueError:
        return np.nan
variable_ranges_csv['itemid'] = variable_ranges_csv['itemid'].apply(convert_to_int)
procedure_df['itemid'] = procedure_df['itemid'].apply(convert_to_int)

In [30]:
def is_numeric(value):
    """Check if a value is numeric."""
    try:
        float(value)
        return True
    except ValueError:
        return False

def convert_outliers(df, vitalsigns_df, itemid_col='itemid', low_col='OUTLIER LOW', high_col='OUTLIER HIGH'):
    # Merge original DataFrame with mapping DataFrame on the itemid column
    merged_df = pd.merge(df, vitalsigns_df, how='left', left_on=itemid_col, right_on=itemid_col)
    
    # convert to nan if outside of the acceptable range
    for itemid in merged_df[itemid_col].unique()[:1]:
        # get thresholds for the item
        thresholds = merged_df[merged_df[itemid_col] == itemid]
        if thresholds.empty:
            continue

        low_threshold = thresholds[low_col].values[0]
        high_threshold = thresholds[high_col].values[0]

        df['value'] = df.apply(
                lambda row: np.nan if is_numeric(row['value']) and row['itemid'] == itemid and (
                    float(row['value']) > high_threshold or
                    float(row['value']) < low_threshold
                ) else row['value'],
                axis=1
            )

#### procedureevents

In [31]:
procedure_df['value'].isna().sum()

0

In [32]:
convert_outliers(procedure_df, variable_ranges_csv, 'itemid')

In [33]:
procedure_df['value'].isna().sum()

0

None of the procedure ids are included in the variable ranges - makes sense since procedure data was not used in the paper where the ranges were pulled from.

#### chartevents

In [34]:
chartevents_df['value'].isna().sum()

0

In [35]:
chartevents_df['itemid'] = chartevents_df['itemid'].apply(convert_to_int)
convert_outliers(chartevents_df, variable_ranges_csv, 'itemid')

In [36]:
chartevents_df['value'].isna().sum()

1

#### outputevents

In [37]:
outputevents_df['value'].isna().sum()

0

In [38]:
convert_outliers(outputevents_df, variable_ranges_csv, 'itemid')

In [39]:
outputevents_df.head()

Unnamed: 0,subject_id,hadm_id,stay_id,charttime,itemid,value,valueuom
0,10553084,28481755,35401987,2199-10-17 20:06:00,226590,400.0,ml
1,10553084,28481755,35401987,2199-10-17 22:00:00,226590,450.0,ml
2,10553084,28481755,35401987,2199-10-18 04:00:00,226590,50.0,ml
3,10553084,28481755,35401987,2199-10-18 06:00:00,226590,100.0,ml
4,10553084,28481755,35401987,2199-10-18 12:00:00,226590,100.0,ml


#### labevents

In [40]:
labevents_df['itemid'] = labevents_df['itemid'].apply(convert_to_int)

In [41]:
labevents_df['value'].isna().sum()

113195

In [42]:
labevents_df.head()

Unnamed: 0,labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
0,2437,10000719,,70783909,51221,P30FVI,2139-09-14 15:10:00,2139-09-14 20:08:00,32.1,32.1,%,36.0,48.0,abnormal,ROUTINE,
1,2444,10000719,,70783909,51256,P30FVI,2139-09-14 15:10:00,2139-09-14 20:08:00,75.6,75.6,%,50.0,70.0,abnormal,ROUTINE,
2,2449,10000719,24558333.0,9035511,51221,,2140-04-15 00:22:00,2140-04-15 01:01:00,31.4,31.4,%,36.0,48.0,abnormal,STAT,
3,2458,10000719,24558333.0,93908058,51221,,2140-04-16 06:40:00,2140-04-16 07:54:00,32.6,32.6,%,36.0,48.0,abnormal,ROUTINE,
4,2464,10000719,,99456512,51221,P484YY,2140-11-14 17:08:00,2140-11-14 20:01:00,35.4,35.4,%,36.0,48.0,abnormal,ROUTINE,


In [43]:
convert_outliers(labevents_df, variable_ranges_csv, 'itemid')

In [44]:
labevents_df['value'].isna().sum()

113195

In [45]:
labevents_df.head()

Unnamed: 0,labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
0,2437,10000719,,70783909,51221,P30FVI,2139-09-14 15:10:00,2139-09-14 20:08:00,32.1,32.1,%,36.0,48.0,abnormal,ROUTINE,
1,2444,10000719,,70783909,51256,P30FVI,2139-09-14 15:10:00,2139-09-14 20:08:00,75.6,75.6,%,50.0,70.0,abnormal,ROUTINE,
2,2449,10000719,24558333.0,9035511,51221,,2140-04-15 00:22:00,2140-04-15 01:01:00,31.4,31.4,%,36.0,48.0,abnormal,STAT,
3,2458,10000719,24558333.0,93908058,51221,,2140-04-16 06:40:00,2140-04-16 07:54:00,32.6,32.6,%,36.0,48.0,abnormal,ROUTINE,
4,2464,10000719,,99456512,51221,P484YY,2140-11-14 17:08:00,2140-11-14 20:01:00,35.4,35.4,%,36.0,48.0,abnormal,ROUTINE,


In [46]:
labevents_df.isna().sum()

labevent_id                0
subject_id                 0
hadm_id              1518295
specimen_id                0
itemid                     0
order_provider_id    2653783
charttime                  0
storetime                  1
value                 113195
valuenum              130728
valueuom              175297
ref_range_lower       325158
ref_range_upper       325158
flag                 2534102
priority              184905
comments             3268595
dtype: int64

### Handling Time Series Conversions
1. make sure that all the timestamp fields are datetime type -- convert if that's not the case with pd.to_datetime
2. Merge all the dataframes together
3. Resample the data in pandas - set the timestamp as the index, group by subject_id and hadm_id, and resample to hourly intervals with `.resample('H')`
4. For each hourly interval, aggregate my mean/median
5. Use interpolation for handling missing intervals -- look into the paper that used the "look ahead" method/forward fill (or something similar to that...)


Try to merge on subject_id, hadm_id, and one of the timestamp fields -- might make sense to rename some of them first for consistency across tables
- Might make sense to do some kind of concatenation


Source for looking into how to do this: https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/

#### Relevant 'events' tables and their date fields
- labevents_df - `charttime`
- chartevents_df - `charttime`
- procedure_df - `starttime`, `endtime`
- microbiology_df - `chartdate`/`charttime`
- outputevents_df - `charttime`

#### Other tables to consider
- omr - `chartdate`
- services `transfertime`
- transfers - `intime`, `outtime`
- edstays - `intime`, `outtime`
- medrecon - `charttime`
- pyxis - `charttime`
- vitalsign - `charttime`
- prescriptions - `starttime`, `stoptime`
- hospital procedures - `chartdate`

#### Design Decision Regarding Datetimes
For LSTM model, each entry is represented by a single timestamp. We will use `charttime` as-is as the timestamp. For tables that have multiple timestamp fields (in/out, start/stop, start/end, etc.), we will use a single timestamp - the "start" timestamp, and create a new feature for "duration" that will cpature the length of time (stop time - start time). That way, we don't lose a lot of information and we abide by the data structure that is needed for the model.

In [47]:
##### read in tables that haven't been imported yet (non-event tables listed above)
omr_df = read_csv('dataframes/hosp_omr.csv')
services_df = read_csv('dataframes/hosp_services.csv')
transfers_df = read_csv('dataframes/hosp_transfers.csv')
edstays_df = read_csv('dataframes/ed_edstays.csv')
medrecon_df = read_csv('dataframes/ed_medrecon.csv')
pyxis_df = read_csv('dataframes/ed_pyxis.csv')
vitalsign_df = read_csv('dataframes/ed_vitalsign.csv')
prescription_df = read_csv('dataframes/hosp_prescriptions.csv')
hosp_procedure_df = read_csv('dataframes/hosp_procedures.csv')

dataframes/hosp_omr.csv
Shape: (899925, 5)
   subject_id   chartdate  seq_num     result_name result_value
0    10000719  2140-11-14        1  Blood Pressure       144/88
1    10000719  2140-11-14        1     BMI (kg/m2)         37.0
2    10000719  2140-11-14        1    Weight (Lbs)          236
3    10001472  2185-10-13        1  Blood Pressure       130/72
4    10001472  2185-10-13        1     BMI (kg/m2)         28.9
dataframes/hosp_services.csv
Shape: (60107, 5)
   subject_id   hadm_id         transfertime prev service curr_service
0    10000719  24558333  2140-04-15 00:15:12          NaN          OBS
1    10001319  23005466  2135-07-20 03:53:25          NaN          OBS
2    10001319  24591241  2138-11-09 20:30:59          NaN          OBS
3    10001319  29230609  2134-04-15 08:01:20          NaN          OBS
4    10001472  23506139  2186-01-10 00:26:41          NaN          OBS
dataframes/hosp_transfers.csv
Shape: (234354, 7)
   subject_id     hadm_id  transfer_id  eventtype  

Convert the timestamp fields to pandas datetime type:

In [48]:
def convert_columns_to_datetime(df, columns_list):
    for column in columns_list:
        if column in df.columns:
            df[column] = pd.to_datetime(df[column], errors='coerce')  # errors will appear as NaT
    return df

labevents_df = convert_columns_to_datetime(labevents_df, ['charttime'])
chartevents_df = convert_columns_to_datetime(chartevents_df, ['charttime'])
procedure_df = convert_columns_to_datetime(procedure_df, ['starttime', 'endtime'])
microbiology_df = convert_columns_to_datetime(microbiology_df, ['charttime'])
outputevents_df = convert_columns_to_datetime(outputevents_df, ['charttime'])



omr_df = convert_columns_to_datetime(omr_df, ['chartdate'])
services_df = convert_columns_to_datetime(services_df, ['transfertime'])
transfers_df = convert_columns_to_datetime(transfers_df, ['intime', 'outtime'])
edstays_df = convert_columns_to_datetime(edstays_df, ['intime', 'outtime'])
medrecon_df = convert_columns_to_datetime(medrecon_df, ['charttime'])
pyxis_df = convert_columns_to_datetime(sdf, ['charttime'])
vitalsign_df = convert_columns_to_datetime(vitalsign_df, ['charttime'])
prescription_df = convert_columns_to_datetime(prescription_df, ['starttime', 'stoptime'])
hosp_procedure_df = convert_columns_to_datetime(hosp_procedure_df, ['chartdate'])

Handle conversion of non-`charttime` fields to `charttime`:

In [49]:
procedure_df['duration'] = procedure_df['endtime'] - procedure_df['starttime']
procedure_df['charttime'] = procedure_df['starttime']
procedure_df = procedure_df.drop(columns=['endtime', 'starttime'])
procedure_df.head()

Unnamed: 0,subject_id,hadm_id,stay_id,itemid,value,valueuom,locationcategory,orderid,linkorderid,ordercategoryname,ordercategorydescription,patientweight,continueinnextdept,statusdescription,duration,charttime
0,10001884,26184834,37510196,225794,6.5,hour,Unknown,4809276,4809276,Ventilation,ContinuousProcess,65.0,0,FinishedRunning,0 days 06:30:00,2131-01-12 21:30:00
1,10001884,26184834,37510196,227194,1.0,,Unknown,6470885,6470885,Intubation/Extubation,Task,65.0,0,FinishedRunning,0 days 00:01:00,2131-01-12 17:40:00
2,10001884,26184834,37510196,228128,1.0,,Unknown,9459863,9459863,Communication,Task,65.0,0,FinishedRunning,0 days 00:01:00,2131-01-19 18:44:00
3,10001884,26184834,37510196,225401,1.0,,Unknown,4595950,4595950,Procedures,Task,65.0,0,FinishedRunning,0 days 00:01:00,2131-01-13 16:14:00
4,10001884,26184834,37510196,225454,1.0,,Unknown,5410081,5410081,Procedures,Task,65.0,0,FinishedRunning,0 days 00:01:00,2131-01-13 16:14:00


In [50]:
services_df['charttime'] = services_df['transfertime']
services_df = services_df.drop(columns=['transfertime'])
services_df.head()

Unnamed: 0,subject_id,hadm_id,prev service,curr_service,charttime
0,10000719,24558333,,OBS,2140-04-15 00:15:12
1,10001319,23005466,,OBS,2135-07-20 03:53:25
2,10001319,24591241,,OBS,2138-11-09 20:30:59
3,10001319,29230609,,OBS,2134-04-15 08:01:20
4,10001472,23506139,,OBS,2186-01-10 00:26:41


In [51]:
transfers_df['duration'] = transfers_df['outtime'] - transfers_df['intime']
transfers_df['charttime'] = transfers_df['intime']
transfers_df = transfers_df.drop(columns=['intime', 'outtime'])
transfers_df.head()

Unnamed: 0,subject_id,hadm_id,transfer_id,eventtype,care_unit_group,duration,charttime
0,10000719,24558333.0,31719052,discharge,Unknown,NaT,2140-04-18 12:41:26
1,10000719,24558333.0,32323060,admit,Labor & Delivery,1 days 02:28:36,2140-04-15 00:15:12
2,10000719,24558333.0,35042205,transfer,Labor & Delivery,2 days 09:57:38,2140-04-16 02:43:48
3,10001319,23005466.0,32828864,admit,Labor & Delivery,0 days 07:33:03,2135-07-20 03:53:25
4,10001319,23005466.0,33014199,transfer,Labor & Delivery,2 days 00:17:48,2135-07-20 11:26:28


In [52]:
# replace NaT with 0 for rows where eventtype is 'discharge'
transfers_df.loc[transfers_df['eventtype'] == 'discharge', 'duration'] = transfers_df.loc[transfers_df['eventtype'] == 'discharge', 'duration'].fillna(pd.Timedelta('0 days'))

In [53]:
transfers_df.head()

Unnamed: 0,subject_id,hadm_id,transfer_id,eventtype,care_unit_group,duration,charttime
0,10000719,24558333.0,31719052,discharge,Unknown,0 days 00:00:00,2140-04-18 12:41:26
1,10000719,24558333.0,32323060,admit,Labor & Delivery,1 days 02:28:36,2140-04-15 00:15:12
2,10000719,24558333.0,35042205,transfer,Labor & Delivery,2 days 09:57:38,2140-04-16 02:43:48
3,10001319,23005466.0,32828864,admit,Labor & Delivery,0 days 07:33:03,2135-07-20 03:53:25
4,10001319,23005466.0,33014199,transfer,Labor & Delivery,2 days 00:17:48,2135-07-20 11:26:28


In [54]:
edstays_df['duration'] = edstays_df['outtime'] - edstays_df['intime']
edstays_df['charttime'] = edstays_df['intime']
edstays_df = edstays_df.drop(columns=['intime', 'outtime'])
edstays_df.head()

Unnamed: 0,subject_id,hadm_id,stay_id,disposition,arrived_by_urgent_transport,duration,charttime
0,10001884,21192799.0,38708413,HOME,0,1 days 03:07:00,2130-10-05 11:58:00
1,10001884,22532141.0,38021228,HOME,0,0 days 16:57:00,2130-10-13 21:00:00
2,10001884,24325811.0,33281437,HOME,0,0 days 17:34:00,2126-11-03 19:15:00
3,10001884,24746267.0,35329716,ADMITTED,0,0 days 06:42:00,2130-12-27 15:48:00
4,10001884,24962904.0,31742950,ADMITTED,0,0 days 05:19:00,2130-12-06 16:46:00


In [55]:
prescription_df['duration'] = prescription_df['stoptime'] - prescription_df['starttime']
prescription_df['charttime'] = prescription_df['starttime']
prescription_df = prescription_df.drop(columns=['starttime', 'stoptime'])
prescription_df.head()

Unnamed: 0,subject_id,hadm_id,pharmacy_id,prod_strength,dose_val_rx,dose_unit_rx,doses_per_24_hrs,route,nonproprietaryname,duration,charttime
0,10001884,21268656,20911660,2 g / 50 mL Premix Bag,2,gm,1.0,IV,magnesium sulfate in water,0 days 23:00:00,2125-10-18 23:00:00
1,10001884,21577720,43978059,2 g / 50 mL Premix Bag,2,gm,1.0,IV,magnesium sulfate in water,0 days 11:00:00,2125-12-27 10:00:00
2,10001884,23594368,88623458,2 g / 50 mL Premix Bag,2,gm,1.0,IV,magnesium sulfate in water,0 days 09:00:00,2125-12-03 10:00:00
3,10001884,26170293,8650292,2 g / 50 mL Premix Bag,2,gm,1.0,IV,magnesium sulfate in water,0 days 23:00:00,2130-04-17 12:00:00
4,10001884,26184834,3491803,750 mg / 150 mL Premix Bag,750,mg,2.0,IV,vancomycin hydrochloride,1 days 11:00:00,2131-01-14 20:00:00


In [56]:
omr_df['charttime'] = omr_df['chartdate']
omr_df = omr_df.drop(columns=['chartdate'])
omr_df.head()

Unnamed: 0,subject_id,seq_num,result_name,result_value,charttime
0,10000719,1,Blood Pressure,144/88,2140-11-14
1,10000719,1,BMI (kg/m2),37.0,2140-11-14
2,10000719,1,Weight (Lbs),236,2140-11-14
3,10001472,1,Blood Pressure,130/72,2185-10-13
4,10001472,1,BMI (kg/m2),28.9,2185-10-13


In [57]:
microbiology_df['charttime'] = microbiology_df['chartdate']
microbiology_df = microbiology_df.drop(columns=['chartdate'])
microbiology_df = convert_columns_to_datetime(microbiology_df, ['charttime'])

In [58]:
hosp_procedure_df = hosp_procedure_df.rename(columns={'chartdate': 'charttime'})

### First Pass at Combining all the tables
#### Adding a 'source' column
This `source` column will be for debugging purposes so that I can better understand the data I am working with. **IMPORTANT**: This data should not be included into the final model because it could introduce bias or otherwise be irrelevant to the task.

In [59]:
# TODO: fill this in with a mapping of the dataframe name to the actual dataframe
# so that we can map the source for debugging purposes
dfs = {
    'labevents': labevents_df,
    'chartevents': chartevents_df,
    'procedureevents': procedure_df,
    'microbiologyevents': microbiology_df,
    'outputevents': outputevents_df,
    'omr': omr_df,
    'services': services_df,
    'transfers': transfers_df,
    'edstays': edstays_df,
    'medrecon': medrecon_df,
    'pyxis': pyxis_df,
    'vitalsign': vitalsign_df,
    'prescriptions': prescription_df,
    'hosp_procedures': hosp_procedure_df
}

# Add 'source' column 
for name, df in dfs.items():
    df['source'] = name

# verify success
labevents_df.head()

Unnamed: 0,labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments,source
0,2437,10000719,,70783909,51221,P30FVI,2139-09-14 15:10:00,2139-09-14 20:08:00,32.1,32.1,%,36.0,48.0,abnormal,ROUTINE,,labevents
1,2444,10000719,,70783909,51256,P30FVI,2139-09-14 15:10:00,2139-09-14 20:08:00,75.6,75.6,%,50.0,70.0,abnormal,ROUTINE,,labevents
2,2449,10000719,24558333.0,9035511,51221,,2140-04-15 00:22:00,2140-04-15 01:01:00,31.4,31.4,%,36.0,48.0,abnormal,STAT,,labevents
3,2458,10000719,24558333.0,93908058,51221,,2140-04-16 06:40:00,2140-04-16 07:54:00,32.6,32.6,%,36.0,48.0,abnormal,ROUTINE,,labevents
4,2464,10000719,,99456512,51221,P484YY,2140-11-14 17:08:00,2140-11-14 20:01:00,35.4,35.4,%,36.0,48.0,abnormal,ROUTINE,,labevents


#### Drop any additional irrelevant columns before combining

In [60]:
labevents_df = labevents_df.drop(columns=['specimen_id', 'storetime', 'comments', 'order_provider_id', 'valuenum'])

In [61]:
labevents_df['ref_range_upper'].isna().sum()

325158

In [62]:
# future function to reference range - don't apply yet
def determine_outside_ref_range(row):
    if pd.isna(row['ref_range_lower']) and pd.isna(row['ref_range_upper']):
        return 'unknown'
    elif pd.isna(row['ref_range_lower']):
        return 'outside' if row['value'] > row['ref_range_upper'] else 'inside'
    elif pd.isna(row['ref_range_upper']):
        return 'outside' if row['value'] < row['ref_range_lower'] else 'inside'
    else:
        return 'outside' if (row['value'] < row['ref_range_lower']) or (row['value'] > row['ref_range_upper']) else 'inside'

# labevents_df['outside_ref_range'] = labevents_df.apply(determine_outside_ref_range, axis=1)

In [63]:
chartevents_df.head()

Unnamed: 0,subject_id,hadm_id,stay_id,charttime,itemid,value,valuenum,valueuom,source
0,10001884,26184834,37510196,2131-01-18 19:00:00,220210,17.0,17.0,insp/min,chartevents
1,10001884,26184834,37510196,2131-01-18 20:00:00,220210,16.0,16.0,insp/min,chartevents
2,10001884,26184834,37510196,2131-01-18 21:00:00,220210,15.0,15.0,insp/min,chartevents
3,10001884,26184834,37510196,2131-01-18 22:00:00,220210,13.0,13.0,insp/min,chartevents
4,10001884,26184834,37510196,2131-01-18 23:00:00,220210,12.0,12.0,insp/min,chartevents


In [64]:
chartevents_df = chartevents_df.drop(columns=["valuenum"])

In [65]:
procedure_df.head()

Unnamed: 0,subject_id,hadm_id,stay_id,itemid,value,valueuom,locationcategory,orderid,linkorderid,ordercategoryname,ordercategorydescription,patientweight,continueinnextdept,statusdescription,duration,charttime,source
0,10001884,26184834,37510196,225794,6.5,hour,Unknown,4809276,4809276,Ventilation,ContinuousProcess,65.0,0,FinishedRunning,0 days 06:30:00,2131-01-12 21:30:00,procedureevents
1,10001884,26184834,37510196,227194,1.0,,Unknown,6470885,6470885,Intubation/Extubation,Task,65.0,0,FinishedRunning,0 days 00:01:00,2131-01-12 17:40:00,procedureevents
2,10001884,26184834,37510196,228128,1.0,,Unknown,9459863,9459863,Communication,Task,65.0,0,FinishedRunning,0 days 00:01:00,2131-01-19 18:44:00,procedureevents
3,10001884,26184834,37510196,225401,1.0,,Unknown,4595950,4595950,Procedures,Task,65.0,0,FinishedRunning,0 days 00:01:00,2131-01-13 16:14:00,procedureevents
4,10001884,26184834,37510196,225454,1.0,,Unknown,5410081,5410081,Procedures,Task,65.0,0,FinishedRunning,0 days 00:01:00,2131-01-13 16:14:00,procedureevents


In [66]:
procedure_df['continueinnextdept'].value_counts()

continueinnextdept
0    37425
1       11
Name: count, dtype: int64

In [67]:
procedure_df['statusdescription'].value_counts()

statusdescription
FinishedRunning    36844
Stopped              570
Paused                22
Name: count, dtype: int64

In [68]:
procedure_df = procedure_df.drop(columns=["orderid", "linkorderid", "continueinnextdept", "statusdescription"])

In [69]:
microbiology_df.head()

Unnamed: 0,microevent_id,subject_id,hadm_id,micro_specimen_id,org_name,charttime,source
0,95,10000719,,4691510,POSITIVE FOR GROUP B BETA STREPTOCOCCI,2140-03-28,microbiologyevents
1,420,10001319,,2654897,GRAM POSITIVE BACTERIA,2135-06-10,microbiologyevents
2,614,10001884,,2893463,POSITIVE FOR INFLUENZA A VIRAL ANTIGEN,2125-12-01,microbiologyevents
3,871,10002266,,3980855,GRAM POSITIVE BACTERIA,2124-08-28,microbiologyevents
4,927,10002428,,1704201,LACTOBACILLUS SPECIES,2156-05-11,microbiologyevents


In [70]:
microbiology_df = microbiology_df.drop(columns=["microevent_id", "micro_specimen_id"])

In [71]:
outputevents_df.head()

Unnamed: 0,subject_id,hadm_id,stay_id,charttime,itemid,value,valueuom,source
0,10553084,28481755,35401987,2199-10-17 20:06:00,226590,400.0,ml,outputevents
1,10553084,28481755,35401987,2199-10-17 22:00:00,226590,450.0,ml,outputevents
2,10553084,28481755,35401987,2199-10-18 04:00:00,226590,50.0,ml,outputevents
3,10553084,28481755,35401987,2199-10-18 06:00:00,226590,100.0,ml,outputevents
4,10553084,28481755,35401987,2199-10-18 12:00:00,226590,100.0,ml,outputevents


#### Applying the Concatenation!

In [72]:
labevents_df['value'].isna().sum()

113195

In [73]:
def create_timeseries_table(df_list):
    combined_df = pd.concat(df_list, ignore_index=True)
    combined_df = combined_df.sort_values(by='charttime')
    # combined_df = combined_df.set_index('charttime')
    # combined_df = combined_df.reset_index(drop=True)
    return combined_df

In [74]:
df_list = [labevents_df, chartevents_df, procedure_df, microbiology_df, outputevents_df]
timeseries_df = create_timeseries_table(df_list)

In [75]:
timeseries_df

Unnamed: 0,labevent_id,subject_id,hadm_id,itemid,charttime,value,valueuom,ref_range_lower,ref_range_upper,flag,priority,source,stay_id,locationcategory,ordercategoryname,ordercategorydescription,patientweight,duration,org_name
2199715,73658924.0,16224440,,50862.0,2109-04-11 16:15:00,4.5,g/dL,3.5,5.2,,ROUTINE,labevents,,,,,,NaT,
2650370,88787554.0,17508974,,50934.0,2109-05-13 11:41:00,15,,,,,ROUTINE,labevents,,,,,,NaT,
526661,18162220.0,11558642,,51256.0,2109-05-17 11:28:00,56.7,%,34.0,71.0,,ROUTINE,labevents,,,,,,NaT,
526660,18162213.0,11558642,,51221.0,2109-05-17 11:28:00,40.2,%,34.0,45.0,,ROUTINE,labevents,,,,,,NaT,
2329881,78058804.0,16597305,,50882.0,2109-06-23 11:09:00,25,mEq/L,22.0,32.0,,ROUTINE,labevents,,,,,,NaT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
656634,23018590.0,11973788,,51006.0,2213-02-13 11:45:00,22,mg/dL,6.0,20.0,abnormal,STAT,labevents,,,,,,NaT,
656635,23018595.0,11973788,,50934.0,2213-02-13 11:45:00,5,,,,,STAT,labevents,,,,,,NaT,
656636,23018602.0,11973788,,51221.0,2213-02-13 11:45:00,40.0,%,34.0,45.0,,STAT,labevents,,,,,,NaT,
656623,23018572.0,11973788,,50862.0,2213-02-13 11:45:00,4.0,g/dL,3.5,5.2,,STAT,labevents,,,,,,NaT,


In [76]:
timeseries_df.isna().sum() / timeseries_df.shape[0]

labevent_id                 0.234062
subject_id                  0.000000
hadm_id                     0.336995
itemid                      0.010200
charttime                   0.000000
value                       0.034737
valueuom                    0.061096
ref_range_lower             0.304543
ref_range_upper             0.304543
flag                        0.783353
priority                    0.274142
source                      0.000000
stay_id                     0.776138
locationcategory            0.991885
ordercategoryname           0.991885
ordercategorydescription    0.991885
patientweight               0.991999
duration                    0.991885
org_name                    0.989800
dtype: float64

In [77]:
timeseries_df.shape

(4613404, 19)

In [78]:
timeseries_df.rename(columns={'flag': 'labevents_flag', 'priority': 'labevents_priority', 
                             'locationcategory': 'procedure_locationcategory',
                             'ordercategoryname': 'procedure_ordercategoryname',
                             'ordercategorydescription': 'procedure_ordercategorydescription',
                             'org_name': 'microbiology_orgname'}, inplace=True)

In [79]:
def replace_itemid_with_label(row):
    if row['source'] == 'labevents':
        label = lab_item_df.loc[lab_item_df['itemid'] == row['itemid'], 'label']
        if not label.empty:
            return label.values[0]
    elif row['source'] != 'microbiologyevents':
        label = item_df.loc[item_df['itemid'] == row['itemid'], 'label']
        if not label.empty:
            return label.values[0]
    return row['itemid']
print(timeseries_df['itemid'].isna().sum())
timeseries_df['itemid'] = timeseries_df.apply(replace_itemid_with_label, axis=1)
print(timeseries_df['itemid'].isna().sum())

47058
47058


In [80]:
timeseries_df.head()

Unnamed: 0,labevent_id,subject_id,hadm_id,itemid,charttime,value,valueuom,ref_range_lower,ref_range_upper,labevents_flag,labevents_priority,source,stay_id,procedure_locationcategory,procedure_ordercategoryname,procedure_ordercategorydescription,patientweight,duration,microbiology_orgname
2199715,73658924.0,16224440,,Albumin,2109-04-11 16:15:00,4.5,g/dL,3.5,5.2,,ROUTINE,labevents,,,,,,NaT,
2650370,88787554.0,17508974,,H,2109-05-13 11:41:00,15.0,,,,,ROUTINE,labevents,,,,,,NaT,
526661,18162220.0,11558642,,Neutrophils,2109-05-17 11:28:00,56.7,%,34.0,71.0,,ROUTINE,labevents,,,,,,NaT,
526660,18162213.0,11558642,,Hematocrit,2109-05-17 11:28:00,40.2,%,34.0,45.0,,ROUTINE,labevents,,,,,,NaT,
2329881,78058804.0,16597305,,Bicarbonate,2109-06-23 11:09:00,25.0,mEq/L,22.0,32.0,,ROUTINE,labevents,,,,,,NaT,


In [81]:
timeseries_df[timeseries_df['patientweight'].notna()]['subject_id'].nunique() / timeseries_df['subject_id'].nunique()

0.1368604466972913

In [82]:
omr_df[omr_df['result_name'] == 'BMI (kg/m2)']['subject_id'].nunique() 

14072

In [83]:
# unique_proc_weight = [id for id in timeseries_df[timeseries_df['patientweight'].notna()]['subject_id'].unique() if
#                id not in omr_df[omr_df['result_name'] == 'BMI (kg/m2)']['subject_id'].unique()]

In [84]:
# # number of patients for which we don't have bmi stored but have access to weight from procedureevents
# len(unique_proc_weight)

In [85]:
omr_df[omr_df['result_name'] == 'Weight (Lbs)']['subject_id'].nunique()

15519

In [86]:
# unique_proc_weight_2 = [id for id in timeseries_df[timeseries_df['patientweight'].notna()]['subject_id'].unique() if
#                id not in omr_df[omr_df['result_name'] == 'Weight (Lbs)']['subject_id'].unique()]
# print(len(unique_proc_weight_2))

In [87]:
# print(len(unique_proc_weight_2))

In [88]:
# len([id for id in omr_df[omr_df['result_name'] == 'BMI (kg/m2)']['subject_id'].unique() if
#                id not in omr_df[omr_df['result_name'] == 'Weight (Lbs)']['subject_id'].unique()])

#### Adding OMR dataframe to the existing timeseries dataframe
- the OMR dataframe has weight measurements for many patients that are not in the procedureevents table (which also captures patient weight). Makes sense to include the data from the OMR dataframe so that we can capture the data for as many patients as possible
- the OMR weight measurements are in lbs - convert to kg before combining with the existing timeseries dataframe
- eventually the `patientweight` and result name == weight rows for OMR will be combined into a single column so that we are not repeating anything

In [89]:
weight_mask = omr_df['result_name'] == 'Weight (Lbs)'
omr_df.loc[weight_mask, 'result_value'] = pd.to_numeric(omr_df.loc[weight_mask, 'result_value'], errors='coerce')
lb_to_kg_conversion = 0.453592
omr_df.loc[omr_df['result_name'] == 'Weight (Lbs)', 'result_value'] *= lb_to_kg_conversion

In [90]:
omr_df['result_name'] = omr_df['result_name'].replace('Weight (Lbs)', 'Weight (kg)')

In [91]:
omr_df.rename(columns={'result_name': 'itemid', 'result_value': 'value'}, inplace=True)

In [92]:
omr_df.head()

Unnamed: 0,subject_id,seq_num,itemid,value,charttime,source
0,10000719,1,Blood Pressure,144/88,2140-11-14,omr
1,10000719,1,BMI (kg/m2),37.0,2140-11-14,omr
2,10000719,1,Weight (kg),107.047712,2140-11-14,omr
3,10001472,1,Blood Pressure,130/72,2185-10-13,omr
4,10001472,1,BMI (kg/m2),28.9,2185-10-13,omr


In [93]:
# mmHg for blood pressure, kg/m2 for BMI, kg for weight
def get_uom(itemid):
    if itemid == 'Blood Pressure':
        return 'mmHg'
    elif itemid == 'Weight (kg)':
        return 'kg'
    elif itemid == 'BMI (kg/m2)':
        return 'kg/m2'
    else:
        return 'Unknown'

omr_df['valueuom'] = omr_df['itemid'].apply(get_uom)

In [94]:
omr_df.isna().sum()

subject_id    0
seq_num       0
itemid        0
value         0
charttime     0
source        0
valueuom      0
dtype: int64

In [95]:
omr_df = omr_df.drop(columns=['seq_num'])

In [96]:
timeseries_df_with_omr = pd.concat([timeseries_df, omr_df], ignore_index=False)
timeseries_df_with_omr = timeseries_df_with_omr.sort_values(by='charttime')
# timeseries_df_With_omr = timeseries_df_with_omr.set_index('charttime').reset_index(drop=True)

In [97]:
timeseries_df_with_omr['itemid'].isna().sum()

47058

Now attempt to combine the `patientweight` and `omr itemid == 'Weight (kg)'` columns together:

In [98]:
timeseries_df_with_omr['patientweight'].isna().sum()

5476419

In [99]:
timeseries_df_with_omr[timeseries_df_with_omr['itemid'] == 'Weight (kg)'].shape

(307570, 19)

In [100]:
# Update patientweight with values from the 'value' column where itemid is 'weight'
timeseries_df_with_omr['patientweight'] = timeseries_df_with_omr.apply(
    lambda row: row['value'] if row['itemid'] == 'Weight (kg)' and pd.notna(row['value']) else row['patientweight'],
    axis=1
)

In [101]:
timeseries_df_with_omr['patientweight'].isna().sum()

5168849

In [102]:
timeseries_df_with_omr['patientweight'].notna().sum()

344480

In [103]:
timeseries_df_with_omr.isna().sum()

labevent_id                           1979748
subject_id                                  0
hadm_id                               2454619
itemid                                  47058
charttime                                   0
value                                  160254
valueuom                               281861
ref_range_lower                       2304906
ref_range_upper                       2304906
labevents_flag                        4513850
labevents_priority                    2164653
source                                      0
stay_id                               4480564
procedure_locationcategory            5475893
procedure_ordercategoryname           5475893
procedure_ordercategorydescription    5475893
patientweight                         5168849
duration                              5475893
microbiology_orgname                  5466271
dtype: int64

#### Combine services df with running dataframe

In [104]:
services_df.head()

Unnamed: 0,subject_id,hadm_id,prev service,curr_service,charttime,source
0,10000719,24558333,,OBS,2140-04-15 00:15:12,services
1,10001319,23005466,,OBS,2135-07-20 03:53:25,services
2,10001319,24591241,,OBS,2138-11-09 20:30:59,services
3,10001319,29230609,,OBS,2134-04-15 08:01:20,services
4,10001472,23506139,,OBS,2186-01-10 00:26:41,services


In [105]:
services_df.isna().sum()

subject_id          0
hadm_id             0
prev service    56712
curr_service        0
charttime           0
source              0
dtype: int64

In [106]:
services_df.shape

(60107, 6)

In [107]:
# nearly all of these are NA
services_df = services_df.drop(columns=['prev service'])

In [108]:
timeseries_df_with_services = pd.concat([timeseries_df_with_omr, services_df], ignore_index=False)
timeseries_df_with_services = timeseries_df_with_services.sort_values(by='charttime')
# timeseries_df_with_services = timeseries_df_with_services.set_index('charttime').reset_index(drop=True)

In [109]:
timeseries_df_with_services.head()

Unnamed: 0,labevent_id,subject_id,hadm_id,itemid,charttime,value,valueuom,ref_range_lower,ref_range_upper,labevents_flag,labevents_priority,source,stay_id,procedure_locationcategory,procedure_ordercategoryname,procedure_ordercategorydescription,patientweight,duration,microbiology_orgname,curr_service
744854,,18300046,,Blood Pressure,2109-04-03,100/70,mmHg,,,,,omr,,,,,,NaT,,
744855,,18300046,,Weight (kg),2109-04-03,56.336126,kg,,,,,omr,,,,,56.336126,NaT,,
556851,,16224440,,Weight (kg),2109-04-07,52.798109,kg,,,,,omr,,,,,52.798109,NaT,,
556850,,16224440,,BMI (kg/m2),2109-04-07,19.6,kg/m2,,,,,omr,,,,,,NaT,,
556849,,16224440,,Blood Pressure,2109-04-07,100/70,mmHg,,,,,omr,,,,,,NaT,,


In [110]:
timeseries_df_with_services.shape

(5573436, 20)

#### Add on transfers

In [111]:
# transfer_id corresponds to stay_id for icu/ed, so keep for now when concatenating
transfers_df.head()

Unnamed: 0,subject_id,hadm_id,transfer_id,eventtype,care_unit_group,duration,charttime,source
0,10000719,24558333.0,31719052,discharge,Unknown,0 days 00:00:00,2140-04-18 12:41:26,transfers
1,10000719,24558333.0,32323060,admit,Labor & Delivery,1 days 02:28:36,2140-04-15 00:15:12,transfers
2,10000719,24558333.0,35042205,transfer,Labor & Delivery,2 days 09:57:38,2140-04-16 02:43:48,transfers
3,10001319,23005466.0,32828864,admit,Labor & Delivery,0 days 07:33:03,2135-07-20 03:53:25,transfers
4,10001319,23005466.0,33014199,transfer,Labor & Delivery,2 days 00:17:48,2135-07-20 11:26:28,transfers


In [112]:
timeseries_df_with_transfers = pd.concat([timeseries_df_with_services, transfers_df], ignore_index=False)
timeseries_df_with_transfers = timeseries_df_with_transfers.sort_values(by='charttime')
# timeseries_df_with_transfers = timeseries_df_with_transfers.set_index('charttime').reset_index(drop=True)

In [113]:
timeseries_df_with_transfers.shape

(5807790, 23)

In [114]:
timeseries_df_with_transfers.head()

Unnamed: 0,labevent_id,subject_id,hadm_id,itemid,charttime,value,valueuom,ref_range_lower,ref_range_upper,labevents_flag,...,procedure_locationcategory,procedure_ordercategoryname,procedure_ordercategorydescription,patientweight,duration,microbiology_orgname,curr_service,transfer_id,eventtype,care_unit_group
744854,,18300046,,Blood Pressure,2109-04-03,100/70,mmHg,,,,...,,,,,NaT,,,,,
744855,,18300046,,Weight (kg),2109-04-03,56.336126,kg,,,,...,,,,56.336126,NaT,,,,,
556851,,16224440,,Weight (kg),2109-04-07,52.798109,kg,,,,...,,,,52.798109,NaT,,,,,
556850,,16224440,,BMI (kg/m2),2109-04-07,19.6,kg/m2,,,,...,,,,,NaT,,,,,
556849,,16224440,,Blood Pressure,2109-04-07,100/70,mmHg,,,,...,,,,,NaT,,,,,


#### Add on edstays

In [115]:
# add all of the fields
edstays_df.head()

Unnamed: 0,subject_id,hadm_id,stay_id,disposition,arrived_by_urgent_transport,duration,charttime,source
0,10001884,21192799.0,38708413,HOME,0,1 days 03:07:00,2130-10-05 11:58:00,edstays
1,10001884,22532141.0,38021228,HOME,0,0 days 16:57:00,2130-10-13 21:00:00,edstays
2,10001884,24325811.0,33281437,HOME,0,0 days 17:34:00,2126-11-03 19:15:00,edstays
3,10001884,24746267.0,35329716,ADMITTED,0,0 days 06:42:00,2130-12-27 15:48:00,edstays
4,10001884,24962904.0,31742950,ADMITTED,0,0 days 05:19:00,2130-12-06 16:46:00,edstays


In [116]:
timeseries_df_with_ed = pd.concat([timeseries_df_with_transfers, edstays_df], ignore_index=False)
timeseries_df_with_ed = timeseries_df_with_ed.sort_values(by='charttime')
# timeseries_df_with_ed = timeseries_df_with_ed.set_index('charttime').reset_index(drop=True)

#### Add on medrecon

In [117]:
medrecon_df.head()

Unnamed: 0,subject_id,stay_id,charttime,nonproprietaryname,source
0,10001884,31306678,2130-10-19 14:38:00,acetaminophen,medrecon
1,10001884,31306678,2130-10-19 14:38:00,oxycodone hydrochloride,medrecon
2,10001884,31306678,2130-10-19 14:38:00,aspirin,medrecon
3,10001884,31306678,2130-10-19 14:38:00,aspirin,medrecon
4,10001884,31306678,2130-10-19 14:38:00,diltiazem hydrochloride,medrecon


In [118]:
medrecon_df = medrecon_df.rename(columns={'nonproprietaryname': 'medication_reconciliation'})

In [119]:
medrecon_df.head()

Unnamed: 0,subject_id,stay_id,charttime,medication_reconciliation,source
0,10001884,31306678,2130-10-19 14:38:00,acetaminophen,medrecon
1,10001884,31306678,2130-10-19 14:38:00,oxycodone hydrochloride,medrecon
2,10001884,31306678,2130-10-19 14:38:00,aspirin,medrecon
3,10001884,31306678,2130-10-19 14:38:00,aspirin,medrecon
4,10001884,31306678,2130-10-19 14:38:00,diltiazem hydrochloride,medrecon


In [120]:
timeseries_df_with_meds = pd.concat([timeseries_df_with_ed, medrecon_df], ignore_index=False)
timeseries_df_with_meds = timeseries_df_with_meds.sort_values(by='charttime')
# timeseries_df_with_meds = timeseries_df_with_meds.set_index('charttime').reset_index(drop=True)

#### Add pyxis

In [121]:
pyxis_df.head()

Unnamed: 0,subject_id,stay_id,charttime,med_rn,name,gsn_rn,gsn,source
0,10001884,31306678,2130-10-19 13:50:00,1,methylprednisolone sodium succ,1,6730.0,pyxis
1,10001884,31306678,2130-10-19 13:50:00,1,methylprednisolone sodium succ,2,51555.0,pyxis
2,10001884,31306678,2130-10-19 13:50:00,1,methylprednisolone sodium succ,3,65978.0,pyxis
3,10001884,31306678,2130-10-19 13:56:00,2,aspirin,1,4380.0,pyxis
4,10001884,31306678,2130-10-19 15:01:00,3,azithromycin,1,31452.0,pyxis


In [122]:
pyxis_df = pyxis_df.drop(columns=['med_rn', 'gsn_rn', 'gsn'])
pyxis_df = pyxis_df.drop_duplicates()

In [123]:
pyxis_df = pyxis_df.rename(columns={'name': 'medication_dispensation'})

In [124]:
pyxis_df.head()

Unnamed: 0,subject_id,stay_id,charttime,medication_dispensation,source
0,10001884,31306678,2130-10-19 13:50:00,methylprednisolone sodium succ,pyxis
3,10001884,31306678,2130-10-19 13:56:00,aspirin,pyxis
4,10001884,31306678,2130-10-19 15:01:00,azithromycin,pyxis
5,10001884,31306678,2130-10-19 15:01:00,albuterol,pyxis
6,10001884,31306678,2130-10-19 15:01:00,ipratropium bromide neb,pyxis


In [125]:
timeseries_df_with_pyxis = pd.concat([timeseries_df_with_meds, pyxis_df], ignore_index=False)
timeseries_df_with_pyxis = timeseries_df_with_pyxis.sort_values(by='charttime')
# timeseries_df_with_pyxis = timeseries_df_with_pyxis.set_index('charttime').reset_index(drop=True)

#### Add in vitalsign

In [126]:
vitalsign_df.head()

Unnamed: 0,subject_id,stay_id,charttime,temperature,heartrate,resprate,o2sat,sbp,dbp,rhythm,pain,source
0,10001884,31306678,2130-10-19 13:34:00,,,,79.0,,,,unable,vitalsign
1,10001884,31306678,2130-10-19 14:31:00,,75.0,15.0,100.0,133.0,68.0,,,vitalsign
2,10001884,31306678,2130-10-19 14:44:00,98.2,76.0,16.0,99.0,139.0,70.0,,0,vitalsign
3,10001884,31306678,2130-10-19 15:50:00,98.2,76.0,20.0,97.0,138.0,72.0,,0,vitalsign
4,10001884,31742950,2130-12-06 16:46:00,97.6,67.0,22.0,97.0,132.0,82.0,,0,vitalsign


In [127]:
vitalsign_df.isna().sum()

subject_id          0
stay_id             0
charttime           0
temperature     52369
heartrate        7199
resprate         9008
o2sat           13803
sbp              8119
dbp              8119
rhythm         149170
pain            44628
source              0
dtype: int64

In [128]:
vitalsign_df.shape

(154819, 12)

In [129]:
# over 80% of the rows are NA
vitalsign_df = vitalsign_df.drop(columns=['rhythm'])

In [130]:
vitalsign_df['pain'].value_counts()

pain
0                                                                           48369
8                                                                            7808
5                                                                            6998
10                                                                           6334
7                                                                            5923
                                                                            ...  
right forearm                                                                   1
unable to score left arm pain                                                   1
right arm BP                                                                    1
left arm BP                                                                     1
0                                                                               1
Name: count, Length: 646, dtype: int64

In [132]:
import re

def clean_pain_value(value):
    if pd.isna(value):
        return np.nan
    value = value.lower()
    # if the value is 'unable', we do not know if it was set to 'unable' because the patient could
    # not provide an answer, or if there was a data collection issue, so set to N/A
    if 'unable' in value or 'refused' in value:
        return np.nan
    if 'bad' in value:
        return 8

    value = re.sub(r'[^\d/.]', '', value)  # remove all non-numeric and non-slash characters
    value = re.sub(r'/.*$', '', value)     # remove anything after and including a slash (ex. 8/10)
    value = re.sub(r'^(\d+)', r'\1', value)  # keep only the leading digits
    
    try:
        numeric_value = float(value)
        # round down if the value is a decimal
        numeric_value = np.floor(numeric_value)
        if 0 <= numeric_value <= 10:
            return int(numeric_value)
        else:
            return np.nan
    except ValueError:
        return np.nan
vitalsign_df['pain'] = vitalsign_df['pain'].apply(clean_pain_value)

In [133]:
vitalsign_df['pain'].value_counts()

pain
0.0     49303
8.0      8008
5.0      7174
10.0     6482
7.0      6081
6.0      5791
4.0      5595
2.0      4730
3.0      4638
9.0      3528
1.0      1403
Name: count, dtype: int64

In [134]:
vitalsign_df['pain'].isna().sum()

52086

In [135]:
# from the docs - "some temperatures may be misrecorded as Celsius"
vitalsign_df['temperature'].describe()

count    102450.000000
mean         98.110383
std           6.585926
min           0.000000
25%          97.800000
50%          98.100000
75%          98.500000
max         988.000000
Name: temperature, dtype: float64

In [136]:
# use outlier variable ranges from MIMIC-Extract for detecting outliers
# < 14.2 degrees Celsius (58 degrees Fahrenheit) and > 47 degrees Celsius (117 degrees Fahrenheit)
vitalsign_df['temperature'] = pd.to_numeric(vitalsign_df['temperature'], errors='coerce')
filtered_df = vitalsign_df[(vitalsign_df['temperature'] < 58) | (vitalsign_df['temperature'] > 117)]

In [137]:
filtered_df

Unnamed: 0,subject_id,stay_id,charttime,temperature,heartrate,resprate,o2sat,sbp,dbp,pain,source
306,10011668,38526562,2134-05-17 02:34:00,9.2,71.0,18.0,99.0,137.0,79.0,8.0,vitalsign
743,10037928,34992024,2183-11-28 17:20:00,36.5,95.0,18.0,98.0,150.0,56.0,8.0,vitalsign
753,10037928,37036523,2177-09-04 06:36:00,31.7,53.0,20.0,95.0,116.0,43.0,,vitalsign
754,10037928,37036523,2177-09-04 06:47:00,31.6,47.0,18.0,93.0,107.0,47.0,,vitalsign
755,10037928,37036523,2177-09-04 07:01:00,31.4,50.0,19.0,94.0,111.0,44.0,,vitalsign
...,...,...,...,...,...,...,...,...,...,...,...
150333,19669999,32195387,2148-08-10 02:12:00,33.3,88.0,22.0,97.0,107.0,46.0,,vitalsign
150334,19669999,32195387,2148-08-10 02:27:00,33.5,90.0,22.0,,99.0,39.0,,vitalsign
150335,19669999,32195387,2148-08-10 02:41:00,33.5,89.0,26.0,98.0,101.0,59.0,,vitalsign
150336,19669999,32195387,2148-08-10 03:02:00,33.7,92.0,29.0,96.0,129.0,48.0,,vitalsign


In [138]:
filtered_df[filtered_df['temperature'] > 117]

Unnamed: 0,subject_id,stay_id,charttime,temperature,heartrate,resprate,o2sat,sbp,dbp,pain,source
19784,11312346,37186004,2152-05-14 18:21:00,162.79,59.0,16.0,96.0,162.0,79.0,,vitalsign
49041,13219876,32156356,2135-09-12 18:40:00,987.0,82.0,16.0,100.0,116.0,70.0,,vitalsign
61410,13994812,32646789,2182-04-21 06:39:00,988.0,73.0,15.0,100.0,97.0,58.0,,vitalsign
82779,15399588,39040242,2157-09-19 22:56:00,978.0,79.0,16.0,99.0,141.0,82.0,8.0,vitalsign
85421,15595239,33022179,2142-05-21 14:30:00,148.0,104.0,18.0,,,,,vitalsign
96078,16279137,32880686,2180-10-12 13:27:00,986.0,67.0,16.0,100.0,119.0,74.0,,vitalsign
109909,17147107,34223249,2178-01-07 23:15:00,130.1,117.0,17.0,100.0,165.0,88.0,0.0,vitalsign
141046,19050723,32186890,2141-12-25 13:45:00,134.5,72.0,16.0,100.0,110.0,75.0,,vitalsign


In [139]:
def clean_temperature(value):
    if pd.isna(value):
        return np.nan
    # if temperature between 14 and 58, it likely needs to be converted from Celsius to Fahrenheit
    if 14 <= value <= 58:
        # C --> F conversion
        return value * 9/5 + 32
    # if temperature between 900 and 1000, it's likely that a decimal needs to be inserted
    elif 900 <= value <= 1000:
        # insert a decimal so that temp is between 90 - 100
        return value / 10
    # these are considered outliers
    # hard to distinguish whether there is a typo or the wrong vital sign recorded (ex. heart rate as temp)
    elif value < 58 or value > 117:
        return np.nan
    # else:
    #     # otherwise, set to NA - hard to distinguish whether there is a typo or the wrong vital sign recorded (ex. heart rate as temp)
    #     return np.nan
vitalsign_df['temperature'] = vitalsign_df['temperature'].apply(clean_temperature)

In [140]:
vitalsign_df['temperature'].describe()

count    269.000000
mean      96.503717
std        5.489537
min       62.600000
25%       93.380000
50%       96.620000
75%       99.140000
max      136.400000
Name: temperature, dtype: float64

In [141]:
vitalsign_df.loc[vitalsign_df['temperature'] > 117, 'temperature'] = np.nan

In [142]:
vitalsign_df['heartrate'].describe()

count    147620.000000
mean         82.850842
std          17.371838
min           1.000000
25%          71.000000
50%          81.000000
75%          93.000000
max         825.000000
Name: heartrate, dtype: float64

In [143]:
# MIMIC Extract outliers are < 0 or > 390
vitalsign_df.loc[vitalsign_df['heartrate'] > 390, 'heartrate'] = np.nan

In [144]:
vitalsign_df['o2sat'].describe()

count    141016.000000
mean         98.214023
std          26.102178
min           0.000000
25%          97.000000
50%          99.000000
75%         100.000000
max        9712.000000
Name: o2sat, dtype: float64

In [145]:
# MIMIC Extract outliers are < 0 or > 150
vitalsign_df.loc[vitalsign_df['o2sat'] > 150, 'o2sat'] = np.nan

In [146]:
# all within range - outlier is less than 0 or greater than 375
vitalsign_df['sbp'].describe()

count    146700.000000
mean        125.257280
std          23.108332
min           8.000000
25%         109.000000
50%         122.000000
75%         139.000000
max         274.000000
Name: sbp, dtype: float64

In [147]:
vitalsign_df['dbp'].describe()

count    146700.00000
mean         70.89032
std         148.46845
min           0.00000
25%          59.00000
50%          69.00000
75%          79.00000
max       51989.00000
Name: dbp, dtype: float64

In [148]:
# MIMIC Extract outliers are < 0 or > 375
vitalsign_df.loc[vitalsign_df['dbp'] > 375, 'dbp'] = np.nan

In [149]:
timeseries_df_with_vitals = pd.concat([timeseries_df_with_pyxis, vitalsign_df], ignore_index=False)
timeseries_df_with_vitals = timeseries_df_with_vitals.sort_values(by='charttime')
# timeseries_df_with_vitals = timeseries_df_with_vitals.set_index('charttime').reset_index(drop=True)

In [150]:
timeseries_df_with_vitals.head()

Unnamed: 0,labevent_id,subject_id,hadm_id,itemid,charttime,value,valueuom,ref_range_lower,ref_range_upper,labevents_flag,...,arrived_by_urgent_transport,medication_reconciliation,medication_dispensation,temperature,heartrate,resprate,o2sat,sbp,dbp,pain
744854,,18300046,,Blood Pressure,2109-04-03,100/70,mmHg,,,,...,,,,,,,,,,
744855,,18300046,,Weight (kg),2109-04-03,56.336126,kg,,,,...,,,,,,,,,,
556851,,16224440,,Weight (kg),2109-04-07,52.798109,kg,,,,...,,,,,,,,,,
556850,,16224440,,BMI (kg/m2),2109-04-07,19.6,kg/m2,,,,...,,,,,,,,,,
556849,,16224440,,Blood Pressure,2109-04-07,100/70,mmHg,,,,...,,,,,,,,,,


#### Add in prescription

In [151]:
prescription_df.head()

Unnamed: 0,subject_id,hadm_id,pharmacy_id,prod_strength,dose_val_rx,dose_unit_rx,doses_per_24_hrs,route,nonproprietaryname,duration,charttime,source
0,10001884,21268656,20911660,2 g / 50 mL Premix Bag,2,gm,1.0,IV,magnesium sulfate in water,0 days 23:00:00,2125-10-18 23:00:00,prescriptions
1,10001884,21577720,43978059,2 g / 50 mL Premix Bag,2,gm,1.0,IV,magnesium sulfate in water,0 days 11:00:00,2125-12-27 10:00:00,prescriptions
2,10001884,23594368,88623458,2 g / 50 mL Premix Bag,2,gm,1.0,IV,magnesium sulfate in water,0 days 09:00:00,2125-12-03 10:00:00,prescriptions
3,10001884,26170293,8650292,2 g / 50 mL Premix Bag,2,gm,1.0,IV,magnesium sulfate in water,0 days 23:00:00,2130-04-17 12:00:00,prescriptions
4,10001884,26184834,3491803,750 mg / 150 mL Premix Bag,750,mg,2.0,IV,vancomycin hydrochloride,1 days 11:00:00,2131-01-14 20:00:00,prescriptions


In [152]:
# we aren't using pharmacy info, so drop pharmacy id
# drop dose information because we don't have access to it for the other medication-related fields
prescription_df = prescription_df.drop(columns=['pharmacy_id', 'prod_strength', 'dose_val_rx',
                                                'doses_per_24_hrs', 'route'])

In [153]:
prescription_df = prescription_df.rename(columns={"nonproprietaryname": "medication_prescription"})

In [154]:
timeseries_df_with_prescrip = pd.concat([timeseries_df_with_vitals, prescription_df], ignore_index=False)
timeseries_df_with_prescrip = timeseries_df_with_prescrip.sort_values(by='charttime')
# timeseries_df_with_prescrip = timeseries_df_with_prescrip.set_index('charttime').reset_index(drop=True)

#### Add on hospital procedures

In [155]:
hosp_procedure_df.head()

Unnamed: 0,subject_id,hadm_id,seq_num,charttime,icd_code_root,source
0,10000719,24558333,1,2140-04-16,0TQD,hosp_procedures
1,10001319,23005466,1,2135-07-20,10E0,hosp_procedures
2,10001319,24591241,1,2138-11-10,10E0,hosp_procedures
3,10001319,29230609,1,2134-04-15,0TQD,hosp_procedures
4,10001472,23506139,1,2186-01-11,0TQD,hosp_procedures


In [156]:
hosp_procedure_df = hosp_procedure_df.drop(columns='seq_num')
hosp_procedure_df = hosp_procedure_df.rename(columns={'icd_code_root': 'itemid'})

In [157]:
final_timeseries_df = pd.concat([timeseries_df_with_prescrip, hosp_procedure_df], ignore_index=False)
final_timeseries_df = final_timeseries_df.sort_values(by='charttime')
final_timeseries_df = final_timeseries_df

In [22]:
"""
Saves pandas DataFrame as a CSV file.
"""
def save_df_as_csv(df, csv_name, directory='dataframes'):
    if not os.path.exists(directory):
        os.makedirs(directory)

    file_path = os.path.join(directory, csv_name)
    df.to_csv(file_path, index=False)

    print(f'DataFrame has been saved as {file_path}')

In [159]:
save_df_as_csv(final_timeseries_df, 'timeseries.csv', 'final_dfs')

DataFrame has been saved as final_dfs/timeseries.csv
