<a href="https://colab.research.google.com/github/victormurcia/VCHAMPS/blob/main/VCHAMPS_Model_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I'll show how to prepare the model for training. Perhaps I'll see about doing training here as well

In [1]:
#General utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm  # Import tqdm for the progress bar
import math
import glob,shutil,os,warnings,math,time,sys,re
from typing import List
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

#For UUID generation
import uuid

#For Slider viz
import ipywidgets as widgets
from IPython.display import display, clear_output,HTML

#Enable data to be extracted and downloaded from my Google Drive
from google.colab import drive, files
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Specify the path to the desired directory
directory_path = r'/content/drive/MyDrive/VCHAMPS - Final Train Data'

# Change the current working directory to the desired directory
os.chdir(directory_path)

# Verify the current working directory
cwd = os.getcwd()

print(f"Current working directory: {cwd}")

Current working directory: /content/drive/MyDrive/VCHAMPS - Final Train Data


In [3]:
#get list of parquet files
parquet_files = glob.glob(os.path.join(directory_path, '*.parquet'))

# Extract filenames from file paths and use them for rearrangement
file_names = [os.path.basename(file_path) for file_path in parquet_files]

# Rearrange the list so that demographics static is next to last and inpatient admissions is at the end
rearranged_list = [fn for fn in file_names if fn != 'demographics_static.parquet' and fn != 'inpatient_admissions.parquet']
rearranged_list.append('demographics_static.parquet')
rearranged_list.append('inpatient_admissions.parquet')

#Add file paths back
rearranged_file_paths = [os.path.join('/content/drive/MyDrive/VCHAMPS - Final Train Data', fn) for fn in rearranged_list]

rearranged_file_paths

['/content/drive/MyDrive/VCHAMPS - Final Train Data/conditions.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/demographics_event.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/death.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/ed_visits.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/immunizations.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/inpatient_location.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/medications_administered.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/outpatient_visits.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/lab_results.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/measurements.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/measurements_bp.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/inpatient_specialty.parquet',
 '/content/drive/MyDrive/VCHAMPS - Final Train Data/medications_ordered.p

# Combining Encounters
First thing I'll try  is to combine the encounter dataframes together since each of the rows here represents a hospitalization. I'll optimize the datatypes and remove any unnecessary features.

In [4]:
# Extract the specific files into a new list
encounter_files = [
    file_path for file_path in rearranged_file_paths
    if 'ed_visits.parquet' in file_path
    or 'outpatient_visits.parquet' in file_path
    or 'inpatient_admissions.parquet' in file_path
]

# Load each Parquet file into separate Dask DataFrames
ed_visits = dd.read_parquet(encounter_files[0])
outpatient_visits = dd.read_parquet(encounter_files[1])
inpatient_admissions = dd.read_parquet(encounter_files[2])

dataframes = [ed_visits, outpatient_visits, inpatient_admissions]

# Automate the categorical conversion process for all DataFrames
for df in dataframes:
    # Find categorical columns in the DataFrame
    categorical_columns = [col for col in df.columns if df[col].dtype.name == 'category']

    # Convert categorical columns to non-categorical data types (object)
    for col in categorical_columns:
        df[col] = df[col].astype('object')

# Concatenate the DataFrames vertically into a new DataFrame
concatenated_df = dd.concat([ed_visits,outpatient_visits,inpatient_admissions])
#Drop datetime columns
datetime_columns = concatenated_df.select_dtypes(include='datetime').columns
concatenated_df = concatenated_df.drop(columns=datetime_columns)
concatenated_df = concatenated_df.compute()

#Modify float columns
float_columns = concatenated_df.select_dtypes(include='float').columns
for col in float_columns:
  concatenated_df[col] = concatenated_df[col].fillna(-1).astype('int16')

#Fill NaNs with -1
concatenated_df = concatenated_df.fillna(-1)

# Define the mapping dictionary
mapping = {'Yes': 1, 'No': 0}

# Replace values based on the mapping dictionary using replace() with regex=True
concatenated_df['Agentorangeflag']       = concatenated_df['Agentorangeflag'].replace(mapping, regex=True).astype('int8')
concatenated_df['Combatflag']            = concatenated_df['Combatflag'].replace(mapping, regex=True).astype('int8')
concatenated_df['Ionizingradiationflag'] = concatenated_df['Ionizingradiationflag'].replace(mapping, regex=True).astype('int8')
concatenated_df['Serviceconnectedflag']  = concatenated_df['Serviceconnectedflag'].replace(mapping, regex=True).astype('int8')
concatenated_df['Swasiaconditionsflag']  = concatenated_df['Swasiaconditionsflag'].replace(mapping, regex=True).astype('int8')

# Columns to be converted to int8
cols_to_convert = [
    'pre_hosp_any',
    'pre_hosp_cv',
    'readmit_allcause_30d',
    'readmit_allcause_90d',
    'readmit_allcause_180d',
    'readmit_allcause_365d',
    'readmit_CV_30d',
    'readmit_CV_90d',
    'readmit_CV_180d',
    'readmit_CV_365d',
    'mortality_inhosp_allcause',
    'Outpatientreferralflag',
    'cc Status_CC',
    'cc Status_MCC',
    'cc Status_NCC',
    'Discharge disposition_Death with autopsy',
    'Discharge disposition_Death without autopsy',
    'Discharge disposition_Irregular',
    'Discharge disposition_NBC or while ASIH',
    'Discharge disposition_Regular',
    'Discharge disposition_Transfer',
    'rehosp_allcause',
    'Age at admission',
    'Age at visit',
    'Age at ed visit',
    'Died during ed visit',
    'CV diagnosis'
]

# Convert columns to int8
for col in cols_to_convert:
    concatenated_df[col] = concatenated_df[col].astype('int8')

concatenated_df.drop(columns=['cc Status'], inplace=True)

cols_to_replace = ['code', 'Stop code', 'Discharging unit service', 'Admitting specialty', 'Discharging specialty','diagnosis','Admitting unit service']

for col in cols_to_replace:
    concatenated_df[col] = concatenated_df[col].replace(-1, 'NA')

concatenated_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,pre_hosp_any,pre_hosp_cv,readmit_allcause_30d,readmit_allcause_90d,readmit_allcause_180d,readmit_allcause_365d,readmit_CV_30d,readmit_CV_90d,readmit_CV_180d,readmit_CV_365d
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
435812,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,15,0,1,1,1,1,0,0,0,0
910894,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,16,0,1,1,1,1,0,0,0,0
910895,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,17,0,1,1,1,1,0,0,0,0
432541,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,0,0,-1,-1,-1,-1,0,0,0,0


In [5]:
concatenated_df.dtypes

Internalpatientid                               int32
Age at ed visit                                  int8
Died during ed visit                             int8
CV diagnosis                                     int8
code                                           object
Encounter ID                                   object
Age at visit                                     int8
Stop code                                      object
Agentorangeflag                                  int8
Combatflag                                       int8
Ionizingradiationflag                            int8
Serviceconnectedflag                             int8
Swasiaconditionsflag                             int8
diagnosis                                      object
Age at admission                                 int8
Admitting unit service                         object
Discharging unit service                       object
Admitting specialty                            object
Discharging specialty       

In [6]:
concatenated_df.to_parquet('/content/drive/MyDrive/VCHAMPS - Model DF/encounters.parquet')

# Merging conditions
Now that I have the encounters dataframe, I'll start merging it with other dataframes. Here I'll clean the conditions dataframe and then merge it with the encounter df and start building the model df.

In [7]:
conditions = pd.read_parquet(rearranged_file_paths[0])
#Drop datetime columns
datetime_columns = conditions.select_dtypes(include='datetime').columns
conditions = conditions.drop(columns=datetime_columns)
conditions = conditions.drop(columns=['cc Status'])

#Modify float columns
bool_columns = conditions.select_dtypes(include='bool').columns
for col in bool_columns:
  conditions[col] = conditions[col].fillna(-1).astype('int8')

#Rename column
conditions = conditions.rename(columns={'Diagnosis sequence number or rank': 'DSNR'})

#One hot encode the Diagnosis Sequence or Rank column
one_hot = pd.get_dummies(conditions['DSNR'], prefix='DSNR')
conditions = pd.concat([conditions, one_hot], axis=1)

#Drop the DSNR column
conditions.drop(columns=['DSNR'], inplace=True)
conditions

Unnamed: 0,Internalpatientid,Age at condition documentation,Diagnosis,Problem,code,Encounter ID,DSNR_1,DSNR_10,DSNR_11,DSNR_12,...,DSNR_25,DSNR_3,DSNR_4,DSNR_5,DSNR_6,DSNR_7,DSNR_8,DSNR_9,DSNR_P,DSNR_S
0,1,58,1,0,M159,36c34b45-5879-4cc7-be9b-d733061e30a3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,58,1,0,M199,333ca061-57d1-4fb1-ab99-f142ead909a1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,58,1,0,I10,63a9f5d1-c77c-452c-9b7f-80e86f19d423,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,58,1,0,E782,3e246aaa-2d6d-4d78-a489-db9ef2373d65,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,1,59,1,0,E782,30ff0057-8fab-4c81-ac1b-034f8c8eedc1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9299995,118516,60,1,0,Z0389,3cf86993-a81b-57c3-964d-0857e5acd137,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9299996,118516,60,1,0,Z0389,ae641a0f-f803-5d7c-aade-62e93064e42b,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9299997,118516,61,1,0,Z0389,8e1c58fe-941f-521f-913e-5930bc72788f,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9299998,118516,62,1,0,E660,802eefc5-514a-54bc-8fbf-ad5b9236e637,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [8]:
conditions.dtypes

Internalpatientid                  int32
Age at condition documentation      int8
Diagnosis                           int8
Problem                             int8
code                              object
Encounter ID                      object
DSNR_1                             uint8
DSNR_10                            uint8
DSNR_11                            uint8
DSNR_12                            uint8
DSNR_13                            uint8
DSNR_14                            uint8
DSNR_15                            uint8
DSNR_16                            uint8
DSNR_17                            uint8
DSNR_18                            uint8
DSNR_19                            uint8
DSNR_2                             uint8
DSNR_20                            uint8
DSNR_21                            uint8
DSNR_22                            uint8
DSNR_23                            uint8
DSNR_24                            uint8
DSNR_25                            uint8
DSNR_3          

In [9]:
model_df = concatenated_df.merge(conditions,on=['Internalpatientid', 'Encounter ID', 'code'], how='left')
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,DSNR_25,DSNR_3,DSNR_4,DSNR_5,DSNR_6,DSNR_7,DSNR_8,DSNR_9,DSNR_P,DSNR_S
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,,,,,,,,,,
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,,,,,,,,,,
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,,,,,,,,,,
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,,,,,,,,,,


In [11]:
#Modify float columns
float_columns = model_df.select_dtypes(include='float64').columns
for col in float_columns:
  model_df[col] = model_df[col].fillna(-1).astype('int8')

In [13]:
pd.set_option('display.max_rows', 10)
model_df.dtypes

Internalpatientid        int32
Age at ed visit           int8
Died during ed visit      int8
CV diagnosis              int8
code                    object
                         ...  
DSNR_7                    int8
DSNR_8                    int8
DSNR_9                    int8
DSNR_P                    int8
DSNR_S                    int8
Length: 74, dtype: object

In [14]:
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,DSNR_25,DSNR_3,DSNR_4,DSNR_5,DSNR_6,DSNR_7,DSNR_8,DSNR_9,DSNR_P,DSNR_S
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,0,0,0,0,0,0,0,0,1,0
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,0,0,0,0,0,0,0,0,0,1
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0,0,0,0,0,0,0,0,1,0
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0,0,0,0,0,0,0,0,0,1
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [15]:
model_df.to_parquet('/content/drive/MyDrive/VCHAMPS - Model DF/model.parquet')

# Merging to measurements_bp
Now I'll merge with the measurements bp dataframe. This dataframe has multiple measurements per encounter. However, the only features we need from this df are the map_avg and map_low variables we created. Therefore, I'll group this df by Encounter ID and then proceed to merge with the model df.

In [21]:
measurements_bp  = pd.read_parquet(rearranged_file_paths[10])
#Drop datetime columns
datetime_columns = measurements_bp.select_dtypes(include='datetime').columns
measurements_bp = measurements_bp.drop(columns=datetime_columns)
measurements_bp = measurements_bp.drop(columns=['Diastolic bp','Systolic bp','Age at measurement bp'])
measurements_bp

Unnamed: 0,Internalpatientid,Encounter ID,map,map_avg,map_low
0,83189,7f635061-7d06-532d-b462-3bc8a46e67b6,92.666667,81.884354,65.333333
1,83189,7f635061-7d06-532d-b462-3bc8a46e67b6,87.333333,81.884354,65.333333
2,83189,7f635061-7d06-532d-b462-3bc8a46e67b6,81.333333,81.884354,65.333333
3,83189,7f635061-7d06-532d-b462-3bc8a46e67b6,77.333333,81.884354,65.333333
4,83189,7f635061-7d06-532d-b462-3bc8a46e67b6,78.333333,81.884354,65.333333
...,...,...,...,...,...
9999995,73313,de85b144-0bbd-4423-85b3-08757a5e79e8,86.000000,86.000000,86.000000
9999996,144793,d761ab81-f7f9-4aad-ae62-db7e65da9626,83.000000,83.000000,83.000000
9999997,62421,ebcb336a-f816-4e01-b856-cdbeeee41cef,100.666667,100.666667,100.666667
9999998,141195,91d4f847-f532-4964-a229-5afc9fccd4f1,94.000000,94.000000,94.000000


I'll group this df by Encounter ID and drop map since map_avg will have that information and we also have map_low

In [22]:
measurements_bp = measurements_bp.groupby(['Internalpatientid', 'Encounter ID']).agg({'map_avg': 'first','map_low': 'first'}).reset_index()
measurements_bp

Unnamed: 0,Internalpatientid,Encounter ID,map_avg,map_low
0,1,00e485bd-8982-5611-b3fb-fe574552fbed,110.333333,110.333333
1,1,047272c1-378c-502f-958f-ea11da51fab1,109.000000,109.000000
2,1,0ce62d49-ea64-5941-b70c-89b4a1c4cc8a,103.666667,103.666667
3,1,1523ee84-7595-585e-9caf-cbd954cc66b1,106.666667,106.666667
4,1,1bc600f4-fe18-532d-a7e5-0bb5e1de5232,96.666667,96.666667
...,...,...,...,...
5755866,169064,f97b45e1-15de-4094-ba56-8f106d876cda,100.000000,100.000000
5755867,169064,fc797eae-a5f3-4afc-825c-4e7beb359ee9,94.666667,94.666667
5755868,169064,fda04ee4-1480-4294-947d-e27315314c48,96.666667,96.666667
5755869,169064,fe4184b7-1167-4b84-af81-4f0bf703980a,71.666667,71.666667


In [23]:
model_df = model_df.merge(measurements_bp,on=['Internalpatientid', 'Encounter ID'], how='left')
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,DSNR_4,DSNR_5,DSNR_6,DSNR_7,DSNR_8,DSNR_9,DSNR_P,DSNR_S,map_avg,map_low
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,0,0,0,0,0,0,1,0,,
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,0,0,0,0,0,0,0,1,,
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0,0,0,0,0,0,1,0,76.333333,76.333333
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0,0,0,0,0,0,0,1,76.333333,76.333333
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,0,0,0,0,0,0,1,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,75.333333,75.333333
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,102.000000,102.000000
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,102.000000,102.000000
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,79.000000,65.000000


In [24]:
model_df.to_parquet('/content/drive/MyDrive/VCHAMPS - Model DF/model.parquet')

# Merging to Measurements
Similar to measurements bp, this dataframe contains multiple entries per encounter, however, I've already constructed the variables of interest so I'll just need to merge those by first grouping the dataframe by encounter id

In [26]:
measurements  = pd.read_parquet(rearranged_file_paths[9])
#Drop datetime columns
datetime_columns = measurements.select_dtypes(include='datetime').columns
measurements = measurements.drop(columns=datetime_columns)
measurements = measurements.drop(columns=['Result numeric','Measurement','Age at measurement'])
measurements

Unnamed: 0,Internalpatientid,Encounter ID,heart_wt_admit,heart_wt_dc,heart_hr_high,heart_hr_avg,id_temp_high,id_temp_avg,id_temp_up,resp_rr_avg,resp_rr_high,discharge_hr,discharge_blood_pressure,BMI
0,1,0ca96b3d-9cca-461d-ba94-b1e8fe8ecfff,,,,,95.804066,,0.0,,,,,
1,9713,2395b56a-b7db-5a4b-869f-5ae3bcc11ee6,,,88.0,83.750000,101.642561,83.750000,1.0,18.500000,21.0,76.0,,
2,97124,a06f4802-c63a-562f-9160-4a65241f4d7e,241.440850,241.440850,66.0,66.000000,96.582804,66.000000,0.0,,,66.0,,
3,107718,e158e91c-6384-5b83-8178-dceb8893a91c,253.359525,254.649458,68.0,63.500000,95.979629,63.500000,0.0,17.000000,18.0,62.0,,
4,96334,8f78627d-d7da-5da5-bdf8-63bdc3d9e132,126.000000,119.626033,93.0,80.800000,99.667202,80.800000,0.0,16.750000,18.0,83.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27199995,56828,22babbdd-2398-5d74-80b9-d93f846b018f,188.290813,216.466102,149.0,73.090909,103.000000,73.090909,1.0,17.509804,23.0,58.0,,21.756812
27199996,128003,6d130e56-fc0b-5f5e-8897-0dc115483bb6,222.310952,241.781791,139.0,101.516129,101.409013,101.516129,1.0,21.592593,35.0,125.0,,
27199997,110343,66ac08ad-edd3-4e67-b875-efee1b188297,,,,,,,,,,,,
27199998,47388,7abb124e-78f9-5eab-8477-8b9386928933,157.357797,158.690935,80.0,66.750000,101.122829,66.750000,1.0,18.750000,21.0,70.0,,


In [27]:
measurements = measurements.groupby(['Internalpatientid', 'Encounter ID']).first().reset_index()
measurements

Unnamed: 0,Internalpatientid,Encounter ID,heart_wt_admit,heart_wt_dc,heart_hr_high,heart_hr_avg,id_temp_high,id_temp_avg,id_temp_up,resp_rr_avg,resp_rr_high,discharge_hr,discharge_blood_pressure,BMI
0,1,00143959-6e5e-4464-833e-44413fba817d,236.021404,236.021404,,,,,,,,,,35.883012
1,1,00238998-8913-4b06-8380-4d5ceed3f252,,,,,,,,18.0,18.0,,,
2,1,00e485bd-8982-5611-b3fb-fe574552fbed,,,73.0,73.0,98.102793,73.0,0.0,,,73.0,,
3,1,010792c4-1bac-42c7-b457-7f26ef5e99d8,,,,,,,,,,,,
4,1,0153602b-c7a6-4670-af53-38bb2c74d92a,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12694188,169064,a272b7a0-6764-4ec5-a0fa-a1ff3dccecbd,,,,,97.074750,,0.0,,,,,
12694189,169064,a446d553-25bd-436b-b474-e2b9ae5ed774,,,,,,,,17.0,17.0,,,
12694190,169064,b19f1a52-0e2a-4fef-8bca-62996c693a63,,,59.0,59.0,,59.0,,,,59.0,,
12694191,169064,c43271c3-2e7f-406b-8fca-513d444290e2,,,,,,,,,,,,


In [28]:
model_df = model_df.merge(measurements,on=['Internalpatientid', 'Encounter ID'], how='left')
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,heart_hr_high,heart_hr_avg,id_temp_high,id_temp_avg,id_temp_up,resp_rr_avg,resp_rr_high,discharge_hr,discharge_blood_pressure,BMI
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,,,,,,,,,,
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,,,,,,,,,,
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,102.0,82.285714,96.489682,82.285714,0.0,18.428571,27.0,102.0,,
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,102.0,82.285714,96.489682,82.285714,0.0,18.428571,27.0,102.0,,
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,,,,,,,,,,
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,,,,,,,,,,
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,,,,,,,,,,
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,86.0,69.000000,102.467201,69.000000,1.0,17.800000,22.0,62.0,,


In [33]:
pd.set_option('display.max_rows', 10)
model_df.dtypes

Internalpatientid             int32
Age at ed visit                int8
Died during ed visit           int8
CV diagnosis                   int8
code                         object
                             ...   
resp_rr_avg                 float64
resp_rr_high                float64
discharge_hr                float64
discharge_blood_pressure    float64
BMI                         float64
Length: 88, dtype: object

In [32]:
model_df.to_parquet('/content/drive/MyDrive/VCHAMPS - Model DF/model.parquet')

In [34]:
model_df['Internalpatientid'].nunique()

105427

# Merge demographic info

In [38]:
demographics_event  = pd.read_parquet(rearranged_file_paths[1])
#Drop datetime columns
datetime_columns = demographics_event.select_dtypes(include='datetime').columns
demographics_event = demographics_event.drop(columns=datetime_columns)
demographics_event = demographics_event.drop(columns=['Age at update'])
#Modify bool columns
bool_columns = demographics_event.select_dtypes(include='bool').columns
for col in bool_columns:
  demographics_event[col] = demographics_event[col].fillna(-1).astype('int8')
demographics_event.rename(columns={'Unknown': 'MarriedStatus_Unknown'}, inplace=True)
demographics_event

Unnamed: 0,Internalpatientid,Divorced,Married,Never married,Separated,Single,MarriedStatus_Unknown,Widowed,Highly rural,Rural,Urban,Encounter ID
0,100028,0,1,0,0,0,0,0,0,0,1,e0f58601-833e-5982-b624-175beb0cfc77
1,100032,0,1,0,0,0,0,0,0,1,0,26960966-c254-5cdd-a22b-e4305f53dba4
2,100046,0,1,0,0,0,0,0,0,0,1,0c514ba1-9295-4acd-b397-a07f3b76fb37
3,100071,0,0,0,0,0,0,1,0,0,1,dd9f49d1-856a-4659-975a-6cb78622be7c
4,100091,0,0,1,0,0,0,0,0,0,1,b747ba8a-8696-44a2-8f35-57a046e551d2
...,...,...,...,...,...,...,...,...,...,...,...,...
133247,99898,0,1,0,0,0,0,0,0,0,1,3281f665-f906-42f4-a83a-272898de4763
133248,9995,0,1,0,0,0,0,0,0,1,0,302180f8-67b4-420e-a7cb-30efd7065455
133249,99950,0,1,0,0,0,0,0,0,0,1,3959fd06-b3f6-4982-9ec6-bd594de84193
133250,9998,1,0,0,0,0,0,0,0,0,1,8304a1cb-b7c6-4d7f-880d-4e671e386c84


In [39]:
model_df = model_df.merge(demographics_event,on=['Internalpatientid', 'Encounter ID'], how='left')
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,Divorced,Married,Never married,Separated,Single,MarriedStatus_Unknown,Widowed,Highly rural,Rural,Urban
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,,,,,,,,,,
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,,,,,,,,,,
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,,,,,,,,,,
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,,,,,,,,,,
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,,,,,,,,,,
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,,,,,,,,,,
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,,,,,,,,,,
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,,,,,,,,,,


I'm going to clear some variables before moving forward to free up a bit of RAM

In [42]:
# List of variables to delete
variables_to_delete = ['measurements_bp', 'demographics_event', 'conditions', 'concatenated_df', 'grouped_df',
                       'ed_visits', 'outpatient_visits', 'inpatient_admissions', 'dataframes', 'df',
                       'categorical_columns', 'col', 'concatenated_df', 'datetime_columns', 'float_columns',
                       'mapping', 'cols_to_convert', 'cols_to_replace']

# Delete the specified variables
for var_name in variables_to_delete:
    if var_name in globals():
        del globals()[var_name]

In [52]:
#Convert columns to int8 and fill NaNs with -1
columns_to_convert = ['Divorced', 'Married', 'Never married', 'Separated', 'Single',
                      'MarriedStatus_Unknown', 'Widowed', 'Highly rural', 'Rural', 'Urban']

model_df[columns_to_convert] = model_df[columns_to_convert].fillna(-1).astype('int8')

In [53]:
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,Divorced,Married,Never married,Separated,Single,MarriedStatus_Unknown,Widowed,Highly rural,Rural,Urban
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [54]:
model_df.to_parquet('/content/drive/MyDrive/VCHAMPS - Model DF/model.parquet')

In [55]:
pd.set_option('display.max_rows', 10)
model_df.dtypes

Internalpatientid         int32
Age at ed visit            int8
Died during ed visit       int8
CV diagnosis               int8
code                     object
                          ...  
MarriedStatus_Unknown      int8
Widowed                    int8
Highly rural               int8
Rural                      int8
Urban                      int8
Length: 98, dtype: object

# Merging inpatient specialty

In [4]:
inpatient_specialty  = pd.read_parquet(rearranged_file_paths[11])
#Drop datetime columns
datetime_columns = inpatient_specialty.select_dtypes(include='datetime').columns
inpatient_specialty = inpatient_specialty.drop(columns=datetime_columns)
inpatient_specialty = inpatient_specialty.drop(columns=['Age at specialty','Specialty','counts'])
inpatient_specialty

Unnamed: 0,Internalpatientid,Encounter ID,genMed,hospice,homeCare,homelessRecovery,rehab,snf,psych,obs,...,loc_rehab,loc_snf,loc_psych,loc_obs,loc_drug,loc_stepdown,loc_icu,loc_other,ICU_days,stepdown_days
0,1,e8f395d3-c8d4-5cf4-a686-34352c9e47c3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
1,100095,98f745fb-5f07-4f4c-86f4-36979464dca3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
2,10064,437a3995-2112-5b1c-865b-6ba64c5bd0a1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
3,100751,63d9b053-ec7a-5c93-bc23-f52d522208c9,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
4,101245,33037e3c-db21-57dc-89b0-fc1e09ebd5fc,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40073402,99689,99375e10-7fdc-5328-b8cb-3f4fb64796fe,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,,
40073403,99689,99375e10-7fdc-5328-b8cb-3f4fb64796fe,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,,
40073404,58306,56ec12dd-131a-4e22-a0b1-d6026f6ca121,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,,
40073405,58306,b24abd9c-f525-442b-92eb-b712b2c626a8,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,,


In [5]:
inpatient_specialty = inpatient_specialty.groupby(['Internalpatientid', 'Encounter ID']).first().reset_index()
inpatient_specialty

Unnamed: 0,Internalpatientid,Encounter ID,genMed,hospice,homeCare,homelessRecovery,rehab,snf,psych,obs,...,loc_rehab,loc_snf,loc_psych,loc_obs,loc_drug,loc_stepdown,loc_icu,loc_other,ICU_days,stepdown_days
0,1,0ce62d49-ea64-5941-b70c-89b4a1c4cc8a,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
1,1,4ab4ae77-f40e-4ade-b4c1-c1574dc2041f,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,,
2,1,52d5e3bc-aced-53a4-b8a4-4a458e55601f,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
3,1,aee47e87-cab8-5ca7-9947-21cb2daf476b,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
4,1,d164c341-63a3-5729-9bf1-a6b54bae4a74,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
856199,169062,caba65f7-390e-40b2-8857-5a2ad2fc072a,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
856200,169062,da784870-55cd-51f3-84f4-9686c00291f9,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
856201,169062,e6460fae-0e3b-5fda-a016-f36cdd654819,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,,
856202,169062,f6956488-7e4c-5967-a95f-208aae167c25,0,0,0,0,0,0,0,1,...,0,0,0,1,0,1,0,0,,0


In [6]:
model_df = pd.read_parquet('/content/drive/MyDrive/VCHAMPS - Model DF/model.parquet')
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,Divorced,Married,Never married,Separated,Single,MarriedStatus_Unknown,Widowed,Highly rural,Rural,Urban
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [7]:
model_df = model_df.merge(inpatient_specialty,on=['Internalpatientid', 'Encounter ID'], how='left')
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,loc_rehab,loc_snf,loc_psych,loc_obs,loc_drug,loc_stepdown,loc_icu,loc_other,ICU_days,stepdown_days
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,,,,,,,,,,
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,,,,,,,,,,
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,,0
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,,0
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,


In [12]:
pd.set_option('display.max_rows', 10)
#model_df.dtypes[-26:]

In [13]:
# List of columns to convert to int8
columns_to_convert = ['genMed', 'hospice', 'homeCare', 'homelessRecovery', 'rehab', 'snf', 'psych', 'obs', 'drug',
                      'stepdown', 'icu', 'other', 'loc_genMed', 'loc_hospice', 'loc_homeCare', 'loc_homelessRecovery',
                      'loc_rehab', 'loc_snf', 'loc_psych', 'loc_obs', 'loc_drug', 'loc_stepdown', 'loc_icu', 'loc_other']

# Convert columns to int8 and fill NaNs with -1
model_df[columns_to_convert] = model_df[columns_to_convert].fillna(-1).astype('int8')

In [16]:
columns_to_convert = ['ICU_days', 'stepdown_days']
# Convert columns to int8 and fill NaNs with -1
model_df[columns_to_convert] = model_df[columns_to_convert].fillna(-1)

In [17]:
model_df

Unnamed: 0,Internalpatientid,Age at ed visit,Died during ed visit,CV diagnosis,code,Encounter ID,Age at visit,Stop code,Agentorangeflag,Combatflag,...,loc_rehab,loc_snf,loc_psych,loc_obs,loc_drug,loc_stepdown,loc_icu,loc_other,ICU_days,stepdown_days
0,101689,64,0,1,I502,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,101689,64,0,1,N189,2afc637f-1e93-5ce7-b817-0784cebc77a1,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,107210,71,0,0,A419,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0,0,0,0,0,0,0,1,-1,-1
3,107210,71,0,0,I959,0d3f5146-915f-5e92-b3e0-0ea9f4aa0141,-1,,-1,-1,...,0,0,0,0,0,0,0,1,-1,-1
4,107866,61,0,0,R918,efb622f7-8c68-5b26-ae9f-09147ff0dce7,-1,,-1,-1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18012027,169062,-1,-1,0,F101,e6460fae-0e3b-5fda-a016-f36cdd654819,-1,,0,-1,...,0,0,0,0,0,0,0,1,-1,-1
18012028,169062,-1,-1,0,E440,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,0,0,0,1,0,1,0,0,-1,0
18012029,169062,-1,-1,0,F101,f6956488-7e4c-5967-a95f-208aae167c25,-1,,0,-1,...,0,0,0,1,0,1,0,0,-1,0
18012030,169064,-1,-1,0,M199,0c52ac5b-8578-50a3-9f26-fbdf68dad075,-1,,0,-1,...,1,1,0,0,0,0,0,1,-1,-1


In [18]:
model_df.to_parquet('/content/drive/MyDrive/VCHAMPS - Model DF/model.parquet')

# Merging Inpatient Location
Is there anything we want to do with this? I might just one hot encode this

In [22]:
inpatient_location  = pd.read_parquet(rearranged_file_paths[5])
#Drop datetime columns
datetime_columns = inpatient_location.select_dtypes(include='datetime').columns
inpatient_location = inpatient_location.drop(columns=datetime_columns)
inpatient_location = inpatient_location.rename(columns={'Service': 'inpatient_location_service'})
inpatient_location = inpatient_location.drop(columns=['Age at location'])
#One hot encode the Diagnosis Sequence or Rank column
one_hot = pd.get_dummies(inpatient_location['inpatient_location_service'], prefix='inpatient_location_service')
inpatient_location = pd.concat([inpatient_location, one_hot], axis=1)
#Drop the inpatient_location_service column
inpatient_location.drop(columns=['inpatient_location_service'], inplace=True)
inpatient_location

Unnamed: 0,Internalpatientid,Died at location,Encounter ID,inpatient_location_service_BLIND REHAB,inpatient_location_service_DOMICILIARY,inpatient_location_service_INTERMEDIATE MED,inpatient_location_service_MEDICINE,inpatient_location_service_NEUROLOGY,inpatient_location_service_NHCU,inpatient_location_service_NON-COUNT,inpatient_location_service_PSYCHIATRY,inpatient_location_service_REHAB MEDICINE,inpatient_location_service_SPINAL CORD INJURY,inpatient_location_service_SURGERY
0,100005,0,6acbda18-63a1-5086-a5f2-64bbb198632e,0,0,0,1,0,0,0,0,0,0,0
1,100008,0,ef5e041b-4e3a-5cba-afa8-de036ba335d4,0,0,0,0,0,0,0,0,0,0,1
2,100015,0,c90d9a73-08d3-5217-aa5f-69225425b1eb,0,0,0,1,0,0,0,0,0,0,0
3,100015,0,bdcc756b-a0be-509c-aedd-f366602ee3a9,0,0,0,1,0,0,0,0,0,0,0
4,100026,0,c8750041-50d2-5e6c-8587-a641615a1182,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
830348,99974,0,89bfbaf3-d368-4ca5-b5bd-cd8f08e891b7,0,0,0,1,0,0,0,0,0,0,0
830349,99981,0,72f1a0a9-3887-5b36-9c34-7fa4bc871441,0,0,0,0,0,0,0,0,0,0,1
830350,99983,0,ced196f3-133d-59a9-9040-9b3601fa2189,0,1,0,0,0,0,0,0,0,0,0
830351,9999,0,eaf7690b-8b9d-5a21-ab1b-ae476aa4b7ce,0,0,0,0,0,0,0,0,0,0,1


In [23]:
inpatient_location['Encounter ID'].nunique()

553924

In [20]:
loc_codes = pd.read_csv('/content/drive/MyDrive/inpatient_locations_codes.csv')
loc_codes

Unnamed: 0,Service,Counts
0,BLIND REHAB,3422
1,DOMICILIARY,33233
2,INTERMEDIATE MED,9132
3,MEDICINE,464201
4,NEUROLOGY,3516
...,...,...
6,NON-COUNT,73047
7,PSYCHIATRY,33988
8,REHAB MEDICINE,3747
9,SPINAL CORD INJURY,5095


# Merging Labs