<a href="https://colab.research.google.com/github/victormurcia/VCHAMPS/blob/main/VCHAMPS_Generic_Mapping_Encounter_ID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to map patient encounters to the various dataframes in the VCHAMPS dataset using the Encounter IDs created from the encounters defined by the inpatient_admissions, ed_visits, and outpatient_visits datasets.

Encounter IDs are defined via a Unique Universal Identifier (UUID).

The mapping function takes the dataframe that we want to map an Encounter ID for and checks  inpatient_admissions for a matching entry interms of InternalpatientId and the Age column. It also checks to see if the date lies within the start/end dates in inpatient admissions. If it finds a match it returns that Encounter ID. If it doesn't find a match it follows that same process with ed_visits, and then outpatient_visits. If no match is found in any of those dataframes then a new unique UUID is made for that entry.

# Running the Notebook

## Step 1. Load the modules below

In [None]:
#General utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm  # Import tqdm for the progress bar
import glob,shutil,os,warnings,math,time,sys,re
from typing import List
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

#For UUID generation
import uuid

#For Slider viz
import ipywidgets as widgets
from IPython.display import display, clear_output,HTML

#Enable data to be extracted and downloaded from my Google Drive
from google.colab import drive, files
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Specify the path to the desired directory
directory_path = r'/content/drive/MyDrive/VCHAMPS - Train Cleaned'

# Change the current working directory to the desired directory
os.chdir(directory_path)

# Verify the current working directory
cwd = os.getcwd()

print(f"Current working directory: {cwd}")

Current working directory: /content/drive/MyDrive/VCHAMPS - Train Cleaned


## Step 2. Load Encounter DFs
In this example I'm loading the encounter dataframes that have already had UUIDs generated for them. There's another notebook that showcases how to do that.

In [None]:
#Load the Dataframes
ed_visits_df            = dd.read_parquet('/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/ed_visits.parquet')
inpatient_admissions_df = dd.read_parquet('/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/inpatient_admissions.parquet')
outpatient_visits_df    = dd.read_parquet('/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/outpatient_visits.parquet')

ed_visits_df = ed_visits_df.compute()
inpatient_admissions_df = inpatient_admissions_df.compute()
outpatient_visits_df = outpatient_visits_df.compute()

## Step 3. Load the Dataframe to be Mapped
Load the dataframe from the directory and compute it to turn it into a pandas dataframe. I've optimized and engineered all the dataframes in terms of their data typing and partitions so as to allow them to all fit into memory.

In [None]:
conditions_df = dd.read_parquet('/content/drive/MyDrive/VCHAMPS - Train Cleaned/conditions_train.parquet/*.parquet')
conditions_df = conditions_df.compute()
conditions_df = conditions_df.reset_index(drop=True)
conditions_df.columns

Index(['Internalpatientid', 'Age at condition documentation',
       'Condition documented date', 'Diagnosis sequence number or rank',
       'Diagnosis', 'Problem', 'code', 'cc Status'],
      dtype='object')

In [None]:
conditions_df

Unnamed: 0,Internalpatientid,Age at condition documentation,Condition documented date,Diagnosis sequence number or rank,Diagnosis,Problem,code,cc Status
0,1,58,2002-03-03 21:37:12,S,True,False,M159,NCC
1,1,58,2002-03-03 21:37:12,S,True,False,M199,NCC
2,1,58,2002-03-03 21:37:12,P,True,False,I10,NCC
3,1,58,2002-03-03 21:37:12,S,True,False,E782,NCC
4,1,59,2002-11-23 13:29:02,S,True,False,E782,NCC
...,...,...,...,...,...,...,...,...
84324356,99999,96,2013-03-19 17:47:55,P,True,False,Z7189,NCC
84324357,99999,96,2013-03-21 22:15:17,P,True,False,N186,MCC
84324358,99999,96,2013-03-24 16:11:04,P,True,False,N186,MCC
84324359,99999,96,2013-04-11 22:34:16,P,True,False,N186,MCC


Make a note of the age and date columns in the dataframe you are going to map since they will be needed for the mapping function.

## Step 4. Instantiate the Mapping Function
The function below is a vectorized form of a variant of one of the mapping functions I wrote prior. It works fairly fast. Keep in mind that we are working with some pretty large datasets so it will still take many hours to process for some of them.

In [None]:
def map_encounter_id_vectorized(row, age_col, date_col):
    """
    Maps the encounter ID for a given row based on matching criteria in different dataframes.

    Args:
        row (pandas.Series): The row containing the data to be matched.
        age_col (str): The column name for the patient's age in the row.
        date_col (str): The column name for the date to match in the row.

    Returns:
        str: The matched encounter ID if found, or a newly generated UUID if no match is found.

    """
    patient_id = row['Internalpatientid']
    patient_age = row[age_col]
    date_to_match = row[date_col]

    filtered_ed_visits = ed_visits_df[ed_visits_df['Internalpatientid'] == patient_id]
    ed_visit_match = (filtered_ed_visits['Ed visit start date'] <= date_to_match) & (filtered_ed_visits['Discharge date ed'] >= date_to_match) & (filtered_ed_visits['Age at ed visit'] <= patient_age)
    if ed_visit_match.any():
        return filtered_ed_visits.loc[ed_visit_match, 'Encounter ID'].iloc[0]

    filtered_inpatient_admissions = inpatient_admissions_df[inpatient_admissions_df['Internalpatientid'] == patient_id]
    inpatient_match = (filtered_inpatient_admissions['Admission date'] <= date_to_match) & (filtered_inpatient_admissions['Discharge date'] >= date_to_match) & (filtered_inpatient_admissions['Age at admission'] <= patient_age)
    if inpatient_match.any():
        return filtered_inpatient_admissions.loc[inpatient_match, 'Encounter ID'].iloc[0]

    filtered_outpatient_visits = outpatient_visits_df[outpatient_visits_df['Internalpatientid'] == patient_id]
    outpatient_match = (filtered_outpatient_visits['Visit start date'] <= date_to_match) & (filtered_outpatient_visits['Visit End Date'] >= date_to_match) & (filtered_outpatient_visits['Age at visit'] <= patient_age)
    if outpatient_match.any():
        return filtered_outpatient_visits.loc[outpatient_match, 'Encounter ID'].iloc[0]

    return str(uuid.uuid4())

## Step 5. Test the mapping function on a small sample of the dataframe
To ensure proper loading and functioning, I reccommend testing the mapping function on a subset of the data as shown below.

In [None]:
#Test on a small sample
conditions_df_sm = conditions_df[:100]

# Create an empty list to store the results
encounter_ids = []

# Iterate over rows and track progress using tqdm
for _, row in tqdm(conditions_df_sm.iterrows(), total=conditions_df_sm.shape[0], desc="Processing"):
    encounter_id = map_encounter_id_vectorized(row, 'Age at condition documentation', 'Condition documented date')
    encounter_ids.append(encounter_id)

# Assign the encounter IDs to the dataframe
conditions_df_sm['Encounter ID'] = encounter_ids

conditions_df_sm

Processing: 100%|██████████| 100/100 [00:02<00:00, 42.18it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  conditions_df_sm['Encounter ID'] = encounter_ids


Unnamed: 0,Internalpatientid,Age at condition documentation,Condition documented date,Diagnosis sequence number or rank,Diagnosis,Problem,code,cc Status,Encounter ID
0,1,58,2002-03-03 21:37:12,S,True,False,M159,NCC,c9fd753c-60a6-4c6b-b808-b45b0e18806d
1,1,58,2002-03-03 21:37:12,S,True,False,M199,NCC,0498ddf2-fe62-4448-a3a1-18c3555d565a
2,1,58,2002-03-03 21:37:12,P,True,False,I10,NCC,2e80fb87-ba4d-48ae-8eb3-8baea6154441
3,1,58,2002-03-03 21:37:12,S,True,False,E782,NCC,90ac8422-96e4-4056-8814-080748b58578
4,1,59,2002-11-23 13:29:02,S,True,False,E782,NCC,47c16fe9-2531-44b8-b486-09b4c0f3fcc6
...,...,...,...,...,...,...,...,...,...
95,10000,65,2023-12-01 06:09:51,S,True,False,H90A,NCC,8def23ed-5cf3-54f2-bdd7-a9c7cdf63d9a
96,10000,65,2023-12-01 06:09:51,P,True,False,Z461,NCC,8def23ed-5cf3-54f2-bdd7-a9c7cdf63d9a
97,10000,65,2023-12-01 06:09:51,S,True,False,H90A,NCC,8def23ed-5cf3-54f2-bdd7-a9c7cdf63d9a
98,10000,65,2023-12-09 06:04:23,S,True,False,G473,NCC,f0ea0c2a-9094-5b62-8121-395178a0931b


My code shows a progress bar as well that should help tell you how soon til the mapping is concluded.

## Step 6. Sample the dataframe for 10 million rows
Some of the dataframes have 100+ million rows which would take about a month to process. From some basic and initial testing I've done, 10 million rows should take approximately 30 hours to map completely. Given our time constratints, I think that this should be more than sufficient data to train our model with while still remaining on schedule

In [None]:
#I'll sample 10M rows of the dataframe. This should take ~20 hours to map
sampled_conditions_df = conditions_df.sample(n=10000000, random_state=42)
sampled_conditions_df = sampled_conditions_df.reset_index(drop=True)
sampled_conditions_df

Unnamed: 0,Internalpatientid,Age at condition documentation,Condition documented date,Diagnosis sequence number or rank,Diagnosis,Problem,code,cc Status
0,133974,68,2002-01-09 15:23:43,S,True,False,I10,NCC
1,111102,51,2001-07-16 05:32:54,P,True,False,F431,NCC
2,10437,67,2013-07-04 11:21:30,S,True,False,E780,NCC
3,78733,75,2017-07-26 14:10:37,S,True,False,E785,NCC
4,115991,65,2015-05-19 14:32:10,S,True,False,F102,NCC
...,...,...,...,...,...,...,...,...
9999995,103533,44,2001-02-28 22:14:04,P,True,False,Z119,NCC
9999996,84012,61,2014-09-23 10:58:31,S,True,False,I509,NCC
9999997,53044,81,2004-06-02 15:41:10,S,True,False,Z7189,NCC
9999998,59981,73,2020-09-16 14:45:26,P,True,False,C61,NCC


## Step 7. Run the mapping function
There are three lines you need to modify below to run the mapping function.

Line 5

Change 'your_df' to the name of the sampled dataframe you are processing

Line 6

Change age_col to the column name in your dataframe that has the patient age.

Change date_col to the column name in your dataframe that has the date of interest.

Both of those entries are strings.

Line 10

Change 'your_df' to the name of the sampled dataframe you are processing

In [None]:
# Create an empty list to store the results
encounter_ids = []

# Iterate over rows and track progress using tqdm
for _, row in tqdm(your_df.iterrows(), total=your_df.shape[0], desc="Processing"): #
    encounter_id = map_encounter_id_vectorized(row, age_col, date_col)
    encounter_ids.append(encounter_id)

# Assign the encounter IDs to the dataframe
your_df['Encounter ID'] = encounter_ids

Processing:   6%|▌         | 575484/10000000 [2:31:00<41:48:58, 62.61it/s]

In [None]:
# Save the Dask DataFrame as Parquet
demographics_event_df.to_parquet('/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/demographics_event.parquet', engine='pyarrow')