<a href="https://colab.research.google.com/github/victormurcia/VCHAMPS/blob/main/VCHAMPS_Lab_Results_Mapping_Encounter_IDipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to map patient encounters to the various dataframes in the VCHAMPS dataset using the Encounter IDs created from the encounters defined by the inpatient_admissions, ed_visits, and outpatient_visits datasets.

Encounter IDs are defined via a Unique Universal Identifier (UUID).

The mapping function takes the dataframe that we want to map an Encounter ID for and checks  inpatient_admissions for a matching entry interms of InternalpatientId and the Age column. It also checks to see if the date lies within the start/end dates in inpatient admissions. If it finds a match it returns that Encounter ID. If it doesn't find a match it follows that same process with ed_visits, and then outpatient_visits. If no match is found in any of those dataframes then a new unique UUID is made for that entry.

# Running the Notebook

## Step 1. Load the modules below

In [1]:
#General utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm  # Import tqdm for the progress bar
import math
import glob,shutil,os,warnings,math,time,sys,re
from typing import List
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

#For UUID generation
import uuid

#For Slider viz
import ipywidgets as widgets
from IPython.display import display, clear_output,HTML

#Enable data to be extracted and downloaded from my Google Drive
from google.colab import drive, files
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Specify the path to the desired directory
directory_path = r'/content/drive/MyDrive/VCHAMPS - Train Cleaned'

# Change the current working directory to the desired directory
os.chdir(directory_path)

# Verify the current working directory
cwd = os.getcwd()

print(f"Current working directory: {cwd}")

Current working directory: /content/drive/MyDrive/VCHAMPS - Train Cleaned


## Step 2. Load Encounter DFs
In this example I'm loading the encounter dataframes that have already had UUIDs generated for them. There's another notebook that showcases how to do that.

In [3]:
mapped_dir = '/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped'
#Load the Dataframes
ed_visits_df            = dd.read_parquet(mapped_dir + '/ed_visits.parquet')
inpatient_admissions_df = dd.read_parquet(mapped_dir + '/inpatient_admissions.parquet')
outpatient_visits_df    = dd.read_parquet(mapped_dir + '/outpatient_visits.parquet')

ed_visits_df = ed_visits_df.compute()
inpatient_admissions_df = inpatient_admissions_df.compute()
outpatient_visits_df = outpatient_visits_df.compute()

## Step 3. Load the Dataframe to be Mapped
Load the dataframe from the directory and compute it to turn it into a pandas dataframe. I've optimized and engineered all the dataframes in terms of their data typing and partitions so as to allow them to all fit into memory.

In [4]:
lab_results_df = dd.read_parquet(directory_path + '/lab_results.parquet/*.parquet')
lab_results_df = lab_results_df.compute()
lab_results_df = lab_results_df.reset_index(drop=True)
lab_results_df.columns

Index(['Internalpatientid', 'Age at lab test', 'Lab test date',
       'Result numeric', 'Specimen source', 'desc', 'concept', 'unit',
       'range_min', 'range_max'],
      dtype='object')

In [5]:
lab_results_df

Unnamed: 0,Internalpatientid,Age at lab test,Lab test date,Result numeric,Specimen source,desc,concept,unit,range_min,range_max
0,23511,66,2013-05-13 19:58:45,143.0,serum,ZSODIUM,na,mmol/L,136.0,145.0
1,23511,66,2013-06-21 17:21:08,144.0,serum,ZSODIUM,na,mmol/L,136.0,145.0
2,23256,51,2001-06-24 23:17:28,137.0,serum,ZSODIUM,na,mmol/L,136.0,148.0
3,23256,56,2006-05-20 02:08:17,149.0,serum,ZSODIUM,na,mmol/L,136.0,148.0
4,23256,60,2010-04-08 10:26:51,140.0,serum,ZSODIUM,na,mmol/L,136.0,148.0
...,...,...,...,...,...,...,...,...,...,...
44013180,31585,66,2002-05-07 11:00:57,23.0,serum,ZZCARBON DIOXIDE,bicarb,mmol/L,22.0,30.0
44013181,29243,72,2002-07-07 11:11:41,27.0,serum,ZZCARBON DIOXIDE,bicarb,mmol/L,22.0,30.0
44013182,32935,71,2000-10-25 04:55:38,27.0,serum,ZZCARBON DIOXIDE,bicarb,mmol/L,22.0,30.0
44013183,47851,44,1998-09-28 20:59:36,29.0,serum,ZZCARBON DIOXIDE,bicarb,mmol/L,22.0,30.0


Make a note of the age and date columns in the dataframe you are going to map since they will be needed for the mapping function.

## Step 4. Instantiate the Mapping Function
The function below is a vectorized form of a variant of one of the mapping functions I wrote prior. It works fairly fast. Keep in mind that we are working with some pretty large datasets so it will still take many hours to process for some of them.

In [5]:
def map_encounter_id_vectorized(row, age_col, date_col):
    """
    Maps the encounter ID for a given row based on matching criteria in different dataframes.

    Args:
        row (pandas.Series): The row containing the data to be matched.
        age_col (str): The column name for the patient's age in the row.
        date_col (str): The column name for the date to match in the row.

    Returns:
        str: The matched encounter ID if found, or a newly generated UUID if no match is found.

    """
    patient_id = row['Internalpatientid']
    patient_age = row[age_col]
    date_to_match = row[date_col]

    filtered_ed_visits = ed_visits_df[ed_visits_df['Internalpatientid'] == patient_id]
    ed_visit_match = (filtered_ed_visits['Ed visit start date'] <= date_to_match) & (filtered_ed_visits['Discharge date ed'] >= date_to_match) & (filtered_ed_visits['Age at ed visit'] <= patient_age)
    if ed_visit_match.any():
        return filtered_ed_visits.loc[ed_visit_match, 'Encounter ID'].iloc[0]

    filtered_inpatient_admissions = inpatient_admissions_df[inpatient_admissions_df['Internalpatientid'] == patient_id]
    inpatient_match = (filtered_inpatient_admissions['Admission date'] <= date_to_match) & (filtered_inpatient_admissions['Discharge date'] >= date_to_match) & (filtered_inpatient_admissions['Age at admission'] <= patient_age)
    if inpatient_match.any():
        return filtered_inpatient_admissions.loc[inpatient_match, 'Encounter ID'].iloc[0]

    filtered_outpatient_visits = outpatient_visits_df[outpatient_visits_df['Internalpatientid'] == patient_id]
    outpatient_match = (filtered_outpatient_visits['Visit start date'] <= date_to_match) & (filtered_outpatient_visits['Visit End Date'] >= date_to_match) & (filtered_outpatient_visits['Age at visit'] <= patient_age)
    if outpatient_match.any():
        return filtered_outpatient_visits.loc[outpatient_match, 'Encounter ID'].iloc[0]

    return str(uuid.uuid4())

## Step 5. Test the mapping function on a small sample of the dataframe
To ensure proper loading and functioning, I reccommend testing the mapping function on a subset of the data as shown below.

In [7]:
#Test on a small sample
lab_results_df_sm = lab_results_df[:100]

# Create an empty list to store the results
encounter_ids = []

# Iterate over rows and track progress using tqdm
for _, row in tqdm(lab_results_df_sm.iterrows(), total=lab_results_df_sm.shape[0], desc="Processing"):
    encounter_id = map_encounter_id_vectorized(row, 'Age at lab test',  'Lab test date')
    encounter_ids.append(encounter_id)

# Assign the encounter IDs to the dataframe
lab_results_df_sm['Encounter ID'] = encounter_ids

lab_results_df_sm

Processing: 100%|██████████| 100/100 [00:01<00:00, 55.46it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lab_results_df_sm['Encounter ID'] = encounter_ids


Unnamed: 0,Internalpatientid,Age at lab test,Lab test date,Result numeric,Specimen source,desc,concept,unit,range_min,range_max,Encounter ID
0,23511,66,2013-05-13 19:58:45,143.0,serum,ZSODIUM,na,mmol/L,136.0,145.0,5575460f-724f-4855-bf0e-587891394a83
1,23511,66,2013-06-21 17:21:08,144.0,serum,ZSODIUM,na,mmol/L,136.0,145.0,f717e19b-5ae0-4cdc-9b0f-2bb36fb94c48
2,23256,51,2001-06-24 23:17:28,137.0,serum,ZSODIUM,na,mmol/L,136.0,148.0,b3668b28-653c-5488-bb46-f5dd14ba103a
3,23256,56,2006-05-20 02:08:17,149.0,serum,ZSODIUM,na,mmol/L,136.0,148.0,f29cb959-d880-5e6f-ba16-0d0bec5af545
4,23256,60,2010-04-08 10:26:51,140.0,serum,ZSODIUM,na,mmol/L,136.0,148.0,bdb1bb2d-215a-502b-b2b8-3b2cd2d51939
...,...,...,...,...,...,...,...,...,...,...,...
95,93185,57,2010-08-05 19:56:33,137.0,serum,ZSODIUM,na,mmol/L,136.0,145.0,6823b1fb-c937-4e6b-9a8c-6f0ceef24342
96,93185,58,2012-01-27 09:55:43,141.0,serum,ZSODIUM,na,mmol/L,136.0,145.0,7cb7f79a-a4f8-5805-a9b9-47cc402e38e6
97,93185,58,2012-02-09 18:31:25,138.0,serum,ZSODIUM,na,mmol/L,136.0,145.0,cec2f008-3f64-5b0e-9c3c-5b9d8a0cf1ab
98,93185,60,2014-06-03 04:16:49,133.0,serum,ZSODIUM,na,mmol/L,136.0,145.0,2e47d7b1-c668-56c6-a163-7ff2dbd52891


My code shows a progress bar as well that should help tell you how soon til the mapping is concluded.

## Step 6. Sample the dataframe for 10 million rows
Some of the dataframes have 100+ million rows which would take about a month to process. From some basic and initial testing I've done, 10 million rows should take approximately 30 hours to map completely. Given our time constratints, I think that this should be more than sufficient data to train our model with while still remaining on schedule

In [None]:
#I'll sample 10M rows of the dataframe. This should take ~20 hours to map
sampled_procedures_df = procedures_df.sample(n=10000000, random_state=42)
sampled_procedures_df = sampled_procedures_df.reset_index(drop=True)
sampled_procedures_df

Unnamed: 0,Internalpatientid,Age at procedure,Procedure date,Procedure code,Procedure code description
0,49958,57,2012-12-17 15:45:00,85610,PROTHROMBIN TIME;
1,37454,78,2010-09-04 11:26:35,98960,EDUCATION AND TRAINING FOR PATIENT SELF-MANAGE...
2,80034,71,2008-08-07 16:29:16,93.94,RESPIRATORY MEDICATION ADMINISTERED BY NEBULIZER
3,140982,65,2014-04-08 21:21:02,85007,"BLOOD COUNT; BLOOD SMEAR, MICROSCOPIC EXAMINAT..."
4,75828,71,2004-10-30 12:58:57,99283,EMERGENCY DEPARTMENT VISIT FOR THE EVALUATION ...
...,...,...,...,...,...
9999995,135096,80,2014-02-11 05:46:06,85999,UNLISTED HEMATOLOGY AND COAGULATION PROCEDURE
9999996,50607,81,2000-07-05 06:34:03,83036,HEMOGLOBIN; GLYCOSYLATED (A1C)
9999997,50033,45,2002-08-15 07:02:28,92341,"FITTING OF SPECTACLES, EXCEPT FOR APHAKIA; BIF..."
9999998,5702,88,2018-04-23 09:11:25,V5011,FITTING/ORIENTATION/CHECKING OF HEARING AID


## Step 7. Run the mapping function
There are three lines you need to modify below to run the mapping function.

Line 5

Change 'your_df' to the name of the sampled dataframe you are processing

Line 6

Change age_col to the column name in your dataframe that has the patient age.

Change date_col to the column name in your dataframe that has the date of interest.

Both of those entries are strings.

Line 10

Change 'your_df' to the name of the sampled dataframe you are processing

In [None]:
# Create an empty list to store the results
encounter_ids = []

# Iterate over rows and track progress using tqdm
for _, row in tqdm(sampled_procedures_df.iterrows(), total=sampled_procedures_df.shape[0], desc="Processing"): #
    encounter_id = map_encounter_id_vectorized(row, 'Age at procedure', 'Procedure date')
    encounter_ids.append(encounter_id)

# Assign the encounter IDs to the dataframe
sampled_procedures_df['Encounter ID'] = encounter_ids

Processing:   6%|▌         | 575484/10000000 [2:31:00<41:48:58, 62.61it/s]

# Step 8. Save the dataframe into a parquet file

In [None]:
# Save the Dask DataFrame as Parquet
sampled_procedures_df.to_parquet('/content/drive/MyDrive/Colab Notebooks/v-CHAMPS/VCHAMPS - Train Cleaned-Mapped/procedures.parquet', engine='pyarrow')

# Steps 6-8 (Reworked). Process dataframes in smaller chunks


In [None]:
save_path = '/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/lab_results'
# Define the chunk size
chunk_size = 100000

# Calculate the number of chunks
num_chunks = math.ceil(len(lab_results_df) / chunk_size)

# Create an empty list to store the encounter IDs
encounter_ids = []

# Iterate over chunks
for i in range(num_chunks):
    start_idx = i * chunk_size
    end_idx = (i + 1) * chunk_size

    # Get the chunk of dataframe
    chunk_df = lab_results_df[start_idx:end_idx]

    # Process the chunk and track progress using tqdm
    for _, row in tqdm(chunk_df.iterrows(), total=chunk_df.shape[0], desc=f"Processing Chunk {i+1}/{num_chunks}"):
        encounter_id = map_encounter_id_vectorized(row, 'Age at lab test',  'Lab test date')
        encounter_ids.append(encounter_id)

    # Create a new DataFrame with the chunk results
    chunk_results_df = chunk_df.copy()
    chunk_results_df['Encounter ID'] = encounter_ids[start_idx:end_idx]

    # Save the results of the chunk to Parquet file
    chunk_results_df.to_parquet(f'{save_path}/lab_results{i+1}.parquet', index=False)

Processing Chunk 1/441: 100%|██████████| 100000/100000 [23:54<00:00, 69.69it/s]
Processing Chunk 2/441: 100%|██████████| 100000/100000 [25:17<00:00, 65.88it/s]
Processing Chunk 3/441: 100%|██████████| 100000/100000 [26:15<00:00, 63.48it/s]
Processing Chunk 4/441: 100%|██████████| 100000/100000 [23:25<00:00, 71.16it/s]
Processing Chunk 5/441: 100%|██████████| 100000/100000 [26:55<00:00, 61.90it/s]
Processing Chunk 6/441: 100%|██████████| 100000/100000 [26:48<00:00, 62.16it/s]
Processing Chunk 7/441: 100%|██████████| 100000/100000 [28:42<00:00, 58.06it/s]
Processing Chunk 8/441: 100%|██████████| 100000/100000 [27:47<00:00, 59.97it/s]
Processing Chunk 9/441: 100%|██████████| 100000/100000 [28:33<00:00, 58.34it/s]
Processing Chunk 10/441: 100%|██████████| 100000/100000 [28:21<00:00, 58.76it/s]
Processing Chunk 11/441: 100%|██████████| 100000/100000 [28:35<00:00, 58.30it/s]
Processing Chunk 12/441: 100%|██████████| 100000/100000 [25:30<00:00, 65.35it/s]
Processing Chunk 13/441: 100%|███████

In [None]:
chunk_results_df

In [None]:
# Directory path where the Parquet files are located
#directory = '/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped'
save_path = '/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/lab_results'

# Pattern to match files starting with 'lab_results' and ending with '.parquet'
pattern = 'lab_results*.parquet'

# Get a list of file paths that match the pattern
file_list = glob.glob(f'{save_path}/{pattern}')

In [5]:
dfs = []

# Load each Parquet file into a DataFrame and append it to the list
for file_path in file_list:
    df = pd.read_parquet(file_path)
    dfs.append(df)

# Concatenate the DataFrames in the list into a single DataFrame
concatenated_df = pd.concat(dfs)

concatenated_df

Unnamed: 0,Internalpatientid,Age at lab test,Lab test date,Result numeric,Specimen source,desc,concept,unit,range_min,range_max,Encounter ID
0,23511,66,2013-05-13 19:58:45,143.000000,serum,ZSODIUM,na,mmol/L,136.0,145.0,75e105d9-27db-4639-bfd9-99214f43e737
1,23511,66,2013-06-21 17:21:08,144.000000,serum,ZSODIUM,na,mmol/L,136.0,145.0,4222f6a7-8023-40a5-ad10-58e043a47822
2,23256,51,2001-06-24 23:17:28,137.000000,serum,ZSODIUM,na,mmol/L,136.0,148.0,b3668b28-653c-5488-bb46-f5dd14ba103a
3,23256,56,2006-05-20 02:08:17,149.000000,serum,ZSODIUM,na,mmol/L,136.0,148.0,f29cb959-d880-5e6f-ba16-0d0bec5af545
4,23256,60,2010-04-08 10:26:51,140.000000,serum,ZSODIUM,na,mmol/L,136.0,148.0,bdb1bb2d-215a-502b-b2b8-3b2cd2d51939
...,...,...,...,...,...,...,...,...,...,...,...
99995,63355,59,2003-10-03 04:33:46,1.000000,plasma,CREATININE,cr,mg/dl,0.6,1.3,cff8c703-7c12-59d2-85f4-6fb1d74ff449
99996,63355,59,2003-10-06 05:04:25,1.000000,plasma,CREATININE,cr,mg/dl,0.6,1.3,cff8c703-7c12-59d2-85f4-6fb1d74ff449
99997,63355,60,2004-07-11 01:58:01,1.000000,plasma,CREATININE,cr,mg/dl,0.6,1.3,a3b0c739-6075-5baf-8a25-850347dd7b2f
99998,63355,64,2008-12-02 14:13:30,0.817920,plasma,CREATININE,cr,mg/dl,0.6,1.3,d16f5dff-1dcc-5626-ac46-7698d754bbd9


In [6]:
s_path = '/content/drive/MyDrive/VCHAMPS - Final Train Data'
concatenated_df.to_parquet(f'{s_path}/lab_results.parquet', index=False)

# Resume mapping at chunk 60
Google colab times out after 24 hours.

In [None]:
save_path = '/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/lab_results'
# Define the chunk size
chunk_size = 100000

# Calculate the number of chunks
num_chunks = math.ceil(len(lab_results_df) / chunk_size)

# Create an empty list to store the encounter IDs
encounter_ids = []
# Iterate over chunks starting from chunk 60
for i in range(59, num_chunks):
    start_idx = i * chunk_size
    end_idx = (i + 1) * chunk_size

    # Get the chunk of dataframe
    chunk_df = lab_results_df[start_idx:end_idx]

    # Create an empty list to store the encounter IDs for the current chunk
    chunk_encounter_ids = []

    # Process the chunk and track progress using tqdm
    for _, row in tqdm(chunk_df.iterrows(), total=chunk_df.shape[0], desc=f"Processing Chunk {i+1}/{num_chunks}"):
        encounter_id = map_encounter_id_vectorized(row, 'Age at lab test', 'Lab test date')
        chunk_encounter_ids.append(encounter_id)

    # Create a new DataFrame with the chunk results
    chunk_results_df = chunk_df.copy()
    chunk_results_df['Encounter ID'] = chunk_encounter_ids

    # Save the results of the chunk to Parquet file
    chunk_results_df.to_parquet(f'{save_path}/lab_results{i+1}.parquet', index=False)

Processing Chunk 60/441: 100%|██████████| 100000/100000 [22:57<00:00, 72.61it/s]
Processing Chunk 61/441: 100%|██████████| 100000/100000 [23:01<00:00, 72.38it/s]
Processing Chunk 62/441: 100%|██████████| 100000/100000 [23:09<00:00, 71.97it/s]
Processing Chunk 63/441: 100%|██████████| 100000/100000 [23:26<00:00, 71.12it/s]
Processing Chunk 64/441: 100%|██████████| 100000/100000 [23:14<00:00, 71.70it/s]
Processing Chunk 65/441: 100%|██████████| 100000/100000 [23:00<00:00, 72.43it/s]
Processing Chunk 66/441: 100%|██████████| 100000/100000 [23:13<00:00, 71.77it/s]
Processing Chunk 67/441: 100%|██████████| 100000/100000 [22:52<00:00, 72.86it/s]
Processing Chunk 68/441: 100%|██████████| 100000/100000 [23:01<00:00, 72.40it/s]
Processing Chunk 69/441: 100%|██████████| 100000/100000 [22:57<00:00, 72.60it/s]
Processing Chunk 70/441: 100%|██████████| 100000/100000 [22:59<00:00, 72.50it/s]
Processing Chunk 71/441: 100%|██████████| 100000/100000 [22:51<00:00, 72.90it/s]
Processing Chunk 72/441: 100

In [None]:
df = pd.read_parquet(file_path)

In [None]:
save_path = '/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/lab_results'
# Define the chunk size
chunk_size = 100000

# Calculate the number of chunks
num_chunks = math.ceil(len(lab_results_df) / chunk_size)

# Create an empty list to store the encounter IDs
encounter_ids = []
# Iterate over chunks starting from chunk 60
for i in range(118, num_chunks):
    start_idx = i * chunk_size
    end_idx = (i + 1) * chunk_size

    # Get the chunk of dataframe
    chunk_df = lab_results_df[start_idx:end_idx]

    # Create an empty list to store the encounter IDs for the current chunk
    chunk_encounter_ids = []

    # Process the chunk and track progress using tqdm
    for _, row in tqdm(chunk_df.iterrows(), total=chunk_df.shape[0], desc=f"Processing Chunk {i+1}/{num_chunks}"):
        encounter_id = map_encounter_id_vectorized(row, 'Age at lab test', 'Lab test date')
        chunk_encounter_ids.append(encounter_id)

    # Create a new DataFrame with the chunk results
    chunk_results_df = chunk_df.copy()
    chunk_results_df['Encounter ID'] = chunk_encounter_ids

    # Save the results of the chunk to Parquet file
    chunk_results_df.to_parquet(f'{save_path}/lab_results{i+1}.parquet', index=False)

Processing Chunk 119/441: 100%|██████████| 100000/100000 [22:06<00:00, 75.38it/s]
Processing Chunk 120/441: 100%|██████████| 100000/100000 [22:12<00:00, 75.03it/s]
Processing Chunk 121/441: 100%|██████████| 100000/100000 [22:23<00:00, 74.41it/s]
Processing Chunk 122/441: 100%|██████████| 100000/100000 [22:14<00:00, 74.96it/s]
Processing Chunk 123/441: 100%|██████████| 100000/100000 [22:02<00:00, 75.63it/s]
Processing Chunk 124/441: 100%|██████████| 100000/100000 [22:30<00:00, 74.06it/s]
Processing Chunk 125/441: 100%|██████████| 100000/100000 [23:08<00:00, 72.04it/s]
Processing Chunk 126/441: 100%|██████████| 100000/100000 [22:47<00:00, 73.12it/s]
Processing Chunk 127/441: 100%|██████████| 100000/100000 [22:04<00:00, 75.52it/s]
Processing Chunk 128/441: 100%|██████████| 100000/100000 [22:01<00:00, 75.68it/s]
Processing Chunk 129/441: 100%|██████████| 100000/100000 [22:07<00:00, 75.36it/s]
Processing Chunk 130/441: 100%|██████████| 100000/100000 [21:47<00:00, 76.48it/s]
Processing Chunk

In [None]:
save_path = '/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/lab_results'
# Define the chunk size
chunk_size = 100000

# Calculate the number of chunks
num_chunks = math.ceil(len(lab_results_df) / chunk_size)

# Create an empty list to store the encounter IDs
encounter_ids = []
# Iterate over chunks starting from chunk 60
for i in range(180, num_chunks):
    start_idx = i * chunk_size
    end_idx = (i + 1) * chunk_size

    # Get the chunk of dataframe
    chunk_df = lab_results_df[start_idx:end_idx]

    # Create an empty list to store the encounter IDs for the current chunk
    chunk_encounter_ids = []

    # Process the chunk and track progress using tqdm
    for _, row in tqdm(chunk_df.iterrows(), total=chunk_df.shape[0], desc=f"Processing Chunk {i+1}/{num_chunks}"):
        encounter_id = map_encounter_id_vectorized(row, 'Age at lab test', 'Lab test date')
        chunk_encounter_ids.append(encounter_id)

    # Create a new DataFrame with the chunk results
    chunk_results_df = chunk_df.copy()
    chunk_results_df['Encounter ID'] = chunk_encounter_ids

    # Save the results of the chunk to Parquet file
    chunk_results_df.to_parquet(f'{save_path}/lab_results{i+1}.parquet', index=False)

Processing Chunk 181/441: 100%|██████████| 100000/100000 [22:22<00:00, 74.48it/s]
Processing Chunk 182/441: 100%|██████████| 100000/100000 [22:08<00:00, 75.28it/s]
Processing Chunk 183/441: 100%|██████████| 100000/100000 [22:02<00:00, 75.61it/s]
Processing Chunk 184/441: 100%|██████████| 100000/100000 [22:21<00:00, 74.56it/s]
Processing Chunk 185/441: 100%|██████████| 100000/100000 [22:12<00:00, 75.07it/s]
Processing Chunk 186/441: 100%|██████████| 100000/100000 [22:16<00:00, 74.82it/s]
Processing Chunk 187/441: 100%|██████████| 100000/100000 [21:47<00:00, 76.46it/s]
Processing Chunk 188/441: 100%|██████████| 100000/100000 [21:45<00:00, 76.58it/s]
Processing Chunk 189/441: 100%|██████████| 100000/100000 [21:44<00:00, 76.64it/s]
Processing Chunk 190/441: 100%|██████████| 100000/100000 [21:34<00:00, 77.25it/s]
Processing Chunk 191/441: 100%|██████████| 100000/100000 [21:35<00:00, 77.18it/s]
Processing Chunk 192/441: 100%|██████████| 100000/100000 [21:28<00:00, 77.63it/s]
Processing Chunk

In [None]:
save_path = '/content/drive/MyDrive/VCHAMPS - Train Cleaned-Mapped/lab_results'
# Define the chunk size
chunk_size = 100000

# Calculate the number of chunks
num_chunks = math.ceil(len(lab_results_df) / chunk_size)

# Create an empty list to store the encounter IDs
encounter_ids = []
# Iterate over chunks starting from chunk 60
for i in range(239, num_chunks):
    start_idx = i * chunk_size
    end_idx = (i + 1) * chunk_size

    # Get the chunk of dataframe
    chunk_df = lab_results_df[start_idx:end_idx]

    # Create an empty list to store the encounter IDs for the current chunk
    chunk_encounter_ids = []

    # Process the chunk and track progress using tqdm
    for _, row in tqdm(chunk_df.iterrows(), total=chunk_df.shape[0], desc=f"Processing Chunk {i+1}/{num_chunks}"):
        encounter_id = map_encounter_id_vectorized(row, 'Age at lab test', 'Lab test date')
        chunk_encounter_ids.append(encounter_id)

    # Create a new DataFrame with the chunk results
    chunk_results_df = chunk_df.copy()
    chunk_results_df['Encounter ID'] = chunk_encounter_ids

    # Save the results of the chunk to Parquet file
    chunk_results_df.to_parquet(f'{save_path}/lab_results{i+1}.parquet', index=False)

Processing Chunk 240/441: 100%|██████████| 100000/100000 [25:36<00:00, 65.08it/s]
Processing Chunk 241/441: 100%|██████████| 100000/100000 [26:02<00:00, 64.01it/s]
Processing Chunk 242/441: 100%|██████████| 100000/100000 [26:07<00:00, 63.78it/s]
Processing Chunk 243/441: 100%|██████████| 100000/100000 [26:05<00:00, 63.88it/s]
Processing Chunk 244/441: 100%|██████████| 100000/100000 [26:13<00:00, 63.56it/s]
Processing Chunk 245/441: 100%|██████████| 100000/100000 [26:08<00:00, 63.75it/s]
Processing Chunk 246/441: 100%|██████████| 100000/100000 [25:32<00:00, 65.24it/s]
Processing Chunk 247/441: 100%|██████████| 100000/100000 [25:59<00:00, 64.10it/s]
Processing Chunk 248/441: 100%|██████████| 100000/100000 [25:57<00:00, 64.21it/s]
Processing Chunk 249/441: 100%|██████████| 100000/100000 [26:10<00:00, 63.65it/s]
Processing Chunk 250/441: 100%|██████████| 100000/100000 [26:11<00:00, 63.64it/s]
Processing Chunk 251/441: 100%|██████████| 100000/100000 [25:49<00:00, 64.54it/s]
Processing Chunk