<a href="https://colab.research.google.com/github/victormurcia/VCHAMPS/blob/main/VCHAMPS_Encounter_Mapping_Function_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VCHAMPS EDA on Quality Check Dataset
*Made by Victor M. Murcia on 7/2/2023*

This notebook showcases the development of the encounter mapping function. I use Unique Universal Identifiers (UUIDs) to define the encounters. This mapping function was applied to 13 different dataframes.

The files this mapping function was applied to are:

 1. inpatient_specialty_qual.csv',
 2. immunization_qual.csv',
 3. conditions_qual.csv',
 4. measurements_blood_pressure_qual.csv',
 5. ed_visits_qual.csv',
 6. procedures_qual.csv',
 7. medications_ordered_qual.csv',
 8. lab_results_qual.csv',
 9. outpatient_visits_qual.csv',
 10. measurements_qual.csv',
 11. medications_administered_qual.csv',
 12. inpatient_location_qual.csv',
 13. inpatient_admissions_qual.csv'

The corresponding columns that were used are:

val_cols_for_mapping = ['Specialty start date','Immunization date','Condition documented date','Measurement date','Ed visit start date','Procedure date','Order date', 'Lab test date','Visit start date','Measurement date','Administration date', 'Location.start.date','Admission.date']

***Basic Idea:***

1. A dictionary was created in order to serve as a sort of hash table such that the Internalpatientid was used as a primary key.
2. Then, for each of the dataframes a specific date column was chosen to serve as the corresponding value entry for our hash table.
3. Finally, the routine would look through each row in each dataframe for the Internalpatientid: DesiredDateColumn column. If that key:value pair doesn't exist then give it a new UUID. If it does then give it the existing UUID.

This would ensure that even across dataframes, the UUID would remain unique regardless of categories since the classification is based on Patient ID and Start Time/Date.

# Required Python Libraries and Modules
Below are the various libraries and modules I used to carry out this initial cursory analysis.

In [1]:
#General utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
from matplotlib.lines import Line2D
import seaborn as sns
from tqdm import tqdm  # Import tqdm for the progress bar
import glob, os, warnings, uuid
from typing import List

#For Slider viz
import ipywidgets as widgets
from IPython.display import display, clear_output,HTML

#Enable data to be extracted and downloaded from my Google Drive
from google.colab import drive, files
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
def load_csvs(path2data: str) -> List[str]:
  """
  Load and return a list of CSV file paths from the specified directory.

  Args:
      path2data (str): The directory path containing the CSV files.

  Returns:
      List[str]: A list of CSV file paths.

  """
  csv_files = glob.glob(path2data + '/*.csv')
  return csv_files

def make_df_list(csv_files: List[str]) -> List[pd.DataFrame]:
  """
  Read CSV files from the provided list of file paths and return a list of DataFrames.

  Args:
      csv_files (List[str]): A list of CSV file paths.

  Returns:
      List[pd.DataFrame]: A list of DataFrames read from the CSV files.

  """
  df_list = []
  # Read the CSV file
  for csv in csv_files:
    df = pd.read_csv(csv)
    df_list.append(df)

  return df_list

def clean_filenames(csv_files: List[str]) -> List[str]:
  """
  Clean the file names by removing directory path and the .csv extension.

  Args:
      csv_files (List[str]): A list of CSV file paths.

  Returns:
      List[str]: A list of cleaned file names without directory path and file extension.

  """
  #Get list of file names without directory junk and remove .csv extension from name
  file_names = []

  for file_path in csv_files:
      file_name = os.path.basename(file_path)  # Get the file name with extension
      file_name = os.path.splitext(file_name)[0]  # Remove the file extension
      file_names.append(file_name)
  return file_names

#Define data location
path2data = '/content/drive/MyDrive/Quality Check'
#Load the .csv files into memory
csv_files  = load_csvs(path2data)
#Create list of dataframes from csvs
df_list    = make_df_list(csv_files)
#Clean the names of .csv files
file_names = clean_filenames(csv_files)

In [3]:
csv_files

['/content/drive/MyDrive/Quality Check/demographics_static_qual.csv',
 '/content/drive/MyDrive/Quality Check/inpatient_specialty_qual.csv',
 '/content/drive/MyDrive/Quality Check/immunization_qual.csv',
 '/content/drive/MyDrive/Quality Check/conditions_qual.csv',
 '/content/drive/MyDrive/Quality Check/measurements_blood_pressure_qual.csv',
 '/content/drive/MyDrive/Quality Check/demographics_event_qual.csv',
 '/content/drive/MyDrive/Quality Check/ed_visits_qual.csv',
 '/content/drive/MyDrive/Quality Check/procedures_qual.csv',
 '/content/drive/MyDrive/Quality Check/medications_ordered_qual.csv',
 '/content/drive/MyDrive/Quality Check/lab_results_qual.csv',
 '/content/drive/MyDrive/Quality Check/outpatient_visits_qual.csv',
 '/content/drive/MyDrive/Quality Check/measurements_qual.csv',
 '/content/drive/MyDrive/Quality Check/medications_administered_qual.csv',
 '/content/drive/MyDrive/Quality Check/inpatient_location_qual.csv',
 '/content/drive/MyDrive/Quality Check/inpatient_admissions_q

# Widget for Rapid Dataframe Inspection
I made this widget to quickly inspect the generated dataframes from the provided .csv files. This allows us to quickly see all the different attributes present in the data.

In [15]:
def visualize_dataframes(df_list: List[pd.DataFrame], file_names: List[str]) -> None:
    """
    Visualizes a list of DataFrames with an interactive slider.

    Args:
        df_list (List[pd.DataFrame]): A list of pandas DataFrames to be visualized.
        file_names (List[str]): A list of corresponding file names for the DataFrames.

    Returns:
        None
    """
    # Set the maximum number of rows to be displayed
    pd.set_option("display.max_rows", 10)

    # Create an index slider
    index_slider = widgets.IntSlider(min=0, max=len(df_list)-1, value=0, description='DataFrame Index')

    # Function to display the selected dataframe
    output = widgets.Output()

    def display_dataframe(index: int) -> None:
        """
        Displays the selected DataFrame based on the index.

        Args:
            index (int): Index of the DataFrame to be displayed.

        Returns:
            None
        """
        df = df_list[index]
        current_file = file_names[index]
        with output:
            clear_output(wait=True)
            print('Filename: ',current_file)
            display(df)

    # Display the initial dataframe
    display(output)

    # Link the slider value to the display_dataframe function
    widgets.interactive_output(display_dataframe, {'index': index_slider})

    # Display the slider
    display(index_slider)

    # Link the slider value to the display_dataframe function
    widgets.interactive(display_dataframe, index=index_slider)

In [11]:
#Mapping function
dfs_for_mapping = [df_list[1],df_list[2],df_list[3],df_list[4],df_list[6],df_list[7],df_list[8],df_list[9],
                   df_list[10],df_list[11],df_list[12],df_list[13],df_list[14]]
val_cols_for_mapping = ['Specialty start date','Immunization date','Condition documented date',
                        'Measurement date','Ed visit start date','Procedure date','Order date',
                        'Lab test date','Visit start date','Measurement date','Administration date',
                        'Location.start.date','Admission.date']

def get_uuid(row, value_column):
    key = (row['Internalpatientid'], row[value_column])
    if key not in uuid_dict:
        uuid_dict[key] = str(uuid.uuid4())
    return uuid_dict[key]

# Instantiate the UUID dictionary
uuid_dict = {}

total_iterations = len(dfs_for_mapping)

for i, df in tqdm(enumerate(dfs_for_mapping), total=total_iterations, desc="Processing DataFrames"):
    # Select the DataFrame and the corresponding value column
    value_column = val_cols_for_mapping[i]  # Update the index as desired

    # Apply the get_uuid function to create the Encounter ID column
    df['Encounter ID'] = df.apply(lambda row: get_uuid(row, value_column), axis=1)

Processing DataFrames: 100%|██████████| 13/13 [01:02<00:00,  4.79s/it]


In [10]:
visualize_dataframes(df_list, file_names)

Output()

IntSlider(value=0, description='DataFrame Index', max=14)

In [13]:
df = df_list[2]
num_unique_encounter_ids = df['Encounter ID'].nunique()
print("Number of unique Encounter IDs:", num_unique_encounter_ids)

Number of unique Encounter IDs: 11432
