# Tutorial of DataProcessorUrothelial

The DataProcessorUrothelial package streamlines the processing of Flatiron Health's advanced urothelial cancer datasets. It provides specialized functions to clean and standardize CSV files containing clinical data (eg., demographics, vitals, labs, medications, ICD codes). Each processing function handles format-specific requirements and common data quality issues, outputting standardized dataframes that can be merged into a single, comprehensive dataset ready for analysis.

In [1]:
# Development setup only 
# These lines are only needed when running this notebook from the example folder in the repository
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

## Setup 
To begin using the package, import the required modules and initialize the processor.

In [2]:
from flatiron_cleaner.urothelial import DataProcessorUrothelial
from flatiron_cleaner.merge_utils import merge_dataframes

import pandas as pd

In [3]:
# Initialize class 
processor = DataProcessorUrothelial()

In [4]:
# Import dataframe with index date of interest for PatientIDs
df = pd.read_csv('../data_uro/LineOfTherapy.csv')

In [5]:
# For our example we'll select patients receiving first-line pembrolizumab or carpoplatin and gemcitabine. 
df = (
    df
    .query('LineNumber == 1')
    .query('LineName == "Pembrolizumab" or LineName == "Carboplatin,Gemcitabine"')
    [['PatientID', 'StartDate']]
)
ids = df.PatientID.to_list()

In [6]:
df.shape

(3712, 2)

## Cleaning CSV Files 
Process individual data files using specialized functions. Each function handles data cleaning and standardization specific to the CSV file. 

In [7]:
# Process Enhanced_AdvUrothelial.csv
enhanced_df = processor.process_enhanced(file_path = '../data_uro/Enhanced_AdvUrothelial.csv', 
                                         patient_ids = ids)

2025-03-02 20:35:59,941 - INFO - Successfully read Enhanced_AdvUrothelial.csv file with shape: (13129, 13) and unique PatientIDs: 13129
2025-03-02 20:35:59,941 - INFO - Filtering for 3712 specific PatientIDs
2025-03-02 20:35:59,943 - INFO - Successfully filtered Enhanced_AdvUrothelial.csv file with shape: (3712, 13) and unique PatientIDs: 3712
2025-03-02 20:35:59,954 - INFO - Successfully processed Enhanced_AdvUrothelial.csv file with final shape: (3712, 13) and unique PatientIDs: 3712


In [8]:
# Process Demographics.csv 
demographics_df = processor.process_demographics(file_path = '../data_uro/Demographics.csv',
                                                 index_date_df = df,
                                                 index_date_column = 'StartDate')

2025-03-02 20:35:59,967 - INFO - Successfully read Demographics.csv file with shape: (13129, 6) and unique PatientIDs: 13129
2025-03-02 20:35:59,977 - INFO - Successfully processed Demographics.csv file with final shape: (3712, 6) and unique PatientIDs: 3712


In [9]:
# Process Practice.csv
practice_df = processor.process_practice(file_path = '../data_uro/Practice.csv',
                                         patient_ids = ids)

2025-03-02 20:35:59,989 - INFO - Successfully read Practice.csv file with shape: (14181, 4) and unique PatientIDs: 13129
2025-03-02 20:35:59,989 - INFO - Filtering for 3712 specific PatientIDs
2025-03-02 20:35:59,991 - INFO - Successfully filtered Practice.csv file with shape: (4020, 4) and unique PatientIDs: 3712
2025-03-02 20:36:00,036 - INFO - Successfully processed Practice.csv file with final shape: (3712, 2) and unique PatientIDs: 3712


In [10]:
# Process Enhanced_AdvUrothelialBiomarkers.csv
biomarkers_df = processor.process_biomarkers(file_path = '../data_uro/Enhanced_AdvUrothelialBiomarkers.csv',
                                             index_date_df = df, 
                                             index_date_column = 'StartDate',
                                             days_before = None, 
                                             days_after = 14)

2025-03-02 20:36:00,055 - INFO - Successfully read Enhanced_AdvUrothelialBiomarkers.csv file with shape: (9924, 19) and unique PatientIDs: 4251
2025-03-02 20:36:00,064 - INFO - Successfully merged Enhanced_AdvUrothelialBiomarkers.csv df with index_date_df resulting in shape: (3886, 20) and unique PatientIDs: 1614
2025-03-02 20:36:00,096 - INFO - Successfully processed Enhanced_AdvUrothelialBiomarkers.csv file with final shape: (3712, 4) and unique PatientIDs: 3712


In [11]:
# Process ECOG.csv
ecog_df = processor.process_ecog(file_path = '../data_uro/ECOG.csv', 
                                 index_date_df = df,
                                 index_date_column = 'StartDate',
                                 days_before = 90,
                                 days_after = 0,
                                 days_before_further = 180)

2025-03-02 20:36:00,164 - INFO - Successfully read ECOG.csv file with shape: (184794, 4) and unique PatientIDs: 9933
2025-03-02 20:36:00,197 - INFO - Successfully merged ECOG.csv df with index_date_df resulting in shape: (66093, 5) and unique PatientIDs: 3110
2025-03-02 20:36:00,252 - INFO - Successfully processed ECOG.csv file with final shape: (3712, 3) and unique PatientIDs: 3712


In [12]:
# Process Vitals.csv
vitals_df = processor.process_vitals(file_path = '../data_uro/Vitals.csv',
                                     index_date_df = df,
                                     index_date_column = 'StartDate',
                                     weight_days_before = 90,
                                     days_after = 0,
                                     vital_summary_lookback = 180, 
                                     abnormal_reading_threshold = 1)

2025-03-02 20:36:03,784 - INFO - Successfully read Vitals.csv file with shape: (3604484, 16) and unique PatientIDs: 13109
2025-03-02 20:36:05,146 - INFO - Successfully merged Vitals.csv df with index_date_df resulting in shape: (1122931, 17) and unique PatientIDs: 3712
2025-03-02 20:36:05,672 - INFO - Successfully processed Vitals.csv file with final shape: (3712, 8) and unique PatientIDs: 3712


In [13]:
# Process Lab.csv
labs_df = processor.process_labs(file_path = '../data_uro/Lab.csv',
                                 index_date_df = df,
                                 index_date_column = 'StartDate',
                                 additional_loinc_mappings = {'crp': ['1988-5']},
                                 days_before = 90,
                                 days_after = 0,
                                 summary_lookback = 180)

2025-03-02 20:36:18,032 - INFO - Successfully read Lab.csv file with shape: (9373598, 17) and unique PatientIDs: 12700
2025-03-02 20:36:20,496 - INFO - Successfully merged Lab.csv df with index_date_df resulting in shape: (3090751, 18) and unique PatientIDs: 3683
2025-03-02 20:36:27,302 - INFO - Successfully processed Lab.csv file with final shape: (3712, 81) and unique PatientIDs: 3712


In [14]:
# Process MedicationAdministration.csv
medications_df = processor.process_medications(file_path = '../data_uro/MedicationAdministration.csv',
                                               index_date_df = df,
                                               index_date_column = 'StartDate',
                                               days_before = 90,
                                               days_after = 0)

2025-03-02 20:36:28,512 - INFO - Successfully read MedicationAdministration.csv file with shape: (997836, 11) and unique PatientIDs: 10983
2025-03-02 20:36:28,795 - INFO - Successfully merged MedicationAdministration.csv df with index_date_df resulting in shape: (297479, 12) and unique PatientIDs: 3656
2025-03-02 20:36:28,833 - INFO - Successfully processed MedicationAdministration.csv file with final shape: (3712, 9) and unique PatientIDs: 3712


In [15]:
# Process Diagnsois.csv 
diagnosis_df = processor.process_diagnosis(file_path = '../data_uro/Diagnosis.csv',
                                           index_date_df = df,
                                           index_date_column = 'StartDate',
                                           days_before = None,
                                           days_after = 0)

2025-03-02 20:36:29,232 - INFO - Successfully read Diagnosis.csv file with shape: (625348, 6) and unique PatientIDs: 13129
2025-03-02 20:36:29,324 - INFO - Successfully merged Diagnosis.csv df with index_date_df resulting in shape: (167901, 7) and unique PatientIDs: 3712
2025-03-02 20:36:29,917 - INFO - Successfully processed Diagnosis.csv file with final shape: (3712, 40) and unique PatientIDs: 3712


In [16]:
# Process Insurance.csv 
insurance_df = processor.process_insurance(file_path = '../data_uro/Insurance.csv',
                                           index_date_df = df,
                                           index_date_column = 'StartDate',
                                           days_before = None,
                                           days_after = 0,
                                           missing_date_strategy = 'liberal')

2025-03-02 20:36:29,999 - INFO - Successfully read Insurance.csv file with shape: (53709, 14) and unique PatientIDs: 12391
2025-03-02 20:36:30,036 - INFO - Successfully merged Insurance.csv df with index_date_df resulting in shape: (15340, 15) and unique PatientIDs: 3519
2025-03-02 20:36:30,081 - INFO - Successfully processed Insurance.csv file with final shape: (3712, 5) and unique PatientIDs: 3712


In [17]:
# Process Enhanced_Mortality_V2.csv and use visit, telemedicine, biomarkers, oral, and progression data to determine censoring date 
mortality_df = processor.process_mortality(file_path = '../data_uro/Enhanced_Mortality_V2.csv',
                                           index_date_df = df, 
                                           index_date_column = 'StartDate',
                                           visit_path = '../data_uro/Visit.csv', 
                                           telemedicine_path = '../data_uro/Telemedicine.csv', 
                                           biomarkers_path = '../data_uro/Enhanced_AdvUrothelialBiomarkers.csv', 
                                           oral_path = '../data_uro/Enhanced_AdvUrothelial_Orals.csv',
                                           progression_path = '../data_uro/Enhanced_AdvUrothelial_Progression.csv',
                                           drop_dates = True)

ValueError: index_date_column not found in index_date_df

## Merge Processed Dataframes
Merge the processed dataframes into a single analysis-ready dataset

In [None]:
merged_data = merge_dataframes(enhanced_df, 
                               demographics_df, 
                               practice_df, 
                               biomarkers_df, 
                               ecog_df, 
                               vitals_df,
                               labs_df,
                               medications_df, 
                               diagnosis_df, 
                               insurance_df,
                               mortality_df, )

In [None]:
merged_data.columns.to_list()

In [None]:
for col, dtype in merged_data.dtypes.items():
    print(f"{col}: {dtype}")